DIRAC on Human Glioblastoma — Vertical Tutorial¶
Goal: Run DIRAC on a real Human Glioblastoma spatial multi-omics sample (RNA + Protein/ADT), produce a joint embedding, cluster with Mclust, visualize spatial maps, and save all outputs (H5AD/CSV/PDF) for downstream analysis.
Table of Contents¶
Overview
Environment & Data
End-to-End Script (copy/paste)
Parameter Notes & Good Defaults
Outputs & File Saving
Reproducibility
Troubleshooting & Tips
1. Overview¶
Data: Human Glioblastoma with RNA and Protein (ADT) modalities.
Integration: DIRAC learns a joint embedding from RNA PCA + ADT expression using a spatial k-NN graph.
Clustering:
mclust(via R) to obtain discrete domains (heren_clusters = 9).Visualization: spatial clustering maps (PDF), colored by cluster assignments.
Artifacts: save AnnData objects, cluster CSV, and figures to a dedicated output folder.
2. Environment & Data¶
Dependencies (minimum)
Python ≥ 3.8
scanpy,anndata,numpy,matplotlib,pandasR with package ``mclust`` (required for
sd.utils.mclust_R)DIRAC codebase locally available If not installed, follow: https://dirac-tutorial.readthedocs.io/en/latest/install.html
Paths used
data_path:../DIRAC-main/data/Glioblastomadata_name:Human_Glioblastomaout_path:./Resultssave_path:./Results/Human_Glioblastoma_DIRAC(auto-created)
Expected files
Human_Glioblastoma_RNA.h5adHuman_Glioblastoma_Protein.h5ad
Download link:
3. End-to-End Script¶
At-a-glance map:
3.1–3.3 = prep & preprocessing
3.4 = build graph
3.5 = DIRAC training (joint embedding)
3.6 = write embedding
3.7 = clustering + spatial plot
3.8 = (Optional) Cross-omics mixing check — UMAP on the joint embedding
3.8 = save outputs (H5AD/CSV/PDF)
It uses GPU if available; otherwise set use_gpu=False.
3.1 Imports, fonts, paths, and seed¶
[1]:
import os
import pandas as pd
import numpy as np
import scanpy as sc
import matplotlib.pyplot as plt
import anndata
import matplotlib as mpl
mpl.rcParams['pdf.fonttype'] = 42
mpl.rcParams['ps.fonttype'] = 42
import sodirac as sd
sd.utils.seed_torch(seed=6)
data_path = "../DIRAC-main/data/Glioblastoma"
data_name = "Human_Glioblastoma"
methods = "DIRAC"
out_path = "./Results"
n_clusters = 9
save_path = os.path.join(out_path, f"{data_name}_{methods}")
if not os.path.exists(save_path):
os.makedirs(save_path)
3.2 Load RNA/Protein and align cell order¶
[2]:
adata_RNA = sc.read(os.path.join(data_path,f"{data_name}_RNA.h5ad"))
adata_RNA.raw = adata_RNA
adata_RNA.var_names_make_unique()
adata_Protein = sc.read(os.path.join(data_path,f"{data_name}_Protein.h5ad"))
adata_Protein.raw = adata_Protein
adata_Protein = adata_Protein[adata_RNA.obs_names]
# adata_RNA = adata_RNA.copy()
adata_RNA.obsm["spatial"] = adata_RNA.obsm["spatial"].astype(float)
# adata_Protein = adata_Protein.copy()
adata_Protein.obsm["spatial"] = adata_Protein.obsm["spatial"].astype(float)
/home/project/11003054/changxu/Conda_envs/dirac-env/lib/python3.9/site-packages/anndata/_core/anndata.py:1820: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
utils.warn_names_duplicates("var")
/home/project/11003054/changxu/Conda_envs/dirac-env/lib/python3.9/site-packages/anndata/_core/anndata.py:1820: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
utils.warn_names_duplicates("var")
/tmp/ipykernel_3355960/3436773580.py:11: ImplicitModificationWarning: Setting element `.obsm['spatial']` of view, initializing view as actual.
adata_Protein.obsm["spatial"] = adata_Protein.obsm["spatial"].astype(float)
3.3 Preprocess (RNA: filter/HVG/PCA; ADT: normalize/log/scale)¶
[3]:
######### Data processing
sc.pp.filter_genes(adata_RNA, min_cells=3)
sc.pp.normalize_total(adata_RNA)
sc.pp.log1p(adata_RNA)
sc.pp.highly_variable_genes(adata_RNA, n_top_genes=2000)
sc.pp.scale(adata_RNA)
sc.tl.pca(adata_RNA, n_comps=100)
sc.pp.normalize_total(adata_Protein)
sc.pp.log1p(adata_Protein)
sc.pp.scale(adata_Protein)
3.4 Build the spatial k-NN graph¶
[4]:
edge_index = sd.utils.get_single_edge_index(adata_RNA.obsm["spatial"], n_neighbors = 8)
Average neighbors per node (directed): 8.00 (edges=46048, nodes=5756)
3.5 Configure and train DIRAC¶
[5]:
dirac = sd.main.integrate_app(save_path = save_path, use_gpu = True, subgraph = False)
samples = dirac._get_data(dataset_list = [adata_RNA.obsm['X_pca'], adata_Protein.X], edge_index = edge_index)
models = dirac._get_model(samples, opt_GNN = "GAT")
data_z, combine_recon = dirac._train_dirac_integrate(samples = samples, models = models, scale_loss = 0.04, epochs = 300)
Found 2 unique domains.
DIRAC integrate..: 100%|█| 300/300 [0
3.6 Write back the joint embedding¶
[6]:
adata_RNA.obsm["DIRAC_embed"] = combine_recon
adata_Protein.obsm["DIRAC_embed"] = combine_recon
3.7 Cluster (MCLUST via R) and make a spatial plot¶
[7]:
# Optional: palette for up to ~30 clusters
colormaps = ["#E41A1C", "#377EB8", "#4DAF4A", "#984EA3", "#FF7F00", "#A65628", "#FFFF33", "#F781BF", "#999999",
"#E5D8BD", "#B3CDE3", "#CCEBC5", "#FED9A6", "#FBB4AE", "#8DD3C7", "#BEBADA", "#80B1D3", "#B3DE69", "#FCCDE5",
"#BC80BD", "#FFED6F", "#8DA0CB", "#E78AC3", "#E5C494", "#CCCCCC", "#FB9A99", "#E31A1C", "#CAB2D6", "#6A3D9A", "#B15928"]
sd.utils.mclust_R(adata=adata_RNA, num_cluster = n_clusters, used_obsm="DIRAC_embed",key_added="DIRAC")
sc.pl.spatial(adata_RNA, color=["DIRAC"], palette = colormaps, frameon=False, spot_size=200, show=False)
plt.savefig(os.path.join(save_path, f"{data_name}_{methods}_spatial.pdf"), bbox_inches='tight', dpi = 300)
R[write to console]: __ __
____ ___ _____/ /_ _______/ /_
/ __ `__ \/ ___/ / / / / ___/ __/
/ / / / / / /__/ / /_/ (__ ) /_
/_/ /_/ /_/\___/_/\__,_/____/\__/ version 6.0.0
Type 'citation("mclust")' for citing this R package in publications.
fitting ...
|======================================================================| 100%
3.8 (Optional) Cross-omics mixing check — UMAP on the joint embedding¶
What/why: After clustering the RNA layer, you may want to verify that RNA and Protein cells are well mixed in the shared embedding (i.e., integration worked). The snippet below:
tags each cell with its modality,
builds a combined AnnData from the joint embedding,
runs neighbors/UMAP + Leiden and MCLUST on the embedding,
plots UMAP colored by Omics, embed_mclust, and embed_leiden.
Assumption:
data_zis the stacked embedding returned by DIRAC for all cells in the same order asdataset_list=[RNA, Protein], i.e.data_z = [RNA_rows; Protein_rows]. If your method returns a different ordering, reorderadata.obsaccordingly before concatenation.
[8]:
# Label each modality for mixing diagnostics
adata_RNA.obs["Omics"] = data_name + "_RNA"
adata_RNA.obs["Omics"] = adata_RNA.obs["Omics"].astype("category")
adata_Protein.obs["Omics"] = data_name + "_Protein"
adata_Protein.obs["Omics"] = adata_Protein.obs["Omics"].astype("category")
# Build a combined AnnData on top of the joint embedding
# X := data_z so neighbors/UMAP use the learned space
adata = anndata.AnnData(data_z)
adata.obs = pd.concat([adata_RNA.obs, adata_Protein.obs], axis=0)
# Also store the embedding in obsm for methods that expect a named rep
adata.obsm[f"{methods}_embed"] = data_z
# UMAP over the joint space
sc.pp.neighbors(adata, use_rep="X")
sc.tl.umap(adata)
# Visualize modality mixing and clusters in UMAP
sc.pl.umap(adata, color="Omics", palette=colormaps, show=False)
plt.savefig(os.path.join(save_path, f"{data_name}_{methods}_embed_umap.pdf"), bbox_inches="tight", dpi=300)
3.9 Save all artifacts (H5AD / CSV)¶
[9]:
adata_RNA.write(os.path.join(save_path, f"{data_name}_{methods}_RNA.h5ad"), compression="gzip")
adata_Protein.write(os.path.join(save_path, f"{data_name}_{methods}_Protein.h5ad"), compression="gzip")
adata_RNA.obs['DIRAC'].to_csv(os.path.join(save_path, f"{data_name}_{methods}.csv"))
4. Parameter Notes & Good Defaults¶
n_neighbors (graph): 6–12 are common. More neighbors → smoother neighborhoods; fewer → sharper boundaries.
RNA HVGs (``n_top_genes``): 2k–3k typical; trial 1–5k for sensitivity.
PCA components: 50–100 usually sufficient; large tissues can benefit from 100+.
GNN (``opt_GNN``):
"GAT"(attentional, expressive),"SAGE"(efficient),"GCN"(simple).n_hiddens / n_outputs: start at 128 / 64; increase for complex tissues, reduce for memory.
epochs: 200–400 typical for stable embeddings.
lamb (regularization): ↑ if overfitting; ↓ if underfitting.
scale_loss: balances reconstruction and other terms; small tweaks (0.02–0.06) can help.
n_clusters (MCLUST): use prior knowledge or grid-search; Leiden is an alternative if R/mclust is unavailable.
spot_size: adapt to the platform’s spot diameter for best visual density.
Optional subgraph mode
For very large samples or limited VRAM: initialize
integrate_app(..., subgraph=True)and passnum_partsin_get_data(...)to split the graph.Larger
num_parts→ lower peak memory but more overhead.
5. Outputs & File Saving¶
Figures:
Results/Human_Glioblastoma_DIRAC/Human_Glioblastoma_DIRAC_Combined_spatial.pdfAnnData:
Results/Human_Glioblastoma_DIRAC/Human_Glioblastoma_DIRAC_RNA.h5adResults/Human_Glioblastoma_DIRAC/Human_Glioblastoma_DIRAC_Protein.h5adLabels (CSV):
Results/Human_Glioblastoma_DIRAC/Human_Glioblastoma_DIRAC.csv(column =DIRAC)
What’s inside
obsm["DIRAC_embed"]: shared embedding for downstream neighbors/UMAP.obs["DIRAC"]: MCLUST cluster labels (strings).unsoften contains plotting metadata afterscanpycalls.
6. Reproducibility¶
Set deterministic seeds:
sd.utils.seed_torch(seed=6)(already in the script).Minor non-determinism may persist on GPU due to cuDNN; results should be qualitatively stable.
Ready to go: Copy the End-to-End Script into your notebook, adjust paths if needed, and run. The tutorial above explains each choice so users can reproduce and extend your results confidently.