QuickStart¶

DIRAC provides horizontal (label transfer across datasets/platforms) and vertical (multi-omics integration) analysis. This QuickStart focuses on two simulated datasets and walks you through the end-to-end workflow:

NSF — spatial multi-omics (RNA, ADT, optional ATAC) with joint embedding, clustering, ARI evaluation, and optional subgraph training.
scMultiSim — horizontal annotation (RNA→RNA, RNA+ATAC→RNA+ATAC), confidence-based novel type discovery, and UMAP mixing checks.

What you’ll learn¶

Preprocess single-cell/spatial data (normalize_total → log1p → scale; optional PCA/LSI).
Build spatial graphs (k-NN; optional radius; multi-batch vs single-sample).
Train DIRAC (annotate_app / integrate_app) and write back embeddings/predictions.
Evaluate results (Accuracy/Precision/Recall/F1, ARI) and visualize (spatial, UMAP).
Use confidence filtering to mark low-confidence predictions as "unassigned" and catch missing/novel cell types.

Prerequisites¶

Python ≥ 3.9, and: scanpy, anndata, numpy, pandas, matplotlib (and torch inside DIRAC).
Local clone of the DIRAC codebase (added to sys.path).
(Optional) R + mclust for MCLUST clustering.

Data¶

NSF (simulated spatial multi-omics): RNA/ADT (+ ATAC) .h5ad files.
scMultiSim (simulated multi-omics, e.g. mask = 0.3 means ~30% random zeros in RNA/ATAC): source_* and target_* .h5ad files.

Note

Place datasets under DIRAC-main/data/ in folders referenced by the notebooks (see each notebook’s first cell for exact paths).

At a glance¶

Load reference/target AnnData.
Preprocess per modality.
Build spatial graphs.
Pack data with _get_data(...) (optionally num_parts_* for subgraphs).
Build model with _get_model(...) (e.g., opt_GNN="SAGE" or "GAT").
Train (_train_dirac_integrate or _train_dirac_annotate).
Evaluate + visualize; save .h5ad/figures/metrics.

The following notebooks are included:

DIRAC Spatial Multi-Omics — Vertical Integration

DIRAC Spatial Multi-Omics — Horizontal Integration

Notebook details¶

notebooks/run-NSF.ipynb — Spatial multi-omics (NSF)

Load RNA and ADT (optionally ATAC).
Preprocess (HVGs/PCA for RNA; standardization for ADT; LSI optional for ATAC).
Build spatial k-NN graph (tune n_neighbors; radius graph optional).
Two-omics integration (RNA+ADT), then extend to three omics (add ATAC).
Optional subgraph training (subgraph=True; control num_parts).
Cluster with MCLUST or Leiden; compute ARI vs ground truth.
Plot spatial maps and UMAP; save embeddings and outputs.

notebooks/run-scMultiSim.ipynb — Horizontal annotation (scMultiSim)

Single-modality (RNA→RNA) and dual-modality (RNA+ATAC→RNA+ATAC).
Preprocess each modality; concatenate features and specify split_list.
Build multi-batch graph for source (if needed) + single-sample graph for target.
Run annotate_app to transfer labels; write back embeddings/predictions.
Confidence filtering (e.g., confidence_threshold=0.9) to mark low-confidence cells as "unassigned" (novel/missing types).
Optional mixing check: UMAP colored by Omics and cluster labels.
Report metrics (Accuracy/Precision/Recall/F1) and Unassigned Rate; save results (NPZ/JSON, figures, .h5ad).

Tips & troubleshooting¶

Ensure required fields exist: obs["cell.type"] (labels), obsm["spatial"] (coords), and obs["batch"] for multi-batch graphs.
Verify split_list aligns with feature concatenation ((0, dim_RNA), (dim_RNA, dim_RNA+dim_ATAC)).
Start with n_neighbors=8–12; adjust for platform resolution/spot density.
Use subgraphs for large tissues (num_parts_* = max(1, n//200)).
Tune model defaults: n_hiddens=128, n_outputs=64, epochs=200–400, balance via lamb/scale_loss.
For portability, save arrays via NPZ and label maps via JSON.