.. _quickstart:

QuickStart
==========

``DIRAC`` provides **horizontal** (label transfer across datasets/platforms) and **vertical** (multi-omics integration) analysis.  
This QuickStart focuses on two **simulated datasets** and walks you through the end-to-end workflow:

- **NSF** — spatial multi-omics (RNA, ADT, optional ATAC) with joint embedding, clustering, ARI evaluation, and optional subgraph training.  
- **scMultiSim** — horizontal annotation (RNA→RNA, RNA+ATAC→RNA+ATAC), confidence-based *novel type discovery*, and UMAP mixing checks.

What you’ll learn
-----------------
- Preprocess single-cell/spatial data (``normalize_total → log1p → scale``; optional PCA/LSI).  
- Build spatial graphs (k-NN; optional radius; multi-batch vs single-sample).  
- Train DIRAC (``annotate_app`` / ``integrate_app``) and write back embeddings/predictions.  
- Evaluate results (Accuracy/Precision/Recall/F1, ARI) and visualize (spatial, UMAP).  
- Use **confidence filtering** to mark low-confidence predictions as ``"unassigned"`` and catch **missing/novel** cell types.

Prerequisites
-------------
- Python ≥ 3.9, and: ``scanpy``, ``anndata``, ``numpy``, ``pandas``, ``matplotlib`` (and ``torch`` inside DIRAC).  
- Local clone of the DIRAC codebase (added to ``sys.path``).  
- (Optional) R + ``mclust`` for MCLUST clustering.

Data
----
- **NSF** (simulated spatial multi-omics): RNA/ADT (+ ATAC) ``.h5ad`` files.  
- **scMultiSim** (simulated multi-omics, e.g. *mask = 0.3* means ~30% random zeros in RNA/ATAC): ``source_*`` and ``target_*`` ``.h5ad`` files.

.. note::
   Place datasets under ``DIRAC-main/data/`` in folders referenced by the notebooks (see each notebook’s first cell for exact paths).

At a glance
-----------
1. Load reference/target AnnData.  
2. Preprocess per modality.  
3. Build spatial graphs.  
4. Pack data with ``_get_data(...)`` (optionally ``num_parts_*`` for subgraphs).  
5. Build model with ``_get_model(...)`` (e.g., ``opt_GNN="SAGE"`` or ``"GAT"``).  
6. Train (``_train_dirac_integrate`` or ``_train_dirac_annotate``).  
7. Evaluate + visualize; save ``.h5ad``/figures/metrics.

----

The following notebooks are included:

.. nbgallery::
   notebooks/run-NSF.ipynb
   notebooks/run-scMultiSim.ipynb

Notebook details
----------------

**notebooks/run-NSF.ipynb — Spatial multi-omics (NSF)**
   - Load **RNA** and **ADT** (optionally **ATAC**).  
   - Preprocess (HVGs/PCA for RNA; standardization for ADT; LSI optional for ATAC).  
   - Build spatial **k-NN** graph (tune ``n_neighbors``; radius graph optional).  
   - Two-omics integration (RNA+ADT), then extend to **three omics** (add ATAC).  
   - Optional **subgraph** training (``subgraph=True``; control ``num_parts``).  
   - Cluster with **MCLUST** or Leiden; compute **ARI** vs ground truth.  
   - Plot **spatial** maps and **UMAP**; save embeddings and outputs.

**notebooks/run-scMultiSim.ipynb — Horizontal annotation (scMultiSim)**
   - Single-modality (**RNA→RNA**) and dual-modality (**RNA+ATAC→RNA+ATAC**).  
   - Preprocess each modality; **concatenate** features and specify ``split_list``.  
   - Build **multi-batch** graph for source (if needed) + **single-sample** graph for target.  
   - Run ``annotate_app`` to **transfer labels**; write back embeddings/predictions.  
   - **Confidence filtering** (e.g., ``confidence_threshold=0.9``) to mark low-confidence cells as ``"unassigned"`` (novel/missing types).  
   - Optional **mixing check**: UMAP colored by *Omics* and cluster labels.  
   - Report metrics (Accuracy/Precision/Recall/F1) and **Unassigned Rate**; save results (NPZ/JSON, figures, ``.h5ad``).

Tips & troubleshooting
----------------------
- Ensure required fields exist: ``obs["cell.type"]`` (labels), ``obsm["spatial"]`` (coords), and ``obs["batch"]`` for multi-batch graphs.  
- Verify ``split_list`` aligns with feature concatenation (``(0, dim_RNA)``, ``(dim_RNA, dim_RNA+dim_ATAC)``).  
- Start with ``n_neighbors=8–12``; adjust for platform resolution/spot density.  
- Use subgraphs for large tissues (``num_parts_* = max(1, n//200)``).  
- Tune model defaults: ``n_hiddens=128``, ``n_outputs=64``, ``epochs=200–400``, balance via ``lamb``/``scale_loss``.  
- For portability, save arrays via **NPZ** and label maps via **JSON**.