dirac.main

class sodirac.main.integrate_app(save_path: str = './Results/', subgraph: bool = True, use_gpu: bool = True, **kwargs)[source]

Bases: object

High-level API for multi-omics graph integration.

This class prepares data (optionally with subgraph sampling), builds an integration model, trains it in an unsupervised manner, and returns embeddings/reconstructions.

__init__(save_path: str = './Results/', subgraph: bool = True, use_gpu: bool = True, **kwargs) None[source]

Initialize the integration app.

Parameters:
  • save_path (str, default './Results/') – Directory to write outputs (figures, checkpoints, etc.). Must be writable.

  • subgraph (bool, default True) – If True, use ClusterData/ClusterLoader for sampling. If False, use a full-batch DataLoader for small graphs.

  • use_gpu (bool, default True) – If True, selects cuda when available; otherwise CPU.

  • **kwargs (Any) – Ignored; forwarded to super.

  • Effects (Side) –

  • ------------

  • self.device (Sets) –

  • self.subgraph

  • self.save_path. (and) –

_get_data(dataset_list: list, edge_index, domain_list=None, batch=None, num_parts: int = 10, num_workers: int = 1, batch_size: int = 1)[source]

Process multi-omics node features and construct a graph dataset.

Parameters:
  • dataset_list (list of (ndarray | torch.Tensor)) – List of feature matrices, one per modality/layer. Each element must be shaped (n_nodes, n_features_i) (rows = nodes, cols = features).

  • edge_index (torch.LongTensor) – Graph connectivity in COO format with shape (2, E). Will be made undirected via to_undirected.

  • domain_list (list[np.ndarray] | None, optional) – Optional per-modality integer domain labels of length n_nodes. If None, each dataset is treated as its own domain (0..n-1).

  • batch (None | pandas.Series | np.ndarray | list, optional) – Optional per-node batch labels of length n_nodes. Non-numeric labels are categorical-encoded. If None, a zero vector is used for each modality.

  • num_parts (int, default 10) – Number of partitions for ClusterData when self.subgraph=True.

  • num_workers (int, default 1) – Number of workers for the loaders.

  • batch_size (int, default 1) – Batch size for ClusterLoader when self.subgraph=True.

Returns:

A dictionary with the following keys: - graph_ds : dict

Underlying graph data object/dict from GraphDataset with additional modality tensors (e.g., data_1, domain_1, batch_1…).

  • graph_dlClusterLoader | DataLoader

    A ClusterLoader if self.subgraph=True; otherwise a full-batch DataLoader with a single item.

  • n_samplesint

    Number of input datasets/modalities.

  • n_inputs_listlist[int]

    Feature dimensions for each dataset [n_features_0, n_features_1, ...].

  • n_domainsint

    Number of unique domains inferred from domain_list.

Return type:

dict

Raises:

ValueError – If node counts differ across dataset_list; if batch length mismatches data; or an unsupported batch type is provided.

Notes

Sets self.n_samples, self.n_inputs_list, and self.num_domains. Prints the number of unique domains detected.

_get_model(samples, n_hiddens: int = 128, n_outputs: int = 64, opt_GNN='GAT', dropout_rate=0.1, use_skip_connections=True, use_attention=True, n_attention_heads=4, use_layer_scale=False, layer_scale_init=0.01, use_stochastic_depth=False, stochastic_depth_rate=0.1, combine_method='concat')[source]

Build the integration model with the provided hyperparameters.

Parameters:
  • samples (dict) – Output from _get_data. Must contain n_inputs_list and n_domains.

  • n_hiddens (int, default 128) – Hidden dimension for GNN layers.

  • n_outputs (int, default 64) – Output/embedding dimension per node.

  • opt_GNN (str, default 'GAT') – GNN backbone option consumed by integrate_model.

  • dropout_rate (float, default 0.1) – Dropout rate inside the model.

  • use_skip_connections (bool, default True) – Whether to enable residual/skip connections (if supported).

  • use_attention (bool, default True) – Whether to use attention (if supported by the chosen backbone).

  • n_attention_heads (int, default 4) – Number of attention heads (if applicable).

  • use_layer_scale (bool, default False) – If True, enable layer scale with initialization layer_scale_init.

  • layer_scale_init (float, default 1e-2) – Initialization value for layer scaling.

  • use_stochastic_depth (bool, default False) – Enable stochastic depth.

  • stochastic_depth_rate (float, default 0.1) – Drop probability for stochastic depth.

  • combine_method ({'concat','sum','attention'}, default 'concat') – How to combine multi-modal features inside the model.

Returns:

models – The model instance returned by integrate_model(...), ready for training.

Return type:

Any

_train_dirac_integrate(samples, models, epochs: int = 500, optimizer_name: str = 'adam', lr: float = 0.001, tau: float = 0.9, wd: float = 0.05, scheduler: bool = True, lamb: float = 0.0005, scale_loss: float = 0.025)[source]

Train the integration model and evaluate embeddings/reconstructions.

Parameters:
  • samples (dict) – Output from _get_data with keys like graph_ds, graph_dl, n_inputs_list, n_domains.

  • models (Any) – Model returned by _get_model / integrate_model.

  • epochs (int, default 500) – Training epochs.

  • optimizer_name (str, default 'adam') – Optimizer identifier consumed by the trainer.

  • lr (float, default 1e-3) – Learning rate.

  • tau (float, default 0.9) – Momentum/EMA or contrastive temperature parameter (per trainer definition).

  • wd (float, default 5e-2) – Weight decay.

  • scheduler (bool, default True) – Whether to use a learning-rate scheduler.

  • lamb (float, default 5e-4) – Loss coefficient used by the trainer.

  • scale_loss (float, default 0.025) – Additional loss scaling used by the trainer.

Returns:

  • data_z (torch.Tensor) – Node embeddings; typically shaped (n_nodes, n_outputs).

  • combine_recon (Any) – Reconstruction(s) as returned by train_integrate.evaluate; may be a tensor or a structure of tensors.

class sodirac.main.annotate_app(save_path: str = './Results/', subgraph: bool = True, use_gpu: bool = True, **kwargs)[source]

Bases: integrate_app

High-level API for annotation / domain adaptation on graphs.

Prepares labeled source (and unlabeled target) graphs, builds an annotation model, supports semi-supervised training, optional novel-class discovery, and evaluation on source/target/test.

_get_data(source_data, source_label, source_edge_index, target_data, target_edge_index, source_domain=None, target_domain=None, test_data=None, test_edge_index=None, weighted_classes=False, split_list=None, num_workers: int = 1, batch_size: int = 1, num_parts_source: int = 1, num_parts_target: int = 1)[source]

Process labeled source and (optional) unlabeled target into loaders.

Parameters:
  • source_data ((ndarray | torch.Tensor)) – Source node features with shape (n_source_nodes, n_features).

  • source_label ((array-like)) – Source labels; numeric or categorical. Non-numeric labels are encoded to 0-based integer codes. A mapping is stored in self.pairs.

  • source_edge_index (torch.LongTensor) – COO connectivity for the source graph, shape (2, E_source); made undirected.

  • target_data ((ndarray | torch.Tensor) or None) – Optional target node features with shape (n_target_nodes, n_features).

  • target_edge_index (torch.LongTensor or None) – Optional COO connectivity for target graph, shape (2, E_target); made undirected if provided.

  • source_domain (array-like[int] or None, default None) – Optional per-node domain labels for source. Defaults to zeros.

  • target_domain (array-like[int] or None, default None) – Optional per-node domain labels for target. Defaults to ones when target_data is provided.

  • test_data ((ndarray | torch.Tensor) or None, default None) – Optional test node features (n_test_nodes, n_features).

  • test_edge_index (torch.LongTensor or None, default None) – Required if test_data is provided.

  • weighted_classes (bool, default False) – If True, compute inverse-frequency class weights for source labels.

  • split_list (list[tuple[int,int]] or None, default None) – Optional feature splits for multi-modal inputs, e.g., [(0,1000),(1000,1500)].

  • num_workers (int, default 1) – DataLoader workers for source/target loaders.

  • batch_size (int, default 1) – Batch size for ClusterLoader.

  • num_parts_source (int, default 1) – ClusterData partitions for source graph.

  • num_parts_target (int, default 1) – ClusterData partitions for target graph.

Returns:

Contains: - source_graph_ds : dict

Graph data object/dict for source (from GraphDataset_unpaired).

  • source_graph_dlClusterLoader

    Loader over source clusters.

  • target_graph_dsdict | None

    Graph data for target or None if no target.

  • target_graph_dlClusterLoader | None

    Loader for target or None if no target.

  • test_graph_dstorch_geometric.data.Data | None

    Test graph object if both test_data and test_edge_index provided.

  • class_weighttorch.FloatTensor | None

    Class weights when weighted_classes=True.

  • n_labelsint

    Number of unique labels in source.

  • n_inputsint

    Feature dimension.

  • n_domainsint

    Number of domains inferred from source_domain/target_domain.

  • split_listlist[tuple[int,int]] | None

    Echo of the provided split_list.

Return type:

dict

Notes

If source_label is categorical, self.pairs stores a mapping {code: original_label}; otherwise self.pairs is None. Sets self.n_labels, self.n_inputs, and self.n_domains. Prints the number of unique domains.

_get_model(samples, n_hiddens: int = 128, n_outputs: int = 64, opt_GNN: str = 'SAGE', s: int = 32, m: float = 0.1, easy_margin: bool = False, dropout_rate: float = 0.1, use_skip_connections: bool = False, use_attention: bool = True, n_attention_heads: int = 2, use_layer_scale: bool = False, layer_scale_init: float = 0.01, use_stochastic_depth: bool = False, stochastic_depth_rate: float = 0.1, combine_method: str = 'concat')[source]

Build the annotation model (classifier/domain-adaptation).

Parameters:
  • samples (dict) – Output from annotate_app._get_data; must include n_domains, n_labels, and either n_inputs (int) or split_list for multi-modal cases.

  • n_hiddens (int, default 128) – Hidden dimension.

  • n_outputs (int, default 64) – Embedding dimension before the classification head.

  • opt_GNN (str, default 'SAGE') – GNN backbone identifier consumed by annotate_model.

  • s (int, default 32) – Scale parameter for margin-based head (if applicable).

  • m (float, default 0.10) – Margin parameter for margin-based head.

  • easy_margin (bool, default False) – Use easy margin variant if supported.

  • dropout_rate (float, default 0.1) – Dropout rate.

  • use_skip_connections (bool, default False) – Enable skip/residual connections (if supported).

  • use_attention (bool, default True) – Enable attention (if supported).

  • n_attention_heads (int, default 2) – Number of attention heads when applicable.

  • use_layer_scale (bool, default False) – Enable layer scaling.

  • layer_scale_init (float, default 1e-2) – Initial value for layer scale.

  • use_stochastic_depth (bool, default False) – Enable stochastic depth.

  • stochastic_depth_rate (float, default 0.1) – Drop probability for stochastic depth.

  • combine_method ({'concat','sum','attention'}, default 'concat') – Feature fusion strategy for multi-modal inputs.

Returns:

models – Model instance returned by annotate_model(...).

Return type:

Any

_train_dirac_annotate(samples, models, n_epochs: int = 200, optimizer_name: str = 'adam', lr: float = 0.001, wd: float = 0.005, scheduler: bool = True, filter_low_confidence: bool = True, confidence_threshold: float = 0.5)[source]

Train the annotation model (semi-supervised/domain adaptation) and evaluate.

Parameters:
  • samples (dict) – Output from _get_data. Expected keys include source_graph_ds, source_graph_dl, optional target_graph_dl and test_graph_ds, and possibly class_weight.

  • models (Any) – Model returned by _get_model / annotate_model.

  • n_epochs (int, default 200) – Number of training epochs.

  • optimizer_name (str, default 'adam') – Optimizer identifier.

  • lr (float, default 1e-3) – Learning rate.

  • wd (float, default 5e-3) – Weight decay.

  • scheduler (bool, default True) – Whether to enable learning-rate scheduling.

  • filter_low_confidence (bool, default True) – If True, mark predictions with confidence < confidence_threshold as "unassigned" in the returned target_pred_filtered / test_pred_filtered.

  • confidence_threshold (float, default 0.5) – Confidence threshold in [0, 1].

Returns:

With keys (some may be None if target/test are absent): source_feat, target_feat, target_output, target_prob, target_pred, target_pred_filtered, target_confs, target_mean_uncert, test_feat, test_output, test_prob, test_pred, test_pred_filtered, test_confs, test_mean_uncert, pairs, pairs_filter, and low_confidence_threshold.

Return type:

dict

_train_dirac_novel(samples, minemodel, num_novel_class: int = 3, pre_epochs: int = 100, n_epochs: int = 200, num_parts: int = 30, resolution: float = 1, s: int = 64, m: float = 0.1, weights: dict = {'alpha1': 1, 'alpha2': 1, 'alpha3': 1, 'alpha4': 1, 'alpha5': 1, 'alpha6': 1, 'alpha7': 1, 'alpha8': 1})[source]

Discover novel target classes and retrain with expanded label space.

Parameters:
  • samples (dict) – Output from _get_data; must include keys source_graph_ds, source_graph_dl, target_graph_ds, target_graph_dl, class_weight (optional), n_labels, and feature sizes n_inputs.

  • minemodel (Any) – Initial annotation model (from _get_model).

  • num_novel_class (int, default 3) – Number of novel classes to discover in target.

  • pre_epochs (int, default 100) – Supervised pretraining epochs on source.

  • n_epochs (int, default 200) – Training epochs for the novel-phase.

  • num_parts (int, default 30) – Number of partitions for the (new) target ClusterData.

  • resolution (float, default 1) – Louvain resolution for clustering.

  • s (int, default 64) – Scale parameter for the (re)built model head.

  • m (float, default 0.1) – Margin parameter for the (re)built model head.

  • weights (dict, default {"alpha1":1, ..., "alpha8":1}) – Loss weights dictionary consumed by _train_novel.

Returns:

With keys: source_feat, target_feat, target_output, target_prob, target_pred, target_confs, target_mean_uncert, test_feat, test_pred. (test_* may be None if a test set is not provided.)

Return type:

dict