dirac.main¶

class sodirac.main.integrate_app(save_path: str = './Results/', subgraph: bool = True, use_gpu: bool = True, **kwargs)[source]¶

Bases: object

High-level API for multi-omics graph integration.

This class prepares data (optionally with subgraph sampling), builds an integration model, trains it in an unsupervised manner, and returns embeddings/reconstructions.

__init__(save_path: str = './Results/', subgraph: bool = True, use_gpu: bool = True, **kwargs) → None[source]¶

Initialize the integration app.

Parameters:

save_path (str, default './Results/') – Directory to write outputs (figures, checkpoints, etc.). Must be writable.
subgraph (bool, default True) – If True, use ClusterData/ClusterLoader for sampling. If False, use a full-batch DataLoader for small graphs.
use_gpu (bool, default True) – If True, selects cuda when available; otherwise CPU.
**kwargs (Any) – Ignored; forwarded to super.
Effects (Side) –
------------ –
self.device (Sets) –
self.subgraph –
self.save_path. (and) –

_get_data(dataset_list: list, edge_index, domain_list=None, batch=None, num_parts: int = 10, num_workers: int = 1, batch_size: int = 1)[source]¶

Process multi-omics node features and construct a graph dataset.

Parameters:

dataset_list (list of (ndarray | torch.Tensor)) – List of feature matrices, one per modality/layer. Each element must be shaped (n_nodes, n_features_i) (rows = nodes, cols = features).
edge_index (torch.LongTensor) – Graph connectivity in COO format with shape (2, E). Will be made undirected via to_undirected.
domain_list (list[np.ndarray] | None, optional) – Optional per-modality integer domain labels of length n_nodes. If None, each dataset is treated as its own domain (0..n-1).
batch (None | pandas.Series | np.ndarray | list, optional) – Optional per-node batch labels of length n_nodes. Non-numeric labels are categorical-encoded. If None, a zero vector is used for each modality.
num_parts (int, default 10) – Number of partitions for ClusterData when self.subgraph=True.
num_workers (int, default 1) – Number of workers for the loaders.
batch_size (int, default 1) – Batch size for ClusterLoader when self.subgraph=True.

Returns:

A dictionary with the following keys: - graph_ds : dict

Underlying graph data object/dict from GraphDataset with additional modality tensors (e.g., data_1, domain_1, batch_1…).

graph_dlClusterLoader | DataLoader
A ClusterLoader if self.subgraph=True; otherwise a full-batch DataLoader with a single item.
n_samplesint
Number of input datasets/modalities.
n_inputs_listlist[int]
Feature dimensions for each dataset [n_features_0, n_features_1, ...].
n_domainsint
Number of unique domains inferred from domain_list.

Return type:

dict

Raises:

ValueError – If node counts differ across dataset_list; if batch length mismatches data; or an unsupported batch type is provided.

Notes

Sets self.n_samples, self.n_inputs_list, and self.num_domains. Prints the number of unique domains detected.

_get_model(samples, n_hiddens: int = 128, n_outputs: int = 64, opt_GNN='GAT', dropout_rate=0.1, use_skip_connections=True, use_attention=True, n_attention_heads=4, use_layer_scale=False, layer_scale_init=0.01, use_stochastic_depth=False, stochastic_depth_rate=0.1, combine_method='concat')[source]¶

Build the integration model with the provided hyperparameters.

Parameters:

samples (dict) – Output from _get_data. Must contain n_inputs_list and n_domains.
n_hiddens (int, default 128) – Hidden dimension for GNN layers.
n_outputs (int, default 64) – Output/embedding dimension per node.
opt_GNN (str, default 'GAT') – GNN backbone option consumed by integrate_model.
dropout_rate (float, default 0.1) – Dropout rate inside the model.
use_skip_connections (bool, default True) – Whether to enable residual/skip connections (if supported).
use_attention (bool, default True) – Whether to use attention (if supported by the chosen backbone).
n_attention_heads (int, default 4) – Number of attention heads (if applicable).
use_layer_scale (bool, default False) – If True, enable layer scale with initialization layer_scale_init.
layer_scale_init (float, default 1e-2) – Initialization value for layer scaling.
use_stochastic_depth (bool, default False) – Enable stochastic depth.
stochastic_depth_rate (float, default 0.1) – Drop probability for stochastic depth.
combine_method ({'concat','sum','attention'}, default 'concat') – How to combine multi-modal features inside the model.

Returns:

models – The model instance returned by integrate_model(...), ready for training.

Return type:

Any

_train_dirac_integrate(samples, models, epochs: int = 500, optimizer_name: str = 'adam', lr: float = 0.001, tau: float = 0.9, wd: float = 0.05, scheduler: bool = True, lamb: float = 0.0005, scale_loss: float = 0.025)[source]¶

Train the integration model and evaluate embeddings/reconstructions.

Parameters:

samples (dict) – Output from _get_data with keys like graph_ds, graph_dl, n_inputs_list, n_domains.
models (Any) – Model returned by _get_model / integrate_model.
epochs (int, default 500) – Training epochs.
optimizer_name (str, default 'adam') – Optimizer identifier consumed by the trainer.
lr (float, default 1e-3) – Learning rate.
tau (float, default 0.9) – Momentum/EMA or contrastive temperature parameter (per trainer definition).
wd (float, default 5e-2) – Weight decay.
scheduler (bool, default True) – Whether to use a learning-rate scheduler.
lamb (float, default 5e-4) – Loss coefficient used by the trainer.
scale_loss (float, default 0.025) – Additional loss scaling used by the trainer.

Returns:

data_z (torch.Tensor) – Node embeddings; typically shaped (n_nodes, n_outputs).
combine_recon (Any) – Reconstruction(s) as returned by train_integrate.evaluate; may be a tensor or a structure of tensors.

class sodirac.main.annotate_app(save_path: str = './Results/', subgraph: bool = True, use_gpu: bool = True, **kwargs)[source]¶

Bases: integrate_app

High-level API for annotation / domain adaptation on graphs.

Prepares labeled source (and unlabeled target) graphs, builds an annotation model, supports semi-supervised training, optional novel-class discovery, and evaluation on source/target/test.

_get_data(source_data, source_label, source_edge_index, target_data, target_edge_index, source_domain=None, target_domain=None, test_data=None, test_edge_index=None, weighted_classes=False, split_list=None, num_workers: int = 1, batch_size: int = 1, num_parts_source: int = 1, num_parts_target: int = 1)[source]¶

Process labeled source and (optional) unlabeled target into loaders.

Parameters:

source_data ((ndarray | torch.Tensor)) – Source node features with shape (n_source_nodes, n_features).
source_label ((array-like)) – Source labels; numeric or categorical. Non-numeric labels are encoded to 0-based integer codes. A mapping is stored in self.pairs.
source_edge_index (torch.LongTensor) – COO connectivity for the source graph, shape (2, E_source); made undirected.
target_data ((ndarray | torch.Tensor) or None) – Optional target node features with shape (n_target_nodes, n_features).
target_edge_index (torch.LongTensor or None) – Optional COO connectivity for target graph, shape (2, E_target); made undirected if provided.
source_domain (array-like[int] or None, default None) – Optional per-node domain labels for source. Defaults to zeros.
target_domain (array-like[int] or None, default None) – Optional per-node domain labels for target. Defaults to ones when target_data is provided.
test_data ((ndarray | torch.Tensor) or None, default None) – Optional test node features (n_test_nodes, n_features).
test_edge_index (torch.LongTensor or None, default None) – Required if test_data is provided.
weighted_classes (bool, default False) – If True, compute inverse-frequency class weights for source labels.
split_list (list[tuple[int,int]] or None, default None) – Optional feature splits for multi-modal inputs, e.g., [(0,1000),(1000,1500)].
num_workers (int, default 1) – DataLoader workers for source/target loaders.
batch_size (int, default 1) – Batch size for ClusterLoader.
num_parts_source (int, default 1) – ClusterData partitions for source graph.
num_parts_target (int, default 1) – ClusterData partitions for target graph.

Returns:

Contains: - source_graph_ds : dict

Graph data object/dict for source (from GraphDataset_unpaired).

source_graph_dlClusterLoader
Loader over source clusters.
target_graph_dsdict | None
Graph data for target or None if no target.
target_graph_dlClusterLoader | None
Loader for target or None if no target.
test_graph_dstorch_geometric.data.Data | None
Test graph object if both test_data and test_edge_index provided.
class_weighttorch.FloatTensor | None
Class weights when weighted_classes=True.
n_labelsint
Number of unique labels in source.
n_inputsint
Feature dimension.
n_domainsint
Number of domains inferred from source_domain/target_domain.
split_listlist[tuple[int,int]] | None
Echo of the provided split_list.

Return type:

dict

Notes

If source_label is categorical, self.pairs stores a mapping {code: original_label}; otherwise self.pairs is None. Sets self.n_labels, self.n_inputs, and self.n_domains. Prints the number of unique domains.

_get_model(samples, n_hiddens: int = 128, n_outputs: int = 64, opt_GNN: str = 'SAGE', s: int = 32, m: float = 0.1, easy_margin: bool = False, dropout_rate: float = 0.1, use_skip_connections: bool = False, use_attention: bool = True, n_attention_heads: int = 2, use_layer_scale: bool = False, layer_scale_init: float = 0.01, use_stochastic_depth: bool = False, stochastic_depth_rate: float = 0.1, combine_method: str = 'concat')[source]¶

Build the annotation model (classifier/domain-adaptation).

Parameters:

samples (dict) – Output from annotate_app._get_data; must include n_domains, n_labels, and either n_inputs (int) or split_list for multi-modal cases.
n_hiddens (int, default 128) – Hidden dimension.
n_outputs (int, default 64) – Embedding dimension before the classification head.
opt_GNN (str, default 'SAGE') – GNN backbone identifier consumed by annotate_model.
s (int, default 32) – Scale parameter for margin-based head (if applicable).
m (float, default 0.10) – Margin parameter for margin-based head.
easy_margin (bool, default False) – Use easy margin variant if supported.
dropout_rate (float, default 0.1) – Dropout rate.
use_skip_connections (bool, default False) – Enable skip/residual connections (if supported).
use_attention (bool, default True) – Enable attention (if supported).
n_attention_heads (int, default 2) – Number of attention heads when applicable.
use_layer_scale (bool, default False) – Enable layer scaling.
layer_scale_init (float, default 1e-2) – Initial value for layer scale.
use_stochastic_depth (bool, default False) – Enable stochastic depth.
stochastic_depth_rate (float, default 0.1) – Drop probability for stochastic depth.
combine_method ({'concat','sum','attention'}, default 'concat') – Feature fusion strategy for multi-modal inputs.

Returns:

models – Model instance returned by annotate_model(...).

Return type:

Any

_train_dirac_annotate(samples, models, n_epochs: int = 200, optimizer_name: str = 'adam', lr: float = 0.001, wd: float = 0.005, scheduler: bool = True, filter_low_confidence: bool = True, confidence_threshold: float = 0.5)[source]¶

Train the annotation model (semi-supervised/domain adaptation) and evaluate.

Parameters:

samples (dict) – Output from _get_data. Expected keys include source_graph_ds, source_graph_dl, optional target_graph_dl and test_graph_ds, and possibly class_weight.
models (Any) – Model returned by _get_model / annotate_model.
n_epochs (int, default 200) – Number of training epochs.
optimizer_name (str, default 'adam') – Optimizer identifier.
lr (float, default 1e-3) – Learning rate.
wd (float, default 5e-3) – Weight decay.
scheduler (bool, default True) – Whether to enable learning-rate scheduling.
filter_low_confidence (bool, default True) – If True, mark predictions with confidence < confidence_threshold as "unassigned" in the returned target_pred_filtered / test_pred_filtered.
confidence_threshold (float, default 0.5) – Confidence threshold in [0, 1].

Returns:

With keys (some may be None if target/test are absent): source_feat, target_feat, target_output, target_prob, target_pred, target_pred_filtered, target_confs, target_mean_uncert, test_feat, test_output, test_prob, test_pred, test_pred_filtered, test_confs, test_mean_uncert, pairs, pairs_filter, and low_confidence_threshold.

Return type:

dict

_train_dirac_novel(samples, minemodel, num_novel_class: int = 3, pre_epochs: int = 100, n_epochs: int = 200, num_parts: int = 30, resolution: float = 1, s: int = 64, m: float = 0.1, weights: dict = {'alpha1': 1, 'alpha2': 1, 'alpha3': 1, 'alpha4': 1, 'alpha5': 1, 'alpha6': 1, 'alpha7': 1, 'alpha8': 1})[source]¶

Discover novel target classes and retrain with expanded label space.

Parameters:

samples (dict) – Output from _get_data; must include keys source_graph_ds, source_graph_dl, target_graph_ds, target_graph_dl, class_weight (optional), n_labels, and feature sizes n_inputs.
minemodel (Any) – Initial annotation model (from _get_model).
num_novel_class (int, default 3) – Number of novel classes to discover in target.
pre_epochs (int, default 100) – Supervised pretraining epochs on source.
n_epochs (int, default 200) – Training epochs for the novel-phase.
num_parts (int, default 30) – Number of partitions for the (new) target ClusterData.
resolution (float, default 1) – Louvain resolution for clustering.
s (int, default 64) – Scale parameter for the (re)built model head.
m (float, default 0.1) – Margin parameter for the (re)built model head.
weights (dict, default {"alpha1":1, ..., "alpha8":1}) – Loss weights dictionary consumed by _train_novel.

Returns:

With keys: source_feat, target_feat, target_output, target_prob, target_pred, target_confs, target_mean_uncert, test_feat, test_pred. (test_* may be None if a test set is not provided.)

Return type:

dict