dirac.utils¶
- sodirac.utils.append_categorical_to_data(X: numpy.ndarray | scipy.sparse.csr.csr_matrix, categorical: numpy.ndarray) Tuple[numpy.ndarray | scipy.sparse.csr.csr_matrix, numpy.ndarray][source]¶
Append a one-hot encoding of a categorical vector to each sample in X.
- Parameters:
X (np.ndarray or sparse.csr_matrix) – Shape [cells, features]. Feature matrix.
categorical (np.ndarray) – Shape [cells,]. Categorical labels per cell.
- Returns:
Xa (np.ndarray or sparse.csr_matrix) – Shape [cells, features + n_categories]. Matrix with one-hot appended.
categories (np.ndarray) – Shape [n_categories,]. Category names in the order used for one-hot.
Examples
>>> X_aug, cats = append_categorical_to_data(X, adata.obs["batch"].values)
Notes
Uses pd.Categorical(…).codes to derive integer label indices, then a one-hot encoding (via make_one_hot) that is concatenated to X.
- sodirac.utils.get_adata_asarray(adata: anndata.AnnData) numpy.ndarray | scipy.sparse.csr.csr_matrix[source]¶
Materialize adata.X as an array or CSR matrix (no view).
- Parameters:
adata (anndata.AnnData) – AnnData object with .X of shape [cells, genes].
- Returns:
X – Concrete in-memory copy of adata.X with matching type.
- Return type:
np.ndarray or sparse.csr_matrix
Notes
Preserves the dense/sparse form of the original .X.
- sodirac.utils.build_classification_matrix(X: numpy.ndarray | scipy.sparse.csr.csr_matrix, model_genes: numpy.ndarray, sample_genes: numpy.ndarray, gene_batch_size: int = 512) numpy.ndarray | scipy.sparse.csr.csr_matrix[source]¶
Reindex a count matrix to the model’s gene order, filling missing genes with zeros.
- Parameters:
X (np.ndarray or sparse.csr_matrix) – Shape [cells, genes]. Count matrix for the sample.
model_genes (np.ndarray) – Expected gene identifiers in model order.
sample_genes (np.ndarray) – Gene identifiers (columns) for X.
gene_batch_size (int, default 512) – Number of genes to copy per batch (speed vs. memory trade-off).
- Returns:
N – Shape [cells, len(model_genes)]. Reindexed counts; zeros for absent genes.
- Return type:
np.ndarray or sparse.csr_matrix
Notes
If model_genes exactly matches sample_genes, returns X unchanged. Otherwise, constructs a new matrix with columns in model order and copies overlapping genes in batches to control memory usage.
- sodirac.utils.knn_smooth_pred_class(X: numpy.ndarray, pred_class: numpy.ndarray, grouping: numpy.ndarray | None = None, k: int = 15) numpy.ndarray[source]¶
Smooth class labels by majority vote among k-nearest neighbors.
- Parameters:
X (np.ndarray) – Shape [N, features]. Embedding used for neighbor search.
pred_class (np.ndarray) – Shape [N,]. Class labels to be smoothed.
grouping (np.ndarray, optional) – Shape [N,]. Group IDs restricting neighbors to within-group only. If None, all cells are considered a single group.
k (int, default 15) – Number of neighbors to use.
- Returns:
smooth_pred_class – Shape [N,]. Smoothed class labels.
- Return type:
np.ndarray
Notes
For each group (or globally), builds a kNN graph and assigns to each cell the majority class among its k neighbors (including or excluding itself depending on scikit-learn defaults used here).
- sodirac.utils.knn_smooth_pred_class_prob(X: numpy.ndarray, pred_probs: numpy.ndarray, names: numpy.ndarray, grouping: numpy.ndarray | None = None, k: Callable[[int], int] | int = 15, dm: numpy.ndarray | None = None, **kwargs: Any) numpy.ndarray[source]¶
Smooth class probabilities by kNN regression with RBF distance weights.
- Parameters:
X (np.ndarray) – Shape [N, features]. Embedding used for neighbor search.
pred_probs (np.ndarray) – Shape [N, C]. Class prediction probabilities per cell.
names (np.ndarray) – Shape [C,]. Class names corresponding to columns of pred_probs.
grouping (np.ndarray, optional) – Shape [N,]. Group IDs restricting neighbors to within-group only. If None, all cells are considered a single group.
k (Callable[[int], int] or int, default 15) – If callable, receives the group size and returns k for that group; otherwise a fixed k is used.
dm (np.ndarray, optional) – Shape [N, N]. Precomputed distance matrix to set the RBF kernel parameter efficiently.
**kwargs (Any) – Additional kwargs forwarded to KNeighborsRegressor.
- Returns:
smooth_pred_class – Shape [N,]. Class labels from argmax of smoothed probabilities.
- Return type:
np.ndarray
Examples
>>> smooth = knn_smooth_pred_class_prob(X, probs, class_names, grouping=clusters, k=15)
Notes
Uses RBFWeight to set kernel width from median pairwise distance, then applies weighted kNN regression to smooth class probabilities within each group. Class labels are taken as argmax of the smoothed probabilities.
- sodirac.utils.argmax_pred_class(grouping: numpy.ndarray, prediction: numpy.ndarray) numpy.ndarray[source]¶
Assign groupwise majority class to all elements in each group.
- Parameters:
grouping (np.ndarray) – Shape [N,]. Group IDs for each element.
prediction (np.ndarray) – Shape [N,]. Predicted class for each element.
- Returns:
assigned_classes – Shape [N,]. Majority class per group applied to all elements.
- Return type:
np.ndarray
Examples
>>> grouping = np.array([0,0,0,1,1,1,2,2,2,2]) >>> prediction = np.array(['A','A','A','B','A','B','C','A','B','C']) >>> argmax_pred_class(grouping, prediction) array(['A','A','A','B','B','B','C','C','C','C'], dtype=object)
Notes
Useful when leveraging cluster assignments from another method to simplify cell-level labels to cluster-level majorities.
- sodirac.utils.compute_entropy_of_mixing(X: numpy.ndarray, y: numpy.ndarray, n_neighbors: int, n_iters: int | None = None, **kwargs: Any) numpy.ndarray[source]¶
Compute entropy of group mixing in local neighborhoods.
- Parameters:
X (np.ndarray) – Shape [N, P]. Feature matrix used for neighbor search.
y (np.ndarray) – Shape [N,]. Discrete group labels.
n_neighbors (int) – Number of neighbors drawn when computing local distributions.
n_iters (int, optional) – Number of random query points to evaluate. If None, uses all points.
**kwargs (Any) – Additional keyword arguments forwarded to NearestNeighbors.
- Returns:
entropy_of_mixing – Shape [n_iters,]. Entropy values per query point (in nats).
- Return type:
np.ndarray
Notes
For each query point, counts group membership among its k-nearest neighbors and computes entropy of the resulting probability vector.
- sodirac.utils.pp_adatas(adata_sc: anndata.AnnData, adata_sp: anndata.AnnData, genes: Iterable[str] | None = None, gene_to_lowercase: bool = True) None[source]¶
Preprocess single-cell and spatial AnnData to align genes and compute density priors.
- Parameters:
adata_sc (anndata.AnnData) – Single-cell AnnData.
adata_sp (anndata.AnnData) – Spatial expression AnnData.
genes (Iterable[str], optional) – Marker genes to use. If None, all genes from adata_sc are considered.
gene_to_lowercase (bool, default True) – If True, lowercases all gene names to align case between modalities.
- Return type:
None
Notes
Filters out all-zero genes in both datasets.
Stores shared training genes in .uns[“training_genes”].
Stores overlapping genes in .uns[“overlap_genes”].
Computes uniform and RNA-count-based density priors in adata_sp.obs.
- class sodirac.utils.RBFWeight(alpha: float | None = None)[source]¶
Bases:
objectRadial basis function (Gaussian) weight generator for distances.
- Parameters:
alpha (float, optional) – RBF parameter (1 / (2 * sigma^2)). If not set, must call set_alpha.
Notes
Weights follow: w(r) = exp(- (alpha * r)^2 ).
- set_alpha(X: numpy.ndarray, n_max: int | None = None, dm: numpy.ndarray | None = None) None[source]¶
Estimate alpha from the median pairwise distance.
- Parameters:
X (np.ndarray) – Shape [N, P]. Observations.
n_max (int, optional) – Max observations to subsample for the median distance computation.
dm (np.ndarray, optional) – Shape [N, N]. Precomputed distance matrix; if provided, speeds up estimation.
- Return type:
None
References
Gretton et al., “A Kernel Two-Sample Test”, JMLR 13(Mar):723–773, 2012.
- sodirac.utils.adata_to_cluster_expression(adata: anndata.AnnData, cluster_label: str, scale: bool = True, add_density: bool = True) anndata.AnnData[source]¶
Aggregate single-cell expression to cluster-level expression by cluster_label.
- Parameters:
adata (anndata.AnnData) – Single-cell AnnData.
cluster_label (str) – Column in adata.obs defining clusters.
scale (bool, default True) – If True, sums counts per cluster (proportional to cluster size). If False, takes mean per cluster.
add_density (bool, default True) – If True, adds normalized cluster sizes to .obs[‘cluster_density’].
- Returns:
aggregated – AnnData with one observation per cluster and the same variables as input.
- Return type:
anndata.AnnData
Notes
Only cluster_label is preserved in .obs of the returned AnnData (plus cluster_density if requested).
- sodirac.utils._edge_list_to_tensor(edge_list: Sequence[Sequence[int]]) torch.LongTensor[source]¶
Convert a list of directed edges to a tensor of shape [2, E].
- Parameters:
edge_list (sequence of (i, j)) – List-like edge pairs.
- Returns:
edge_index – Shape [2, E]. Empty [2, 0] if input is empty.
- Return type:
torch.LongTensor
Notes
Validates shape and dtype; does not deduplicate edges.
- sodirac.utils._to_tuple_list(edges: List[Tuple[int, int]] | torch.Tensor | numpy.ndarray) List[Tuple[int, int]][source]¶
Normalize various edge representations to a list of (i, j) tuples.
- Parameters:
edges (list/array/tensor) – List of pairs [[i, j], …] / [(i, j), …] or a 2D tensor/array of shape [2, E] or [E, 2].
- Returns:
edge_list – List of directed edge tuples.
- Return type:
Notes
Accepts torch or numpy arrays in either [2, E] or [E, 2] layout.
- sodirac.utils.get_multi_edge_index(pos: numpy.ndarray, regions: numpy.ndarray, graph_methods: str = 'knn', n_neighbors: int | None = None, n_radius: float | None = None, verbose: bool = True) torch.LongTensor[source]¶
Build intra-region graphs (no cross-region edges) and merge them.
- Parameters:
pos (np.ndarray) – Shape [N, d]. Coordinates.
regions (np.ndarray) – Shape [N,]. Region label for each node.
graph_methods ({"knn", "radius"}, default "knn") – Graph construction method.
n_neighbors (int, optional) – Required if graph_methods == “knn”. Number of neighbors (> 0).
n_radius (float, optional) – Required if graph_methods == “radius”. Neighborhood radius (> 0).
verbose (bool, default True) – If True, prints average directed neighbors per node.
- Returns:
edge_index – Shape [2, E]. Directed edges (i, j). Empty [2, 0] if none.
- Return type:
torch.LongTensor
Notes
Uses PyG knn_graph or radius_graph per region and remaps indices to global node IDs before concatenation.
- sodirac.utils.get_single_edge_index(pos: numpy.ndarray, graph_methods: str = 'knn', n_neighbors: int | None = None, n_radius: float | None = None, verbose: bool = True) torch.LongTensor[source]¶
Build a graph on a single region or the whole set.
- Parameters:
pos (np.ndarray) – Shape [N, d]. Coordinates.
graph_methods ({"knn", "radius"}, default "knn") – Graph construction method.
n_neighbors (int, optional) – Required if graph_methods == “knn”. Number of neighbors (> 0).
n_radius (float, optional) – Required if graph_methods == “radius”. Neighborhood radius (> 0).
verbose (bool, default True) – If True, prints average directed neighbors per node.
- Returns:
edge_index – Shape [2, E]. Directed edges (i, j). Empty [2, 0] if none.
- Return type:
torch.LongTensor
- sodirac.utils.get_expr_edge_index(expr: numpy.ndarray, n_neighbors: int = 20, mode: str = 'connectivity', metric: str = 'correlation', include_self: bool = False) List[Tuple[int, int]][source]¶
Build a kNN graph from a feature/expression matrix using scikit-learn.
- Parameters:
expr (np.ndarray) – Shape [N, P]. Feature matrix.
n_neighbors (int, default 20) – Number of neighbors.
mode ({"connectivity", "distance"}, default "connectivity") – Graph construction mode.
metric (str, default "correlation") – Distance/affinity metric passed to kneighbors_graph.
include_self (bool, default False) – Whether to include self-edges.
- Returns:
edges – Directed edges (row -> col) as a list of (i, j) pairs.
- Return type:
Notes
Returns COO order from the sparse adjacency.
- sodirac.utils.edge_lists_intersection(edges1: List[Tuple[int, int]] | torch.Tensor, edges2: List[Tuple[int, int]] | torch.Tensor) List[Tuple[int, int]][source]¶
Compute the direction-sensitive intersection of two edge sets.
- Parameters:
- Returns:
edges – Intersection as a list of directed edges.
- Return type:
Notes
Converts inputs to tuple lists, then intersects as Python sets.
- sodirac.utils.get_consensus_edges(spatial: numpy.ndarray, *omics: numpy.ndarray, target_neighbors: int = 8, max_iter: int = 20) torch.LongTensor[source]¶
Intersect spatial and feature kNN graphs to target a desired average degree.
- Parameters:
spatial (np.ndarray) – Shape [N, d]. Coordinates.
*omics (np.ndarray) – One or more matrices, each shape [N, p_k], concatenated for feature kNN.
target_neighbors (int, default 8) – Desired average number of neighbors in the intersection (0 < target < N).
max_iter (int, default 20) – Maximum number of binary-search iterations.
- Returns:
edge_index – Shape [2, E]. Intersection edges (i, j). Empty [2, 0] if none.
- Return type:
torch.LongTensor
Notes
Binary-searches the neighbor count used for both spatial and feature graphs until the intersection’s average degree approaches target_neighbors.
- sodirac.utils.tfidf(X: numpy.ndarray | scipy.sparse.csr_matrix) numpy.ndarray | scipy.sparse.csr_matrix[source]¶
Apply TF-IDF normalization (Seurat v3-style).
- Parameters:
X (np.ndarray or sparse.csr_matrix) – Input matrix of shape [cells, features].
- Returns:
X_tfidf – TF-IDF normalized matrix with the same sparsity type as input.
- Return type:
np.ndarray or sparse.csr_matrix
Notes
idf = n_cells / feature_sum; tf is row-normalized counts.
- sodirac.utils.lsi(adata: anndata.AnnData, n_comps: int = 20, use_highly_variable: bool | None = None, **kwargs: Any) anndata.AnnData[source]¶
Compute LSI embeddings following the Seurat v3 approach.
- Parameters:
- Returns:
adata – Input AnnData with adata.obsm[“X_lsi”] added.
- Return type:
anndata.AnnData
Notes
Applies TF-IDF, L1 normalization, log1p scaling, randomized SVD, and per-cell z-scoring (mean 0, std 1).
- sodirac.utils._optimize_cluster(adata: anndata.AnnData, resolution: List[float] = []) float[source]¶
Optimize Leiden resolution by maximizing the Calinski–Harabasz score.
- Parameters:
- Returns:
res – Best resolution (prints it as well).
- Return type:
Notes
Runs sc.tl.leiden for each resolution and computes CH score on .X.
- sodirac.utils._priori_cluster(adata: anndata.AnnData, eval_cluster_n: int = 7, res_min: float = 0.01, res_max: float = 2.5, res_step: float = 0.01) float[source]¶
Find a Leiden resolution that yields a target number of clusters.
- Parameters:
adata (anndata.AnnData) – AnnData to be clustered. Overwrites adata.obs[“leiden”].
eval_cluster_n (int, default 7) – Desired number of Leiden clusters.
res_min (float, default 0.01) – Minimum resolution to try (inclusive).
res_max (float, default 2.5) – Maximum resolution to try (inclusive).
res_step (float, default 0.01) – Step size between consecutive resolutions.
- Returns:
res – Resolution that first (from high to low) yields exactly eval_cluster_n, or the last tried value if none match.
- Return type:
Notes
Search order is descending to mirror original behavior.
- sodirac.utils.mclust_R(adata: anndata.AnnData, num_cluster: int, modelNames: str = 'EEE', used_obsm: str = 'DIRAC_embed', random_seed: int = 2020, key_added: str = 'DIRAC') None[source]¶
Cluster embeddings using R’s mclust via rpy2.
- Parameters:
adata (anndata.AnnData) – AnnData with embeddings in adata.obsm[used_obsm] of shape [N, d].
num_cluster (int) – Number of clusters for Mclust.
modelNames (str, default 'EEE') – Covariance model string (see mclust docs).
used_obsm (str, default 'DIRAC_embed') – Key in adata.obsm to cluster.
random_seed (int, default 2020) – Random seed for NumPy and R.
key_added (str, default 'DIRAC') – Column name to store cluster labels in adata.obs.
- Return type:
None
Notes
Requires rpy2 and R package mclust. Writes integer labels to adata.obs[key_added] as categorical.
- sodirac.utils.seed_torch(seed: int = 1029) None[source]¶
Set random seeds for Python, NumPy, and PyTorch (CPU & CUDA) for reproducibility.
- Parameters:
seed (int, default 1029) – Seed value used across Python’s random, NumPy, and PyTorch.
- Return type:
None
Notes
- Sets:
os.environ[‘PYTHONHASHSEED’]
torch.manual_seed / cuda.manual_seed / cuda.manual_seed_all
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True
- sodirac.utils.combine_multimodal_adatas(adatas: Dict[str, anndata.AnnData], *, prefixes: Dict[str, str] | None = None, align_obs: bool = True, preserve_obsm: Iterable[str] = ('spatial',), dtype: numpy.dtype = numpy.float32) anndata.AnnData[source]¶
Concatenate features across modalities for the same set of cells.
- Parameters:
adatas (Dict[str, AnnData]) – Mapping from modality name (e.g., “RNA”, “ATAC”, “ADT”) to AnnData. Dict insertion order determines feature block order.
prefixes (Dict[str, str], optional) – Per-modality prefixes for feature names (default: f”{mod.upper()}_”).
align_obs (bool, default True) – If True, reindex each AnnData to match the first modality’s cells. If False, raises on mismatch.
preserve_obsm (Iterable[str], default ("spatial",)) – Keys in .obsm to copy from the first modality if present.
dtype (np.dtype, default np.float32) – Output matrix dtype.
- Returns:
combined – AnnData with dense X (features concatenated), .var describing feature types and original names, .obs from the reference modality, and .uns with combination metadata.
- Return type:
anndata.AnnData
Notes
Converts all blocks to dense arrays before horizontal concatenation.
- sodirac.utils.ctg(adata_sc: anndata.AnnData, cluster_label: str, n_genes: int = 150, *, min_cells: int = 3, method: str = 'wilcoxon', use_raw: bool = False) List[str][source]¶
Select top marker genes per cluster using Scanpy’s rank_genes_groups.
- Parameters:
adata_sc (anndata.AnnData) – Single-cell AnnData with expression matrix and metadata.
cluster_label (str) – Column in adata_sc.obs with cluster identities.
n_genes (int, default 150) – Number of top-ranked genes to collect per cluster before de-duplication.
min_cells (int, keyword-only, default 3) – Minimum cells a gene must be expressed in prior to ranking.
method ({"wilcoxon", "t-test", "logreg"}, keyword-only, default "wilcoxon") – Statistical test for differential expression.
use_raw (bool, keyword-only, default False) – Whether to use adata.raw for testing.
- Returns:
markers – Unique list of top marker genes across clusters.
- Return type:
List[str]
Notes
Works on a copy of the input AnnData to avoid in-place modifications. Applies minimal preprocessing (filter, normalize_total, log1p) before DE.