Consensus Clustering

InMoose implements consensus clustering [Monti2003], an unsupervised cluster discovery algorithm. InMoose implementation is based on a previous implementation by Žiga Sajovic

Cohort Stratification

We illustrate the clustering-based stratification capabilities of InMoose.

We start by simulating RNA-Seq data, using the sim module of InMoose.

>>> import numpy as np
>>> import pandas as pd
>>> from inmoose.sim import sim_rnaseq
>>> 
>>> # number of genes
>>> N = 1000
>>> # number of samples
>>> M = 1000
>>> assert M % 10 == 0
>>> P = M // 10  # 10% of M, helper variable
>>> 
>>> # 3 batches: 20% 30% 50% of the samples
>>> batch = (2 * P) * [0] + (3 * P) * [1] + (5 * P) * [2]
>>> batch = np.array([f"batch{b}" for b in batch])
>>> batch0 = batch == "batch0"
>>> batch1 = batch == "batch1"
>>> batch2 = batch == "batch2"
>>> 
>>> # 2 condition groups
>>> #   - group 1: 50% batch 1, 33% batch 2, 60% batch 3
>>> #   - group 2: 50% batch 1, 67% batch 2, 40% batch 3
>>> group = P * [0] + P * [1] + P * [0] + (2 * P) * [1] + (2 * P) * [0] + (3 * P) * [1]
>>> group = np.array([f"group{g}" for g in group])
>>> assert len(batch) == M and len(group) == M
>>> 
>>> # store clinical metadata (i.e. batch and group) as a DataFrame
>>> clinical = pd.DataFrame({"batch": batch, "group": group})
>>> clinical.index = [f"sample{i}" for i in range(M)]
>>> 
>>> # simulate data
>>> # random_state passes a seed to the PRNG for reproducibility
>>> counts = sim_rnaseq(N, M, batch=batch, group=group, random_state=42).T

We then run the consensus clustering algorithm.

>>> from inmoose.consensus_clustering.consensus_clustering import consensusClustering
>>> from sklearn.cluster import AgglomerativeClustering
>>> 
>>> cc = consensusClustering(AgglomerativeClustering)
>>> cc.compute_consensus_clustering(counts, random_state=None)

We can now look at the clusters found.

>>> from anndata import AnnData
>>> from inmoose.utils import Factor
>>> import scanpy as sc
>>> 
>>> 
>>> ad = AnnData(counts, obs=clinical)
>>> for k in range(2, 11):
...     # Factor ensures that cluster ID are interpreted as categorical data
...     ad.obs[f"k={k}"] = Factor(cc.predict(k))
... 
>>> # compute the PCA
>>> sc.tl.pca(ad)
>>> # plot the PCA
>>> sc.pl.pca(ad, color=[f"k={k}" for k in range(2, 11)], return_fig=True).show()

References

[Monti2003]

S. Monti, P. Tamayo, J. Mesirov, T. Golub. 2003. Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data. Machine Learning 52(1). doi:https://doi.org/10.1023/A:1023949509487

Code documentation

consensusClustering(cluster[, mink, maxk, ...])

Implementation of Consensus clustering, following the paper https://link.springer.com/content/pdf/10.1023%2FA%3A1023949509487.pdf