================================ Consensus Clustering ================================ InMoose implements consensus clustering [Monti2003]_, an unsupervised cluster discovery algorithm. InMoose implementation is based on a previous `implementation by Žiga Sajovic `_ Cohort Stratification ===================== .. currentmodule:: inmoose.consensus_clustering We illustrate the clustering-based stratification capabilities of InMoose. We start by simulating RNA-Seq data, using the :mod:`sim` module of InMoose. .. repl:: import numpy as np import pandas as pd from inmoose.sim import sim_rnaseq # number of genes N = 1000 # number of samples M = 1000 assert M % 10 == 0 P = M // 10 # 10% of M, helper variable # 3 batches: 20% 30% 50% of the samples batch = (2 * P) * [0] + (3 * P) * [1] + (5 * P) * [2] batch = np.array([f"batch{b}" for b in batch]) batch0 = batch == "batch0" batch1 = batch == "batch1" batch2 = batch == "batch2" # 2 condition groups # - group 1: 50% batch 1, 33% batch 2, 60% batch 3 # - group 2: 50% batch 1, 67% batch 2, 40% batch 3 group = P * [0] + P * [1] + P * [0] + (2 * P) * [1] + (2 * P) * [0] + (3 * P) * [1] group = np.array([f"group{g}" for g in group]) assert len(batch) == M and len(group) == M # store clinical metadata (i.e. batch and group) as a DataFrame clinical = pd.DataFrame({"batch": batch, "group": group}) clinical.index = [f"sample{i}" for i in range(M)] # simulate data # random_state passes a seed to the PRNG for reproducibility counts = sim_rnaseq(N, M, batch=batch, group=group, random_state=42).T We then run the consensus clustering algorithm. .. repl:: from inmoose.consensus_clustering.consensus_clustering import consensusClustering from sklearn.cluster import AgglomerativeClustering cc = consensusClustering(AgglomerativeClustering) cc.compute_consensus_clustering(counts, random_state=None) We can now look at the clusters found. .. repl-quiet:: from matplotlib import rcParams # repl default config replaces '.' by '-' in the savefig.directory :/ rcParams['savefig.directory'] = rcParams['savefig.directory'].replace("readthedocs-org", "readthedocs.org") .. repl:: from anndata import AnnData from inmoose.utils import Factor import scanpy as sc ad = AnnData(counts, obs=clinical) for k in range(2, 11): # Factor ensures that cluster ID are interpreted as categorical data ad.obs[f"k={k}"] = Factor(cc.predict(k)) # compute the PCA sc.tl.pca(ad) # plot the PCA sc.pl.pca(ad, color=[f"k={k}" for k in range(2, 11)], return_fig=True).show() References ========== .. [Monti2003] S. Monti, P. Tamayo, J. Mesirov, T. Golub. 2003. Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data. *Machine Learning* 52(1). :doi:`https://doi.org/10.1023/A:1023949509487` Code documentation ================== .. autosummary:: :toctree: generated/ ~consensus_clustering.consensusClustering