Differential Expression Analysis

InMoose offers a Python port of the well-known R Bioconductor packages:

Note that not all features of the R packages are necessarily ported. Extending the functionality of these modules will be based on user requests, so do not hesitate to open an issue if your favorite feature is missing.

In addition, InMoose provides a meta-analysis feature to combine the results from different differential expression analysis tools.

Please refer to [Colange2024] for a detailed comparison of InMoose implementation with the original R implementations.

Differential Expression Meta-Analysis

We illustrate the differential expression meta-analysis capabilities of InMoose along two approaches:

  • the Aggregate Data (AD) approach consists in running classical differential expression tools on individual cohorts then combining the results through e.g. random-effect models.

  • the Individual Sample Data (ISD) consists in merging individual cohorts into a large meta-cohort, accounting for batch effects to eliminate inter-cohort biases, then running a classical differential expression analysis on the resulting meta-cohort.

We start by simulating RNA-Seq data, using the sim module of InMoose.

>>> import numpy as np
>>> import pandas as pd
>>> from inmoose.sim import sim_rnaseq
>>> 
>>> # number of genes
>>> N = 10000
>>> # number of samples
>>> M = 2000
>>> assert M % 10 == 0
>>> P = M // 10  # 10% of M, helper variable
>>> 
>>> # 3 batches: 20% 30% 50% of the samples
>>> batch = (2 * P) * [0] + (3 * P) * [1] + (5 * P) * [2]
>>> batch = np.array([f"batch{b}" for b in batch])
>>> batch0 = batch == "batch0"
>>> batch1 = batch == "batch1"
>>> batch2 = batch == "batch2"
>>> 
>>> # 2 condition groups
>>> #   - group 1: 50% batch 1, 33% batch 2, 60% batch 3
>>> #   - group 2: 50% batch 1, 67% batch 2, 40% batch 3
>>> group = P * [0] + P * [1] + P * [0] + (2 * P) * [1] + (2 * P) * [0] + (3 * P) * [1]
>>> group = np.array([f"group{g}" for g in group])
>>> assert len(batch) == M and len(group) == M
>>> 
>>> # store clinical metadata (i.e. batch and group) as a DataFrame
>>> clinical = pd.DataFrame({"batch": batch, "group": group})
>>> clinical.index = [f"sample{i}" for i in range(M)]
>>> 
>>> # simulate data
>>> # random_state passes a seed to the PRNG for reproducibility
>>> counts = sim_rnaseq(N, M, batch=batch, group=group, random_state=42).T

We then run the two meta-analysis approaches on the obtained data.

>>> from inmoose.deseq2 import DESeq, DESeqDataSet
>>> from inmoose.diffexp import meta_de
>>> from inmoose.pycombat import pycombat_seq
>>> 
>>> # run AD meta-analysis on batches 1, 2 and 3
>>> # first run the differential expression analysis on each batch individually
>>> cohorts = [DESeqDataSet(counts[b], clinical.loc[b], design="~ group")
...            for b in [batch0, batch1, batch2]]
>>> individual_de = [DESeq(c).results() for c in cohorts]
[INFO] estimating size factors
[INFO] estimating dispersions
[INFO] gene-wise dispersion estimates
[INFO] mean-dispersion relationship
[INFO] final dispersion estimates
[INFO] fitting model and testing
[INFO] estimating size factors
[INFO] estimating dispersions
[INFO] gene-wise dispersion estimates
[INFO] mean-dispersion relationship
[INFO] final dispersion estimates
[INFO] fitting model and testing
[INFO] estimating size factors
[INFO] estimating dispersions
[INFO] gene-wise dispersion estimates
[INFO] mean-dispersion relationship
[INFO] final dispersion estimates
[INFO] fitting model and testing
>>> # then aggregate the results
>>> ad = meta_de([de for de in individual_de])
>>> 
>>> # run ISD meta-analysis on merged batches
>>> # first correct batch effects to properly merge batches
>>> # transpositions account for the difference of formats for pycombat_seq and deseq2
>>> harmonized_counts = pycombat_seq(counts.T, batch).T
[INFO] Found 3 batches
[INFO] Adjusting for 0 covariate(s) or covariate level(s)
[INFO] Estimating dispersions
[INFO] Fitting the GLM model
[INFO] shrinkage off - using GLM estimates for parameters
[INFO] Adjusting the data
>>> # then run the differential expression analysis on the merged batches
>>> isd = DESeq(DESeqDataSet(harmonized_counts, clinical, design="~ group")).results()
[INFO] estimating size factors
[INFO] estimating dispersions
[INFO] gene-wise dispersion estimates
[INFO] mean-dispersion relationship
[INFO] final dispersion estimates
[INFO] fitting model and testing
[INFO] 
        -- replacing outliers and refitting for 18 genes
        -- DESeq argument 'minReplicatesForReplace' = 7
        -- original counts are preserved in dds.counts()
        
[INFO] estimating dispersions
[INFO] fitting model and testing

We can now compare the results obtained by the two approaches.

>>> import matplotlib.pyplot as plt
>>> from scipy.stats import pearsonr
>>> 
>>> ax = plt.gca()
>>> ax.axline((0,0), slope=1, ls="--", lw=0.3, c=".4")
<matplotlib.lines.AxLine object at 0x75c46929c350>
>>> ax.scatter(isd["log2FoldChange"], ad["combined logFC"], s=3)
<matplotlib.collections.PathCollection object at 0x75c469881f40>
>>> corr = pearsonr(isd["log2FoldChange"], ad["combined logFC"]).statistic
>>> ax.annotate(f"Pearson correlation: {100*corr:.1f}%", (2, -4), size="small")
Text(2, -4, 'Pearson correlation: 98.0%')
>>> ax.set_title("ISD", loc="center")
Text(0.5, 1.0, 'ISD')
>>> ax.set_ylabel("AD", loc="center")
Text(0, 0.5, 'AD')
>>> 
>>> plt.show()
_images_repl/mpl-0_1.svg

It is possible to combine results obtained from different tools, as long as the results of the differential expression analysis are stored as DEResults. All three modules limma, edgepy and deseq2 return sub-classes of DEResults, thus allowing users to perform cross-technology meta-analysis (e.g. by combining results from limma with results from deseq2).

References

[Colange2024]

M. Colange, G. Appé, L. Meunier, S. Weill, A. Nordor, A. Behdenna. 2024. Differential Expression Analysis with InMoose, the Integrated Multi-Omic Open-Source Environment in Python. Bioarxiv. doi:10.1101/2024.11.14.623578

Code documentation

DEResults(df, *args, **kwargs)

A class to store the results of differential expression analysis

meta_de(de_results[, alpha, min_common_genes])

Combine logFC and p-values of differential expression analyses