edgepy

This module is a partial port in Python of the R Bioconductor edgeR package. Only the functionalities necessary to inmoose.pycombat.pycombat_seq() and differential expression analysis have been ported so far.

Differential Expression Analysis Example

We give below an example of how to use edgepy to perform a differential expression analysis on the pasilla dataset.

>>> from inmoose.data.pasilla import pasilla
>>> from inmoose.edgepy import DGEList, glmLRT, topTags
>>> from patsy import dmatrix
>>> 
>>> # load the pasilla dataset as an AnnData
>>> pas = pasilla()
>>> 
>>> # extract the count matrix and the annotation dataframe from the AnnData object
>>> counts = pas.X.T
>>> anno = pas.obs
>>> # build the design matrix
>>> design = dmatrix("~condition", data=anno)
>>> 
>>> # build a DGEList object
>>> dge_list = DGEList(counts=counts, samples=anno, group_col="condition", genes=pas.var)
>>> # estimate the dispersions
>>> dge_list.estimateGLMCommonDisp(design=design)
<inmoose.edgepy.DGEList.DGEList object at 0x7f5a73866ad0>
>>> 
>>> # fit the GLM
>>> fit = dge_list.glmFit(design=design)
>>> 
>>> # run a differential expression analysis based on LRT
>>> lrt = glmLRT(fit)
>>> 
>>> topTags(lrt)
Coefficient: ['condition[T.untreated]']
                logFC    logCPM          LR        PValue
gene_id                                                  
FBgn0000003  4.545498  5.865508  393.136662  1.718024e-87
FBgn0000008  4.086658  4.382093  284.081650  9.686265e-64
FBgn0000014 -2.949233  6.940623  257.805485  5.162684e-58
FBgn0000015  3.126516  8.439420  247.983708  7.145234e-56
FBgn0000017 -3.016862  4.143248  219.727035  1.037348e-49
FBgn0000018  3.439618  4.173629  218.438723  1.981268e-49
FBgn0000022  4.801664  2.531135  208.330035  3.178423e-47
FBgn0000024 -2.731359  4.811910  200.507111  1.618712e-45
FBgn0000028 -3.453653  2.469176  187.327970  1.217864e-42
FBgn0000032  3.559548  3.203304  186.856505  1.543539e-42

References

[Chen2016]

Y. Chen, A.T.L Lun, G.K. Smyth. 2016. From reads to genes to pathways: differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline. F1000Research 5, 1438. doi:10.12688/f1000research.8987.2

[Gibbons1975]

J.D. Gibbons, J.W. Pratt. 1975. P-values: interpretation and methodology. The American Statistician 29, 20-25. doi:10.1080/00031305.1975.10479106

[Lun2016]

A.T.L. Lun, Y. Chen, G.K. Smyth. 2016. It’s DE-licious: a recipe for differential expression analyses of RNA-seq experiments using quasi-likelihood methods in edgeR. Methods in Molecular Biology 1418, 391-416. doi:10.1007/978-1-4939-3578-9_19

[Lund2012]

S.P. Lund, D. Nettleton, D.J. McCarthy, G.K. Smyth. 2012. Detecting differential expression in RNA-sequence data using quasi-likelihood with shrunken dispersion estimates. Statistical Applications in Genetics and Molecular Biology Volume 11, Issue 5, Article 8. doi:10.1515/1544-6115.1826

[Lun2017]

A.T.L. Lun, G.K. Smyth. 2017. No counts, no variance: allowing for loss of degrees of freedom when assessing biological variability from RNA-seq data. Statistical Applications in Genetics and Molecular Biology 16(2), 83-93. doi:10.1515/sagmb-2017-0010

[McCarthy2012]

D. J. McCarthy, Y. Chen, G. K. Smyth. 2012. Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Research 40, 4288-4297. doi:10.1093/nar/gks042

[Phipson2016]

B. Phipson, S. Lee, I.J. Majewski, W. S. Alexander, G.K. Smyth. 2016. Robust hyperparameter estimation protects against hypervariable genes and improves power to detect differential expression. Annals of Applied Statistics 10, 946-963. doi:10.1214/16-AOAS920

[Robinson2008]

M.D. Robinson, g.K. Smyth. 2008. Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics 9, 321-332. doi:10.1093/biostatistics/kxm030

Code documentation

DGEList(counts[, lib_size, norm_factors, ...])

A class for storing read counts and associated information from digital gene expression or sequencing technologies.

addPriorCount(y[, lib_size, offset, prior_count])

Add a library size-adjusted prior count to each observation.

adjustedProfileLik(dispersion, y, design, offset)

Compute adjusted profile log-likelihoods for the dispersion parameters of genewise negative binomial GLMs.

aveLogCPM(y[, lib_size, offset, ...])

Compute average log2 counts per million for each row of counts.

binomTest(y1, y2[, n1, n2, p])

Computes p-values for differential abundance for each gene between two digital libraries, conditioning on the total counts for each gene.

designAsFactor(design)

Convert a general design matrix into a oneway layout if that is possible.

dispCoxReid(y[, design, offset, weights, ...])

Estimate a common dispersion parameter across multiple negative binomial GLMs, by maximizing the Cox-Reid adjusted profile likelihood.

dispCoxReidInterpolateTagwise(y, design, ...)

Estimate genewise dispersion parameters across multiple negative binomial GLMs using weighted Cox-Reid adjusted profile likelihood and cubic spline interpolation over a genewise grid.

estimateGLMCommonDisp(y[, design, offset, ...])

Estimate a common negative binomial dispersion parameter for a DGE dataset with a general experimental design.

estimateGLMTagwiseDisp(y, dispersion[, ...])

Compute an empirical Bayes estimate of the negative binomial dispersion parameter for each tag, with expression levels specified by a log-linear model.

exactTest(self[, pair, dispersion, ...])

Compute genewise exact tests for differences in the means between two groups of negative-binomially distributed counts.

exactTestBetaApprox(y1, y2[, dispersion])

Compute genewise p-values for differences in the means between two groups of negative-binomially distributed counts.

exactTestByDeviance(y1, y2[, dispersion])

Compute genewise p-values for differences in the means between two groups of negative-binomially distributed counts.

exactTestBySmallP(y1, y2[, dispersion])

Compute genewise p-values for differences in the means between two groups of negative-binomially distributed counts.

exactTestDoubleTail(y1, y2[, dispersion, ...])

Compute genewise p-values for differences in the means between two groups of negative-binomially distributed counts.

glmFit(y[, design, dispersion, offset, ...])

Fit a negative binomial generalized log-linear model to the read counts for each gene.

glmLRT(glmfit[, coef, contrast])

Conduct genewise statistical tests for a given coefficient or coefficient contrast.

glmQLFit(y[, design, dispersion, offset, ...])

Fit a quasi-likelihood negative binomial generalized log-linear model to count data.

glmQLFTest(glmfit[, coef, contrast, ...])

Conduct genewise statistical tests for a given coefficient or contrast

mglmLevenberg(y, design[, dispersion, ...])

Fit genewise negative binomial GLMs with log-link using Levenberg damping to ensure convergence.

mglmOneGroup(y[, dispersion, offset, ...])

Fit single-group negative-binomial GLMs genewise.

mglmOneWay(y[, design, group, dispersion, ...])

Fit multiple negative binomial GLMs with log-link by Fisher scoring with a single explanatory factor in the model.

movingAverageByCol(x[, width, full_length])

Moving average smoother for matrix columns.

nbinomDeviance(y, mean[, dispersion, weights])

Residual deviances for row-wise negative binomial GLMs.

plotQLDisp(glmfit[, xlab, ylab, col_shrunk, ...])

Plot the genewise quasi-likelihood dispersion against the gene abundance (in log2 counts per million)

predFC(y, design[, prior_count, offset, ...])

Compute estimated coefficients for a negative binomial GLM in such a way that the log-fold-changes are shrunk towards zero.

splitIntoGroups(y[, group])

Split the counts according to group

systematicSubset(n, order_by)

Take a systematic subset of indices stratified by a ranking variable

topTags(self[, n, adjust_method, sort_by, ...])

Extract the most differentially expressed genes (or sequence tags) from a test object, ranked either by p-value or by absolute log-fold-change.

validDGEList(y)

Check for standard components of DGEList object