edgepy

This module is a partial port in Python of the R Bioconductor edgeR package. Only the functionalities necessary to inmoose.pycombat.pycombat_seq() and differential expression analysis have been ported so far.

Differential Expression Analysis Example

We give below an example of how to use edgepy to perform a differential expression analysis on the pasilla dataset.

>>> from inmoose.data.pasilla import pasilla
>>> from inmoose.edgepy import DGEList, glmLRT, topTags
>>> from patsy import dmatrix
>>> 
>>> # load the pasilla dataset as an AnnData
>>> pas = pasilla()
>>> 
>>> # extract the count matrix and the annotation dataframe from the AnnData object
>>> counts = pas.X.T
>>> anno = pas.obs
>>> # build the design matrix
>>> design = dmatrix("~condition", data=anno)
>>> 
>>> # build a DGEList object
>>> dge_list = DGEList(counts=counts, samples=anno, group_col="condition", genes=pas.var)
>>> # estimate the dispersions
>>> dge_list.estimateGLMCommonDisp(design=design)
<inmoose.edgepy.DGEList.DGEList object at 0x7f5a73866ad0>
>>> 
>>> # fit the GLM
>>> fit = dge_list.glmFit(design=design)
>>> 
>>> # run a differential expression analysis based on LRT
>>> lrt = glmLRT(fit)
>>> 
>>> topTags(lrt)
Coefficient: ['condition[T.untreated]']
                logFC    logCPM          LR        PValue
gene_id                                                  
FBgn0000003  4.545498  5.865508  393.136662  1.718024e-87
FBgn0000008  4.086658  4.382093  284.081650  9.686265e-64
FBgn0000014 -2.949233  6.940623  257.805485  5.162684e-58
FBgn0000015  3.126516  8.439420  247.983708  7.145234e-56
FBgn0000017 -3.016862  4.143248  219.727035  1.037348e-49
FBgn0000018  3.439618  4.173629  218.438723  1.981268e-49
FBgn0000022  4.801664  2.531135  208.330035  3.178423e-47
FBgn0000024 -2.731359  4.811910  200.507111  1.618712e-45
FBgn0000028 -3.453653  2.469176  187.327970  1.217864e-42
FBgn0000032  3.559548  3.203304  186.856505  1.543539e-42

References

[Chen2016]

Y. Chen, A.T.L Lun, G.K. Smyth. 2016. From reads to genes to pathways: differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline. F1000Research 5, 1438. doi:10.12688/f1000research.8987.2

[Gibbons1975]

J.D. Gibbons, J.W. Pratt. 1975. P-values: interpretation and methodology. The American Statistician 29, 20-25. doi:10.1080/00031305.1975.10479106

[Lun2016]

A.T.L. Lun, Y. Chen, G.K. Smyth. 2016. It’s DE-licious: a recipe for differential expression analyses of RNA-seq experiments using quasi-likelihood methods in edgeR. Methods in Molecular Biology 1418, 391-416. doi:10.1007/978-1-4939-3578-9_19

[Lund2012]

S.P. Lund, D. Nettleton, D.J. McCarthy, G.K. Smyth. 2012. Detecting differential expression in RNA-sequence data using quasi-likelihood with shrunken dispersion estimates. Statistical Applications in Genetics and Molecular Biology Volume 11, Issue 5, Article 8. doi:10.1515/1544-6115.1826

[Lun2017]

A.T.L. Lun, G.K. Smyth. 2017. No counts, no variance: allowing for loss of degrees of freedom when assessing biological variability from RNA-seq data. Statistical Applications in Genetics and Molecular Biology 16(2), 83-93. doi:10.1515/sagmb-2017-0010

[McCarthy2012]

D. J. McCarthy, Y. Chen, G. K. Smyth. 2012. Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Research 40, 4288-4297. doi:10.1093/nar/gks042

[Phipson2016]

B. Phipson, S. Lee, I.J. Majewski, W. S. Alexander, G.K. Smyth. 2016. Robust hyperparameter estimation protects against hypervariable genes and improves power to detect differential expression. Annals of Applied Statistics 10, 946-963. doi:10.1214/16-AOAS920

[Robinson2008]

M.D. Robinson, g.K. Smyth. 2008. Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics 9, 321-332. doi:10.1093/biostatistics/kxm030

Code documentation

`DGEList`(counts[, lib_size, norm_factors, ...])	A class for storing read counts and associated information from digital gene expression or sequencing technologies.
`addPriorCount`(y[, lib_size, offset, prior_count])	Add a library size-adjusted prior count to each observation.
`adjustedProfileLik`(dispersion, y, design, offset)	Compute adjusted profile log-likelihoods for the dispersion parameters of genewise negative binomial GLMs.
`aveLogCPM`(y[, lib_size, offset, ...])	Compute average log2 counts per million for each row of counts.
`binomTest`(y1, y2[, n1, n2, p])	Computes p-values for differential abundance for each gene between two digital libraries, conditioning on the total counts for each gene.
`designAsFactor`(design)	Convert a general design matrix into a oneway layout if that is possible.
`dispCoxReid`(y[, design, offset, weights, ...])	Estimate a common dispersion parameter across multiple negative binomial GLMs, by maximizing the Cox-Reid adjusted profile likelihood.
`dispCoxReidInterpolateTagwise`(y, design, ...)	Estimate genewise dispersion parameters across multiple negative binomial GLMs using weighted Cox-Reid adjusted profile likelihood and cubic spline interpolation over a genewise grid.
`estimateGLMCommonDisp`(y[, design, offset, ...])	Estimate a common negative binomial dispersion parameter for a DGE dataset with a general experimental design.
`estimateGLMTagwiseDisp`(y, dispersion[, ...])	Compute an empirical Bayes estimate of the negative binomial dispersion parameter for each tag, with expression levels specified by a log-linear model.
`exactTest`(self[, pair, dispersion, ...])	Compute genewise exact tests for differences in the means between two groups of negative-binomially distributed counts.
`exactTestBetaApprox`(y1, y2[, dispersion])	Compute genewise p-values for differences in the means between two groups of negative-binomially distributed counts.
`exactTestByDeviance`(y1, y2[, dispersion])	Compute genewise p-values for differences in the means between two groups of negative-binomially distributed counts.
`exactTestBySmallP`(y1, y2[, dispersion])	Compute genewise p-values for differences in the means between two groups of negative-binomially distributed counts.
`exactTestDoubleTail`(y1, y2[, dispersion, ...])	Compute genewise p-values for differences in the means between two groups of negative-binomially distributed counts.
`glmFit`(y[, design, dispersion, offset, ...])	Fit a negative binomial generalized log-linear model to the read counts for each gene.
`glmLRT`(glmfit[, coef, contrast])	Conduct genewise statistical tests for a given coefficient or coefficient contrast.
`glmQLFit`(y[, design, dispersion, offset, ...])	Fit a quasi-likelihood negative binomial generalized log-linear model to count data.
`glmQLFTest`(glmfit[, coef, contrast, ...])	Conduct genewise statistical tests for a given coefficient or contrast
`mglmLevenberg`(y, design[, dispersion, ...])	Fit genewise negative binomial GLMs with log-link using Levenberg damping to ensure convergence.
`mglmOneGroup`(y[, dispersion, offset, ...])	Fit single-group negative-binomial GLMs genewise.
`mglmOneWay`(y[, design, group, dispersion, ...])	Fit multiple negative binomial GLMs with log-link by Fisher scoring with a single explanatory factor in the model.
`movingAverageByCol`(x[, width, full_length])	Moving average smoother for matrix columns.
`nbinomDeviance`(y, mean[, dispersion, weights])	Residual deviances for row-wise negative binomial GLMs.
`plotQLDisp`(glmfit[, xlab, ylab, col_shrunk, ...])	Plot the genewise quasi-likelihood dispersion against the gene abundance (in log2 counts per million)
`predFC`(y, design[, prior_count, offset, ...])	Compute estimated coefficients for a negative binomial GLM in such a way that the log-fold-changes are shrunk towards zero.
`splitIntoGroups`(y[, group])	Split the counts according to group
`systematicSubset`(n, order_by)	Take a systematic subset of indices stratified by a ranking variable
`topTags`(self[, n, adjust_method, sort_by, ...])	Extract the most differentially expressed genes (or sequence tags) from a test object, ranked either by p-value or by absolute log-fold-change.
`validDGEList`(y)	Check for standard components of DGEList object