edgepy

This module is a partial port in Python of the R Bioconductor edgeR package. Only the functionalities necessary to inmoose.pycombat.pycombat_seq() and differential expression analysis have been ported so far.

Differential Expression Analysis Example

We give below an example of how to use edgepy to perform a differential expression analysis on the pasilla dataset.

>>> from inmoose.data.pasilla import pasilla
>>> from inmoose.edgepy import DGEList, glmLRT, topTags
>>> from patsy import dmatrix
>>> 
>>> # load the pasilla dataset as an AnnData
>>> pas = pasilla()
>>> 
>>> # extract the count matrix and the annotation dataframe from the AnnData object
>>> counts = pas.X.T
>>> anno = pas.obs
>>> # build the design matrix
>>> design = dmatrix("~condition", data=anno)
>>> 
>>> # build a DGEList object
>>> dge_list = DGEList(counts=counts, samples=anno, group_col="condition", genes=pas.var)
>>> # estimate the dispersions
>>> dge_list.estimateGLMCommonDisp(design=design)
<inmoose.edgepy.DGEList.DGEList object at 0x7f155a35dcd0>
>>> 
>>> # fit the GLM
>>> fit = dge_list.glmFit(design=design)
>>> 
>>> # run a differential expression analysis based on LRT
>>> lrt = glmLRT(fit)
>>> 
>>> topTags(lrt)
Coefficient: ['condition[T.untreated]']
                logFC    logCPM          LR        PValue
gene_id                                                  
FBgn0000003  4.545505  5.865509  393.393494  1.510489e-87
FBgn0000008  4.086659  4.382091  284.242373  8.935808e-64
FBgn0000014 -2.949233  6.940623  257.991933  4.701464e-58
FBgn0000015  3.126516  8.439421  248.166909  6.517421e-56
FBgn0000017 -3.016858  4.143243  219.858071  9.712747e-50
FBgn0000018  3.439621  4.173627  218.563065  1.861317e-49
FBgn0000022  4.801665  2.531154  208.407308  3.057399e-47
FBgn0000024 -2.731362  4.811909  200.637661  1.515937e-45
FBgn0000028 -3.453657  2.469177  187.408001  1.169845e-42
FBgn0000032  3.559537  3.203302  186.943820  1.477260e-42

References

[Chen2016]

Y. Chen, A.T.L Lun, G.K. Smyth. 2016. From reads to genes to pathways: differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline. F1000Research 5, 1438. doi:10.12688/f1000research.8987.2

[Gibbons1975]

J.D. Gibbons, J.W. Pratt. 1975. P-values: interpretation and methodology. The American Statistician 29, 20-25. doi:10.1080/00031305.1975.10479106

[Lun2016]

A.T.L. Lun, Y. Chen, G.K. Smyth. 2016. It’s DE-licious: a recipe for differential expression analyses of RNA-seq experiments using quasi-likelihood methods in edgeR. Methods in Molecular Biology 1418, 391-416. doi:10.1007/978-1-4939-3578-9_19

[Lund2012]

S.P. Lund, D. Nettleton, D.J. McCarthy, G.K. Smyth. 2012. Detecting differential expression in RNA-sequence data using quasi-likelihood with shrunken dispersion estimates. Statistical Applications in Genetics and Molecular Biology Volume 11, Issue 5, Article 8. doi:10.1515/1544-6115.1826

[Lun2017]

A.T.L. Lun, G.K. Smyth. 2017. No counts, no variance: allowing for loss of degrees of freedom when assessing biological variability from RNA-seq data. Statistical Applications in Genetics and Molecular Biology 16(2), 83-93. doi:10.1515/sagmb-2017-0010

[McCarthy2012]

D. J. McCarthy, Y. Chen, G. K. Smyth. 2012. Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Research 40, 4288-4297. doi:10.1093/nar/gks042

[Phipson2016]

B. Phipson, S. Lee, I.J. Majewski, W. S. Alexander, G.K. Smyth. 2016. Robust hyperparameter estimation protects against hypervariable genes and improves power to detect differential expression. Annals of Applied Statistics 10, 946-963. doi:10.1214/16-AOAS920

[Robinson2008]

M.D. Robinson, g.K. Smyth. 2008. Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics 9, 321-332. doi:10.1093/biostatistics/kxm030

Code documentation

DGEList(counts[, lib_size, norm_factors, ...])

A class for storing read counts and associated information from digital gene expression or sequencing technologies.

addPriorCount(y[, lib_size, offset, prior_count])

Add a library size-adjusted prior count to each observation.

adjustedProfileLik(dispersion, y, design, offset)

Compute adjusted profile log-likelihoods for the dispersion parameters of genewise negative binomial GLMs.

aveLogCPM(y[, lib_size, offset, ...])

Compute average log2 counts per million for each row of counts.

binomTest(y1, y2[, n1, n2, p])

Computes p-values for differential abundance for each gene between two digital libraries, conditioning on the total counts for each gene.

designAsFactor(design)

Convert a general design matrix into a oneway layout if that is possible.

dispCoxReid(y[, design, offset, weights, ...])

Estimate a common dispersion parameter across multiple negative binomial GLMs, by maximizing the Cox-Reid adjusted profile likelihood.

dispCoxReidInterpolateTagwise(y, design, ...)

Estimate genewise dispersion parameters across multiple negative binomial GLMs using weighted Cox-Reid adjusted profile likelihood and cubic spline interpolation over a genewise grid.

estimateGLMCommonDisp(y[, design, offset, ...])

Estimate a common negative binomial dispersion parameter for a DGE dataset with a general experimental design.

estimateGLMTagwiseDisp(y, dispersion[, ...])

Compute an empirical Bayes estimate of the negative binomial dispersion parameter for each tag, with expression levels specified by a log-linear model.

exactTest(self[, pair, dispersion, ...])

Compute genewise exact tests for differences in the means between two groups of negative-binomially distributed counts.

exactTestBetaApprox(y1, y2[, dispersion])

Compute genewise p-values for differences in the means between two groups of negative-binomially distributed counts.

exactTestByDeviance(y1, y2[, dispersion])

Compute genewise p-values for differences in the means between two groups of negative-binomially distributed counts.

exactTestBySmallP(y1, y2[, dispersion])

Compute genewise p-values for differences in the means between two groups of negative-binomially distributed counts.

exactTestDoubleTail(y1, y2[, dispersion, ...])

Compute genewise p-values for differences in the means between two groups of negative-binomially distributed counts.

glmFit(y[, design, dispersion, offset, ...])

Fit a negative binomial generalized log-linear model to the read counts for each gene.

glmLRT(glmfit[, coef, contrast])

Conduct genewise statistical tests for a given coefficient or coefficient contrast.

glmQLFit(y[, design, dispersion, offset, ...])

Fit a quasi-likelihood negative binomial generalized log-linear model to count data.

glmQLFTest(glmfit[, coef, contrast, ...])

Conduct genewise statistical tests for a given coefficient or contrast

mglmLevenberg(y, design[, dispersion, ...])

Fit genewise negative binomial GLMs with log-link using Levenberg damping to ensure convergence.

mglmOneGroup(y[, dispersion, offset, ...])

Fit single-group negative-binomial GLMs genewise.

mglmOneWay(y[, design, group, dispersion, ...])

Fit multiple negative binomial GLMs with log-link by Fisher scoring with a single explanatory factor in the model.

movingAverageByCol(x[, width, full_length])

Moving average smoother for matrix columns.

nbinomDeviance(y, mean[, dispersion, weights])

Residual deviances for row-wise negative binomial GLMs.

plotQLDisp(glmfit[, xlab, ylab, col_shrunk, ...])

Plot the genewise quasi-likelihood dispersion against the gene abundance (in log2 counts per million)

predFC(y, design[, prior_count, offset, ...])

Compute estimated coefficients for a negative binomial GLM in such a way that the log-fold-changes are shrunk towards zero.

splitIntoGroups(y[, group])

Split the counts according to group

systematicSubset(n, order_by)

Take a systematic subset of indices stratified by a ranking variable

topTags(self[, n, adjust_method, sort_by, ...])

Extract the most differentially expressed genes (or sequence tags) from a test object, ranked either by p-value or by absolute log-fold-change.

validDGEList(y)

Check for standard components of DGEList object