inmoose.limma.eBayes

inmoose.limma.eBayes(fit, proportion=0.01, stdev_coef_lim=(0.1, 4), trend=False, robust=False, winsor_tail_p=(0.05, 0.1))

Empirical Bayes statistics for differential expression

Given a linear model fit from lmFit(), compute moderated t-statistics, moderated F-statistics, and log-odds of differential expression by empirical Bayes moderation of the standard errors towards a global value.

This function is used to rank genes in order of evidence for differential expression. They use an empirical Bayes method to squeeze the genewise residual variances towards a common value (or towards a global trend) [Smyth2004] [Phipson2016]. The degrees of freedom for the individual variances are increased to reflect the extra information gained from the empirical Bayes moderation, resulting in increased statistical power to detect differential expression.

This function accepts as input a MArrayLM fitted model fit produced by lmFit(). The columns of fit define a set of contrasts which are to be tested equal to zero. The fitted model object may have been processed by contrasts_fit() before being passed to eBayes() to convert the coefficients of the original design matrix into an arbitrary number of contrasts.

The empirical Bayes moderated t-statistics test each individual contrast equal to zero. For each gene (row), the moderated F-statistic tests whether all the contrasts are zero. The F-statistic is an overall test computed from the set of t-statistics for that probe. This is exactly analogous to the relationship between t-tests and F-statistics in conventional ANOVA, except that the residual mean squares have been moderated between genes.

The estimates s2_prior and df_prior are computed by fitFDist(). s2_post is the weighted average of s2_prior and sigma**2 with weights proportional to df_prior and df_residual respectively. The log-odds of differential expression lods was called the B-statistic by [Loennstedt2002]. The F-statistics F are computed by classifyTestsF() with fstats_only=True.

eBayes() does not compute ordinary t-statistics because they always have worse performance than the moderated versions. The ordinary (unmoderated) t-statistics can, however, be easily extracted from the linear model output for comparison purposes – see the example code below.

The use of eBayes() with trend=True is known as the limma-trend method [Law2014] [Phipson2016]. With this option, an intensity-dependent trend is fitted to the prior variances s2_prior. Specifically, squeezeVar() is called with the covariate equal to Amean, the average log2-intensity for each gene. The trend that is fitted can be examined by plotSA(). limma-trend is useful for processing expression values that show a mean-variance relationship. This is often useful for microarray data, and it can also be applied to RNA-Seq counts that have been converged to log2-counts per million (logCPM) values [Law2014]. When applied to RNA-Seq logCPM values, limma-trend give similar results to the voom() method. The voom method incorporates the mean-variance trend into the precision weights, whereas limma-trend incorporates the trend into the empirical Bayed moderation. limma-trend is somewhat simpler than voom because it assumes that the sequencing depths (library sizes) are wildly different between the samples and it applies the mean-variance trend on a genewise basis instead to individual observations. limma-trend is recommended for RNA-Seq analysis when the library sizes are reasonably consistent (less than 3-fold difference from smallest to largest) because of its simplicity and speed.

If robust=True, then the robust empirical Bayes procedure of [Phipson2016] is used. This is frequently useful to protect the empirical Bayes procedure against hyper-variable or hypo-variable genes, especially when analysing RNA-Seq data. See squeezeVar() for more details.

Parameters:

fit (MArrayLM) – fitted model object produce by lmFit() or contrasts_fit()
proportion (float) – value between 0 and 1, assumed proportion of genes which are differentially expressed
stdev_coef_lim (pair of floats) – assumed lower and upper limits for the standard deviations of log2-fold-changes for differentially expressed genes
trend (bool or array_like) – whether an intensity-dependent trend should be allowed for the prior variance. If False then the prior variance is constant. Alternatively, trend can be a row-wise array, which will be used as the covariate for the prior variance.
robust (bool) – whether the estimation of df_prior and var_prior should be robustified against outlier sample variances.

Returns:

an object containing everything found in fit plus the following added attributes:

t: matrix of moderated t-statistics
p_value: matrix of two-sided p-values corresponding to the t-statistics
lods matrix giving the log-odds of differential expression (on the natural log scale)
s2_prior: estimated prior value for sigma**2. A row-wise array if covariate is non-None, otherwise a single value
df_prior: degrees of freedom associated with s2_prior. A row-wise array if robust=True, otherwise a single value
df_total: row-wise array giving the total degrees of freedom associated with the t-statistics for each gene. Equal to df_prior + df_residual or sum(df_residual), whichever is smaller.
s2_post: row-wise array giving the posterior values for sigma**2
var_prior: column-wise array giving estimated prior values for the variance of the log2-fold-changes for differentially expressed gene for each contrast. Used for evaluating lods.
F: row-wise array of moderated F-statistics for testing all contrasts defined by the columns of fit simultaneously equal to zero.
F_p_value: row-wise array giving p-values corresponding to F

The matrices t, p_value, lods have the same dimensions as the input object fit, with rows corresponding to genes and columns to coefficients or constrats. The vector s2_prior, df_prior, df_total, F and F_p_value correspond to rows, with length equal to the number of genes. The vector var_prior corresponds to columns, with length equal to the number of contrasts. If s2_prior or df_prior have length 1, then the same value applies to all genes.

s2_prior, df_prior and var_prior contain empirical Bayes hyperparameters used to obtain df_total, s2_post and lods.

Return type:

MArrayLM