inmoose.limma.eBayes
- inmoose.limma.eBayes(fit, proportion=0.01, stdev_coef_lim=(0.1, 4), trend=False, robust=False, winsor_tail_p=(0.05, 0.1))
Empirical Bayes statistics for differential expression
Given a linear model fit from
lmFit(), compute moderated t-statistics, moderated F-statistics, and log-odds of differential expression by empirical Bayes moderation of the standard errors towards a global value.This function is used to rank genes in order of evidence for differential expression. They use an empirical Bayes method to squeeze the genewise residual variances towards a common value (or towards a global trend) [Smyth2004] [Phipson2016]. The degrees of freedom for the individual variances are increased to reflect the extra information gained from the empirical Bayes moderation, resulting in increased statistical power to detect differential expression.
This function accepts as input a
MArrayLMfitted modelfitproduced bylmFit(). The columns offitdefine a set of contrasts which are to be tested equal to zero. The fitted model object may have been processed bycontrasts_fit()before being passed toeBayes()to convert the coefficients of the original design matrix into an arbitrary number of contrasts.The empirical Bayes moderated t-statistics test each individual contrast equal to zero. For each gene (row), the moderated F-statistic tests whether all the contrasts are zero. The F-statistic is an overall test computed from the set of t-statistics for that probe. This is exactly analogous to the relationship between t-tests and F-statistics in conventional ANOVA, except that the residual mean squares have been moderated between genes.
The estimates
s2_prioranddf_priorare computed byfitFDist().s2_postis the weighted average ofs2_priorandsigma**2with weights proportional todf_prioranddf_residualrespectively. The log-odds of differential expressionlodswas called the B-statistic by [Loennstedt2002]. The F-statisticsFare computed byclassifyTestsF()withfstats_only=True.eBayes()does not compute ordinary t-statistics because they always have worse performance than the moderated versions. The ordinary (unmoderated) t-statistics can, however, be easily extracted from the linear model output for comparison purposes – see the example code below.The use of
eBayes()withtrend=Trueis known as the limma-trend method [Law2014] [Phipson2016]. With this option, an intensity-dependent trend is fitted to the prior variancess2_prior. Specifically,squeezeVar()is called with thecovariateequal toAmean, the average log2-intensity for each gene. The trend that is fitted can be examined byplotSA(). limma-trend is useful for processing expression values that show a mean-variance relationship. This is often useful for microarray data, and it can also be applied to RNA-Seq counts that have been converged to log2-counts per million (logCPM) values [Law2014]. When applied to RNA-Seq logCPM values, limma-trend give similar results to thevoom()method. The voom method incorporates the mean-variance trend into the precision weights, whereas limma-trend incorporates the trend into the empirical Bayed moderation. limma-trend is somewhat simpler thanvoombecause it assumes that the sequencing depths (library sizes) are wildly different between the samples and it applies the mean-variance trend on a genewise basis instead to individual observations. limma-trend is recommended for RNA-Seq analysis when the library sizes are reasonably consistent (less than 3-fold difference from smallest to largest) because of its simplicity and speed.If
robust=True, then the robust empirical Bayes procedure of [Phipson2016] is used. This is frequently useful to protect the empirical Bayes procedure against hyper-variable or hypo-variable genes, especially when analysing RNA-Seq data. SeesqueezeVar()for more details.- Parameters:
fit (MArrayLM) – fitted model object produce by
lmFit()orcontrasts_fit()proportion (float) – value between 0 and 1, assumed proportion of genes which are differentially expressed
stdev_coef_lim (pair of floats) – assumed lower and upper limits for the standard deviations of log2-fold-changes for differentially expressed genes
trend (bool or array_like) – whether an intensity-dependent trend should be allowed for the prior variance. If
Falsethen the prior variance is constant. Alternatively,trendcan be a row-wise array, which will be used as the covariate for the prior variance.robust (bool) – whether the estimation of
df_priorandvar_priorshould be robustified against outlier sample variances.
- Returns:
an object containing everything found in
fitplus the following added attributes:t: matrix of moderated t-statisticsp_value: matrix of two-sided p-values corresponding to the t-statisticslodsmatrix giving the log-odds of differential expression (on the natural log scale)s2_prior: estimated prior value forsigma**2. A row-wise array ifcovariateis non-None, otherwise a single valuedf_prior: degrees of freedom associated withs2_prior. A row-wise array ifrobust=True, otherwise a single valuedf_total: row-wise array giving the total degrees of freedom associated with the t-statistics for each gene. Equal todf_prior + df_residualorsum(df_residual), whichever is smaller.s2_post: row-wise array giving the posterior values forsigma**2var_prior: column-wise array giving estimated prior values for the variance of the log2-fold-changes for differentially expressed gene for each contrast. Used for evaluatinglods.F: row-wise array of moderated F-statistics for testing all contrasts defined by the columns offitsimultaneously equal to zero.F_p_value: row-wise array giving p-values corresponding toF
The matrices
t,p_value,lodshave the same dimensions as the input objectfit, with rows corresponding to genes and columns to coefficients or constrats. The vectors2_prior,df_prior,df_total,FandF_p_valuecorrespond to rows, with length equal to the number of genes. The vectorvar_priorcorresponds to columns, with length equal to the number of contrasts. Ifs2_priorordf_priorhave length 1, then the same value applies to all genes.s2_prior,df_priorandvar_priorcontain empirical Bayes hyperparameters used to obtaindf_total,s2_postandlods.- Return type: