limma

This module is a partial port in Python of the R Bioconductor limma package.

Introduction

limma is a package for the analysis of gene expression microarray data, especially the use of linear models for analysing designed experiments and the assessment of differential expression. limma provides the ability to analyse comparisons between many RNA targets simultaneously in arbitrarily complicated designed experiments. Empirical Bayesian methods are used to provide stable results even when the number of arrays is small. The linear model and differential expression functions apply to all gene expression technologies, including microarrays, RNA-Seq and quantitative PCR.

There are three types of documentation available:

the LIMMA User’s Guide can be reached through the “User Guides and Package Vignettes” links at the top of the LIMMA contents page.

an overview of limma functions grouped by purpose is contained in the chapters of the present page

the Code documentation section gives an alphabetical index of detailed help topics

Classes Defined in this Module

This module defines the following data classes:

MArrayLM: store the result of fitting gene-wise linear models to the normalized intensities or log-ratios. Usually created by lmFit(). Objects of this class normally contain only one row for each unique probe.
TestResults: store the results of testing a set of contrasts equal to zero for each probe. Usually created by decideTests(). Objects of this class normally contain one row for each unique probe.

Linear Models for Microarrays

This section gives an overview of the LIMMA functions available to fit linear models and to interpret the results. This section covers models for two color arrays in terms of log-ratios or for single-channel arrays in terms of log-intensities. If you wish to fit models to the individual channel log-intensities from two color arrays, see Individual Channel Analysis of Two-Color Microarrays.

The core of this module is the fitting of gene-wise linear models to microarray data. The basic idea is to estimate log-ratios between two or more target RNA samples simultaneously. See the LIMMA User’s Guide for several case studies.

Fitting Models

The main function for model fitting is lmFit(). This is the recommended interface for most users. lmFit() produces a fitted model object of class MArrayLM containing coefficients, standard errors and residual standard errors for each gene. lmFit() calls one of the following three functions to do the actual computations:

lm_series() Straightforward least squares fitting of a linear model for each gene.
mrlm() An alternative to lm_series() using robust regression as implemented by the rlm function in the MASS package.
gls_series() Generalized least squares taking into account correlations between duplicate spots (i.e. replicate spots on the same array) or related arrays. The function duplicateCorrelation() is used to estimate the inter-duplicate or inter-block correlation before using gls_series().

All the function which fit linear models use getEAWP() to extract data from microarray data objects, and unwrapdups() which provides a unified method for handling duplicate spots.

Forming the Design Matrix

lmFit() has two main arguments: the expression data and the design matrix. The design matrix is essentially an indicator matrix which specifies which target RNA samples were applied to each channel on each array. There is considerable freedom in choosing the design matrix – there is always more than one choice which is correct provided it is interpreted correctly.

Design matrices for Affymetrix or single-color arrays can be created using the function patsy.dmatrix(). The function modelMatrix() is provided to assist with the creation of an appropriate design matrix for two-color microarray experiments. For direct two-color designs, without a common reference, the design matrix often needs to be created by hand.

Making Comparisons of Interest

Once a linear model has been fit using an appropriate design matrix, the function makeContrasts() may be used to form a contrast matrix to make comparisons of interest. The fit and the contrast matrix are used by contrasts_fit() to compute fold changes and t-statistics for the contrasts of interest. This is a way to compute all possible pairwise comparisons between treatments, for example in an experiment which compares many treatments to a common reference.

Assessing Differential Expression

After fitting a linear model, the standard errors are moderated using a simple empirical Bayes model using eBayes() or treat(). A moderated t-statistic and a log-odds of differential expression is computed for each contrast for each gene. treat() tests whether log-fold-changes are greater than a threshold rather than merely different to zero.

eBayes() and treat() use internal functions squeezeVar(), fitFDist(), tmixture_matrix() and tmixture_vector().

Summarizing Model Fits

After the above steps, the results may be displayed or further processed using:

topTable() Presents a list of the genes most likely to be differentially expressed for a given contrast or set of contrasts.
topTableF() Presents a list of the genes most likely to be differentially expressed for a given set of contrasts. Equivalent to topTable() with coef set to all the coefficients, coef=range(fit.shape[1]).
volcanoplot() Volcano plot of fold change versus the B-statistic for any fitted coefficient.
plotlines() Plots fitted coefficients or log-intensity values for time-course data.
genas() Estimates biological correlation between two coefficients.
write_fit() Writes a MArrayLM object to a file. Note that if fit is a MArrayLM object, either write_fit() or write_table() can be used to write the results to a delimited text file.

For multiple testing functions which operate on linear model fits, see Hypothesis Testing for Linear Models.

Model Selection

selectModel() provides a means to choose between alternative linear models using AIC or BIC information criteria.

Individual Channel Analysis of Two-Color Microarrays

This section gives an overview of the LIMMA functions fit linear models to two-color microarray data in terms of the log-intensities rather than log-ratios.

The function intrapotCorrelation() estimates the intra-spot correlation between the two channels. The regression function lmscFit() takes the correlation as an argument and fits linear models to the two-color data in terms of the individual log-intensities. The output of lmscFit() is a MArrayLM object just the same as from lmFit(), so inference proceeds the same way as for log-ratios once the linear model is fitted. See Linear Models for Microarrays.

The function targetsA2C() converts two-color format target data frames to single channel format, i.e. converts from array-per-line to channel-per-line, to facilitate the formulation of the design matrix.

Hypothesis Testing for Linear Models

LIMMA provides a number of functions for multiple testing across both contrasts and genes. The starting point is a MArrayLM object, called fit say, resulting from fitting a linear model and running eBayes() and, optionally, contrasts_fit(). See Linear Models for Microarrays or Individual Channel Analysis of Two-Color Microarrays for details.

Multiple Testing across Genes and Contrasts

The key function is decideTests(). This function writes an object of class TestResults, which is basically a matrix of \(-1\), \(0\) or \(1\) elements, of the same dimension as fit_coefficients, indicating whether each coefficient is significantly different from zero. A number of multiple testing strategies are provided. decideTests() calls classifyTestsF() to implement the nested F-test strategy.

selectModel() chooses between linear models for each probe using AIC or BIC criteria. This is an alternative to hypothesis testing and can choose between non-nested models.

A number of other functions are provided to display the results of decideTests(). The function heatDiagram() displays the results in a heat-map style display. This allows visual comparison of the results across many different conditions in the linear model.

The functions vennCounts() and vennDiagram() provide Venn diagrams style summaries of the results.

Summary and show() method exists for objects of class TestResults.

The results from decideTests() can also be included when the results of a linear model fit are written to a file using write_fit().

References

[Law2014]

C.W. Law, Y. Chen, W. Shi, G.K. Smyth. 2014. Voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biology 15(R29). doi:10.1186/gb-2014-15-2-r29

[Loennstedt2002]

I. Loennstedt, T.P. Speed. 2002. Replicated microarray data. Statistica Sinica 12(31-46).

[Michaud2008]

J. Michaud, K.M. Simpson, R. Escher, K. Buchet-Poyau, R. Beissbarth, C. Carmichael, M.E. Ritchie, F. Schutz, P. Cannon, M. Liu, X. Shen, Y. Ito, W.H. Raskind, M.S. Horwitz, M. Osato, D.R. Turner, T.P. Speed, M. Kavallaris, G.K. Smyth, H.S. Scott. 2008. Integrative analysis of RUNX1 downstream pathways and target genes. BMC Genomics 9(363). doi:10.1186/1471-2164-9-363

[Ritchie2015]

M.E. Ritchie, B. Phipson, D. Wu, Y. Hu, C.W. Law, W. Shi, G.K. Smyth. 2015. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 43(7). doi:10.1093/nar/gkv007

[Sartor2006]

M.A. Sartor, C.R. Tomlinson, S.C. Wesselkamper, S. Sivaganesan, G.D. Leikauf, M. Medvedovic. 2006. Intensity-based hierarchical Bayes method improves testing for differentially expressed genes in microarray experiments. BMC Bioinformatics 7(538). doi:10.1186/1471-2105-7-538

[Smyth2004]

G.K. Smyth. 2004. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology 3(1). doi:10.2202/1544-6115.1027

Code documentation

`MArrayLM`(coefficients, stdev_unscaled, ...)	A class to store the results of fitting gene-wise linear models to a set of microarrays
`TestResults`(df, args, *kwargs)	A matrix-based class for storing the results of simultaneous tests.
`classifyTestsF`(self[, cor_matrix, df, ...])	For each gene, classify a series of related t-statistics as significantly up or down using nested F-tests.
`contrasts_fit`(fit[, contrasts, coefficients])	Compute contrasts from linear model fit
`eBayes`(fit[, proportion, stdev_coef_lim, ...])	Empirical Bayes statistics for differential expression
`fitFDist`(x, df1[, covariate])	Moment estimation of the parameters of a scaled F-distribution given one of the degrees of freedom.
`lmFit`(obj[, design, ndups, spacing, block, ...])	Fit linear models for each gene given a series of arrays
`lm_series`(M[, design, ndups, spacing, weights])	Fit linear model to microarray data by ordinary least squares
`makeContrasts`(contrasts, levels)	Construct the contrast matrix corresponding to specified contrasts of a set of parameters.
`squeezeVar`(var, df[, covariate, robust, ...])	Squeeze a set of sample variances together by computing empirical Bayes posterior means This function implements empirical Bayes algorithms proposed by [Smyth2004] and [Phipson2016].
`topTable`(fit[, coef, number, genelist, ...])	Extract a table of the top-ranked genes from a linear model fit
`tmixture_matrix`(tstat, stdev_unscaled, df, ...)	Estimate scale factor in mixture of t-distributions
`tmixture_vector`(tstat, stdev_unscaled, df, ...)	Estimate scale factor in mixture of t-distributions