inmoose.limma.topTable

inmoose.limma.topTable(fit, coef=None, number=10, genelist=None, adjust_method='fdr_bh', sort_by='B', resort_by=None, p_value=1, fc=None, lfc=None, confint=False)

Extract a table of the top-ranked genes from a linear model fit

This function summarizes the linear model fit object produced by lmFit(), :func:lm_series`, mrlm() by selecting the top-ranked genes for any given contrast, or for a set of contrasts. It assumes that the linear model fit has already been processed by eBayes().

If coef has a single value, then the moderated t-statistics and p-values for that coefficient or contrast are used. If coef takes two or more values, the moderated F-statistics for that set of coefficients or contrasts are used. If coef=None, then all the coefficients or contrasts in the fitted model are used, except that any coefficient named Intercept will be removed.

The p-values for the coefficient/contrast of interest are adjusted for multiple testing by a call to statsmodels.stats.multitest.multipletests(). The "fdr_bh" method, which controls the expected false discovery rate (FDR) below the specified value, is the default adjustment method because it is the most likely to be appropriate for microarray studies. Note that the adjusted p-values from this method are bounds on the FDR rather than p-values in the usual sense. Because they relate to FDRs rather than rejection probabilities, they are sometimes called q-values.

Note, if there is no good evidence for differential expression in the experiment, that it is quite possible for all the adjusted p-values to be large, even for all of them to be equal to one. It is quite possible for all the adjusted p-values to be equal to one if the smallest p-values is no smaller than 1/ngenes where ngenes is the number of genes with non-missing p-values.

The sort_by argument specifies the criterion used to select the top genes. The choices are "logFC" to sort by the (absolute) coefficient representing the log-fold-change; "A" to sort by average expression level (over all arrays) in descending order, "T" or "t" for absolute t-statistic; "P" or "p" for p-values; or "B" for the lods or B-statistic.

Normally the genes appear in order of selection in the output table. If a different order is wanted, then the resort_by argument may be useful. For example, topTable(fit, sort_by="B", resort_by="logFC") selects the top genes according to log-odds of differential expression and then orders the selected genes by log-ratio in decreasing order. Or topTable(fit, sort_by="logFC", resort_y="logFC") would select the genes by absolute log-fold-change and then sort them from most positive to most negative.

Toptable output for all probes in original (unsorted) order can be obtained by topTable(fit, sort="none", number=np.inf).

By default number probes are listed. Alternatively, by specifying p_values and number=np.inf, all genes with adjusted p-values below a specified value can be listed.

The arguments fc and lfc give the ability to filter genes by log-fold change, but see the Notes below.

Notes

Although this function enables users to set both p-values and fold-change cutoffs, the use of fold-change cutoffs is not generally recommended. If the fold changes and p-values are not highly correlated, then the use of a fold change cutoff can increase the false discovery rate above the nominal level. Users wanting to use fold change thresholding are usually recommended to use treat() and topTreat() instead.

In general, the adjusted p-values returned by adjust_method="fdr_bh" remain valid as FDR bounds only when the genes remain sorted by p-value. Resorting the table by log-fold-change can increase the false discovery rate above the nominal level for genes at the top of resorted table.

Parameters:
  • fit (MArrayLM) – linear model fit produced by lmFit(), lm_series(), gls_series() or mrlm()

  • coef (int or str) – column number or column name specifying which coefficient or contrast of the linear model is of interest. Can also be a vector of column subscripts, in which case the gene ranking is by F-statistic for that set of contrasts.

  • number (int) – maximum number of genes to list

  • genelist (array_like) – data frame or array containing gene information. Defaults to fit.genes

  • adjust_method ({"none", "fdr_bh", "fdr_by", "holm"}) – method used to adjust the p-values for multiple testing. See statsmodels.stats.multitest.multipletests() for the complete list of options. A None value will result in the default adjustment method, which is "fdr_bh".

  • sort_by ({ "logFC", "log2FoldChange", "AveExpr", "t", "P", "p", "B", "none" }) – string specifying which statistic to rank the genes by

  • resort_by ({ "logFC", "AveExpr", "t", "P", "p", "B", "none" }) – string specifying statistic to sort the selected genes by in the output data frame

  • p_value (float) – cutoff value for adjusted p-values. Only genes with lower p-values are listed

  • fc (float, optional) – minimum fold-change required

  • lfc (float, optional) – optional minimum log2-fold-change required, equal to log2(fc). fc and lfc are alternative ways to specify a fold-change cutoff and, if both are specified, then fc takes precedence. If specified, then the results will include only genes with (at least one) absolute log-fold-change greater than lfc

  • confint (bool or float) – whether the confidence 95% intervals should be output for logFC. Alternatively, can be a value between 0 and 1 specifying the required confidence level.

Returns:

DEResults if coef has a single value, otherwise pd.DataFrame. A dataframe with a row for each of the number top genes and the following columns:

  • genelist: one or more columns of probe annotation, if genelist was included as input

  • log2FoldChange: estimate of the log2-fold-change corresponding to the effect or contrast (DEResults only)

  • CI_L: left limit of confidence interval for logFC, if confint=True or confint is a numeric value

  • CI_R: right limit of confidence interval for logFC, if confint=True or confint is a numeric value

  • AveExpr: average log2-expression for the probe over all arrays and channels, same as Amean in the MArrayLM object

  • stat: moderated t-statistic (DEResults only)

  • F: moderated F-statistic (pd.DataFrame only)

  • pvalue: raw p-value

  • adj_pvalue: adjusted p-value or q-value

  • B: log-odds that the gene is differentially expressed

Return type:

DEResults or pd.DataFrame