inmoose.edgepy.exactTest

inmoose.edgepy.exactTest(self, pair=(0, 1), dispersion='auto', rejection_region='doubletail', big_count=900, prior_count=0.125)

Compute genewise exact tests for differences in the means between two groups of negative-binomially distributed counts.

This function tests for differential expression between two groups of count libraries. It implements the exact test proposed by [Robinson2008] for a difference in mean between two groups of negative binomial random variables. The functions accept two groups of count libraries, and a test is performed for each row of data. For each row, the test is conditional on the sum of counts for that row. The test can be viewed as a generalization of the well-known exact binomial test (implemented in binomTest()) but generalized to overdispersed counts.

This function is the main user-level function, and produces an object containing all the necessary components for downstream analysis. It calls one of the low-level functions exactTestDoubleTail(), exactTestBetaApprox(), exactTestBySmallP() or exactTestByDeviance() to do the p-value computation. The low-level functions all assume that the libraries have been normalized to have the same size, i.e. to have the same expected column sum under the null hypothesis. exactTest() equalizes the library sizes using equalizeLibSizes before calling the low-level functions.

The functions exactTestDoubleTail(), exactTestBySmallP() and exactTestByDeviance() correspond to different ways to define the two-sided rejection region when the two groups have different numbers of samples. exactTestBySmallP() implements the method of small probabilities as proposed by [Robinson2008]. This method corresponds exactly to binomTest() as the dispersion approaches zero, but gives poor results when the dispersion is very large. exactTestDoubleTail() computed two-sided p-values by doubling the smaller tail probability. exactTestByDeviance() uses the deviance goodness of fit statistics to define the rejection region, and is therefore equivalent to a conditional likelihood ratio test.

Note that rejection_region="smallp" is no longer recommended. It is preserved as an option only for backward compatibility with earlier versions of edgeR. rejection_region="deviance" has good theoretical statistical properties but is relatively slow to compute. rejection_region="doubletail" is just slightly more conservative than rejection_region="deviance", but is recommended because of its much greater speed. For general remarks on different types of rejection regions for exact tests, see [Gibbons1975].

exactTestBetaApprox() implements an asymptotic beta distribution approximation to the conditional count distribution. It is called by the other functions for rows with both group counts greater than big_count.

Parameters:
  • pair (pair of ints or of strings) – the pair of groups to be compared. If strings, then should be the names of two groups (e.g. two levels of self.samples["group"]). If integers, then groups to be compared are chosen by finding the levels self.samples["group"] corresponding to those indices and using those levels as the groups to be compared. If None, then first two levels of self.samples["group"] (a factor) are used. Note that the first group listed in the pair is the baseline for the comparison, so if the pair is ("A","B") then the comparison is B - A, so genes with positive log-fold changes are up-regulated in group B compared with group A (and vice versa for genes with negative log-fold change)

  • dispersion (array_like of floats, or {"auto", "common", "trended", "tagwise"}) – an array of dispersions or a string indicating that dispersions should be taken from the data object. If floats, then can be either of length one or of length equal to the number of genes. Defaults to "auto" to use the most complex dispersions found in data object.

  • rejection_region ({"doubletail", "smallp", "deviance"}) – type of rejection region for two-sided exact test.

  • big_count (int) – count size above which asymptotic beta approximation will be used.

  • prior_count (float) – average prior count used to shrink log-fold-changes. Larger values produce more shrinkage.

Returns:

dataframe with two additional components:

  • comparison, string giving the names of the two groups being compared.

  • genes, dataframe containing annotation for each gene; taken from self

The dataframe columns has the same rows as self and contains the following columns:

  • "log2FoldChange", log2-fold-change of expression between conditions being tested.

  • "lfcSE", standard error of log2-fold-change.

  • "logCPM", average log2-counts per million.

  • "pvalue", the two-sided p-values.

Return type:

DGEExact