inmoose.deseq2.nbinomWaldTest

inmoose.deseq2.nbinomWaldTest(obj, betaPrior=False, betaPriorVar=None, modelMatrix=None, modelMatrixType=None, betaTol=1e-08, maxit=100, useOptim=True, quiet=False, useT=False, df=None, useQR=True, minmu=0.5)

Wald test for the GLM coefficients

This function tests for significance of coefficients in a Negative Binomial GLM, using previously calculated sizeFactors or normalizationFactors and dispersion estimates. See DESeq() for the GLM formula.

The fitting proceeds as follows: standard maximum likelihood estimates for GLM coefficients (synonymous with “beta”, “log2 fold change”, “effect size”) are calculated. Then, optionally, a zero-centered normal prior distribution (flag betaPrior) is assumed for the coefficients other than the intercept.

Note that this posterior log2 fold change estimation is now not the default setting for nbinomWaldTest(), as the standard workflow for coefficient shrinkage has moved to an additional function lfcShrink().

To calculate Wald test p-values, the coefficients are scaled by their standard errors and then compared to a standard normal distribution. The DESeqDataSet.results() method without any argument will automatically perform a contrast of the last level of the last variable in the design formula over the first level. The contrast argument of DESeqDataSet.results() can be used to generate other comparisons.

The Wald test can be replaced with nbinomLRT() for an alternative test of significance.

Notes

The variance of the prior distribution for each non-intercept coefficient is calculated using the observed distribution of the maximum likelihood coefficients. The final coefficients are then maximum a posterior estimates using this prior (Tikhonov/ridge regularization). See below for details on the prior variance and the methods section of the original DESeq2 manuscript for more details. The use of a prior has little effect on genes with high counts and helps to moderate the large spread in coefficients for genes with low counts.

The prior variance is calculated by matching the 0.05 upper quantile of the observed MLE coefficients to a zero-centered normal distribution. In a change of methods since the 2014 paper, the weighted upper quantile is calculated using the wtd_quantile() function from the Hmisc package (function has been copied into DESeq2 code to avoid extra dependencies). The weights are the inverse of the expected variance of log counts, so the inverse of \(1 / \overline{\mu} + \\alpha_{tr}\) using the mean of normalized counts and the trended dispersion fit. The weighting ensures that noisy estimates of log fold changes from small count genes do not overly influence the calculation of the prior variance (see estimateBetaPriorVar()). The final prior variance for a factor level is the average of the estimated prior variance over all contrasts of all levels of the factor.

When a log2 fold change prior is used (betaPrior=True), then nbinomWaldTest() will by default use expanded model matrices, as described in the modelMatrixType argument, unless this arguments is used to override the default behavior. This ensures that log2 fold changes will be independent of the choice of the reference level. In this case, the beta prior variance for each factor is calculated as the average of the mean squared maximum likelihood estimates for each level and every possible contrast.

Parameters:
  • obj (DESeqDataSet) – a DESeqDataSet object

  • betaPrior (bool) – whether to use a zero-mean normal prior on the non-intercept coefficients

  • betaPriorVar (ndarray, optional) – a vector with length equal to the number of model terms including the intercept, giving the variance of the prior on the sample beta on the log2 scale. If None, it is estimated from the data

  • modelMatrix (ndarray, optional) – a design matrix. Typically left None and created by the function

  • modelMatrixType (str) –

    either "standard" or "expanded", which describe how the model matrix is formed.

    "standard" means as created by patsy.dmatrix() using the design formula.

    "expanded" includes an indicator variable for each level of factors in addition to an intercept. betaPrior must be set to True in order for expanded model matrices to be fit.

  • betaTol (float) – control parameter defining convergence

  • maxit (int) – the maximum number of iterations to allow for convergence of the coefficient vector

  • useOptim (bool) – whether to use the native optimization function on rows that do not converged within maxit iterations

  • quiet (bool) – whether to print messages at each step

  • useT (bool) – whether to use a t-distribution as a null distribution, for significance testing of the Wald statistics. If False, a standard normal null distribution is used. See next argument df for information about which t is used. If useT=True then further calls to DESeqDataSet.results() will make use of obj.var["tDegreesFreedom"] that is stored by nbinomWaldTest().

  • df (array-like) – the degrees of freedom for the t-distribution. It must be broadcastable to the number of columns of obj. If not specified, the degrees of freedom will be set by the number of samples minus the number of columns of the design matrix used for dispersion estimation. If weights are included in obj.layers, then the sum of the weights is used in lieu of the number of samples.

  • useQR (bool) – whether to use the QR decomposition of the design matrix while fitting the GLM

  • minmu (float) – lower bound on the estimated count while fitting the GLM

Returns:

the input DESeqDataSet with results columns accessible with the DESeqDataSet.results() method. The coefficients and standard errors are reported on a log2 scale.

Return type:

DESeqDataSet