inmoose.deseq2.DESeq

inmoose.deseq2.DESeq(obj, test='Wald', fitType='parametric', sfType='ratio', betaPrior=False, full=None, reduced=None, quiet=False, minReplicatesForReplace=7, modelMatrixType=None, useT=False, minmu=None, parallel=False)

Differential expression analysis based on the Negative Binomial distribution.

This function performs a default analysis through the steps:

For complete details on each step, see the documentation of the corresponding functions. After DESeq() returns a DESeqDataSet, results tables (log2 fold change and p-values) can be generated using the DESeqDataSet.results() function. Shrunken LFC can then be generated using lfcShrink().

Notes

The differential expression analysis uses a generalized linear model of the form:

\[ \begin{align}\begin{aligned}K_{ij} &\sim \mathcal{NB}(\mu_{ij}, \alpha_i)\\\mu_{ij} &\sim s_j \; q_{ij}\\\log q_{ij} &= x_{j.} \; \beta_i\end{aligned}\end{align} \]

where counts \(K_{ij}\) for gene \(i\) and sample \(j\) are modeled using a Negative Binomial distribution with fitted mean \(\mu_{ij}\) and a gene-specific dispersion parameter \(\alpha_i\). The fitted mean is composed of a sample-specific size factor \(s_j\) and a parameter \(q_{ij}\) proportional to the expected true concentration of fragments for sample \(j\). The coefficients \(\beta_i\) give the log2 fold changes for gene \(i\) for each column of the model matrix \(X\). The sample-specific size factors can be replaced by gene-specific normalization factors for each sample using DESeqDataSet.normalizationFactors.

For details on the fitting of the log2 fold changes and calculation of p-values, see nbinomWaldTest() if using test="Wald", or nbinomLRT() if using test="LRT".

Experiments without replicates do not allow for estimation of the dispersion of counts around the expected value for each group, which is critical for differential expression analysis.

The argument minReplicatesForReplace is used to decide which samples are eligible for automatic replacement in the case of extreme Cook’s distance. By default, DESeq() will replace outliers if the Cook’s distance is large for a sample which has 7 or more replicates (including itself). Outlier replacement is turned off entirely for fitType="glmGamPoi". This replacement is performed by replaceOutliers(). This default behavior helps to prevent filtering genes based on Cook’s distance when there are many degrees of freedom. See DESeqDataSet.results() for more information about filtering using Cook’s distance, and the “Dealing with outliers” section of the module documentation. Unlike the behavior of replaceOutliers(), here original counts are kept in the matrix returned by counts, original Cook’s distances are kept in obj.layers["cooks"], and the replacement counts used for fitting are kept in obj.layers["replaceCounts"].

Note that is a log2 fold change prior is used (betaPrior=True), then expanded model matrices will be used for fitting. These are described in nbinomWaldTest() and in the module documentation. The contrast argument of DESeqDataSet.results() should be used for generating result tables.

Parameters:
  • obj (DESeqDataSet) – DESeqDataSet object

  • test (str, optional) – either "Wald" (default) or "LRT", to choose between Wald significance tests (nbinomWaldTest()) or the likelihood ratio test on the difference in deviance between a full and a reduced model formula (nbinomLRT()).

  • fitType (str, optional) – either "parametric" (default), "local", "mean" or "glmGamPoi" for the type of fitting of dispersions to the mean intensity. See DESeqDataSet.estimateDispersions() for details.

  • sfType (str, optional) – either "ratio" (default), "poscounts" or "iterate" for the type of size factor estimation. See DESeqDataSet.estimateSizeFactors() for details.

  • betaPrior (bool, optional) – whether to put a zero-mean normal prior on the non-intercept coefficients. See nbinomWaldTest() for details on the calculation of the beta prior. Defaults to False, and shrunken LFCs are obtained afterwards using lfcShrink().

  • full (formula or design matrix, optional) – for test="LRT", the full model formula, which is restricted to the formula in obj.design. Alternatively, it can be a model matrix constructed by the user. Advanced use: specifying a model matrix for full and test="Wald" is possible if betaPrior=False.

  • reduced (formula or design matrix, optional) – for test="LRT", a reduced formula to compare against, i.e. the full formula with the terms of interest removed. Alternatively, it can be a model matrix constructed by the user.

  • quiet (bool) – whether to print messages at each step

  • minReplicatesForReplace (int, optional) – the minimum number of replicates required in order to use replaceOutliers() on a sample. If there are samples with that many replicates, the model will be refit after replacing outliers, flagged by Cook’s distance. Set it to inf in order to never replace outliers. It is set to inf if fitType="glmGamPoi".

  • modelMatrixType (str, optional) – either "standard" or "expanded". Describes how the model matrix X of the GLM formula is formed. If “standard”, model matrix is built directly from the design formula. If “expanded”, model matrix includes an indicator variable for each level of factors in addition to an intercept. betaPrior must be True in order for expanded model matrices to be fitted.

  • useT (bool, optional) – passed to nbinomWaldTest(), where Wald statistics are assumed to follow a standard Normal

  • minmu (float, optional) – lower bound on the estimated count for fitting gene-wise dispersion and for use with nbinomWaldTest and nbinomLRT. If fitType == "glmGamPoi" then defaults to 1e-6 (as this fitType is optimized for single cell data, for which a lower minmu is recommended), otherwise defaults to 0.5

  • parallel (bool) – unimplemented

Returns:

the input obj, updated with differential expression analysis data

Return type:

DESeqDataSet