inmoose.deseq2.DESeq
- inmoose.deseq2.DESeq(obj, test='Wald', fitType='parametric', sfType='ratio', betaPrior=False, full=None, reduced=None, quiet=False, minReplicatesForReplace=7, modelMatrixType=None, useT=False, minmu=None, parallel=False)
Differential expression analysis based on the Negative Binomial distribution.
This function performs a default analysis through the steps:
estimation of size factors (
DESeqDataSet.estimateSizeFactors())estimation of dispersion (
DESeqDataSet.estimateDispersions())negative binomial GLM fitting and Wald statistics (
nbinomWaldTest()ornbinomLRT())
For complete details on each step, see the documentation of the corresponding functions. After
DESeq()returns aDESeqDataSet, results tables (log2 fold change and p-values) can be generated using theDESeqDataSet.results()function. Shrunken LFC can then be generated usinglfcShrink().Notes
The differential expression analysis uses a generalized linear model of the form:
\[ \begin{align}\begin{aligned}K_{ij} &\sim \mathcal{NB}(\mu_{ij}, \alpha_i)\\\mu_{ij} &\sim s_j \; q_{ij}\\\log q_{ij} &= x_{j.} \; \beta_i\end{aligned}\end{align} \]where counts \(K_{ij}\) for gene \(i\) and sample \(j\) are modeled using a Negative Binomial distribution with fitted mean \(\mu_{ij}\) and a gene-specific dispersion parameter \(\alpha_i\). The fitted mean is composed of a sample-specific size factor \(s_j\) and a parameter \(q_{ij}\) proportional to the expected true concentration of fragments for sample \(j\). The coefficients \(\beta_i\) give the log2 fold changes for gene \(i\) for each column of the model matrix \(X\). The sample-specific size factors can be replaced by gene-specific normalization factors for each sample using
DESeqDataSet.normalizationFactors.For details on the fitting of the log2 fold changes and calculation of p-values, see
nbinomWaldTest()if usingtest="Wald", ornbinomLRT()if usingtest="LRT".Experiments without replicates do not allow for estimation of the dispersion of counts around the expected value for each group, which is critical for differential expression analysis.
The argument
minReplicatesForReplaceis used to decide which samples are eligible for automatic replacement in the case of extreme Cook’s distance. By default,DESeq()will replace outliers if the Cook’s distance is large for a sample which has 7 or more replicates (including itself). Outlier replacement is turned off entirely forfitType="glmGamPoi". This replacement is performed byreplaceOutliers(). This default behavior helps to prevent filtering genes based on Cook’s distance when there are many degrees of freedom. SeeDESeqDataSet.results()for more information about filtering using Cook’s distance, and the “Dealing with outliers” section of the module documentation. Unlike the behavior ofreplaceOutliers(), here original counts are kept in the matrix returned bycounts, original Cook’s distances are kept inobj.layers["cooks"], and the replacement counts used for fitting are kept inobj.layers["replaceCounts"].Note that is a log2 fold change prior is used (
betaPrior=True), then expanded model matrices will be used for fitting. These are described innbinomWaldTest()and in the module documentation. Thecontrastargument ofDESeqDataSet.results()should be used for generating result tables.- Parameters:
obj (DESeqDataSet) – DESeqDataSet object
test (str, optional) – either
"Wald"(default) or"LRT", to choose between Wald significance tests (nbinomWaldTest()) or the likelihood ratio test on the difference in deviance between a full and a reduced model formula (nbinomLRT()).fitType (str, optional) – either
"parametric"(default),"local","mean"or"glmGamPoi"for the type of fitting of dispersions to the mean intensity. SeeDESeqDataSet.estimateDispersions()for details.sfType (str, optional) – either
"ratio"(default),"poscounts"or"iterate"for the type of size factor estimation. SeeDESeqDataSet.estimateSizeFactors()for details.betaPrior (bool, optional) – whether to put a zero-mean normal prior on the non-intercept coefficients. See
nbinomWaldTest()for details on the calculation of the beta prior. Defaults toFalse, and shrunken LFCs are obtained afterwards usinglfcShrink().full (formula or design matrix, optional) – for
test="LRT", the full model formula, which is restricted to the formula inobj.design. Alternatively, it can be a model matrix constructed by the user. Advanced use: specifying a model matrix for full andtest="Wald"is possible ifbetaPrior=False.reduced (formula or design matrix, optional) – for
test="LRT", a reduced formula to compare against, i.e. the full formula with the terms of interest removed. Alternatively, it can be a model matrix constructed by the user.quiet (bool) – whether to print messages at each step
minReplicatesForReplace (int, optional) – the minimum number of replicates required in order to use
replaceOutliers()on a sample. If there are samples with that many replicates, the model will be refit after replacing outliers, flagged by Cook’s distance. Set it toinfin order to never replace outliers. It is set toinfiffitType="glmGamPoi".modelMatrixType (str, optional) – either
"standard"or"expanded". Describes how the model matrix X of the GLM formula is formed. If “standard”, model matrix is built directly from the design formula. If “expanded”, model matrix includes an indicator variable for each level of factors in addition to an intercept.betaPriormust beTruein order for expanded model matrices to be fitted.useT (bool, optional) – passed to
nbinomWaldTest(), where Wald statistics are assumed to follow a standard Normalminmu (float, optional) – lower bound on the estimated count for fitting gene-wise dispersion and for use with nbinomWaldTest and nbinomLRT. If
fitType == "glmGamPoi"then defaults to 1e-6 (as thisfitTypeis optimized for single cell data, for which a lowerminmuis recommended), otherwise defaults to 0.5parallel (bool) – unimplemented
- Returns:
the input
obj, updated with differential expression analysis data- Return type: