inmoose.deseq2.varianceStabilizingTransformation
- inmoose.deseq2.varianceStabilizingTransformation(obj, blind=True, fitType='parametric')
Apply a variance stabilizing transformation (VST) to the count data
This function calculates a variance stabilizing transformation (VST) from the fitted dispersion-mean relation(s) and then transforms the count data (normalized by division by the size factors or normalization factors), yielding a matrix of values which are now approximately homoskedastic (having constant variance along the range of mean values). The transformation also normalizes with respect to library size. The
rlog()is less sensitive to size factors, which can be an issue when size factors vary widely. These transformations are useful when checking for outliers or as input for machine learning techniques such as clustering or linear discriminant analysis.Variance stabilizing transformation was originally described in [Anders2010].
Details
For each sample (i.e. line of
dds.counts()), the full variance function is calculated from the raw variance (by scaling according to the size factor and adding the shot noise). We recommend a blind estimation of the variance function, i.e. one ignoring conditions. This is performed by default, and can be modified using theblindargument.Note that neither
rlog()transformation nor the VST are used by the differential expression estimation inDESeq(), which always occurs on the raw count data, through generalized linear modeling which incorporates knowledge of the variance-mean dependence. Therlog()transformation and VST are offered as separate functionality which can be used for visualization, clustering or other machine learning tasks. See the transformation section of the vignette for more details.The transformation does not require that one has already estimated size factors and dispersions.
A typical workflow is shown in Section Variance stabilizing transformation in the vignette.
If
estimateDispersions()was called with:fitType="parametric": a closed-form expression for the variance stabilizing transformation is used on the normalized count data.fitType="local": the reciprocal of the square root of the variance of the normalized counts, as derived from the dispersion fit, is then numerically integrated, and the integral (approximated by a spline function) is evaluated for each count value in the column, yielding a transformed value.fitType="mean", a VST is applied for Negative Binomial distributed counts, \(k\), with a fixed dispersion, \(a\): \((2 \operatorname{asinh}(\sqrt{a k}) - \log(a) - \log(4)) / \log(2)\).
In all cases, the transformation is scaled such that for large counts, it becomes asymptotically (for large values) equal to the logarithm to base 2 of normalized counts.
The variance stabilizing transformation from a previous dataset can be “frozen” and reapplied to new samples. The frozen VST is accomplished by saving the dispersion function accessible with
dispersionFunction(), assigning this to theDESeqDataSetwith the new samples, and runningvarianceStabilizingTransformation()withblindset toFalse. Then the dispersion function from the previous dataset will be used to transform the new sample(s).Limitations: In order to preserve normalization, the same transformation has to be used for all samples. This results in the variance stabilization to be only approximate. The more the size factors differ, the more residual dependence of the variance on the mean will be found in the transformed data.
rlog()is a transformation which can perform better in these cases. As shown in the vignette,meanSdPlotfrom the package vsn can be used to see whether this is a problem.- param obj:
a
DESeqDataSetor matrix of counts- type obj:
DESeqDataSet or matrix
- param blind:
whether to blind the transformation to the experimental design.
blind=Trueshould be used for comparing samples in a manner unbiased by prior information on samples, for example to perform sample QA (quality assurance).blind=Falseshould be used for transforming data for downstream analysis, where the full use of the design information should be made.blind=Falsewill skip re-estimation of the dispersion trend, if this has already been calculated. If many genes have large differences in counts due to the experimental design, it is important to setblind=Falsefor downstream analysis. Defaults toTrue.- type blind:
bool
- param fitType:
in case dispersions have not yet been estimated for
self, this parameter is passed on toestimateDispersions()(options described there). Defaults to"parametric".- type fitType:
{ “parametric”, “local”, “mean” }
- returns:
returns a
DESeqTransformif aDESeqDataSetwas provided, or returns a matrix if a count matrix was provided. Note that forDESeqTransformoutput, the matrix of transformed values is stored invsd.layers.- rtype:
DESeqTransform or matrix