inmoose.deseq2.DESeqDataSet.DESeqDataSet.estimateSizeFactors
- DESeqDataSet.estimateSizeFactors(type_='ratio', locfunc=<function median>, geoMeans=None, controlGenes=None, normMatrix=None, quiet=False)
Estimate the size factors of a
DESeqDataSetThis function estimates the size factors using the “median ratio method”, described by Equation 5 in [Anders2010].
The estimated size factors can be accessed through the
DESeqDataSet.sizeFactorsproperty ofDESeqDataSet. Alternative library size estimators can also be supplied through this property.See
DESeq()for a description of the use of size factors in the GLM. One should call this function after building aDESeqDataSet, unless size factors are manually specified with propertyDESeqDataSet.sizeFactors. Alternatively, gene-specific normalization factors for each sample can be provided using theDESeqDataSet.normalizationFactorswhich will always preemptDESeqDataSet.sizeFactorsin calculations.Internally, the function calls
estimateSizeFactorsForMatrix(), which provides more details on the calculation.See also
- Parameters:
obj (DESeqDataSet) – the input dataset
type ("ratio", "poscounts" or "iterate") –
the algorithm to estimate the size factors
"ratio"uses the standard median ratio method introduced in DESeq. The size factor is the median ratio of the sample over a “pseudosample”: for each gene, the geometric mean of all samples."poscounts"and"iterate"offer alternative estimators, which can be used even when all genes contain a sample with a zero (a problem for the default method, as the geometric then becomes zero, and the ratio undefined).The
"poscounts"estimator deals with a gene with some zeros by calculating a modified geometric mean by taking the n-th root of the product of the non-zero counts. This evolved out of use cases with Paul McMurdie’s phyloseq package for metagenomic samples.The
"iterate"estimator iterates between estimating the dispersion with a design of ~1, and finding a size factor vector by numerically optimizing the likelihood of the ~1 model.locfunc – a function to compute a location for a sample. By default, the median is used.
geoMeans (array-like, optional) – by default, the geometric means of the counts are calculated within the function. A vector of geometric means from another count matrix can be provided for a “frozen” size factor calculation. The size factors will be scaled to have a geometric mean of 1 when supplying
geoMeans.controlGenes (array-like, optional) – index vector specifying those genes to use for size factor estimation (e.g. housekeeping or spike-in genes)
normMatrix (ndarray, optional) – a matrix of normalization factors which do not yet control for library size. Providing
normMatrixwill estimate size factors on the count matrix divided bynormMatrixand store the product of the size factors andnormMatrixasDESeqDataSet.normalizationFactors. It is recommended to divide out the sample-wise geometric mean ofnormMatrixso the sample-wise factors are roughly centered on 1.quiet (bool) – controls verbosity, defaults to False
- Returns:
the input
obj, with the size factors filled in- Return type: