inmoose.deseq2.DESeqDataSet.DESeqDataSet.estimateSizeFactors

DESeqDataSet.estimateSizeFactors(type_='ratio', locfunc=<function median>, geoMeans=None, controlGenes=None, normMatrix=None, quiet=False)

Estimate the size factors of a DESeqDataSet

This function estimates the size factors using the “median ratio method”, described by Equation 5 in [Anders2010].

The estimated size factors can be accessed through the DESeqDataSet.sizeFactors property of DESeqDataSet. Alternative library size estimators can also be supplied through this property.

See DESeq() for a description of the use of size factors in the GLM. One should call this function after building a DESeqDataSet, unless size factors are manually specified with property DESeqDataSet.sizeFactors. Alternatively, gene-specific normalization factors for each sample can be provided using the DESeqDataSet.normalizationFactors which will always preempt DESeqDataSet.sizeFactors in calculations.

Internally, the function calls estimateSizeFactorsForMatrix(), which provides more details on the calculation.

Parameters:
  • obj (DESeqDataSet) – the input dataset

  • type ("ratio", "poscounts" or "iterate") –

    the algorithm to estimate the size factors

    "ratio" uses the standard median ratio method introduced in DESeq. The size factor is the median ratio of the sample over a “pseudosample”: for each gene, the geometric mean of all samples.

    "poscounts" and "iterate" offer alternative estimators, which can be used even when all genes contain a sample with a zero (a problem for the default method, as the geometric then becomes zero, and the ratio undefined).

    The "poscounts" estimator deals with a gene with some zeros by calculating a modified geometric mean by taking the n-th root of the product of the non-zero counts. This evolved out of use cases with Paul McMurdie’s phyloseq package for metagenomic samples.

    The "iterate" estimator iterates between estimating the dispersion with a design of ~1, and finding a size factor vector by numerically optimizing the likelihood of the ~1 model.

  • locfunc – a function to compute a location for a sample. By default, the median is used.

  • geoMeans (array-like, optional) – by default, the geometric means of the counts are calculated within the function. A vector of geometric means from another count matrix can be provided for a “frozen” size factor calculation. The size factors will be scaled to have a geometric mean of 1 when supplying geoMeans.

  • controlGenes (array-like, optional) – index vector specifying those genes to use for size factor estimation (e.g. housekeeping or spike-in genes)

  • normMatrix (ndarray, optional) – a matrix of normalization factors which do not yet control for library size. Providing normMatrix will estimate size factors on the count matrix divided by normMatrix and store the product of the size factors and normMatrix as DESeqDataSet.normalizationFactors. It is recommended to divide out the sample-wise geometric mean of normMatrix so the sample-wise factors are roughly centered on 1.

  • quiet (bool) – controls verbosity, defaults to False

Returns:

the input obj, with the size factors filled in

Return type:

DESeqDataSet