pycombat_seq: batch effect correction for RNASeq data

ComBat-Seq (Zhang et al, 2020) [1] follows on the steps of ComBat, but targets specifically RNA-Seq data. Conceptually, ComBat-Seq is based on the same mathematical framework as ComBat, except that its replaces the normal distribution of microarray data by a negative binomial distribution to account for the specificities of RNA-Seq expression data. pycombat_seq() is a direct port of ComBat-Seq to Python. Since ComBat-Seq relies on the Bioconductor edgeR package, the relevant parts of edgeR have been ported along. Closely following the original implementation in R, pycombat_seq() has results very similar to those of ComBat-Seq in terms of batch effects correction. Additionally, pycombat_seq() is as fast, if not faster, than the original implementation in R. It also features additional capabilities, such as fixing a given batch as reference.

Code documentation

inmoose.pycombat.pycombat_seq(counts, batch, covar_mod=None, shrink=False, shrink_disp=False, gene_subset_n=None, ref_batch=None, na_cov_action='raise')

pycombat_seq is an improved model from ComBat using negative binomial regression, which specifically targets RNA-Seq count data.

Parameters:

counts (matrix) – raw count matrix (dataframe or numpy array) from genomic studies (dimensions gene x sample)
batch (array or list or inmoose.utils.factor.Factor) – Batch indices. Must have as many elements as the number of columns in the expression matrix.
covar_mod (list or matrix, optional) – model matrix (dataframe, list or numpy array) for one or multiple covariates to include in linear model (signal from these variables are kept in data after adjustment). Covariates have to be categorial, they can not be continious values (default: None).
shrink (bool, optional) – whether to apply shrinkage on parameter estimation
shrink_disp (bool, optional) – whether to apply shrinkage on dispersion
gene_subset_n (int, optional) – number of genes to use in emprirical Bayes estimation, only useful when shrink = True
ref_batch (any, optional) – batch id of the batch to use as reference (default: None)
na_cov_action (str) –
Option to choose the way to handle missing covariates
- "raise" raise an error if missing covariates and stop the code
- "remove" remove samples with missing covariates and raise a warning
- "fill" handle missing covariates, by creating a distinct covariate per batch
(default: "raise")

Returns:

the input expression matrix adjusted for batch effects. same type as the input data

Return type:

matrix