pycombat_seq: batch effect correction for RNASeq data
ComBat-Seq (Zhang et al, 2020) [1] follows on the steps of ComBat, but targets
specifically RNA-Seq data. Conceptually, ComBat-Seq is based on the same
mathematical framework as ComBat, except that its replaces the normal
distribution of microarray data by a negative binomial distribution to account
for the specificities of RNA-Seq expression data.
pycombat_seq() is a direct port of ComBat-Seq to Python. Since ComBat-Seq
relies on the Bioconductor edgeR package, the relevant parts of
edgeR have been ported along. Closely following the original
implementation in R, pycombat_seq() has results very similar to those of
ComBat-Seq in terms of batch effects correction. Additionally,
pycombat_seq() is as fast, if not faster, than the original implementation
in R. It also features additional capabilities, such as fixing a given batch as
reference.
Code documentation
- inmoose.pycombat.pycombat_seq(data, batch, group=None, covar_mod=None, full_mod=True, shrink=False, shrink_disp=False, gene_subset_n=None, ref_batch=None)
pycombat_seq is an improved model from ComBat using negative binomial regression, which specifically targets RNA-Seq count data.
- Parameters:
data (matrix) – raw count matrix (dataframe or numpy array) from genomic studies (dimensions gene x sample)
batch (array or list or
inmoose.utils.factor.Factor) – Batch indices. Must have as many elements as the number of columns in the expression matrix.group (array or list or
inmoose.utils.factor.Factor, optional) – vector/factor for biological condition of interest (default: None)covar_mod (matrix, optional) – model matrix for multiple covariates to include in linear model (signal from these variables are kept in data after adjustment)
full_mod (bool, optional) – if True, include condition of interest in model
shrink (bool, optional) – whether to apply shrinkage on parameter estimation
shrink_disp (bool, optional) – whether to apply shrinkage on dispersion
gene_subset_n (int, optional) – number of genes to use in emprirical Bayes estimation, only useful when shrink = True
ref_batch (any, optional) – batch id of the batch to use as reference (default: None)
- Returns:
the input expression matrix adjusted for batch effects. same type as the input data
- Return type:
matrix