pycombat_seq: batch effect correction for RNASeq data
ComBat-Seq (Zhang et al, 2020) [1] follows on the steps of ComBat, but targets
specifically RNA-Seq data. Conceptually, ComBat-Seq is based on the same
mathematical framework as ComBat, except that its replaces the normal
distribution of microarray data by a negative binomial distribution to account
for the specificities of RNA-Seq expression data.
pycombat_seq() is a direct port of ComBat-Seq to Python. Since ComBat-Seq
relies on the Bioconductor edgeR package, the relevant parts of
edgeR have been ported along. Closely following the original
implementation in R, pycombat_seq() has results very similar to those of
ComBat-Seq in terms of batch effects correction. Additionally,
pycombat_seq() is as fast, if not faster, than the original implementation
in R. It also features additional capabilities, such as fixing a given batch as
reference.
Code documentation
- inmoose.pycombat.pycombat_seq(data, batch, covar_mod=None, shrink=False, shrink_disp=False, gene_subset_n=None, ref_batch=None, na_cov_action='raise')
pycombat_seq is an improved model from ComBat using negative binomial regression, which specifically targets RNA-Seq count data.
- Parameters:
data (matrix) – raw count matrix (dataframe or numpy array) from genomic studies (dimensions gene x sample)
batch (array or list or
inmoose.utils.factor.Factor) – Batch indices. Must have as many elements as the number of columns in the expression matrix.covar_mod (list or matrix, optional) – model matrix (dataframe, list or numpy array) for one or multiple covariates to include in linear model (signal from these variables are kept in data after adjustment). Covariates have to be categorial, they can not be continious values (default: None).
shrink (bool, optional) – whether to apply shrinkage on parameter estimation
shrink_disp (bool, optional) – whether to apply shrinkage on dispersion
gene_subset_n (int, optional) – number of genes to use in emprirical Bayes estimation, only useful when shrink = True
ref_batch (any, optional) – batch id of the batch to use as reference (default: None)
na_cov_action (str) –
Option to choose the way to handle missing covariates -
"raise"raise an error if missing covariates and stop the code -"remove"remove samples with missing covariates and raise a warning -"fill"handle missing covariates, by creating a distinctcovariate per batch
(default:
"raise")
- Returns:
the input expression matrix adjusted for batch effects. same type as the input data
- Return type:
matrix