pycombat_seq: batch effect correction for RNASeq data
ComBat-Seq [Zhang2020] follows on the steps of ComBat, but targets specifically
RNA-Seq data. Conceptually, ComBat-Seq is based on the same mathematical
framework as ComBat, except that its replaces the normal distribution of
microarray data by a negative binomial distribution to account for the
specificities of RNA-Seq expression data. pycombat_seq() is a direct port
of ComBat-Seq to Python. Since ComBat-Seq relies on the Bioconductor
edgeR package, the relevant parts of edgeR have been ported
along. Closely following the original implementation in R, pycombat_seq()
has results very similar to those of ComBat-Seq in terms of batch effects
correction. Additionally, pycombat_seq() is as fast, if not faster, than
the original implementation in R. It also features additional capabilities, such
as fixing a given batch as reference.
Code documentation
- inmoose.pycombat.pycombat_seq(counts, batch, covar_mod=None, shrink=False, shrink_disp=False, gene_subset_n=None, ref_batch=None, na_cov_action='raise')
pycombat_seq is an improved model from ComBat using negative binomial regression, which specifically targets RNA-Seq count data.
- Parameters:
counts (matrix) – raw count matrix (dataframe or numpy array) from genomic studies (dimensions gene x sample)
batch (array or list or
inmoose.utils.factor.Factor) – Batch indices. Must have as many elements as the number of columns in the expression matrix.covar_mod (list or matrix, optional) – model matrix (dataframe, list or numpy array) for one or multiple covariates to include in linear model (signal from these variables are kept in data after adjustment). Covariates have to be categorial, they can not be continuous values (default: None).
shrink (bool, optional) – whether to apply shrinkage on parameter estimation
shrink_disp (bool, optional) – whether to apply shrinkage on dispersion
gene_subset_n (int, optional) – number of genes to use in emprirical Bayes estimation, only useful when shrink = True
ref_batch (any, optional) – batch id of the batch to use as reference (default: None)
na_cov_action (str) –
Option to choose the way to handle missing covariates
"raise"raise an error if missing covariates and stop the code"remove"remove samples with missing covariates and raise a warning"fill"handle missing covariates, by creating a distinct covariate per batch
(default:
"raise")
- Returns:
the input expression matrix adjusted for batch effects. same type as the input data
- Return type:
matrix