pycombat_seq: batch effect correction for RNASeq data

ComBat-Seq (Zhang et al, 2020) [1] follows on the steps of ComBat, but targets specifically RNA-Seq data. Conceptually, ComBat-Seq is based on the same mathematical framework as ComBat, except that its replaces the normal distribution of microarray data by a negative binomial distribution to account for the specificities of RNA-Seq expression data. pycombat_seq() is a direct port of ComBat-Seq to Python. Since ComBat-Seq relies on the Bioconductor edgeR package, the relevant parts of edgeR have been ported along. Closely following the original implementation in R, pycombat_seq() has results very similar to those of ComBat-Seq in terms of batch effects correction. Additionally, pycombat_seq() is as fast, if not faster, than the original implementation in R. It also features additional capabilities, such as fixing a given batch as reference.

Code documentation

inmoose.pycombat.pycombat_seq(data, batch, group=None, covar_mod=None, full_mod=True, shrink=False, shrink_disp=False, gene_subset_n=None, ref_batch=None)

pycombat_seq is an improved model from ComBat using negative binomial regression, which specifically targets RNA-Seq count data.

Parameters:
  • data (matrix) – raw count matrix (dataframe or numpy array) from genomic studies (dimensions gene x sample)

  • batch (array or list or inmoose.utils.factor.Factor) – Batch indices. Must have as many elements as the number of columns in the expression matrix.

  • group (array or list or inmoose.utils.factor.Factor, optional) – vector/factor for biological condition of interest (default: None)

  • covar_mod (matrix, optional) – model matrix for multiple covariates to include in linear model (signal from these variables are kept in data after adjustment)

  • full_mod (bool, optional) – if True, include condition of interest in model

  • shrink (bool, optional) – whether to apply shrinkage on parameter estimation

  • shrink_disp (bool, optional) – whether to apply shrinkage on dispersion

  • gene_subset_n (int, optional) – number of genes to use in emprirical Bayes estimation, only useful when shrink = True

  • ref_batch (any, optional) – batch id of the batch to use as reference (default: None)

Returns:

the input expression matrix adjusted for batch effects. same type as the input data

Return type:

matrix