Cohort Quality Control
In addition to its batch effect correction features, InMoose can generate a HTML report to assess how well batch effects are corrected, and how they correlate with co-variates.
We illustrate its usage with data freely available on NCBI Gene Expression Omnibus, namely:
GSE18520
GSE66957
GSE69428
The corresponding expression files are stored on InMoose repository in the data subfolder.
import pandas as pd
from inmoose.cohort_qc import CohortMetric, QCReport
from inmoose.pycombat import pycombat_norm
dataset_1 = pd.read_pickle("data/GSE18520.pickle")
dataset_2 = pd.read_pickle("data/GSE66957.pickle")
dataset_3 = pd.read_pickle("data/GSE69428.pickle")
datasets = [dataset_1, dataset_2, dataset_3]
# merge all three datasets into a single one, keeping only common genes
df_expression = pd.concat(datasets, join="inner", axis=1)
batch = [j for j, ds in enumerate(datasets) for _ in range(len(ds.columns))]
# run pycombat_norm
df_corrected = pycombat_norm(df_expression, batch)
# compute cohort metrics
cohort_metric = CohortMetric(
clinical_df=pd.DataFrame({"batch": batch}, index=df_expression.columns),
batch_column="batch",
data_expression_df=df_corrected,
data_expression_df_before=df_expression,
)
cohort_metric.process()
# build QC report
report = QCReport(cohort_metric)
report.save_report("report.html")
The snippet above generates the following report: