FAIR Data

Batch Effect in Single-cell RNA-seq: Frequently Asked Questions and Answers

Anurag Srivastava
December 18, 2023

The batch effect in single-cell RNA-seq experiments occurs when cells from distinct biological conditions are processed separately. These effects represent consistent fluctuations in gene expression patterns and high dropout events (almost 80% of gene expression values are zero), primarily stemming from technical rather than biological differences among analyzed cells. Such variations can impact detection rates, drive distances between transcription profiles, and result in false discoveries.

Causes of Batch Effect in Single-cell RNA-seq

Batch effects can originate from multiple sources in various experimental settings, including single-cell RNA-sequencing (scRNA-seq) studies. These effects may arise from differences in sequencing platforms, timing, reagents, or experimental conditions across laboratories. They pose a significant challenge in scRNA-seq data analysis, potentially confounding biological interpretations.

Mitigating batch effects is crucial for ensuring the accuracy and reliability of downstream biological analysis in scRNA-seq experiments. Sound experimental design strategies are often employed to minimize these effects. However, when mitigation is not feasible, computational approaches become essential.

In scRNA-seq, correcting batch effects is a crucial step in data preprocessing. By applying computational methods, researchers can effectively reduce the impact of technical variability and confounding factors, improving the robustness of downstream biological analysis. These approaches help ensure that biological signals are accurately captured and interpreted, facilitating meaningful insights into cellular heterogeneity, gene expression dynamics, and regulatory networks at the single-cell level.

Addressing Frequently Asked Questions on Batch Effect in Single-cell RNA-seq

Even though batch effect correction is a critical step in the processing of single-cell data, users often have questions about the process. Common queries include differentiating between normalization and batch effect correction, detecting the batch effect in single-cell datasets, methods for performing batch effect correction, and recognizing signs of overcorrection. This blog seeks to address these frequently asked questions in handling batch effects in single-cell RNA-seq.

What is the Difference between Normalization and Batch Effect Correction?

Normalization operates on the raw count matrix (e.g., cells x genes), while most methods for removing batch effect utilize dimensionality-reduced data to expedite computation time. However, other methods (e.g., ComBat, Scanorama, etc.) can correct the full expression matrix. Additionally, both processes address different technical variations in the sample. Normalization mitigates sequencing depth across cells, library size, and amplification bias caused by gene length. In contrast, batch effect correction mitigates different sequencing platforms, timing, reagents, or different conditions/laboratories.

How to Observe the Batch Effect in Single-cell RNA-seq Data?

The most common ways to identify the batch effect in single-cell RNA-seq datasets are the following:

1. Principal Component Analysis

Performing principal component analysis (PCA) on raw single-cell data aids in the identification of batch effect through the analysis of the top principal components (PCs). The scatter plot of these top PCs reveals variations induced by the batch effect, showcasing sample separation attributed to distinct batches rather than biological sources.

2. t-SNE/UMAP Plot Examination

An effective approach to identifying batch effect involves performing clustering analysis and visualizing cell groups on a t-SNE or UMAP plot. This visualization includes labeling cells based on their sample group (e.g., case/control) and batch number before and after batch correction. The rationale is that, in the presence of uncorrected batch effects, cells from different batches tend to cluster together instead of grouping based on biological similarities. After batch correction, the expectation is a cohesive clustering without such fragmentation.

3. Quantitative Metrics

To evaluate the efficacy of batch correction methods in minimizing technical variations and harmonizing datasets from different experimental batches, quantitative metrics are employed for single-cell data. These metrics, calculated on the data distribution before and after batch correction, indicate an overall enhancement in the integration of cells from different samples following the application of the correction method. Some common quantitative metrics used for the batch corrections are normalized mutual information (NMI), adjusted rand index (ARI), percentage of corrected random pairs within batches (PCR_batch), graph-based integrated local similarity inference (Graph_ILSI), k-nearest neighbor batch effect test (kBET) and more.

How to Perform Batch Effect Correction?

There are multiple algorithms developed for batch effect correction. Here, we will discuss the commonly publicly available algorithm to correct the batch effect.

a. Seurat 3: This algorithm uses canonical correlation analysis (CCA) to project data into a subspace, identifying correlations across datasets. The mutual nearest neighbors (MNNs) calculated in this subspace then serve as anchors to correct and align cells during batch integration.

b. Harmony: This algorithm utilizes PCA for dimensionality reduction. Harmony iteratively removes batch effects by clustering similar cells across batches in each iteration. The process maximizes diversity within each cluster and calculates a correction factor for each cell, allowing for efficient and accurate detection of true biological connections across datasets.

Batch Effect in Single-cell RNA-seq
Figure showing batch effect correction with Seurat 3 and Harmony method (Adapted from paper)

c. MNN Correct: This algorithm maps cells between datasets, reconstructing data in a shared space by detecting mutual nearest neighbors (MNNs). Assuming shared cell types, observed differences indicate batch effects, quantifying their strength. This metric aids in merging batches when pooling together. Despite yielding a normalized gene expression matrix for downstream analysis, the method demands significant computational resources due to high-dimensional neighbor computations in gene expression space.

d. LIGER: This algorithm employs integrative non-negative matrix factorization to achieve a low-dimensional representation of input data consisting of batch-specific and shared factors. The process begins with clustering, and following that, a shared factor neighborhood graph is established to connect cells with similar neighborhoods. Identified clusters have their factor loading quantiles normalized to a chosen reference dataset, typically the one with the largest number of cells, accomplishing batch correction.

Batch Effect in Single-cell RNA-seq
Figure showing batch effect correction with MNN Correct and LIGER method (Adapted from paper)

e. scGen: This algorithm employs a variational autoencoder (VAE) model trained on a reference dataset to correct the actual data. The VAE model demonstrates favorable performance in batch correction applications. Like MNN correct, scGen produces a normalized gene expression matrix, facilitating downstream analysis.

f. Scanoroma: This algorithm searches for MNNs in dimensionally reduced spaces, leveraging them in a similarity-weighted approach to guide batch integration. Scanorama yields both corrected expression matrices and embeddings, exhibiting notable performance on more complex data in various studies.

Batch Effect in Single-cell RNA-seq
Figure showing batch effect correction with Scanorama and scGen method (Adapted from paper)

What Are the Key Signs of Overcorrection?

One common issue faced while performing batch correction is the overcorrection of raw data. Some indicative signs of an ‘overcorrected’ batch correction include:

a. A significant portion of cluster-specific markers comprising genes with widespread high expression across various cell types, such as ribosomal genes.

b. A substantial overlap among markers specific to clusters.

c. Notable absence of expected cluster-specific markers; for instance, the lack of canonical markers for a particular T-cell subtype known to be present in the dataset.

d. The scarcity or absence of differential expression hits associated with pathways expected based on the composition of samples in terms of cell types and experimental conditions.

These signs serve as indicators of potential overcorrection issues during batch correction processes.

Is the Batch Effect Correction in Single-cell RNA-seq Data the Same as in Bulk RNA-seq, or Is It Different?

The purpose of batch correction is to identify and mitigate technical variations. The distinction in 'batch correction' between the two sequencing methods is primarily algorithmic. Techniques used in bulk RNA-seq might be insufficient to correct batch effect in single-cell RNA-seq due to data size (10k cells vs. 10 total samples) or data sparsity. Conversely, single-cell RNA-seq techniques might be excessive for the smaller experimental design associated with bulk RNA-seq.

Effective Batch Correction With Polly

Typically, batch correction is executed using a specific method and validated through visualization. Polly-processed single-cell data incorporates both batch effect correction methods and quantitative metrics, offering a notable advantage. This allows for a thorough evaluation of the batch correction method's effectiveness. In Polly's single-cell processing pipeline, the ‘Harmony’ method is commonly employed for batch effect, and its efficacy is assessed using quantitative metrics. In the quantitative metrics, values closer to 1 indicate better mixing of cells from the different batches. Here is an example of raw data with batch effect corrected on the dataset.

Batch Effect in Single-cell RNA-seq
Before and after batch effect correction by Polly’s single-cell processing pipeline
Batch Effect in Single-cell RNA-seq
Quantitative metrics for batch effect correction by Polly’s single-cell processing pipeline

'Polly Verified' ensures the absence of batch effects in the harmonized datasets delivered to our customers. At Elucidata, we believe in data transparency, and all the data delivered to our customers or partners is accompanied by a ‘Polly Verified’ report, ensuring high quality. A sample report can be accessed here, and a sneak peek into the report is shown below.

Batch Effect in Single-cell RNA-seq
Polly Verified report of single-cell datasets

For researchers working with single-cell datasets, several algorithms excel at addressing batch effects in single-cell RNA sequencing (scRNA-seq) within a given study. These algorithms effectively tackle part of the challenge by correcting batch effects within the same study, enhancing the reliability of the results.

However, the issue persists when integrating data across multiple studies. Fully eliminating batch effects across studies, particularly considering diverse experimental designs and handling, remains a daunting task at present. Nevertheless, researchers can employ quantitative metrics to better evaluate the efficacy of batch correction methods.

By utilizing these metrics, researchers can assess the extent to which batch effects are mitigated and make informed decisions regarding data integration strategies. While complete elimination of batch effects across studies may be challenging, leveraging quantitative metrics allows researchers to gauge the effectiveness of batch correction methods and minimize their impact on downstream analyses in single-cell RNA-seq studies.

To find out more, reach out to us at info@elucidata.io, or request a demo here.

Blog Categories

Blog Categories

Request Demo