Enhancing Data Quality: QC Filters for Single Cell RNA-seq Analysis

Shrushti Joshi

October 30, 2023

Quality control (QC) is a critical preliminary stage in single-cell RNA-seq (scRNA-Seq) data analysis, serving two primary objectives:

Evaluation of sample quality through metrics to determine the suitability for subsequent analyses.
Elimination of low-quality data and noise to enhance analysis accuracy and interpretability.

In this solution brief, we discuss the prevalent metrics and techniques employed for QC and filtering of cell barcodes in single-cell RNA-seq data on our biomedical data curation platform - Polly. These methods influence the inclusion of cell barcodes in downstream analysis, potentially influencing clustering outcomes and visualization.

Enhancing Data Quality: QC Filters for Single Cell RNA-seq Analysis — Quality control is crucial in scRNA-seq.

Frequent Quality Control Filters for scRNA-seq on Polly

In scRNA-seq, various techniques exhibit differences in transcript length and sequence coverage. Some methods, such as Smart-seq and Quartz-seq, capture complete transcript sequences, while others, like Drop-seq (3’-end only), STRT-seq (5’-end only), and Chromium (3’-end only) focus on partial sequences. These techniques collectively form a pipeline that transforms limited-scale input into high-dimensional output, shedding light on cellular mechanisms and trajectory dynamics.

This analysis follows a structured workflow, divided into two main sections: pre-processing and downstream analysis. Common quality control filters are the gatekeepers of data integrity, ensuring that the information derived from complex datasets remains accurate and reliable.

Pre-filtering Prerequisites for scRNA-seq Data Analysis on Polly

Before embarking on the data filtering process for single-cell RNA-seq data, two essential steps should be undertaken:

Running Cell Ranger: The initial step involves employing Cell Ranger, a suite of analysis pipelines designed to process Chromium single-cell data. These pipelines handle tasks such as read alignment, the generation of feature-barcode matrices, secondary analysis like clustering, and more. The feature-barcode matrix produced in this step is the foundation for subsequent quality control (QC) and filtering of cell barcodes. For detailed information, please refer to the relevant webpage.
Plotting Distribution of Potential Filtering Metrics: Before establishing the specific metrics and their corresponding thresholds for filtering, it is advisable to visualize the distribution of relevant data points. This practice provides insights into overall data quality and aids in identifying any unexpected phenotypes. Various metrics, as detailed in the subsequent section, can be examined through visualization techniques like violin plots, box plots, or density plots. This step enhances the understanding of data characteristics and informs subsequent filtering decisions.

QC Metrics and Filtering Approaches on Polly

In the analysis of single-cell data, the adoption of common metrics and filtering methods is pivotal. Below, we explore these practices in breif, providing insights into their rationale and potential caveats where applicable.

Filtering Cell Barcodes by UMI Counts: The UMI counts associated with a cell barcode signify the observed transcript number in a droplet. Barcodes with exceptionally high UMI counts might indicate multiplets (multiple cells in one droplet), while low UMI counts may suggest ambient RNAs in empty droplets. Filtering by UMI counts helps eliminate non-single-cell barcodes. The choice of UMI count thresholds varies, with some studies utilizing data-driven thresholds. However, in highly heterogeneous samples, a one-size-fits-all threshold may exclude real single cells with varying RNA contents.
Filtering Cells by Number of Features: Similar to UMI counts, barcodes with an unusually high number of features may represent multiplets, while those with fewer features may indicate empty droplets. The choice of feature count thresholds should consider sample heterogeneity, as imposing a uniform threshold may erroneously exclude cells expressing a wide or limited array of genes.
Filtering Cells by Percent of Mitochondrial (mt) Reads: Elevated mt DNA transcripts in cells are associated with unhealthy states or cellular leakage. Filtering based on the percentage of mt reads involves arbitrary or data-driven thresholds. However, mt gene expression varies between cell types, and filtering based solely on this metric may introduce bias, particularly in cases like cardiomyocytes.
Sorting Cell Based on Expression Thresholds: Setting expression thresholds is essential to filter out genes with little to no detectable expression. Genes with minimal or absent expression levels may arise due to technical artifacts, and including them in the analysis can introduce noise and misinterpretation of the data. By eliminating such genes, the focus is shifted to biologically relevant signals.
Identifying the Detection Rate: Detecting and excluding genes with low detection rates is another crucial step. These are genes expressed in only a few cells within the dataset. Prioritizing consistently detected genes enhances the robustness of the analysis, as genes with sporadic expression may not contribute significantly to the overall biological patterns.
Detecting and Filtering Doublets: To ensure the accuracy of the analysis, specialized algorithms like DoubletFinder and Scrublet are employed for doublet detection. Doublets or multiplets are cells containing genetic material from multiple sources, and their presence can confound the interpretation of single-cell data. Identifying and removing these doublets is essential for accurate downstream analysis.
Handling the Batch Effects: It's crucial to address batch effects when dealing with data from multiple sequencing runs or experimental conditions. Batch effects can introduce variability and bias into the data, making employing methods like batch correction or integration essential to harmonize the data. This ensures that technical variations do not influence the results but reflect true biological differences.
Dimensionality Reduction: Dimensionality reduction techniques, such as Principal Component Analysis (PCA), are used to simplify complex data and visualize cell clusters. This step is crucial for identifying and potentially removing outliers or contaminating cells that might skew the analysis results. Researchers can focus on the most informative features and patterns by reducing the data's dimensionality.
Evaluating the Replicates: To ensure the data's reliability, researchers assess the consistency between replicates. Replicates are data generated from different runs or samples, and evaluating their consistency verifies that the data is comparable and suitable for downstream analyses. Inconsistencies between replicates can indicate data quality issues and should be addressed.

Polly-Verified Datasets: Ensuring Quality in scRNA-seq Data

Quality control is critical in scRNA-seq data analysis, ensuring that only high-quality cells and genes are used for downstream analysis. By implementing ordinary QC filters and considering the unique characteristics of your dataset, you can enhance the reliability and biological relevance of your scRNA-seq results, leading to more accurate insights into cellular heterogeneity and gene expression patterns.

Polly is a transformative asset in elevating the quality of data. It excels in curating multi-omics and assay data, rendering them ML-ready and analysis-ready. This process is driven by a Polly-verified curation engine, overseen by skilled experts who harmonize a wide spectrum of data types, enrich metadata, and ensure consistent data processing while maintaining affordability. The ML-Ready data is securely stored on cloud-based Atlas data stores, optimized for efficient analysis and data management.

Polly's state-of-the-art technology caters to approximately 26 diverse R&D data types, meeting the requirements of teams involved in pre-clinical drug discovery and diagnostics R&D. It's the trusted choice for over 25 research organizations, including four of the largest 10 pharmaceutical companies, who leverage Polly and its associated solutions to expedite their discovery programs. Numerous other data-driven healthcare enterprises rely on Polly-verified processes to harmonize and securely store public and proprietary biomedical data. In a nutshell, Polly, with its user-friendly interface and advanced capabilities, ensures high-quality scRNA-seq data.

‍

Other Resources

Blogs Case Studies Dataset Roundup Documentation Glossary Webinars Whitepapers

Thank you for reaching out!

Our team will get in touch with you over email within next 24-48hrs.

Oops! Something went wrong while submitting the form.

FAQs

What are the key benefits of using Polly for gene target prioritization in patient stratification?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Upcoming Webinar: AI-Powered Insights from PK/PD Clinical Trial Data

Register Now

[Upcoming Webinar] Scaling High-Quality Data Processing: Achieve 4x Cost Reduction for Foundation ModelsRegister Now->

Reserve Your Seat

Pharma Company Achieves 4x Faster Target Identification for Inflammatory Disease

Key Highlights

What’s a Rich Text element?

Static and dynamic content editing

How to customize formatting for each rich text

All Solution Briefs

Other Resources

Enhancing Data Quality: QC Filters for Single Cell RNA-seq Analysis

Frequent Quality Control Filters for scRNA-seq on Polly

Pre-filtering Prerequisites for scRNA-seq Data Analysis on Polly

QC Metrics and Filtering Approaches on Polly

Polly-Verified Datasets: Ensuring Quality in scRNA-seq Data

Other Resources

Talk to our Data Expert

More Solution Briefs

Faster Insights on Omics Data Signatures with Polly Discover

Enhancing Data Quality: QC Filters for Single Cell RNA-seq Analysis

How to Perform Patient Stratification on Polly

ChatGPT in Drug Discovery

Solving Biomedical Data Findability Issues Using Polly

How to Compare Gene Signatures on Polly

FAQs

What are the key benefits of using Polly for gene target prioritization in patient stratification?

How does Polly help in training classifier models for patient stratification?

How does Polly assist in defining genetic signatures for different stages of cell differentiation?

What is the process of creating a disease-specific atlas using Polly’s harmonization engine?

How does Polly integrate multiple data types for more reliable patient stratification?

Can Polly handle data quality issues and unstructured data from public repositories?

How does Polly harmonize multi-omic datasets to improve the quality of patient stratification?

How does Elucidata's Polly help in overcoming the challenges of patient stratification?

What challenges do researchers face when performing patient stratification using multi-omics data?

What is patient stratification, and why is it important for precision medicine?

What are the key advantages of using Polly for transcriptome profiling and biomarker identification?

What methodologies does Polly use to identify synergistic drug combinations?

How does Polly rank datasets similar to a gene signature query?

What steps are involved in creating a query gene signature on Polly?

How does Polly's RNA-Seq Atlas simplify gene signature analysis?

What is gene signature comparison, and why is it important in drug discovery?

Get the latest news, industry insights, and updates delivered directly to your inbox.

All Solution Briefs

Faster Insights on Omics Data Signatures with Polly Discover

Enhancing Data Quality: QC Filters for Single Cell RNA-seq Analysis

How to Perform Patient Stratification on Polly

ChatGPT in Drug Discovery

Solving Biomedical Data Findability Issues Using Polly

How to Compare Gene Signatures on Polly

info@elucidata.io

info@elucidata.io

info@elucidata.io