Enhancing Data Quality: QC Filters for Single Cell RNA-seq Analysis

Enhancing Data Quality: QC Filters for Single Cell RNA-seq Analysis

Shrushti Joshi
October 30, 2023

Quality control (QC) is a critical preliminary stage in single-cell RNA-seq (scRNA-Seq) data analysis, serving two primary objectives:

  1. Evaluation of sample quality through metrics to determine the suitability for subsequent analyses.
  2. Elimination of low-quality data and noise to enhance analysis accuracy and interpretability.

In this solution brief, we discuss the prevalent metrics and techniques employed for QC and filtering of cell barcodes in single-cell RNA-seq data on our biomedical data curation platform - Polly. These methods influence the inclusion of cell barcodes in downstream analysis, potentially influencing clustering outcomes and visualization.

Enhancing Data Quality: QC Filters for Single Cell RNA-seq Analysis
Quality control is crucial in scRNA-seq.

Frequent Quality Control Filters for scRNA-seq on Polly

In scRNA-seq, various techniques exhibit differences in transcript length and sequence coverage. Some methods, such as Smart-seq and Quartz-seq, capture complete transcript sequences, while others, like Drop-seq (3’-end only), STRT-seq (5’-end only), and Chromium (3’-end only) focus on partial sequences. These techniques collectively form a pipeline that transforms limited-scale input into high-dimensional output, shedding light on cellular mechanisms and trajectory dynamics.

This analysis follows a structured workflow, divided into two main sections: pre-processing and downstream analysis. Common quality control filters are the gatekeepers of data integrity, ensuring that the information derived from complex datasets remains accurate and reliable.

Enhancing Data Quality: QC Filters for Single Cell RNA-seq Analysis
QC at each step of scRNA-seq data

Pre-filtering Prerequisites for scRNA-seq Data Analysis on Polly

Before embarking on the data filtering process for single-cell RNA-seq data, two essential steps should be undertaken:

  1. Running Cell Ranger: The initial step involves employing Cell Ranger, a suite of analysis pipelines designed to process Chromium single-cell data. These pipelines handle tasks such as read alignment, the generation of feature-barcode matrices, secondary analysis like clustering, and more. The feature-barcode matrix produced in this step is the foundation for subsequent quality control (QC) and filtering of cell barcodes. For detailed information, please refer to the relevant webpage.
  2. Plotting Distribution of Potential Filtering Metrics: Before establishing the specific metrics and their corresponding thresholds for filtering, it is advisable to visualize the distribution of relevant data points. This practice provides insights into overall data quality and aids in identifying any unexpected phenotypes. Various metrics, as detailed in the subsequent section, can be examined through visualization techniques like violin plots, box plots, or density plots. This step enhances the understanding of data characteristics and informs subsequent filtering decisions.

QC Metrics and Filtering Approaches on Polly

In the analysis of single-cell data, the adoption of common metrics and filtering methods is pivotal. Below, we explore these practices in breif, providing insights into their rationale and potential caveats where applicable.

  1. Filtering Cell Barcodes by UMI Counts: The UMI counts associated with a cell barcode signify the observed transcript number in a droplet. Barcodes with exceptionally high UMI counts might indicate multiplets (multiple cells in one droplet), while low UMI counts may suggest ambient RNAs in empty droplets. Filtering by UMI counts helps eliminate non-single-cell barcodes. The choice of UMI count thresholds varies, with some studies utilizing data-driven thresholds. However, in highly heterogeneous samples, a one-size-fits-all threshold may exclude real single cells with varying RNA contents.
  2. Filtering Cells by Number of Features: Similar to UMI counts, barcodes with an unusually high number of features may represent multiplets, while those with fewer features may indicate empty droplets. The choice of feature count thresholds should consider sample heterogeneity, as imposing a uniform threshold may erroneously exclude cells expressing a wide or limited array of genes.
  3. Filtering Cells by Percent of Mitochondrial (mt) Reads: Elevated mt DNA transcripts in cells are associated with unhealthy states or cellular leakage. Filtering based on the percentage of mt reads involves arbitrary or data-driven thresholds. However, mt gene expression varies between cell types, and filtering based solely on this metric may introduce bias, particularly in cases like cardiomyocytes.
  4. Sorting Cell Based on Expression Thresholds: Setting expression thresholds is essential to filter out genes with little to no detectable expression. Genes with minimal or absent expression levels may arise due to technical artifacts, and including them in the analysis can introduce noise and misinterpretation of the data. By eliminating such genes, the focus is shifted to biologically relevant signals.
  5. Identifying the Detection Rate: Detecting and excluding genes with low detection rates is another crucial step. These are genes expressed in only a few cells within the dataset. Prioritizing consistently detected genes enhances the robustness of the analysis, as genes with sporadic expression may not contribute significantly to the overall biological patterns.
  6. Detecting and Filtering Doublets: To ensure the accuracy of the analysis, specialized algorithms like DoubletFinder and Scrublet are employed for doublet detection. Doublets or multiplets are cells containing genetic material from multiple sources, and their presence can confound the interpretation of single-cell data. Identifying and removing these doublets is essential for accurate downstream analysis.
  7. Handling the Batch Effects: It's crucial to address batch effects when dealing with data from multiple sequencing runs or experimental conditions. Batch effects can introduce variability and bias into the data, making employing methods like batch correction or integration essential to harmonize the data. This ensures that technical variations do not influence the results but reflect true biological differences.
  8. Dimensionality Reduction: Dimensionality reduction techniques, such as Principal Component Analysis (PCA), are used to simplify complex data and visualize cell clusters. This step is crucial for identifying and potentially removing outliers or contaminating cells that might skew the analysis results. Researchers can focus on the most informative features and patterns by reducing the data's dimensionality.
  9. Evaluating the Replicates: To ensure the data's reliability, researchers assess the consistency between replicates. Replicates are data generated from different runs or samples, and evaluating their consistency verifies that the data is comparable and suitable for downstream analyses. Inconsistencies between replicates can indicate data quality issues and should be addressed.

Polly-Verified Datasets: Ensuring Quality in scRNA-seq Data

Quality control is critical in scRNA-seq data analysis, ensuring that only high-quality cells and genes are used for downstream analysis. By implementing ordinary QC filters and considering the unique characteristics of your dataset, you can enhance the reliability and biological relevance of your scRNA-seq results, leading to more accurate insights into cellular heterogeneity and gene expression patterns.

Polly is a transformative asset in elevating the quality of data. It excels in curating multi-omics and assay data, rendering them ML-ready and analysis-ready. This process is driven by a Polly-verified curation engine, overseen by skilled experts who harmonize a wide spectrum of data types, enrich metadata, and ensure consistent data processing while maintaining affordability. The ML-Ready data is securely stored on cloud-based Atlas data stores, optimized for efficient analysis and data management.

Polly's state-of-the-art technology caters to approximately 26 diverse R&D data types, meeting the requirements of teams involved in pre-clinical drug discovery and diagnostics R&D. It's the trusted choice for over 25 research organizations, including four of the largest 10 pharmaceutical companies, who leverage Polly and its associated solutions to expedite their discovery programs. Numerous other data-driven healthcare enterprises rely on Polly-verified processes to harmonize and securely store public and proprietary biomedical data. In a nutshell, Polly, with its user-friendly interface and advanced capabilities, ensures high-quality scRNA-seq data.

Request Demo