Data Science & Machine Learning

Curation – The Missing Link to Single Cell Data Analysis

Deepthi Das
February 24, 2022

Single cell RNA sequencing (scRNA-seq) has revolutionized the way we study gene expression at the cellular level. By sequencing the RNA of individual cells, we can now gain a deeper understanding of ty and biological processes at the single cell level.

However, with the abundance of data generated by scRNA-seq experiments, come the challenges that must be overcome to unlock its full potential.

Discovery teams working on single-cell data typically get stuck for days and weeks on the initial step of sourcing relevant datasets from open-source portals. Storing and analyzing this data is another roadblock.

Let’s take a quick glance at the recurring challenges scientists face while performing single-cell analysis (SCA) and some solutions that could streamline their discovery process.

Single Cell RNA Data Analysis Workflow

Challenges of Working with Publicly Available Single-Cell Data

  • Semi-structured, raw scRNA-seq data from public repositories are difficult to retrieve and integrate together for cell-type and cell-function annotation exercises. Each repository processes data differently and may lack the adequate metadata annotations which directly affects findability of these datasets.
  • There is a lack of standards for the deposition of cell-level metadata. Although guidelines have recently been proposed for single-cell data deposition, these guidelines have primarily focused on describing experimental aspects of the study. In most cases, even the cell types assigned by investigators for each cellular barcode are not mentioned.
  • Preliminary exploration or analysis of single cell data has extensive memory requirements. Also, researchers need to spend critical amounts of time downloading the data, packages, and libraries to a computational environment.

  • Analysis and insight generation from different pipelines written by different users is often counterproductive to reproducibility. More importantly, comparing and interpreting different datasets requires a standard processing pipeline.

Here Are the Factors Which Can Streamline the Discovery Process

1. Metadata Harmonization

Standard metadata fields such as tissue, disease, number of samples, platform or sequencing technology (10x or smartseq), organism, sample cohorts, cell types are some of the key annotations that would ease the effort of identifying relevant datasets.

2. Scalable Infrastructure and Integrative Platform for Analysis

A cloud platform that can store different formats of data such as h5ad or h5seurat, perform compute-intensive processing workflows such as Cellranger, Scanpy, Seurat as well as integrations with open source algorithms such as Nichenet (Ligand-Receptor analysis), CCA, Harmony (Batch Correction), SingleR, SCSA (Automated Cell Type Annotation) or applications would be ideal for processing the data.

3. Specific Pipeline for Consistent Analysis

A standard single-cell analysis workflow (such as Scanpy, Seurat) should be used to perform analysis across all the datasets so that comparative studies can be carried out between single cell data from different sources. The datasets can then be stored in a single format, such as h5ad format, which is a widely used format in the single-cell sequencing community. It should be designed to store large amounts of data and allow fast querying of parts of a file without accessing the complete file in memory.

4. Find All the Relevant Data in One Place

A resource/ repository that collates all the single cell data on diverse areas, especially oncology, will save a lot of time and effort for the researchers who could use it to derive meaningful insights.

5. Interactive Visualization Tools

Interactive visualization tools can help researchers to quickly identify and remove outliers or low-quality cells, reducing the time and effort required for manual curation. These tools allow researchers to visualize their scRNA-seq data in a variety of ways, making it easier to identify and remove outliers or low-quality cells.

6. Collaboration and Sharing

Collaboration and sharing of scRNA-seq data and curation tools among researchers can play a significant role in streamlining the curation process for single-cell RNA-seq data analysis.

In conclusion, curation is a critical step in scRNA-seq data analysis and should not be overlooked. By organizing, cleaning, and standardizing the data, curation helps to ensure that the results of scRNA-seq data analysis are accurate and reliable. As the field of scRNA-seq continues to grow and evolve, the importance of curation will only increase, making it a key component of the data analysis pipeline.

One-Stop Solution for Your Curation Problems

Elucidata’s data-centric ML Ops platform, Polly, allows the user to carry out integrative analysis on single-cell data. We have the world’s largest collection of ML-ready single-cell and bulk RNA seq data. Polly hosts highly curated datasets following standard ontologies with harmonized metadata, standardized and normalized data processed through consistent pipelines, and accurate expert-annotated cell types to ensure reliable results and to empower scientists in achieving their research goals.

Reach out to us or email us at info@elucidata.io to learn more!


Blog Categories

Blog Categories

Request Demo