FAIR Data

Leveraging RNA-seq Data in Life Sciences R&D

Pooja Viswanathan
February 12, 2024

RNA-seq data refers to the gene expression data that forms the basis of transcriptomics. The increasing availability of single-cell RNA sequencing and bulk RNA sequencing has propelled biomedical research to a new era. This blog delves into the nuances of these techniques, their pivotal role in advancing our understanding of cellular processes, disease mechanisms and the discovery of new therapeutic targets. We explore the challenges faced by biomedical researchers in applying these datasets to their research area. And also shed light on how to resolve these challenges to maximize the impact of RNA-seq data and accelerate research progress.

Understanding Bulk RNA-seq
Understanding Single-cell RNA-seq

Bulk RNA-seq is a principal technique in transcriptomic analysis. It provides a complete, holistic view of gene expression within a given sample. From the extraction of RNA to the generation of expression profiles, researchers leverage bulk RNA-seq to uncover global patterns of gene expression across various biological conditions and various biological samples. Applications of such data range from identifying differentially expressed genes in disease states to unraveling regulatory networks critical for cellular functions.

Single-cell RNA-seq (scRNA-seq) has revolutionized our ability to scrutinize gene expression at a cellular resolution and advance the field of precision medicine. scRNA-seq empowers researchers to discern subtle differences between cells, unravel developmental trajectories, and identify rare cell types crucial for understanding complex biological systems. The technique reveals biomarkers and pathophysiological mechanisms in disease and contributes to the development of precise individualized treatment.

Importance of RNA-seq Data in Life Sciences R&D

In life sciences R&D, RNA sequencing technology has been leveraged to great success. It enables the investigation of a wide variety of research questions aimed at understanding healthy and diseased biological systems. Here, we introduce three examples of advancements in different fields of research.

1. Single-cell Differential Expression Analysis

Single-cell RNA sequencing (scRNA-seq) has allowed unique access to understanding specific cell types, and this access is profoundly impactful in immunology where specific cell types have specific responses to immunological triggers. These immune responses dictate downstream physiological effects both in disease and in healthy. B cells are immune cells with very specific biology, and different receptors that are specific within individuals. Their differentiation brings about great genetic diversity and studying their migration, differentiation, and evolution over time lens insight to understanding immune responses. B cell receptors (BCR) can be sequenced using scRNA-seq and studied using phylogenetic trees representing their evolution with each mutation (Hoehn and Kleinstein, 2024).  

2. Gene Expression Profiling

scRNA-seq measures gene expression at the single-cell level to reveal the heterogeneity of gene expression in individual cells or homologous cell types. This is achieved by tagging individual cells, such that a sort of barcode can be added to cells to identify them and RNA fragments can be identified as belonging to a specific cell. scRNA-seq provides valuable information on the characteristics of single cells and their gene expression profiles in healthy organs and diseased organs. In hepatitis B virus infections, scRNA-seq analysis of liver tissue revealed that increased Treg (regulatory T cells) and Tex (exhausted T cells) cells are associated with the extent of liver damage (Zhang et al., 2023). 

3. Biomarker Discovery

RNA-seq can reveal genetic biomarkers of diseases. When a genetic component is suspected in a disease, genomic and transcriptomic studies help delineate the relationship between genetic variation and specific abnormal cellular mechanisms. RNA-seq analysis of small RNA unveiled differential expression of snoRNA transcripts in schizophrenia, in turn revealing sex-based differences in the disease (Ragan et al., 2017). Sex-specific dysregulation in brain regions in schizophrenia was indicated by alterations in a class of snoRNAs(small nucleolar RNA). These were further associated with functional loss in synaptic connections (Smalheiser et al., 2014). 

Challenges in Working with RNA-seq Data

While RNA-seq technologies have opened new frontiers in genomics, they come with their own set of challenges. This section explores the hurdles researchers face, including issues related to poor data quality, the difficulty of finding relevant datasets, challenges in data analysis and visualization, and the overall management of voluminous and complex RNA-seq data. Overcoming these challenges is imperative for maximizing the utility of transcriptomics data in life sciences R&D.

1. Lack of Integration and Standardization

The sources of RNA-seq data are massive public data repositories like Genomic Expression Omnibus (GEO), and data produced in-house in different sectors. Pharmaceutical companies and research institutions produce their own bulk and single-cell RNA-seq data from specific experiments and data samples produced in laboratories. To fully exploit the potential of these data for research, they must be properly integrated. The heterogeneity in formats, acquisition  methods and experimental design pose specific challenges in integration.

2. Lack of Annotations and Sample Information

Further, data in public repositories lack metadata or annotations which would allow proper indexing and search. Searching and finding appropriate data for research or meta-analysis therefore becomes difficult. Doing so requires pre-processing datasets to ensure appropriate metadata tagging and quality checks on the labels.

3. High Data Volume

Bulk RNA-seq produces data at massive volumes which can be a strain to computational resources for analysis and management. Handling such datasets requires good computational techniques for security and efficiency. This results in a slowing of research timelines and demands a higher level of resources to tackle.

Addressing Challenges in Working with RNA-seq Data

Despite these challenges in RNA-seq work, new innovative solutions enhance the potential of RNA-seq data. Advanced techniques with integrated machine learning algorithms play a crucial role in uncovering patterns and associations within complex datasets. These methods can process large volumes of data, identify relevant features, and target analyses with unprecedented accuracy.

Polly is a comprehensive data harmonization platform by Elucidata, designed to address the challenges associated with RNA-seq data.

This section sheds light on Polly's role in making RNA-seq data more accessible and usable. Polly standardizes RNA-seq data, enabling seamless integration from diverse sources. Polly accelerates data analysis, ensures data quality, enhances collaboration, and contributes to the reproducibility of results. 

At the core of Polly's capabilities is its harmonization engine that standardizes transcriptomics data from diverse public and in-house sources. Polly's harmonization engine tackles issues related to data variability by aligning datasets, ensuring uniformity in format, and incorporating standardized metadata. This harmonization process significantly reduces the time and effort required for data cleaning and enables researchers to focus on the analysis and interpretation of results.

1. Data Standardization and Harmonization

Polly is a leader in data standardization and harmonization. Its harmonization engine can process data across a wide variety of formats, batch-process, and unify them. It completes metadata annotation with missing fields and data labels, and ensures metadata completeness. Researchers can specify data sources from public repositories to in-house proprietary data, and Polly standardizes and harmonizes data across sources. Polly implements about 50 quality checks to ensure highest data quality.

2. Data Processing 

Polly allows precise curation with flexible bioinformatics pipelines like STAR, Kallisto and other proprietary pipelines of choice, to achieve consistent data quality. Researchers can customize the quality check mechanisms, cut-offs, and log-fold thresholds used in the processing pipelines. It also allows curation of metadata, cohorts, or comparisons within cohorts to streamline the search for biologically relevant signatures. 

3. Data Management

Polly seamlessly integrates data from different sources and into in-house existing infrastructure that can hold large volumes of data. The data can be analyzed and visualized on a central Atlas on Polly or a proprietary platform of choice.

4. Enhanced Collaboration and Reproducibility

Thanks to the data normalization methods and quality checks that Polly implements, collaboration is made extremely simple. Technical variations and artifacts are removed with consistency, thus, producing datasets that can be analyzed confidently to give consistent results. This multimodality and smooth data integration enhances collaboration across different departments. Standardized data ensures the reproducibility of results, a critical aspect of scientific research. Polly's harmonization engine contributes to the robustness and reliability of transcriptomics data analysis. By securing and maintaining data standards at every step, Polly supports reproducible results that researchers can rely on.

Ensuring Data Quality for RNA-seq Data on Polly

Data quality is the cornerstone of effective data-centric discovery. All datasets on Polly undergo rigorous ~50 QA/QC checks for metadata completeness, metadata accuracy, schema compliance, technical artifacts, and more, to ensure the highest-quality data. These are called 'Polly Verified' datasets and are delivered in a transparent manner, accompanied by a detailed verification report on the checks conducted. All datasets are available with raw counts and associated metadata, thoroughly checked for completeness and quality metrics. 

Bulk and single-cell data are harmonized with a configurable, transparent, and granular curation process. Polly consistently processes data in custom pipelines of choice, and prepares them for various downstream use cases, including meta-analysis, Rare Transcript Discovery, or Integrative Multi-omics analysis. Bulk RNA-seq data can be further processed according to ontologies at the dataset and sample level, like disease, tissue, organism, cell line, cell type and drug. scRNA-seq data can be further processed according to specific cell types, genes without concerns for data doubling, or incompleteness.

Case Studies and Success Stories

Our first case study illustrates the transformative role Polly has played in scRNA-seq research. The focus is on Polly's ability to accelerate RNA-seq data analysis while substantially reducing costs.

  1. A Boston-based pharmaceutical company approached Elucidata to use publicly available scRNA-seq data for identifying and validating gene targets for inflammatory disease in a specific cell type. They wanted to fast-track the target identification and validation to move swiftly towards drug discovery.
  2. Their specific challenges were to find relevant datasets for inflammatory disease, harmonize the datasets to a consistent, analysis-ready format, and to perform suitable, expert analysis on scRNA-seq data.
  3. Elucidata found a three-fold solution to meet these challenges. 
    - First, we curated the data on Polly to form an Atlas with accurate and complete metadata, including disease labels to find relevant data for inflammatory disease and cell type of interest. 
    - Second, we provided expert data management and analysis. We validated the targets the company had identified by shortlisting differentially expressed genes, checking for overlaps and statistical significance. 
    - Third, we performed unbiased meta-analysis by selecting relevant data, merging them according to relevant criteria, and forming metadata classes. We then used classifier models to select features, prioritized surface markers and validated the targets by confirming their expression in other datasets. 
  1. Impact- Elucidata identified four new targets and validated five pre-identified targets with four-fold acceleration.

The second case study illustrates how Polly expedites the identification of therapeutic targets, underscoring its efficiency and efficacy in real-life scenarios.

  • A global AI Biotech company uses machine learning models to identify new treatments, de-risk and accelerate clinical trials. The company needed numerous bulk RNA-seq datasets across relevant disease areas to be processed consistently through a custom STAR pipeline within a short time-frame.
  • Their specific challenges were data integration of bulk data, the high costs of processing data at scale, maintaining data quality through all the steps of their analysis.
  • Elucidata rose to the challenges by building a custom STAR pipeline on Polly. The pipeline delivered high-quality bulk RNA-seq datasets at high throughputs within the constraints of a short timeline and limited budget.
  • Using this pipeline, 7000 bulk RNA-seq samples could be processed every month. The datasets were harmonized with uniform annotations with 30 metadata fields of choice, optimization for Adaptive Genome Index Selection based on sample FASTQ reads. Quality check results were provided to identify low-quality or failed samples at every step. We also optimized data storage and enabled parallel processing.
  • Elucidata delivered harmonized data for their ML-powered platform and saved more than $1.4 Million in annual costs of data curation and processing.

Polly Paving the Way for Future Discoveries

In conclusion, the utilization of Bulk RNA-seq and Single-cell RNA-seq technologies has ushered in a new era of transcriptomic exploration, revolutionizing our understanding of gene expression in diverse biological contexts. However, the challenges associated with working with RNA-seq data necessitate innovative solutions. Polly provides these solutions, seamlessly addressing the challenges in working with RNA-seq data and powering RNA-seq research forward. As we embrace the potential of Polly in making RNA-seq data more accessible, usable, and of the highest quality, we anticipate a future where the complexities of the transcriptome are deciphered with unparalleled precision and efficiency.

To revolutionize your RNA-seq and transcriptomics research with Polly, visit our pages for bulk RNA-seq datasets and single-cell RNA-seq datasets. Join the community of researchers who have embraced Polly and experience the power of unified, harmonized RNA-seq data analysis. Connect with us or reach out to us at info@elucidata.io to learn more.

Blog Categories

Blog Categories

Request Demo