Preprocessing of Bulk RNA-seq Datasets from GEO for Accurate Analysis

August 30, 2023

RNA sequencing (RNA-seq) is the preferred technique for a transcriptomic investigation of tissue slices, biopsies, or pooled cell populations, to enable highly accurate gene expression quantification. Bulk RNA-seq data analysis helps understand changes in gene expression between samples as it assesses the average expression level of each gene over a large number of input cells—from hundreds to millions. Besides, it provides researchers with visibility into previously undetected changes occurring in disease states, in response to therapeutics, under different environmental conditions, and across a broad range of other study designs.

This blog delves into the detailed process of preprocessing and curating bulk RNA-seq data extracted from GEO, aiming to ensure precise and efficient downstream analysis.

What Are the Challenges Associated with GEO Datasets?

There are many public repositories for high-throughput data on gene expression studies, RNA-seq data, and other forms of high-throughput functional genomics data submitted by the research community. GEO is the most widely used repository for finding RNA-seq data due to the vastness of its data. However, several challenges are associated with finding and using the relevant data for downstream analyses.

Findability and usability are the two key challenges associated with GEO datasets.
  1. Lack of Annotation: GEO data is poorly annotated, making it difficult to interpret and compare results across different studies. The non-uniform file formats and data structures used make it time-consuming to extract and analyze the data.
  2. Data Inconsistency and Lack of Data Quality: Data on GEO is sourced from research labs across the globe. The researchers use different methods and platforms, which can result in variable data qualities and inconsistencies among the data.

FAIRifying, curating, and pre-processing data to a standard format help mitigate these challenges.

What Is Data Pre-processing and Why Is It Important?

The accuracy of alignment and quantification methods for bulk RNA-seq data processing can impact downstream analysis, such as differential expression analysis, functional annotation, and pathway analysis.

Alignment refers to mapping the sequence reads to a reference genome or transcriptome. In contrast, quantification refers to estimating the abundance of transcripts or genes based on the aligned reads. Inaccurate alignment or quantification can lead to false positives or false negatives in downstream analyses, resulting in incorrect conclusions.

Therefore, it is crucial to use alignment and quantification methods that are both accurate and efficient to ensure reliable analysis of bulk RNA sequencing data.

Pre-processing Methods of Bulk RNA-seq Data

Pre-processing bulk RNA-seq data is the first major step for reliable data analysis. Few of the tools available to achieve them include Kallisto, STAR, Salmon, and BWA. These methods employ different algorithms to align and quantify RNA-seq reads.

Check out this blog to learn more about the popular tools.

Depending on the experimental design and data quality of the analyzed RNA-seq data, each has its own advantages and limitations.

We at Elucidata have optimized a fast and efficient method to preprocess the bulk RNA-seq data sourced from GEO using the Kallisto pipeline. Kallisto is a lightweight and fast tool for RNA sequencing analysis that uses a pseudo-alignment algorithm to determine the abundance of transcripts in a sample. It is a more memory-efficient method compared to another popular tool- STAR.

Read how Kallisto compares to STAR here.  

The data from GEO is ingested on Polly, the biomedical data platform of Elucidata, using automated pipelines.

Bulk RNA-Seq Data on Polly

Polly's Bulk RNA-Seq OmixAtlas comprises ~40,000 curated bulk RNA-Seq studies collected from GEO. The OmixAtlas features a wide spectrum of diseases - cancer, metabolic and auto-immune conditions, ~1774 drugs, and ~877 tissues. All datasets are available with raw counts data matrix and associated metadata. The datasets are available in .gct (Gene Cluster Format) file format, which allows for storing sample metadata and expression data in a single file.

Preprocessing of Bulk RNA-seq Datasets from GEO for Accurate Analysis
Steps in the processing pipeline

Curation of Bulk RNA-Seq Data on Polly

Data on public sources like GEO freely distribute biological data submitted by the research community. However, it lacks standardization and a curation procedure, which makes data findability and reusability a challenge.

Polly's NLP-based curation models are used to curate all datasets and their corresponding samples, which are then harmonized using specific biomedical ontologies. This process ensures that the metadata associated with the data is complete, accurate, and labeled to facilitate efficient retrieval and reuse.  The labels include commonly used search criteria - Organism, Disease, Tissue, Cell Line, Cell Type, and Drug - to improve findability.


The quantity of both public and private multi-omics datasets has increased at a never-before-seen rate during the past two decades. Efficient and effective data mining and machine learning techniques are the need of the hour to extract usable information and insights.

Elucidata's biomedical data platform, Polly, hosts the world's most extensive collection of ML-ready bulk RNA-seq data that are highly curated using our curation infrastructure PollyBERT and processed consistently using Kallisto. Our curation pipelines, high-quality, accurately annotated data, standard workflows, and scientific expertise are used by industries and academia across the globe to accelerate their drug discovery process.

Reach out to us to learn more about how to accelerate your research!

Blog Categories

Request Demo