Analyzing Transcriptomics Data from GEO Datasets

Shrushti Joshi
July 26, 2023

The Gene Expression Omnibus (GEO) database is a crucial resource for transcriptomic research. It stores a vast amount of publicly available gene expression data, including microarrays, RNA sequencing, and other high-throughput sequencing data. Researchers worldwide can upload and access data, enabling the exploration of gene expression patterns, molecular mechanisms, and disease associations. It promotes collaboration, data reuse, and scientific discovery in the field of genomics.

GEO accepts data submissions from researchers worldwide, making it a globally collaborative resource. Researchers can upload their gene expression data to GEO, ensuring that valuable data generated from various experiments and studies are publicly accessible.

Data Retrieval and Preprocessing

Accessing and selecting data on the GEO database is straightforward. Researchers can easily navigate the GEO website and utilize its search tools to discover specific datasets of interest.

Here's a step-by-step guide to accessing and selecting data on GEO:

  1. Visit the GEO website: Go to the GEO database website.
  2. Find the dataset of interest: Use the search options available on the GEO website to find the dataset that needs analysis. Search by keywords, organism, platform, or other criteria to determine the result.
  3. Access the dataset: Once identified, click on its accession number or title to access its details.
  4. Navigate to GEO2R: Locate the "Analysis Tools" section on the dataset page and click the "GEO2R" link. This will open the GEO2R analysis page.
  5. Configure the analysis parameters:
    Select groups
    : GEO2R allows the researcher to compare two or more groups of samples within the dataset. Choose the groups that are to be compared from the available options.
    Choose normalization
    : Select the appropriate normalization method, such as "None," "Quantile," or "TMM."
    Adjust for multiple testing
    : Decide whether to apply multiple-testing correction methods such as "Benjamini & Hochberg (FDR)" or "Bonferroni."
  6. Run the analysis: Once the parameters have been configured, click the "Submit" button to start the analysis. GEO2R will perform differential expression analysis based on the selected groups and parameters.
  7. Explore the results:
    Result table
    : GEO2R will generate a table listing the genes ordered by statistical significance, typically represented by p-values. The table will include gene symbols, fold change, adjusted p-values, and more.
    Graphical plots
    : GEO2R also provides graphical plots, including scatterplots, volcano plots, and heatmaps, to help visualize the differentially expressed genes and assess data quality.
  8. Interpret the results: Analyze the results table and plots to identify genes significantly differentially expressed across the experimental conditions. Pay attention to statistical significance, fold change values, and other relevant information.

GEO2R operates independently of curated datasets and directly assesses Series Matrix data files. It is crucial to understand that this tool can access and analyze nearly any GEO Series, irrespective of data type or quality. Therefore, users should be mindful of the limitations and considerations associated with GEO2R.

Transcriptome Assembly and Quantification

The GEO database can be used to retrieve raw RNA-seq data, perform transcriptome assembly and quantification, and gain insights into the gene expression profiles of selected datasets. Transcriptome assembly and quantification using the GEO database involves several steps.

  1. Identify and access the dataset: Explore the GEO database and search for RNA-seq datasets containing the required transcriptome data. Click on its accession number or title to access the dataset's details page.
  2. Retrieve the raw data: Look for the "Supplementary Files" or "Data Files" section on the dataset's details page. Download the raw sequencing data files, typically available in FASTQ or SRA format. These files contain the raw RNA-seq reads generated from the experiment. It is crucial to process the raw data before assembly and quantification. This includes quality control, adapter trimming, and read filtering. Tools like FastQC, Trimmomatic, or Cutadapt can be utilized to perform these preprocessing steps.
  3. Transcriptome assembly: Use a transcriptome assembly tool to construct the transcriptome from the preprocessed RNA-seq reads. Popular tools for transcriptome assembly include Trinity, StringTie, and Cufflinks. Consult the documentation of chosen tool for the specific commands and parameters required.
  4. Transcript quantification: Once the transcriptome is assembled, the expression levels of the transcripts are quantified. Salmon, Kallisto, or RSEM are commonly used for transcript quantification. These tools utilize alignment-free or pseudo-alignment methods to estimate transcript abundances.
  5. Obtain gene-level expression: Depending on the analysis, gene-level expression values can be obtained from the transcript-level quantification results. Tools like tximport, Sleuth, or DESeq2 can aggregate transcript-level counts into gene-level counts or perform differential expression analysis.
  6. Perform downstream analysis: This involves differential gene expression analysis, functional enrichment analysis, pathway analysis, or other analyses relevant to your research objectives. Various bioinformatics tools and software packages, such as DESeq2, edgeR, or clusterProfiler, are available for these analyses.
  7. Validate results and interpret findings: It is crucial to validate and interpret the findings in the context of research questions. If necessary, compare results with existing literature, perform additional experimental validation, and explore the biological implications of the differentially expressed genes or functional annotations obtained.
  8. Perform differential gene expression analysis: Use statistical tools such as limma, DESeq2, or edgeR to identify differentially expressed genes between experimental conditions or sample groups. These tools utilize appropriate statistical models to determine significant differences in gene expression levels.
  9. Interpret the results: Analyze the differential gene expression analysis results, which typically include lists of differentially expressed genes with corresponding statistical measures. Pay attention to fold changes, adjusted p-values, and other relevant statistics to identify genes of interest.
  10. Functional annotation: Annotate the differentially expressed genes with functional information to gain insights into their biological roles. Tools like DAVID, Enrichr, or clusterProfiler can perform available enrichment analysis by identifying overrepresented gene ontology (GO) terms, pathways, or functional categories.
  11. Pathway analysis: Utilize pathway analysis tools such as Gene Set Enrichment Analysis (GSEA), Reactome, or the KEGG pathway database to investigate the involvement of differentially expressed genes in specific biological pathways. These tools can help uncover critical pathways or biological processes affected by experimental conditions.
  12. Validate and visualize results: If possible, validate findings through literature review and experimental validation. Additionally, generate visualizations such as volcano plots, heatmaps, or pathway diagrams to facilitate the interpretation and communication of results.
  13. Conclude and generate insights: Based on the results of differential gene expression analysis, functional annotation, and pathway analysis, conclude the biological significance of the identified genes and pathways. Relate findings to the research question or hypothesis that drove the analysis.

Transcriptome analysis using the GEO database comes with several challenges researchers may encounter.

Challenges Faced While using GEO
Data Retrieval and Preprocessing Transcriptome Assembly and Quantification
10-Minute Timeout Data Quality
Publication Discrepancy Sample Size and Replicates
Missing Samples Data Heterogeneity
Sample Comparability Incomplete Metadata
Data Type Restriction Batch Effects
Contrast Selection Data Integration
Within-Series Restriction Lack of Computational Resources and Analysis Tools
Failed Jobs

Polly for Analyzing Transcriptomics Data

Polly is an advanced AI-powered assistant for researchers, scientists, and data analysts. With its deep understanding of scientific concepts, natural language processing capabilities, and access to vast amounts of knowledge, Polly is the go-to companion for tackling complex research tasks and accelerating scientific discovery. Here's how Polly can help:

1. Data Retrieval:

Polly can quickly search the GEO database and retrieve the relevant transcriptome datasets based on specified criteria, saving time and effort manually browsing the database.

2. Transcriptome Assembly and Quantification:

Polly can guide selecting the appropriate tools and parameters for transcriptome assembly and quantification based on the dataset characteristics and research goals. It employs advanced curation models that automatically extract and annotate relevant information from the raw data, such as sample characteristics, experimental conditions, treatment groups, or any other pertinent details. These curated metadata fields are generated using machine learning algorithms and data processing techniques, ensuring accuracy and consistency across samples.

Data Overview of GEO dataset in Polly
Metadata Table on Polly

In addition to the curated metadata, the source metadata fields are also included. Source metadata refers to the information provided by the original data contributors or researchers who generated the dataset. This metadata may include sample identifiers, experimental protocols, sample descriptions, or any other information relevant to the dataset.

Users can navigate to the "details" page of a specific dataset ID within the Omixatlas interface to access the sample-level metadataOmixatlas interface. On this page, all the metadata fields associated with each sample in the dataset will be visible.

Polly gives various options for curation, QC and report generation.

3. Differential Gene Expression Analysis:

Polly can assist in performing differential gene expression analysis by recommending suitable statistical analysis methods and guiding through the necessary steps. It can help interpret the analysis results, including fold changes, p-values, and adjusted statistics, making identifying significant differentially expressed genes easier. To generate transcript-level expression counts, Polly utilizes Kallisto, a popular tool for RNA-Seq analysis. Kallisto maps the high-quality reads to the genome using the "kallisto quant" command. This mapping process assigns each read to its corresponding transcript.

After mapping, the counts are aggregated at the gene level by summing up the transcript-level counts associated with each gene. This step ensures that the final expression counts represent the overall expression of genes rather than individual transcripts.

By leveraging Kallisto's capabilities, Polly accurately quantifies gene expression levels based on transcript-level counts, providing researchers with valuable information about gene expression patterns in their RNA-seq datasets.

4. Functional Annotation and Pathway Analysis:

Polly can provide access to a wide range of functional annotation and pathway analysis tools. It can assist in interpreting enriched gene ontology terms, pathways, and functional categories associated with the differentially expressed genes. At the feature level, Polly performs mapping of Ensembl gene IDs to their corresponding HGNC symbol, MGI symbol, or RGI symbol. This mapping allows for converting gene identifiers to more recognizable and standardized symbols, facilitating easier interpretation of the results.

To ensure accurate quantification, duplicate genes are handled by dropping counts using the Mean Average Deviation (MAD) Score. This process helps eliminate redundancy and ensures a unique count value represents each gene.

Furthermore, Polly enhances the analysis by annotating each sample with relevant metadata.

5. Data Visualization:

Polly can generate visualizations such as volcano plots, heatmaps, and pathway diagrams to help visualize and communicate the results effectively. It can also provide interactive visualizations that allow intuitive exploring of the data.

Metadata chart for data visualization on Polly

6. Data Integration:

Polly can help with data integration by recommending appropriate methods to handle batch effects, normalize data from different studies or platforms, and integrate transcriptome data for more comprehensive analyses.

7. Computational Resources:

Polly can leverage its computational power to handle large datasets and perform computationally intensive analyses. You can offload the computational burden to Polly, allowing researchers to focus more on the analysis and interpretation of the results.

By utilizing Polly's capabilities, researchers can streamline the entire transcriptome analysis process, from data retrieval to downstream analysis and interpretation. Its assistance can save time, provide expert guidance, and simplify the complex tasks involved in transcriptome analysis, ultimately enhancing the efficiency and accuracy of research.

Polly aims to empower researchers by augmenting their capabilities, accelerating the pace of discovery, and facilitating breakthroughs in various scientific fields.

Book a demo to learn more!

Blog Categories

Request Demo