FAIR Data

Importance of Gene Expression Omnibus (GEO) Data in Biomedical Research

Pooja Viswanathan
January 25, 2024

The Gene Expression Omnibus (GEO) is a comprehensive repository of gene expression datasets, offering researchers a complete comprehensive view of molecular processes.  Because these molecular processes are the basis of human biology in health and disease, the relevance of this data in biomedical research cannot be overstated. GEO datasets are a reservoir of invaluable information that fuel research endeavors globally. A data-driven revolution is at the doorstep for biomedical research, and key to this revolution is GEO data.

What is GEO (Gene Expression Omnibus) ?

GEO (Gene Expression Omnibus), established by the National Center for Biotechnology Information (NCBI)- a component of the National Library of Medicine at the National Institutes of Health (NIH)- serves as a public repository committed to archiving and freely distributing molecular data. 

GEO was conceived in response to the growing field of genomics and the need for a centralized repository to house the burgeoning volume of gene expression data. While GEO datasets originated as a collection of gene expression data collected through microarray experiments, they soon expanded to include data on genome copy number variation, genome-wide profiling of DNA-binding proteins and sequence-based functional genomics. These encompass data acquired through various high-throughput experiments, from microarrays to methylation and chromatin accessibility studies. Each dataset is a curated collection of biologically and statistically comparable samples. Besides functioning as a dynamic repository, it further offers a comprehensive toolkit for useful analyses.

Functions and Features of GEO (Gene Expression Omnibus)

Today the repository holds more than 200,000 Series data, with >25,000 Platforms and nearly 7 million Samples. These three records are the core objects for the datasets. The Series type comprises expression profiles using different methods, array, genome tiling array, high-throughput sequencing, single nucleotide polymorphism (SNP) arrays etc. It also includes methylation profiles, genome binding/occupancy profiles, as well as non-coding RNA and protein profiles. 

The data profiles are just as diverse, including various techniques like in situ oligonucleotide, RT-PCR, spotted DNA and cDNA. These datasets span more than 1000 species, dominated by humans but including rarer species like rice species, Toxoplasma gondii and Aedes aegypti

GEO provides a user-friendly interface for searching, viewing, and downloading datasets. Its functionalities include tools for data visualization, analysis, and comparison, empowering researchers to derive meaningful insights from the wealth of information available. GEO facilitates comparative analyses, allowing researchers to juxtapose gene expression profiles across experiments and platforms. Such comparative analysis highlights useful trends in the data. This function is supported with freely available data downloads promoting open science and collaboration.

Impact of GEO Data in Biomedical Research

Illustrating the practical impact of GEO data, various case studies showcase its application in elucidating disease mechanisms, identifying biomarkers, and uncovering potential therapeutic targets. Researchers have used GEO datasets to identify distinct molecular subtypes of cancers towards precise diagnosis and treatment. By comparing healthy and diseased tissues at the molecular level, GEO enables the identification of potential diagnostic biomarkers and therapeutic targets. GEO datasets have motivated the discovery of biomarkers in neurodegenerative disorders as well. 

Studies investigating diseases such as Alzheimer’s, Parkinson’s, and Huntington’s leverage GEO data to understand the underlying molecular mechanisms. Researchers analyze gene expression patterns in different regions of the brain, identifying key genes and pathways associated with disease progression and severity. This opens avenues for understanding the disease and developing targeted interventions and potential therapies. 

In cardiovascular research, GEO datasets provide a window into the molecular events underlying heart diseases.

By analyzing gene expression changes in diseased hearts, researchers identify potential biomarkers for early detection and gain insights into the mechanisms of cardiac disorders, such as heart failure and atherosclerosis.

Understanding molecular mechanisms has allowed greater insight into the pathophysiology of various infectious diseases such as influenza, COVID-19 and HIV/AIDS. GEO datasets have been pivotal in advancing immunological research. Gene expression profiles in immune cells in response to infections, vaccinations, or autoimmune conditions. These datasets contribute to our understanding of immune responses, aiding vaccine development, immunotherapies, and treatments for autoimmune diseases.

In the realm of drug discovery and development, GEO datasets have been indispensable. Pharmaceutical researchers use these datasets to assess the effects of drugs on gene expression profiles, predict potential side effects, and identify molecular signatures associated with drug response or resistance. This accelerates drug development by providing crucial information about the biological effects of candidate compounds. Researchers use GEO datasets to identify better therapeutic targets specific to diseases and dysregulated pathways. Their versatility allows researchers to tailor their analyses to the unique demands of their respective fields.

Case Studies: Real-world Examples

Since its establishment, as the number of GEO datasets on the platform grows, the diversity of those datasets grows as well. This diversity enables path-breaking research to find biomarkers in diseases that affect multiple systems and organs. For example, the role of FOXD2-AS1 in cancer was discovered in literature and data in the GEO database. FOXD2-AS1 expression was correlated with poor overall survival and disease-free survival (Zhang, Liang et al, 2020). It was also associated with bigger tumor size and TNM stage (Tumor, Node, Metastasis, an overall staging system used for cancer reporting). Another success story fueled by the existence of GEO data is the discovery of the association between NRNX3 and Alzheimer disease (Zheng, Li et al, 2018). GEO data reporting gene expression in areas of the brain showed that among many differentially expressed genes in Alzheimer disease, low expression of aging-related NRNX3 was particularly associated with the cognitive decline that occurred in Alzheimer disease. 

How to Access and Retrieve Data from GEO

The search interfaces provided by GEO datasets provide simple and advanced searches. Because GEO data are extensively indexed and described by many separate fields, search for specific data can be refined by constructing fielded queries. Genes of interest can be looked for as well as tested on the interface using the gene-level browser, GEO Profiles. Sequence-based queries can also be performed on the GEO BLAST interface. Additionally, there is a study level browser to access datasets, Dataset Browser. Data can be freely downloaded in single datasets and in bulk through the File Transfer Protocol (FTP) site. Raw data files are provided in native formats where possible. Data are available in multiple formats including tab-delimited tables, plain text and XML. Effective data queries can also be programmed and are well supported.

Challenges with Accessing and Retrieving Data from GEO

While GEO datasets have already proven to be powerful in biomedical research, some crucial challenges remain. The richness of the datasets presents problems such as data heterogeneity and metadata inconsistencies. As our experimental technology evolves, keeping datasets consistent becomes even more relevant. 

1. Ensuring Data Quality in GEO Datasets

Ensuring the quality of data within GEO is a meticulous process involving a series of measures. Quality assurance measures extend beyond data processing. They begin from the planning and execution of experiments to the harmonization and standardization of the datasets. These measures include the detection and correction of outliers, normalization techniques to mitigate batch effects, validation protocols, and the application of statistical methods to validate the reliability of results. The scientific community has set continuous monitoring and feedback loops to aid the process of refining and improving data quality, however, more work is needed. 

2. Role of Metadata Standards in GEO Data

Metadata is the contextual information accompanying datasets. The metadata provides critical details about experimental conditions, the characteristics of the sample, and the steps taken to process the raw data. Standardizing this information ensures consistency, comparability and interpretability across different datasets. GEO utilizes metadata standards such as Minimum Information About a Microarray Experiment (MIAME) and Minimum Information about a Sequencing Experiment (MINSEQE). These standards establish guidelines for the comprehensive reporting of the metadata. Researchers can then fully understand and reproduce the experiments based on the metadata provided. The metadata connected to gene expression studies within GEO typically do not use controlled vocabularies to describe various biological fields such as tissue type, cell type, gene/protein, cell line, drug/small-molecule, and disease. The bulk of GEO publications remain with unstandardized metadata, use inconsistent protocols for data processing and are not machine-readable.

3. Data Harmonization

Further data curation practices must be adopted by any software that interfaces closely with GEO. For example, search engines need to process a large number of microarray and RNA-seq samples to be applied to gene expression profiles and co-expression modules. Integrating datasets across studies and performing meta-analyses from various studies is still difficult. There are remaining challenges that can be addressed by better integration and harmonization procedures, to use relevant community-accepted controlled vocabulary and standards, and better annotation strategies. Making GEO datasets ready for machine learning algorithms and powerful AI-assisted data analytics will greatly accelerate research progress. 

Importance of Data Curation for GEO Datasets

Data is not merely a collection of numbers and symbols, meticulous data curation plays an important role in ensuring that the data can be applied appropriately to research questions. Data curation involves active and ongoing management of data throughout its lifecycle. In the context of GEO datasets, this encompasses processes such as data acquisition, validation, cleaning, transformation, and annotation. The aim of curation is to enhance the quality of the data, making it more accessible and usable for researchers. 

The following are the advantages of good data curation in GEO datasets:

  • Interoperability
    Curation becomes even more important when the data is sourced from different experiments, or combined across different sources. GEO datasets combine a lot of molecular information, gene expression patterns, epigenetic modifications, and more. However, the diversity of data demands rigorous curation and harmonization to ensure that the datasets can be effectively compared. 
  • Reproducibility
    Reproducibility is a cornerstone of scientific enquiry, and effective data curation goes a long way to support the efforts towards producing more reproducible research. GEO datasets are transparent and well-documented, allowing researchers to understand the nuances of the experimental methods and to be able to replicate or build upon prior studies with confidence.
  • Accessibility
    GEO datasets are valuable resources for researchers across the globe. Effective curation enhances the accessibility and usability of these datasets. Clear descriptions, standardized formats, and well-organized repositories make it easier for researchers to navigate, locate, and utilize the datasets relevant to their specific research questions.
  • Accuracy
    The sheer volume of data that is supported in GEO also requires good quality checks. Without careful curation, inaccuracies, inconsistencies, and artifacts may emerge, potentially misleading researchers. 
  • Fostering Interdisciplinary Collaboration
    Another advantage of data curation is to foster interdisciplinary collaboration, by advancing standards and greater convergence in datasets. Effective curation bridges the gap between biologists, bioinformaticians, and clinicians.
  • Durability
    Good data curation promotes long-term preservation, making datasets accessible for future generations of researchers, thus protecting the time and resources invested.
  • Upholding Ethical Standards
    Beyond technical considerations, good data curation in GEO datasets upholds ethical standards. Clear documentation of experimental protocols, participant consent, and adherence to ethical guidelines ensures that the use of these datasets maintains the rights and privacy of individuals contributing to scientific knowledge. 

Data Harmonization and Curation for GEO Datasets on Polly

At Elucidata, we believe that data harmonization and curation of GEO datasets is the only way to appropriately leverage the power of GEO data. Our data harmonization platform Polly provides a basis for integration of GEO datasets. As we have already discussed, data curation standards form the basis of scientific rigor. Detailed information about experimental design and procedure, good quality assurance measures demonstrate a commitment to transparency and reproducibility in the community. Moreover, adherence to such standards makes it easy for the datasets to be discovered. Researchers can efficiently search, navigate and identify datasets relevant to their research questions. This accelerates the pace of research and saves crucial resources in the pursuit of research.

Polly applies advanced algorithms and machine learning to implement data harmonization across datasets. Such an integration ensures that data from various sources can be combined and analyzed uniformly. It enhances user experience, improves accessibility and fosters collaboration and in doing so, makes each dataset that much more powerful. We recognize the primary position that data curation has in the application of GEO datasets to research. We have custom curation services that can be customized to the last detail, including inclusion criteria and analysis needs. We also perform harmonization to ensure that data is processed uniformly, annotated with the right metadata, filling in missing annotations, and thoroughly quality checked. This harmonized data is stored on Atlas, where it can be further explored with our powerful analysis tools or exported to be analyzed according to research needs. 

Polly allows users to streamline entire analysis workflows in this way, from data search to analysis and interpretation. Our analysis tools include several bioinformatics techniques, such as gene expression profiling, functional enrichment analysis and pathway analysis. Polly has played a crucial role in melanoma research, supporting successful meta-analysis to identify key genes and pathways associated with melanoma progression. Where GEO datasets have already expanded the field of possible research questions, Polly accelerates the path to answers through its support of the research method from start to end. 

In biomedical research, GEO data has triggered an exciting space of discovery. With advancements in curation and integration strategies closely following, we predict GEO data will continue to support scientific enterprise and collaboration, and shape the future of medicine. 

Book a Demo with us or reach out at info@elucidata.io for more information. 

Blog Categories

Blog Categories

Request Demo