FAIR Data

All You Need to Know about Gene Expression Omnibus (GEO)

Kriti Srivastava
July 26, 2023

In the realm of multi-omics research, the Gene Expression Omnibus (GEO) stands as a wealth of valuable data. This blog serves as your guide to understanding the intricacies of GEO's vast repository, spanning diverse data types such as gene expression, microarray, and sequencing data.

Here in this blog, we help you understand how to navigate this rich resource, accessing datasets that fuel scientific breakthroughs. You can effortlessly download data in various formats, unlock hidden insights and propel your research forward. Read on to find out more.

What Is GEO?

The Gene Expression Omnibus (GEO) database is an international public repository that archives and freely distributes high-throughput gene expression and other functional genomics data sets. Created in 2000 as a worldwide resource for gene expression studies, GEO has evolved with rapidly changing technologies and now accepts high-throughput data for many other data applications, including those that examine genome methylation, chromatin structure, and gene–protein interactions. GEO supports community-derived reporting standards that specify the provision of several critical study elements, including raw data, processed data, and descriptive metadata.

The database provides access to data for tens of thousands of studies. It offers various web-based tools and strategies that enable users to locate data relevant to their interests and visualize and analyze the data. Here we will look at GEO's repository, the datatypes it hosts, how to search for this data, and download and cite it.

Data on GEO

These high-throughput screening genomics data are derived from microarray or RNA-Seq experimental data. The GEO DataSets database stores original submitter-supplied records (Series, Samples, and Platforms) and curated DataSets. See the Overview for information about these different record types and how they are related.

GEO was designed around the standard features of most high-throughput and parallel molecular abundance-measuring technologies today. These include data generated from microarray and high-throughput sequence technologies, for example:

  • Gene expression profiling by microarray or next-generation sequencing
  • Non-coding RNA profiling by microarray or next-generation sequencing
  • Chromatin immunoprecipitation (ChIP) profiling by microarray or next-generation sequencing
  • Genome methylation profiling by microarray or next-generation sequencing
  • High-throughput RT-PCR
  • Genome variation profiling by array (arrayCGH)
  • SNP arrays (see human subject FAQ)
  • Serial Analysis of Gene Expression (SAGE)
  • Protein arrays
All You Need to Know about Gene Expression Omnibus (GEO)
Figure 1 - Bar chart representing the number of datasets for some of the data types available on Gene Expression Omnibus (GEO)

It is important to note that the numbers provided are approximate and may vary as new datasets are continually added to GEO. Furthermore, GEO accommodates data sourced from many organisms, encompassing animals, plants, and microorganisms, thereby broadening the spectrum of datasets accessible through the repository.

Polly is a data-centric platform that curates biomedical data and can curate 26+ data types.

Reach out to us to learn more about how to accelerate your research!

Using the Data on GEO

How to Search for GEO Datasets?

You can search for GEO datasets in the database using many attributes, including keywords, organisms, datatype, and authors. GEO DataSets and GEO Profiles are part of NCBI's network of Entrez databases. As with these other databases, data of interest may be located by entering keywords into the GEO DataSets or GEO Profiles search boxes. However, challenges in finding relevant data on GEO include the vast number of datasets to navigate, inconsistent metadata, varying data formats and quality, and the absence of a unified search interface. These factors make it challenging to quickly and accurately identify the most suitable datasets for specific research needs. The hurdles above can be overcome with the help of Elucidata's Polly, which follows the FAIR principles, making data easily discoverable and highly reusable.

The Advanced Search Builder page, linked at the head of the GEO DataSets and GEO Profiles pages, assists in constructing complex queries. For a complex query, specify the search terms, their fields, and the Boolean operations to be performed on the words to maximize the possibility of finding relevant datasets. However, the effectiveness of this search is limited by the lack of a structured curation process and inconsistent or incomplete metadata, which makes it challenging to assess the suitability of a dataset for a specific research question.

Polly's curation infrastructure enables curating biomolecular data at scale, keeping in mind the importance of data quality. It reduces the time taken to determine data usability from public sources.  

How to Download Data from GEO?

All the data in GEO can be downloaded in a variety of formats using a variety of mechanisms. The following information lists download options and designs.

  1. Original Submitter's Website: In some cases, GEO provides links to the original submitter's website where the data was initially deposited. Users can visit these external websites to access and download the data directly.
  2. GEO FTP Site: GEO hosts an FTP site that allows users to download data files in various formats. Users can navigate through the directory structure to find the specific dataset of interest and download the associated files. Formats available for download include raw data files, processed data files, metadata files, and supplementary files.
  3. GEO DataSets: GEO DataSets is a web-based tool with a user-friendly interface for searching and downloading data. Users can access the dataset of interest on the GEO DataSets website and choose to download the data in different formats, such as TXT, CSV, or SOFT (Simple Omnibus Format in Text).
  4. SRA (Sequence Read Archive): For datasets containing sequencing data, GEO often links to the Sequence Read Archive (SRA) maintained by the National Center for Biotechnology Information (NCBI). Users can access the SRA database to download raw sequencing data files in FASTQ format.
  5. GEO2R: GEO2R is an online analysis tool provided by GEO that allows users to perform basic analysis on GEO datasets. While the primary function of GEO2R is analysis, it also allows downloading the processed data from selected datasets.
  6. Programmatic Access: GEO records metadata can be programmatically accessed and retrieved using the Entrez Programming Utilities (E-Utils) suite.

The data on GEO is stored in the data format shared by the author, and the availability of specific download options and formats may vary. Therefore, Exploring the dataset of interest on the GEO website is essential to determine the available download options and formats for that particular dataset.

Downloading data from Polly is as easy as a single-click! You can easily download the GCT/h5ad file locally, ready to be explored. Talk to Us to know more!

Citing GEO in Research Papers

Citing data obtained from the Gene Expression Omnibus (GEO) is crucial for providing proper attribution and allowing readers to access the referenced dataset. When including GEO data in your research papers, following the appropriate citation format is essential. Here is a suggested method for citing GEO in your papers:

To cite a dataset from GEO, start with the author's name(s) followed by the publication year. Next, provide the title of the dataset, mentioning it is from the Gene Expression Omnibus (GEO) database. Include the dataset's accession number, a unique identifier assigned by GEO, and conclude with the URL where the dataset can be accessed. This format ensures precise and accurate citation of GEO data in your papers. It is recommended that submitters cite their Series accession number (GSExxx) because that record summarizes the experiment and includes links to all other relevant data.

For example, a citation might look like this: Smith AB, Johnson CD (2022). Gene expression profiling of human liver samples. Gene Expression Omnibus (GEO) database (Accession Number GSE12345), Available on NCBI.

Including this citation provides proper credit to the dataset's authors and acknowledges GEO as the data source. The accession number is a specific identifier, enabling others to locate the dataset quickly. Additionally, providing the URL ensures direct access to the dataset on the GEO website.

Unlock the Full Potential of GEO Datasets With Polly

Even though GEO is a treasure trove of data, finding relevant data on GEO is like 'finding a needle in a haystack.' The immense amount of data is stored in a loosely defined author-shared format, making it a challenge to pinpoint relevant datasets because of the lack of effective biocuration. Fortunately, this problem has already been solved by Polly. Polly, our data-centric ML Ops platform, hosts over 1.5 million highly curated ML-ready biomolecular datasets from various repositories like GEO.

Finding data on Polly is much easier than GEO because of our curated standard fields that enable knowledge graph-backed filters on the GUI. These filters allow users to narrow their search results based on various essential metadata terms associated with the dataset, such as disease, tissue, cell type, drug, etc. Since knowledge graphs back the filters, they also allow users to include terms related/similar to their terms of interest. This leads to finding more datasets relevant to the user's requirements.

Reach out to us to learn more about how to accelerate your research!


Blog Categories

Blog Categories

Request Demo