AskGEO: Tool For Recommending GEO Studies
Data Science

AskGEO: Tool For Recommending GEO Studies

Saksham Malhotra
June 12, 2020

Introduction

The GEO database started by NCBI GEO is a public repository for the free distribution of next-generation sequencing and other forms of high-throughput functional genomics data submitted by researchers all across the world. There are around 60,000 Microarray and High Throughput Sequencing studies available on GEO however, there is no effective way to find datasets of interest.

The GEO platform provides search functionality that is based on keywords provided by a researcher. The results returned by GEO are diverse and extensive in size, and nearly impossible for a researcher to manually go through. Moreover, the results are based on keywords present in the experiment design or the title of a study. This does not convey the full complexity of a study.

Researchers may have a gene signature of interest which was obtained through an experiment or which is heavily cited in the literature. Here we created a system - AskGEO, that will take this gene signature and find studies in the complete GEO database in which a similar set of genes is co-expressed. In this way, AskGEO searches for relevant datasets takes into account the data present in the data rather than relying on the external information provided by GEO.

The user can further refine the recommendations by giving keywords that are looked for in the publications linked to these datasets. By combining the actual gene expression values from a dataset and the textual information present in the publication linked to that dataset, we can create a powerful tool which further can generate relevant suggestions as is shown by the results obtained using two signatures representative of two different biological conditions which were used to validate this tool.

Solution - AskGEO

Polly provides a query and search engine - AskGEO, that allows the user to find the right datasets for their analysis from its Data Lakes and helps them run analysis on top of those datasets. The developed methodology facilitates the systematic curation and processing of publicly available gene expression datasets from GEO. Here we present a specific engine, AskGEO that runs a signature-based and keyword-based search that helps the user identify studies related to a biological phenomenon from the entire GEO repository.

GEO Figure 1
Figure 1:
(A) Number of transcriptomics datasets in GEO by organism and platform
(B) Number of RNASeq GEO studies processed through Polly Compute and present in Polly data lake
(C) Number of studies by platform for three organisms

This search engine AskGEO ensured:

  • Standardizing the processing of 40,000 datasets
  • Building a gene co-expression database
  • Using gene signatures to recommend datasets
  • Validating the recommendations for two gene signatures and a random gene signature
  • and more...
(A) Examples of metadata annotation for different GEO studies obtained using the model
(B) TSNE plot of annotations of different GEO studies. Similar themed annotations shown by
color cluster together
(C) Workflow for recommending datasets for a gene signature. The metadata terms are
optional and can be used to improve the recommendations

Get in Touch

Learn the methods used to create the system - AskGEO, which is able to generate recommendations for GEO studies based on a gene signature of interest, while overcoming the bottleneck of normalizing different datasets coming from different sources.

Polly manages the technology so that you can do high-level research. Book a session today to make the most of your work.

Subscribe to our Newsletter

Get the latest insights on Biomolecular data and ML

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Blog Categories