GEO OmixAtlas: Standardized and Curated Biomolecular Data from GEO Databases on Polly

GEO OmixAtlas: Standardized and Curated Biomolecular Data from GEO Databases on Polly

Jayashree
October 31, 2022

The variety and volume of data being produced by biological research have grown dramatically in recent years. However, this abundance of data has brought new challenges, especially in curation. Data curation deals with identifying data sources, and cleaning and transforming the raw data. Engineers and stakeholders often face two main challenges with respect to data. One is data quality and the other is building systems that scale well i.e. enabling data curation at scale.

When it comes to data quality, data curation is an essential mechanism to generate high-quality datasets through metadata capture and the accuracy of metadata annotation is critical. There are many public repositories that freely distribute biological data submitted by the research community, one being GEO (Gene Expression Omnibus).

However, the findability and usability of GEO data are unsatisfactory. Why is it a herculean task for researchers and data scientists to derive accurate high-quality data from the GEO database? Where does it fall short?  How does Polly help in finding relevant datasets from GEO?

Continue to read this blog to find answers to all the above questions and to understand the kind of data that Elucidata tries to bring to the fore through its platform Polly.

A Brief Overview of GEO

GEO is one of the largest open-source repositories for high-throughput data on gene expression studies, including those that examine genome methylation, chromatin structure, and genome–protein interactions. It is an international repository that archives and freely distributes microarray, next-generation sequencing, and other forms of high-throughput functional genomics data submitted by the research community.

Querying Data on GEO

GEO data can be retrieved in many ways:


  • However, given the large volumes of data stored in these databases, it is often useful to perform more
    refined queries using the advanced search in order to filter down to the most relevant data.


Challenges Faced While Using GEO

While the query process looks pretty straightforward, it is not so in practice.  

1. The main challenge while working with GEO is the difficulty of retrieving data. Some of the most useful metadata for each dataset in GEO is stored in unstructured English text that is difficult for researchers to utilize effectively. Unless the correct keywords are used while searching, the search results might be completely off.

2. While downloading data, it gets downloaded in the form of compressed files. To find out if the relevant data fields are present one must go through the downloaded files individually. Another challenge is that these files must be analyzed individually. One can’t load it on a processing pipeline because the data is not standardized.

3. The data on GEO does not follow a particular ontology. So, it might be important to find out the synonyms and the acronyms/ abbreviations of the keyword of interest to improve the search results.

4. Many a time, keyword searches can be misleading because of wrong metadata tags placed on the records.


Find Standardized and Curated Datasets from GEO Databases on Polly

Elucidata’s proprietary curation platform, built on top of NLP-based AI models, generates harmonized metadata annotations with scientific context at an accuracy matching that of human experts. Polly’s curation infrastructure PollyBERT enriches the way we access metadata from various data sources. The model is trained on ~17 billion words and ~660 million parameters. The curated data is hosted in a structured format in our repository OmixAtlas as GEO OmixAtlas.

All datasets on Polly go through a 2-step process:

1. Data Engineering: This includes transforming data to fit a proprietary data schema that is uniform across several datatypes.

2. Metadata Harmonization: This means tagging each sample and dataset with a uniform ontology.

Data Engineering and Metadata Harmonisation

Lower Missing Annotation and More Harmonized Data

Impact: Missing Annotations and More Harmonized Data

  • Using Polly, <1% of our annotations are missing and 99% of data are harmonized. The data is annotated using ontologies and manual QCs are performed to ensure that the metadata is of high quality.
  • The samples are tagged with relevant information such as disease, tissue (source biomaterial), cell line, etc.  
  • Samples are tagged uniformly with the same vocabulary.
  • Datasets are processed uniformly with the same molecular identifiers. We use the same standardized pipeline for processing the data.
  • GEO on Polly offers powerful search and querying capabilities, and integration with shiny apps and other data dashboards.
  • GEO OA allows connected queries, i.e checking for experimental data dealing with a particular disease in a specific organism or a disease-drug combination. This is nearly impossible on GEO database because of the lack of curation.

Conclusion

Polly’s curation infrastructure enables curating biomolecular data at scale along with keeping in mind the importance of data quality. It cuts down the time taken to figure out the usability of data from public sources.  

Our data-centric ML Ops platform, Polly, hosts over 1.5 million highly curated ML-ready
biomolecular datasets from various repositories like GEO, etc. This level of curation ensures
that you get all the relevant datasets in seconds just by doing a keyword search. Additionally, the
curation fields help you filter the data to get very streamlined results. Since the data is highly
curated and harmonized, various analyses can also be carried out easily.

Reach out to us to know more about how to accelerate your research!

Request Demo