GEO OmixAtlas: Standardized and Curated Biomolecular Data from GEO Databases on Polly

Jayashree

October 31, 2022

The variety and volume of data being produced by biological research have grown dramatically in recent years. However, this abundance of data has brought new challenges, especially in curation. Data curation deals with identifying data sources, and cleaning and transforming the raw data. Engineers and stakeholders often face two main challenges with respect to data. One is data quality and the other is building systems that scale well i.e. enabling data curation at scale.

When it comes to data quality, data curation is an essential mechanism to generate high-quality datasets through metadata capture and the accuracy of metadata annotation is critical. There are many public repositories that freely distribute biological data submitted by the research community, one being GEO (Gene Expression Omnibus).

However, the findability and usability of GEO data are unsatisfactory. Why is it a herculean task for researchers and data scientists to derive accurate high-quality data from the GEO database? Where does it fall short? How does Polly help in finding relevant datasets from GEO?

Continue to read this blog to find answers to all the above questions and to understand the kind of data that Elucidata tries to bring to the fore through its platform Polly.

A Brief Overview of GEO

GEO is one of the largest open-source repositories for high-throughput data on gene expression studies, including those that examine genome methylation, chromatin structure, and genome–protein interactions. It is an international repository that archives and freely distributes microarray, next-generation sequencing, and other forms of high-throughput functional genomics data submitted by the research community.

Querying Data on GEO

GEO data can be retrieved in many ways:

You may use the accession number or search for relevant data on the GEO datasets and the GEO Profiles

‍

However, given the large volumes of data stored in these databases, it is often useful to perform more
refined queries using the advanced search in order to filter down to the most relevant data.

Challenges Faced While Using GEO

While the query process looks pretty straightforward, it is not so in practice.

1. The main challenge while working with GEO is the difficulty of retrieving data. Some of the most useful metadata for each dataset in GEO is stored in unstructured English text that is difficult for researchers to utilize effectively. Unless the correct keywords are used while searching, the search results might be completely off.

2. While downloading data, it gets downloaded in the form of compressed files. To find out if the relevant data fields are present one must go through the downloaded files individually. Another challenge is that these files must be analyzed individually. One can’t load it on a processing pipeline because the data is not standardized.

3. The data on GEO does not follow a particular ontology. So, it might be important to find out the synonyms and the acronyms/ abbreviations of the keyword of interest to improve the search results.

4. Many a time, keyword searches can be misleading because of wrong metadata tags placed on the records.

‍

Find Standardized and Curated Datasets from GEO Databases on Polly

Elucidata’s proprietary curation platform, built on top of NLP-based AI models, generates harmonized metadata annotations with scientific context at an accuracy matching that of human experts. Polly’s curation infrastructure PollyBERT enriches the way we access metadata from various data sources. The model is trained on ~17 billion words and ~660 million parameters. The curated data is hosted in a structured format in our repository OmixAtlas as GEO OmixAtlas.

All datasets on Polly go through a 2-step process:

1. Data Engineering: This includes transforming data to fit a proprietary data schema that is uniform across several datatypes.

2. Metadata Harmonization: This means tagging each sample and dataset with a uniform ontology.

Data Engineering and Metadata Harmonisation

‍

Lower Missing Annotation and More Harmonized Data

Impact: Missing Annotations and More Harmonized Data

Using Polly, <1% of our annotations are missing and 99% of data are harmonized. The data is annotated using ontologies and manual QCs are performed to ensure that the metadata is of high quality.
The samples are tagged with relevant information such as disease, tissue (source biomaterial), cell line, etc.
Samples are tagged uniformly with the same vocabulary.
Datasets are processed uniformly with the same molecular identifiers. We use the same standardized pipeline for processing the data.
GEO on Polly offers powerful search and querying capabilities, and integration with shiny apps and other data dashboards.
GEO OA allows connected queries, i.e checking for experimental data dealing with a particular disease in a specific organism or a disease-drug combination. This is nearly impossible on GEO database because of the lack of curation.

Conclusion

Polly’s curation infrastructure enables curating biomolecular data at scale along with keeping in mind the importance of data quality. It cuts down the time taken to figure out the usability of data from public sources.

Our data-centric ML Ops platform, Polly, hosts over 1.5 million highly curated ML-ready
biomolecular datasets from various repositories like GEO, etc. This level of curation ensures
that you get all the relevant datasets in seconds just by doing a keyword search. Additionally, the
curation fields help you filter the data to get very streamlined results. Since the data is highly
curated and harmonized, various analyses can also be carried out easily.

Reach out to us to know more about how to accelerate your research!

Other Resources

Blogs Case Studies Dataset Roundup Documentation Glossary Webinars Whitepapers

Thank you for reaching out!

Our team will get in touch with you over email within next 24-48hrs.

Oops! Something went wrong while submitting the form.

Explore : Target Discovery - Lessons from the Field

Read More

Pharma Company Achieves 4x Faster Target Identification for Inflammatory Disease

Key Highlights

What’s a Rich Text element?

Static and dynamic content editing

How to customize formatting for each rich text

All Solution Briefs

Other Resources

GEO OmixAtlas: Standardized and Curated Biomolecular Data from GEO Databases on Polly

A Brief Overview of GEO

Querying Data on GEO

Challenges Faced While Using GEO

Find Standardized and Curated Datasets from GEO Databases on Polly

Impact: Missing Annotations and More Harmonized Data

Conclusion

Other Resources

Talk to our Data Expert

More Solution Briefs

Faster Insights on Omics Data Signatures with Polly Discover

Enhancing Data Quality: QC Filters for Single Cell RNA-seq Analysis

How to Perform Patient Stratification on Polly

ChatGPT in Drug Discovery

Solving Biomedical Data Findability Issues Using Polly

How to Compare Gene Signatures on Polly

FAQs

What are the key benefits of using Polly for gene target prioritization in patient stratification?

How does Polly help in training classifier models for patient stratification?

How does Polly assist in defining genetic signatures for different stages of cell differentiation?

What is the process of creating a disease-specific atlas using Polly’s harmonization engine?

How does Polly integrate multiple data types for more reliable patient stratification?

Can Polly handle data quality issues and unstructured data from public repositories?

How does Polly harmonize multi-omic datasets to improve the quality of patient stratification?

How does Elucidata's Polly help in overcoming the challenges of patient stratification?

What challenges do researchers face when performing patient stratification using multi-omics data?

What is patient stratification, and why is it important for precision medicine?

What are the key advantages of using Polly for transcriptome profiling and biomarker identification?

What methodologies does Polly use to identify synergistic drug combinations?

How does Polly rank datasets similar to a gene signature query?

What steps are involved in creating a query gene signature on Polly?

How does Polly's RNA-Seq Atlas simplify gene signature analysis?

What is gene signature comparison, and why is it important in drug discovery?

Get the latest news, industry insights, and updates delivered directly to your inbox.

All Solution Briefs

Faster Insights on Omics Data Signatures with Polly Discover

Enhancing Data Quality: QC Filters for Single Cell RNA-seq Analysis

How to Perform Patient Stratification on Polly

ChatGPT in Drug Discovery

Solving Biomedical Data Findability Issues Using Polly

How to Compare Gene Signatures on Polly

info@elucidata.io

info@elucidata.io

info@elucidata.io