The variety and volume of data being produced by biological research have grown dramatically in recent years. However, this abundance of data has brought new challenges, especially in curation. Data curation deals with identifying data sources, and cleaning and transforming the raw data. Engineers and stakeholders often face two main challenges with respect to data. One is data quality and the other is building systems that scale well i.e. enabling data curation at scale.
When it comes to data quality, data curation is an essential mechanism to generate high-quality datasets through metadata capture and the accuracy of metadata annotation is critical. There are many public repositories that freely distribute biological data submitted by the research community, one being GEO (Gene Expression Omnibus).
However, the findability and usability of GEO data are unsatisfactory. Why is it a herculean task for researchers and data scientists to derive accurate high-quality data from the GEO database? Where does it fall short? How does Polly help in finding relevant datasets from GEO?
Continue to read this blog to find answers to all the above questions and to understand the kind of data that Elucidata tries to bring to the fore through its platform Polly.
GEO is one of the largest open-source repositories for high-throughput data on gene expression studies, including those that examine genome methylation, chromatin structure, and genome–protein interactions. It is an international repository that archives and freely distributes microarray, next-generation sequencing, and other forms of high-throughput functional genomics data submitted by the research community.
GEO data can be retrieved in many ways:
While the query process looks pretty straightforward, it is not so in practice.
1. The main challenge while working with GEO is the difficulty of retrieving data. Some of the most useful metadata for each dataset in GEO is stored in unstructured English text that is difficult for researchers to utilize effectively. Unless the correct keywords are used while searching, the search results might be completely off.
2. While downloading data, it gets downloaded in the form of compressed files. To find out if the relevant data fields are present one must go through the downloaded files individually. Another challenge is that these files must be analyzed individually. One can’t load it on a processing pipeline because the data is not standardized.
3. The data on GEO does not follow a particular ontology. So, it might be important to find out the synonyms and the acronyms/ abbreviations of the keyword of interest to improve the search results.
4. Many a time, keyword searches can be misleading because of wrong metadata tags placed on the records.
Elucidata’s proprietary curation platform, built on top of NLP-based AI models, generates harmonized metadata annotations with scientific context at an accuracy matching that of human experts. Polly’s curation infrastructure PollyBERT enriches the way we access metadata from various data sources. The model is trained on ~17 billion words and ~660 million parameters. The curated data is hosted in a structured format in our repository OmixAtlas as GEO OmixAtlas.
All datasets on Polly go through a 2-step process:
1. Data Engineering: This includes transforming data to fit a proprietary data schema that is uniform across several datatypes.
2. Metadata Harmonization: This means tagging each sample and dataset with a uniform ontology.
Polly’s curation infrastructure enables curating biomolecular data at scale along with keeping in mind the importance of data quality. It cuts down the time taken to figure out the usability of data from public sources.
Our data-centric ML Ops platform, Polly, hosts over 1.5 million highly curated ML-ready
biomolecular datasets from various repositories like GEO, etc. This level of curation ensures
that you get all the relevant datasets in seconds just by doing a keyword search. Additionally, the
curation fields help you filter the data to get very streamlined results. Since the data is highly
curated and harmonized, various analyses can also be carried out easily.
Reach out to us to know more about how to accelerate your research!