Solving Biomedical Data Findability Issues Using Polly

Kriti Srivastava
June 21, 2023
Solving Biomedical Data Findability Issues Using Polly

Researchers are producing an unprecedented amount of genomic data to gain insights into the functioning of the genome and its impact on human health and disease. By 2025, approximately 40 exabytes of storage will be required to accommodate the global genome sequencing data. To put this into perspective, a mere 5 exabytes would be sufficient to store all the words ever spoken by humans. Such is the magnitude of biological data being generated! Here we discuss how biomedical data can be made findable and reusable in an efficient manner.

Challenges in Finding Biomedical Datasets in Public Databases

Biomedical data is inherently diverse as it stems from various experimental techniques, platforms, and omics technologies. The enormity of this data is overwhelming, further compounded by multiple data access policies. Insufficient annotation and curation also pose challenges, discouraging researchers from navigating this vast data repository. The sheer volume and diversity of available datasets make it time-consuming to locate specific ones. Hence, effective data management and retrieval systems are essential to optimize the utilization of this generated data. Let’s have a look at how these challenges impact finding relevant datasets on GEO.

Finding the Relevance of a Dataset: The Journey on GEO

Discovering the significance of a dataset can prove to be an uphill task, mainly when dealing with a public database such as GEO, which holds the potential for valuable information but is often inaccessible due to the many challenges associated with finding datasets on public databases. Let's check out a real-world example for the same.
Researchers working on novel therapies for neurodegenerative diseases will want to find datasets that study how a particular drug affects disease progression. However, finding such a dataset from a public database will be challenging due to the lack of proper clinical metadata associated with these datasets. Let's see the researchers' journey to find the relevance of a patient-derived dataset (Ex.GSE97709_GPL1730) for neurodegenerative disease on a public database such as GEO.

On GEO:

  • Once we search for GSE97709 on GEO, we move to the GEO Accession viewer page, where we can see information related to this dataset, such as Title, Summary, Overall Design, etc. To understand whether this is a clinically relevant dataset, a user needs to read the full Summary to check for the term 'patient' or 'patient-derived' to ensure it is indeed a patient sample.
A snapshot of the details of the dataset GSE97709_GPL1730 on GEO.
  • Often, this information might not be present in the Summary, and the user must scourge through the metadata to figure it out. Information related to each sample and project must be accessed individually through different clicks and pages. Looking for relevant datasets from a general query can take up to half an hour and at least 45 clicks per session.

As we can see, finding datasets on a public database such as GEO is time-consuming. To find the relevance of any dataset, the user needs to read through the whole accession view page to understand the experimental design and confirm whether the datasets are clinically relevant. Even after being on the same page, all the information is scattered in sections the user must go through to shortlist it for his analysis. Many other details are saved in separate .tsv and .xlsx files, which must be downloaded locally to understand the make of the dataset. Let's improve data findability by adding a straightforward step to raw data pre-processing and see how it works wonders in finding data!

How Can Data Findability Be Improved?

Voila! Curation is the answer!
Curation is the extraction of knowledge from unstructured data into a structured, computable form. This process helps to find datasets easily by improving data organization, standardizing metadata, and enhancing data quality. Navigating and locating relevant data is easier once the datasets are organized into categories or topics. Standardized metadata ensures consistent annotation, enabling researchers to use specific search criteria and filters to refine their search. Curation also involves quality control, ensuring that datasets are reliable and scientifically sound. With organized data, standardized metadata, and improved quality, researchers can quickly identify and access datasets that align with their research needs, streamlining the process of finding relevant data. Let's look at how Polly helps solve the findability challenges.

How to Resolve the Challenges that You Face on GEO

Polly is an AI-enabled cloud platform that provides access to FAIR multi-omics data from various public sources, including GEO. The data on Polly is curated using standardized pipelines and ontologies, ensuring consistent and analysis-ready data. The data engineering and metadata harmonization process transforms messy data into structured data in the OmixAtlas (OA) data warehouse. OA contains millions of curated FAIR datasets from public and licensed sources, ready for machine learning and analysis. Polly's customizable interface lets users quickly find relevant data without multiple searches. OAs are tailored to specific research needs, like diseases or gene targets. Consistent processing allows reliable comparison and cohort creation across all data on Polly.

Searching the Same Dataset on Polly:

  • When we search for GSE97709_GPL1730, you can see the output below. You can find the number of samples and the main keywords associated with the datasets. The different sections allow you to navigate the relevant details with just a click which is much faster than searching for datasets on GEO.
  • In addition to metadata tags, one will see the number of samples, GEO ID, dataset title, and other information the author provided. The options and view details buttons are used to explore a dataset further. All samples containing patient-derived data are also tagged as donor datasets to identify these valuable studies quickly.
A snapshot of the dataset GSE97709_GPL1730 on Polly.
  • The View Details Button - Clicking on this will open up a new tab where one will find the dataset overview, the metadata table, and metadata charts. The dataset overview page has all the experimental details available for one to read, including the abstract. The metadata table shows all the recorded metadata for each sample in the dataset.
The overview of the same dataset has all relevant details in an easily readable format.
  • All the source metadata fields at the sample level are available on the new 'details page.' The 'Metadata Table' table provides all the source metadata columns with clean names. All the information in the table can be easily read and interpreted without the need to read paragraphs or go through the SRA run selector and search for relevant terms. These fields bring value to the datasets and make the user's journey to explore datasets of relevance smoother.
The metadata table includes all source metadata columns with intuitive labels.
  • An interactive sunburst plot is available under the 'metadata charts' section within the 'Details' page. A total of 4 features can be plotted in one go. The sunburst plot, by default, allows users to see how the experimental factors vary for different samples. However, any metadata field not belonging to the list of experimental factors can also be viewed. Together, these pages allow the user to dive deeper into a dataset without downloading it or doing any pre-processing first. This means less time is spent cleaning data and more time gaining insights. After exploring the dataset, it can be downloaded locally, or the user can continue to explore using the integrated Phantasus application for bulk RNA-seq data and Cellxgene for single-cell RNA-seq data.
An interactive sunburst chart provides a quick visualization of the whole dataset.
  • When there are different permutations and combinations of metadata fields, say two or more, analyzing how complicated the relationships between the other metadata fields are becomes confusing; in such cases, having the sunburst plot of the factors will allow users to explore the data deeper.
  • The sunburst plot helps to finalize a dataset as a dataset of interest, making the job easier for the user.
  • It also ensures that users do not refrain from considering a dataset due to the sheer messiness of the data at the source.
  • The Experimental Factors field is also available in the 'Table View' of the OmixAtlas. With the experimental factors field open along with the sunburst plot, it becomes easier for users to ascertain the relevance of the experimental setup. They can choose datasets relevant to their problem statements at a glance without reading through publications or sifting through metadata tables. The plot also enables them to analyze data faster on the Polly UI to create cohorts of their liking.
Table view showing the experimental factors field along with many other fields.
  • All the curated fields on Polly use ontologies such as MESH to determine the field values. This ensures that anytime the user is looking for neurodegenerative disease datasets, they can easily find them since they are all tagged with the same term.
  • Polly has point-and-click filters for the curated fields, such as disease, drug, and organism, to help one quickly identify datasets they want to use. These processes take at most 5-10 minutes and show relevant results.
  • Relevant datasets can be searched using the search bar at the top of the page. An elastic search drives the search bar. It allows the user to look for keywords present across the source metadata fields, such as title, description, and overall design, as well as our curated metadata fields, such as tissue, drug, and cell line. Unlike the check box filters, the search bar performs a fuzzy search. For example, typing in transcriptomics will also show results for transcriptome and transcript. The search bar also supports the use of operators to help users with some advanced searches.
Different operators available on Polly
  • On Polly, metadata for different datasets is available in a cleaner and more harmonized format than the source, GEO, for all datasets. The column names are easily readable and intuitive, making it simple for users to get an overview of the metadata. This also allows for comparing samples from different datasets.

All the sample-level metadata from the source is now available on Polly, so users can get the whole picture by accessing the metadata table.

Key Takeaways

  • It can be said that it is easier to search Polly than GEO due to the enhancements made for each dataset, such as metadata harmonization, getting it ML-ready, clinically tagged, etc. The smoothest transition from trying to find datasets for a problem statement to landing the perfect datasets to gain insights into the data without pre-processing is the best feature of Polly.
  • Polly is almost 80% faster than searching datasets on a public database like GEO. Since the data is unstructured in the latter, the user must navigate through many tabs to get a whole idea of the dataset.
  • Polly offers integration to third-application tools like Phantasus and Cellxgene to get a quick view of the relevant dataset without downloading it locally.


If you are scouring datasets to find relevant ones for downstream analysis, now is the time to reach out.

Connect with us to learn more about how to accelerate your drug discovery process using curated data.

Other Resources

Request Demo