Power of Meticulously Curated Datasets in Accelerating Biomedical Research

Deepthi Das
March 28, 2024
“Data that is loved tends to survive.” - Kurt Bollacker

…And give great insights!!


Data serves as the cornerstone of bioinformatic breakthroughs. However, messy data can pose significant obstacles on the journey from raw data to meaningful insights. Information sourced from diverse public repositories often lacks consistent formatting and vital metadata annotations. The absence of contextual and structured information diminishes the findability and reusability of relevant data. Read this blog to understand how meticulously curated datasets can significantly impact and accelerate biomedical R&D by improving the findability and reusability of data.

Don't Repositories Have ‘Curated Datasets’?


Yes, they do. In the realm of data repositories, the concept of 'curated datasets' is akin to a library where books are meticulously organized based on various criteria. However, like libraries, these curated datasets may not always align perfectly with specific user queries, presenting a limitation to their utility. This disparity prompts a comparison to platforms like Google Scholar, renowned for its deep indexing capabilities that facilitate more relevant findings. Similarly, deep curation, considering downstream analyses, becomes imperative to extract value from the data housed within these repositories.  

Need for Curated Datasets

Let’s look at some specific challenges associated with public repositories that necessitate deep curation to unlock the value of its data.

  • Data Format: Datasets in repositories may be stored in specific formats that are not immediately usable by all researchers. For example, genomic data might be stored in formats like FASTQ or BAM, which require specialized software and expertise to interpret.
  • Data Quality: Even curated datasets may contain errors or inconsistencies that need to be addressed before they can be reliably used for analysis. Quality control processes are essential for ensuring that the data is accurate and reliable.
  • Metadata: While repositories may provide curated datasets, the accompanying metadata describing the samples, experimental conditions, and other relevant information may be incomplete or insufficient. Without comprehensive metadata, it can be challenging to interpret and analyze the data effectively.
  • Normalization and Preprocessing: Biological data often requires normalization and preprocessing to account for various experimental factors and technical biases. Researchers may need to perform additional processing steps to ensure that the data is suitable for their specific analysis.
  • Integration with Other Datasets: Researchers may need to integrate data from multiple sources to address specific research questions. This process can be complex and may require additional computational resources and expertise.

While repositories play a crucial role in making biological data available to the research community, ensuring the usability of these datasets requires addressing various technical, quality, and accessibility challenges. This is where Elucidata steps in, utilizing cutting-edge AI models to address data quality issues, enabling researchers to fully leverage the wealth of public biomedical data for their research objectives.

Curated Datasets on Polly

Polly, Elucidata's data harmonization platform, effortlessly overcomes the significant data quality challenges found in publicly available datasets from diverse sources such as GEO, PRIDE, CPTAC, and various publications. By employing advanced AI algorithms, Polly harmonizes multi-omics and assay data, transforming them into machine learning (ML)-compatible formats. Trained experts utilize Polly's robust harmonization engine to curate diverse data types, annotate metadata, and ensure consistent processing, all while keeping costs affordable. The resulting ML-ready datasets are stored in Polly's Atlas or any preferred platform, facilitating seamless analysis and management.

Raw Data at Source (GEO Datasets on Polly
1. Lack of standardization make parsing and utilization difficult. 1. Data stored in a structured, consistent format.
2. Comprehensive but unrefined data; unstructured and inconsistent metadata. 2. Meticulously curated and standardized data; Accurate and complete metadata.
3. Significant cleaning, standardization, and interpretation needed by researchers before reliable analysis. 3. Ready for immediate use, freeing researchers from data preparation tasks.

Case Study: How Curation Impacts Data Retrieval

[GEO Vs. CREEDS Vs. Polly]

To demonstrate the benefits of meticulous deep curation, we analyze the effectiveness of data retrieval across data from three distinct sources, each containing the same datasets from CREEDS:

1.  Unprocessed data directly from GEO
2.  Data manually curated by CREEDS
3.  The same datasets but curated through our Polly Harmonization Engine

These sources represent varying levels of data quality, with raw GEO data at the lower end and
the data curated by Polly Harmonization Engine at the higher end in terms of quality. The experiment was carried out using state of the art Named Entity Recognition (NER) models which have the ability to process text based queries on the data corpus.

The experiment demonstrated a significant improvement in the search responses with the Polly Harmonized version of the data corpus, in contrast to the other two sources. The NER model-enabled search against the Polly Harmonized corpus accurately retrieved the datasets for most of the tested queries. Conversely, there was a significant variance in metrics among queries, and poorer outcomes (lower scores), when using the raw data source (GEO) and CREEDS. The Polly Harmonized data guaranteed precise responses to queries while diminishing the likelihood of overlooking relevant datasets.

Read this whitepaper for more details on this case study.
Comparison of the retrieval accuracy (depicted by F1 scores) from the three data sources evaluated.

This study, conducted on a fair sample of real queries, emphasizes the vital role of data quality in retrieving pertinent information from a data collection. It's not enough for a language-understanding AI to comprehend user questions accurately; the underlying knowledge base must also be meticulously curated, annotated, and structured to aid in finding relevant data. Both aspects of the search process must collaborate to efficiently translate user queries and provide contextually precise responses. The results of the study underscore the importance of high-quality, deeply curated metadata in navigating large-scale biomedical datasets.

Connect with us or reach out to us at info@elucidata.io to learn more.

Blog Categories

Blog Categories

Request Demo