The Curious Case of Mis-labelling Data: Orangutan Genomic Data Mix-Up

Recently, a correction came up in a landmark nature publication that reported orangutan genomes. A research team using the data from a 2011 Nature publication noticed that the names given to some of the samples didn’t match the animals’ reported sex. For e.g., the paper reported that an orangutan named Dolly was male, but according to the zoo records, Dolly was female. Making the case even stronger, they found that some of the genomes marked as male lacked a Y chromosome. They rechecked the data and found that the 2011 paper had misidentified all but two of the orangutan genomes. Banes suggested that some mistakes seem to be the result of typos.

This was a very obvious mix-up. A male with a missing Y chromosome should ideally catch the attention of scientists. However, the paper was cited in 641 articles and remained unchallenged for around 11 years! There are errors like this in many, many published papers. This was about Orangutans - which is important- but imagine if this was a biomedical paper and people were developing therapies based on published data.

Towards FAIR Digital Transformation

In this age and time of data reuse, where projects like TCGA and TARGET are carried out just to generate data that can be mined for years to come, maintaining data quality from end to end has to be looked at with utmost importance. Research teams perform high throughput experiments and store the data organized as folders on the cloud. In most cases, data generated is used to answer a particular question and then stored away. If this is revisited after a length of time, like at the time of publishing the results or for another related experiment, the notations and labels may not be very obvious. Even if it is labeled well by a team, another team trying to reuse the data might not be able to follow all labels unless a specific ontology is used.

The immediate need is to switch from the folder structure on the cloud to a data warehouse structure to store data. A data warehouse is a central repository of information that can be analyzed to make more informed decisions. However, the data warehouse does more than just store your data. It necessitates that data is stored and organized in predefined formats. This increases the findability, interoperability, and integration possibilities of your data. It also reduces the manual work involved in generating reports or visualizations based on your data.

High Time For A Mindset Shift

The scientific community has become more aware of the need to label and store data in a reusable manner. Incentives for publishing data following the FAIR guidelines have been offered, and there are initiatives for tackling missing data. But the need is for life science research- academics and industries alike- to recognize the data they produce as independent assets that have the potential to power various AI/ ML initiatives. Big pharma-AI partnerships are steadily increasing to accelerate drug discovery and repurposing. The focus is on using internally generated and public data to feed various automated pipelines for target identification, precision medicine development, biomarker identification, etc. The data generation, storage, and management have to be carried out with this end goal in mind to maximize the ROI on the experiments and data. As the data volume increases, the data extraction, transformation, and storage need to be automated to maintain the data hygiene and quality from lab to analysis to data reuse.

We have been passionate about bringing about the FAIR transformation revolution. In that direction, we recently hosted our annual event DataFAIR2022 where a stellar team of experts from Massachusetts General Hospital, Exelixis, and Alnylam Pharmaceuticals, who have led transformative FAIRification efforts within their enterprises, discuss topics like what the infrastructure to handle big biomedical data should look like, the challenges they faced while adopting FAIR approaches to data management, and some wins on this journey to FAIR transformation. Check out the panel discussion here.

‍

Blog Categories

CDMO

Top Drug Targets

AI Labs

Data Analysis and Management

Data Quality & Compliance

Industry Features

Product & Engineering

Data Science & Machine Learning

Thank you for reaching out!

Our team will get in touch with you over email within next 24-48hrs.

Oops! Something went wrong while submitting the form.

Other Resources

Case Studies Dataset Roundup Documentation Glossary Solution Briefs Webinars Whitepapers

Explore : Target Discovery - Lessons from the Field

Read More

Polly Modules

Data Modalities

[Upcoming Webinar] Scaling High-Quality Data Processing: Achieve 4x Cost Reduction for Foundation ModelsRegister Now->

Reserve Your Seat

The Curious Case of Mis-labelling Data: Orangutan Genomic Data Mix-Up

Towards FAIR Digital Transformation

High Time For A Mindset Shift

Blog Categories

Talk to our Data Expert

Other Resources

Watch the full Webinar

De-risking Autoimmune Clinical Trials with Agentic AI

Blog Categories

Why Regulatory Intelligence Is Drowning in Documents

Why Regulatory Intelligence Is Drowning in Documents

Spreadsheet Hell Is Still the Default in CDMO Data Handoffs, and It's Costing You More Than Time

Spreadsheet Hell Is Still the Default in CDMO Data Handoffs, and It's Costing You More Than Time

Why Workflow Automation Matters for Antibody Development and Biologics R&D

Why Workflow Automation Matters for Antibody Development and Biologics R&D

How Agentic AI is Rewriting the Rules of Flow Cytometry: An approach towards Automated Gating in AML.

How Agentic AI is Rewriting the Rules of Flow Cytometry: An approach towards Automated Gating in AML.

How Whole Genome Sequencing Helps Researchers Unlock Deeper Biological Insights

How Whole Genome Sequencing Helps Researchers Unlock Deeper Biological Insights

Whole Exome Sequencing: Accelerating Precision Diagnostics with Variant Stores and Multimodal Data

Whole Exome Sequencing: Accelerating Precision Diagnostics with Variant Stores and Multimodal Data

How Agentic AI is Rewriting the Rules of Flow Cytometry: An approach towards Automated Gating in AML.

Target Discovery and Independent Orthogonal Validation for Small Cell Lung Carcinoma

Polly Scout: Find the Fastest Path to Right Public Biomedical Data

CellAtria vs Polly BioAgent: Why Autonomous AI Beats Rigid Pipelines?

Challenges with Diagnostics Data Processing Pipelines

info@elucidata.io

info@elucidata.io

info@elucidata.io

Explore : Target Discovery - Lessons from the Field

Read More

[Upcoming Webinar] Scaling High-Quality Data Processing: Achieve 4x Cost Reduction for Foundation ModelsRegister Now->

Reserve Your Seat

The Curious Case of Mis-labelling Data​: Orangutan Genomic Data Mix-Up

Towards FAIR Digital Transformation

High Time For A Mindset Shift

Blog Categories

Talk to our Data Expert

Other Resources

Related Blogs

Why Regulatory Intelligence Is Drowning in Documents

Spreadsheet Hell Is Still the Default in CDMO Data Handoffs, and It's Costing You More Than Time

Why Workflow Automation Matters for Antibody Development and Biologics R&D

How Agentic AI is Rewriting the Rules of Flow Cytometry: An approach towards Automated Gating in AML.

How Whole Genome Sequencing Helps Researchers Unlock Deeper Biological Insights

Whole Exome Sequencing: Accelerating Precision Diagnostics with Variant Stores and Multimodal Data

Watch the full Webinar

De-risking Autoimmune Clinical Trials with Agentic AI

Blog Categories

Get the latest news, industry insights, and updates delivered directly to your inbox.

Latest Blogs

Why Regulatory Intelligence Is Drowning in Documents

Why Regulatory Intelligence Is Drowning in Documents

Spreadsheet Hell Is Still the Default in CDMO Data Handoffs, and It's Costing You More Than Time

Spreadsheet Hell Is Still the Default in CDMO Data Handoffs, and It's Costing You More Than Time

Why Workflow Automation Matters for Antibody Development and Biologics R&D

Why Workflow Automation Matters for Antibody Development and Biologics R&D

How Agentic AI is Rewriting the Rules of Flow Cytometry: An approach towards Automated Gating in AML.

How Agentic AI is Rewriting the Rules of Flow Cytometry: An approach towards Automated Gating in AML.

How Whole Genome Sequencing Helps Researchers Unlock Deeper Biological Insights

How Whole Genome Sequencing Helps Researchers Unlock Deeper Biological Insights

Whole Exome Sequencing: Accelerating Precision Diagnostics with Variant Stores and Multimodal Data

Whole Exome Sequencing: Accelerating Precision Diagnostics with Variant Stores and Multimodal Data

Trending Blogs

How Agentic AI is Rewriting the Rules of Flow Cytometry: An approach towards Automated Gating in AML.

Target Discovery and Independent Orthogonal Validation for Small Cell Lung Carcinoma

Polly Scout: Find the Fastest Path to Right Public Biomedical Data

CellAtria vs Polly BioAgent: Why Autonomous AI Beats Rigid Pipelines?

Challenges with Diagnostics Data Processing Pipelines

info@elucidata.io

info@elucidata.io

info@elucidata.io

The Curious Case of Mis-labelling Data: Orangutan Genomic Data Mix-Up