X

Only Insights. No Spam.

* indicates required
Subscribe to our newsletter
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
The Curious Case of Mis-labelling Data​: Orangutan Genomic Data Mix-Up
FAIR Data

The Curious Case of Mis-labelling Data​: Orangutan Genomic Data Mix-Up

Deepthi Das
November 18, 2022

Recently, a correction came up in a landmark nature publication that reported orangutan genomes. A research team using the data from a 2011 Nature publication noticed that the names given to some of the samples didn’t match the animals’ reported sex. For e.g., the paper reported that an orangutan named Dolly was male, but according to the zoo records, Dolly was female. Making the case even stronger, they found that some of the genomes marked as male lacked a Y chromosome. They rechecked the data and found that the 2011 paper had misidentified all but two of the orangutan genomes. Banes suggested that some mistakes seem to be the result of typos.

This was a very obvious mix-up. A male with a missing Y chromosome should ideally catch the attention of scientists. However, the paper was cited in 641 articles and remained unchallenged for around 11 years! There are errors like this in many, many published papers. This was about Orangutans - which is important- but imagine if this was a biomedical paper and people were developing therapies based on published data.

Towards FAIR Digital Transformation


In this age and time of data reuse, where projects like TCGA and TARGET are carried out just to generate data that can be mined for years to come, maintaining data quality from end to end has to be looked at with utmost importance. Research teams perform high throughput experiments and store the data organized as folders on the cloud.  In most cases, data generated is used to answer a particular question and then stored away. If this is revisited after a length of time, like at the time of publishing the results or for another related experiment, the notations and labels may not be very obvious. Even if it is labeled well by a team, another team trying to reuse the data might not be able to follow all labels unless a specific ontology is used.

The immediate need is to switch from the folder structure on the cloud to a data warehouse structure to store data. A data warehouse is a central repository of information that can be analyzed to make more informed decisions. However, the data warehouse does more than just store your data. It necessitates that data is stored and organized in predefined formats. This increases the findability, interoperability, and integration possibilities of your data. It also reduces the manual work involved in generating reports or visualizations based on your data.

High Time For A Mindset Shift

The scientific community has become more aware of the need to label and store data in a reusable manner. Incentives for publishing data following the FAIR guidelines have been offered, and there are initiatives for tackling missing data. But the need is for life science research- academics and industries alike- to recognize the data they produce as independent assets that have the potential to power various AI/ ML initiatives. Big pharma-AI partnerships are steadily increasing to accelerate drug discovery and repurposing. The focus is on using internally generated and public data to feed various automated pipelines for target identification, precision medicine development, biomarker identification, etc. The data generation, storage, and management have to be carried out with this end goal in mind to maximize the ROI on the experiments and data. As the data volume increases, the data extraction, transformation, and storage need to be automated to maintain the data hygiene and quality from lab to analysis to data reuse.

We have been passionate about bringing about the FAIR transformation revolution. In that direction, we recently hosted our annual event DataFAIR2022 where a stellar team of experts from Massachusetts General Hospital, Exelixis, and Alnylam Pharmaceuticals, who have led transformative FAIRification efforts within their enterprises,  discuss topics like what the infrastructure to handle big biomedical data should look like, the challenges they faced while adopting FAIR approaches to data management, and some wins on this journey to FAIR transformation. Check out the panel discussion here.

Subscribe to our newsletter
Only data insights. No spam!
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Blog Categories