Why do you need custom data curation?
Big Data

Why do you need custom data curation?

Swetabh Pathak
September 21, 2021

Curation of large and extensive volumes of semi-structured or unstructured biomedical data for drug discovery can be increasingly expensive, cumbersome, and resource-intensive. However, the biomedical research community seems to agree on the need for better and more extensive curation, though not always on how universal the curation can be. Traditionally, curation has been thought of as a single-step process to be done once for good. The assumption is that once done, it works for all use cases. But increasingly, data-driven teams are thinking of curation as a tailored process that is to be carried out based on specific requirements of each use case.

High-quality curation done as needed can be a superpower for data-driven bioinformatics teams. In this post, we talk about 3 reasons why custom curation is here to stay.

You want to work with the latest and greatest dataset

Probably the most obvious reason. Every week many interesting datasets get published. Research groups have a high-throughput data generation engine. Bioinformaticians are often trying to analyze multiple datasets per week and they want to move fast. Curation can very soon end up becoming a bottleneck. For example, your team might want to analyze the latest new single-cell atlas. For the liver, with a simple google search, one can find more than 10. If you are looking for normal cells, the number is an order of magnitude higher. On Polly, we have more than 1000 Atlases for normal tissues.

Without a dedicated curation effort, researchers have to restrict themselves to one or two datasets that they can analyze. Often, this is a choice-driven not by scientific reasoning but by resource constraints. If you have access to a dedicated curation effort internally or through a platform like Polly, you can analyze multiple datasets to shortlist the most appropriate ones to be explored further. More information usually means better scientific decisions.

You need to process the raw data with a specific pipeline

Researchers tend to utilize data from multiple (public, premium, and proprietary) sources. But data processed differently can’t easily be compared. So, research groups want to often re-process raw data using their own ‘custom’ pipelines that can enable them to make ‘apple to apple’ comparisons.

Re-processing raw data is not trivial. Especially if your team doesn’t have a scalable way to do it. Some teams are well equipped because they have productionized their raw data processing pipelines. Many other teams aren’t as well prepared. For them, it would be useful to productionize processing pipelines like how we do, using Nextflow on Polly.

The metadata isn’t compatible with your question

Typically, the standard metadata information provided for datasets (i.e. age, gender, cell type, cell line, drug, etc) is not enough for the question at hand. Standard metadata works for ‘standard’ use cases. Not if your team is interested in looking at datasets that are relevant in CAR-T, or if your team is studying the profile of blood samples to research women’s reproductive health. In this case, the type/source of the blood sample i.e. menstrual blood, cervicovaginal blood, or whole blood is important. But it’s unlikely that such a metadata field exists ‘out of the box’.

Depending upon the number of datasets (10s or 1000s), a new ‘custom’ metadata field can be added manually or through some kind of automation. In either case, process and QC are important. You want to be sure that the fields are being attributed consistently.

In summary, data curation is not a one-time process, neither is it one size fits all. Empower your team to do it often and well. Science is after all, iterative – as is data curation.

Subscribe to our Newsletter

Get the latest insights on Biomolecular data and ML

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Blog Categories