The advent of the digital era, advancements in data sharing, and high throughput technologies have broadened the scope of data usage, data sharing, and collaborative research. Different stakeholders (data generators, data custodians, data managers, and data consumers) are involved in the data lifecycle. Data has become an independent asset, and its stakeholders might be spread across different geographies, subject domains, etc. It is paramount to ensure data quality and that data doesn’t lose its integrity while moving along different stages of its life cycle. Hence, at Elucidata, we have set up a system to perpetually assess the data and its metadata quality critically to ensure that our data keeps up the promise of reliability and interoperability that we pledge to our customers.
Data quality traits can be categorized into two groups based on whether those traits are inherent to the data (intrinsic) or not (extrinsic). Ensuring intrinsic data quality (Eg., removing bias in the data, capturing optimum number of data points in an experiment, etc.) is the mandate of data generators. It can mostly only be improved at the source, whereas extrinsic data quality (Eg.,: ensuring the correctness of the metadata fields, accurate annotations, etc.) generally depends on data custodians and data managers and can be improved through data curation.
Quality of data is notional and not always absolutely quantifiable. Voluminous data ingestion tends to have some errors which creep in due to the variation in how the experimenters fill in the metadata details or how each curator processes certain information. Therefore a continuous monitoring and correction system is needed to ensure the highest level of extrinsic data quality and streamlined ingestion and curation process.
Extrinsic data quality assurance needs careful consideration in several aspects, such as:
1. Standardization - deals with conformance of field names and values to ontologies and controlled vocabularies as well as the formats specified for those fields. It enables data to be more searchable and findable.
2. Accuracy of information - pertains to the correctness and plausibility of the data. It builds trust in the data.
3. Integrity - highlights the truthfulness and concordance of data.
4. Breadth - pertains to ensuring sufficient curated fields for a user to understand and use the data for analysis. This enables data to be more accessible to the user.
5. Completeness - primarily deals with eliminating the missingness in data points. It enables the user to use all the available data and not lose information to incompleteness.
At Elucidata, we have devised a data quality assessment approach to ensure that all these aspects are taken care of and that the data is FAIR (Findable, Accessible, Interoperable, and Reusable) before it reaches the consumable stage on our data-centric ML Ops platform, Polly.
The process has two main parts:
1. The Validation Layer - understanding the data issues and creating the rulesets needed for the computational program to highlight errors in the data
2. The Correction Layer - correcting the errors that were found in the validation layer
Each validation and correction layer is further divided into processes that could be dealt with computationally and those that would need manual intervention. The validation layer gives us a typical list of errors, which range from schema issues to contextual anomalies. Though trivial, these errors accumulate and magnify, making it difficult to access, query, and integrate data across the system. To improve the data quality, we needed to correct these. Adapting a two-pronged approach, our team groups the errors that can be handled systematically and those that need human expertise.
The expert-curated rulesets and guidelines corrected errors more smoothly than expected. Each dataset had multiple sample labels, and each sample had multiple descriptive labels. Across Polly, all labels for seven repos (TCGA, GEO, cBioPortal, DepMap, ImmPort, GDC, CPTAC) were assessed at dataset Level (~9.9 million) and sample level (~ 34.87 million), and 8% of labels were erroneous. Of those, we corrected 99% of the errors. We could automate the curation of more than 94% of the errors distributed across labels, with the remaining being dealt with by manual curators.
Given the sheer volume of datasets (~400k) at our disposal, error collection using simple iteration over an initial run of six OmixAtlases took an unreasonable ~7 hours to execute. To tackle the same, we implemented the multiprocessing system on Polly and lowered the execution time to under 4 min. This demonstrates the computational efficiency of the system we have developed.
Additionally, to include more stringent checks on ingested data that would capture the data quality for standardization, accuracy, and integrity of data, we use the ‘pydantic’ library that allowed us to perform schema checks, field-specific checks as well as certain logical checks across multiple fields on the dataset metadata.
The quality validations described above are embedded into the system by packaging the validation and correction algorithms into our ingestion libraries. Any data that comes into Polly has to first pass the quality assessment mechanism. Additionally, for the rare event of ambiguous data evading the automatic surveillance and finding its way into Polly, we have set up a manual validation layer with expert human auditors continuously inspecting the quality of data in Polly.
In conclusion, we behold the promise of “richly curated biomedical molecular data” very near to our hearts and make it richer every passing day. Connect with us to learn more about the 1.5M highly curated ML-ready biomolecular datasets on our Polly platform!
Get the latest insights on Biomolecular data and ML