Elucidata's Unique Data Quality Assurance Process
Product & Engineering

Elucidata's Unique Data Quality Assurance Process

Shruti Desai, Deepthi Das

The advent of the digital era, advancements in data sharing, and high throughput technologies have broadened the scope of data usage, data sharing, and collaborative research. Different stakeholders (data generators, data custodians, data managers, and data consumers) are involved in the data lifecycle. Data has become an independent asset, and its stakeholders might be spread across different geographies, subject domains, etc. It is paramount to ensure data quality and that data doesn’t lose its integrity while moving along different stages of its life cycle. Hence, at Elucidata, we have set up a system to perpetually assess the data and its metadata quality critically to ensure that our data keeps up the promise of reliability and interoperability that we pledge to our customers.

Data Quality

Data quality traits can be categorized into two groups based on whether those traits are inherent to the data (intrinsic) or not (extrinsic). Ensuring intrinsic data quality (Eg., removing bias in the data, capturing optimum number of data points in an experiment, etc.) is the mandate of data generators. It can mostly only be improved at the source, whereas extrinsic data quality (Eg.,: ensuring the correctness of the metadata fields, accurate annotations, etc.) generally depends on data custodians and data managers and can be improved through data curation.

Extrinsic and intrinsic data traits flowchart
Color-coded image to identify extrinsic and intrinsic data traits

Why Do We Need a Continuous Data Quality Assessment Process?

Quality of data is notional and not always absolutely quantifiable. Voluminous data ingestion tends to have some errors which creep in due to the variation in how the experimenters fill in the metadata details or how each curator processes certain information. Therefore a continuous monitoring and correction system is needed to ensure the highest level of extrinsic data quality and streamlined ingestion and curation process.

How Do We Assess and Ensure Extrinsic Data Quality?

Extrinsic data quality assurance needs careful consideration in several aspects, such as:

  1. Standardization - deals with conformance of field names and values to ontologies and controlled vocabularies as well as the formats specified for those fields. It enables data to be more searchable and findable.
  2. Accuracy of information - pertains to the correctness and plausibility of the data. It builds trust in the data.
  3. Integrity - highlights the truthfulness and concordance of data.
  4. Breadth - pertains to ensuring sufficient curated fields for a user to understand and use the data for analysis. This enables data to be more accessible to the user.
  5. Completeness - primarily deals with eliminating the missingness in data points. It enables the user to use all the available data and not lose information to incompleteness.

At Elucidata, we have devised a data quality assessment approach to ensure that all these aspects are taken care of and that the data is FAIR (Findable, Accessible, Interoperable, and Reusable) before it reaches the consumable stage on our data-centric ML Ops platform, Polly.

The process has two main parts:
1. The Validation Layer - understanding the data issues and creating the rulesets needed for the computational program to highlight errors in the data

2. The Correction Layer - correcting the errors that were found in the validation layer

Each validation and correction layer is further divided into processes that could be dealt with computationally and those that would need manual intervention. The validation layer gives us a typical list of errors, which range from schema issues to contextual anomalies. Though trivial, these errors accumulate and magnify, making it difficult to access, query, and integrate data across the system. To improve the data quality, we needed to correct these. Adapting a two-pronged approach, our team groups the errors that can be handled systematically and those that need human expertise.

The two-pronged approach to group and correct errors in data
The two-pronged approach to group and correct errors

The expert-curated rulesets and guidelines corrected errors more smoothly than expected. Each dataset had multiple sample labels, and each sample had multiple descriptive labels. Across Polly, all labels for seven repos (TCGA, GEO, cBioPortal, DepMap, ImmPort, GDC, CPTAC) were assessed at dataset Level (~9.9 million) and sample level (~ 34.87 million), and 8% of labels were erroneous. Of those, we corrected 99% of the errors. We could automate the curation of more than 94% of the errors distributed across labels, with the remaining being dealt with by manual curators.

Given the sheer volume of datasets (~400k) at our disposal, error collection using simple iteration over an initial run of six OmixAtlases took an unreasonable ~7 hours to execute. To tackle the same, we implemented the multiprocessing system on Polly and lowered the execution time to under 4 min. This demonstrates the computational efficiency of the system we have developed.

Additionally, to include more stringent checks on ingested data that would capture the data quality for standardization, accuracy, and integrity of data, we use the ‘pydantic’ library that allowed us to perform schema checks, field-specific checks as well as certain logical checks across multiple fields on the dataset metadata.

Increasing the Reliability of Data on Polly

The quality validations described above are embedded into the system by packaging the validation and correction algorithms into our ingestion libraries. Any data that comes into Polly has to first pass the quality assessment mechanism. Additionally, for the rare event of ambiguous data evading the automatic surveillance and finding its way into Polly, we have set up a manual validation layer with expert human auditors continuously inspecting the quality of data in Polly.

In conclusion, we behold the promise of “richly curated biomedical molecular data” very near to our hearts and make it richer every passing day. Connect with us to learn more about the 1.5M highly curated ML-ready biomolecular datasets on our Polly platform!

Subscribe to our newsletter
Only data insights. No spam!
Thank you! Please click on the link to start the download.
Download Now
Oops! Something went wrong while submitting the form.

Blog Categories