FAIR Data

Data Quality in Early-Stage Drug Development: Paving the Way for Reliable Research Outcomes

Pawan Verma
February 23, 2024

Ensuring data quality is crucial in early-stage drug development for reliable research outcomes. This blog explores the importance of data integrity, accuracy, and completeness in identifying promising drug candidates and meeting regulatory standards. By prioritizing data quality, researchers can navigate the complexities of drug development with confidence and precision.

Factors Affecting Data Quality and Research Outcomes

In real-world applications, data is often considered as ‘dirty’, making data quality a critical factor for machine learning systems to accurately predict the phenomenon that it claims to measure. In high-stakes AI applications, the importance of data quality is magnified due to its heightened downstream impact, influencing the accuracy of predictions.

Data Quality has a ‘domino effect’ where errors in data can easily propagate and have a ‘compounding’ negative downstream impact resulting in increased technical debt over time.

Data Quality in Early-Stage Drug Development

The above image describes common data-domino triggers that are crucial to managing any data-driven organization or while building data-intensive applications.

  • Insufficient Domain Expertise/ Experience: Insufficient domain expertise or experience can lead to a misunderstanding of the nuances and complexities inherent to the data being analyzed. This can result in the development of models that are based on incorrect assumptions or that overlook critical variables.
    Such models are likely to perform poorly when applied to real-world situations, potentially leading to a cascade of errors as subsequent decisions and models are built on this flawed foundation.
  • Limited Training Data: When training data is scarce, models may not be exposed to enough examples to generalize well from the training data to real-world scenarios. This limitation can lead to overfitting, where a model performs well on training data but is unable to predict accurately on unseen data.

  • Data Bias and Noise: Bias can occur through the exclusion of certain groups within the data or through the overrepresentation of others, leading to models that perform well for some populations but poorly for others. Noise, on the other hand, can obscure the true signal that a model is supposed to learn, making it difficult for the application to identify the underlying patterns it needs to make accurate predictions.
    Both bias and noise can result in models that are less robust and that propagate errors throughout their use.

  • Incomplete Data: Incomplete data arises when critical information is missing from a dataset, which can stem from subpar data curation practices. When datasets are not curated with rigor and attention to detail, they may lack key pieces of information that are crucial for accurate analysis. This missing data can lead to skewed analytical outcomes, as the models may overrepresent the available information, which is not an accurate reflection of the real-world scenario.

How Do You Define Good Quality for Biomedical Data?

Characteristics of data quality can be divided into two types: intrinsic, which are qualities inherent to the data itself, and extrinsic, which are qualities not directly related to the data's inherent properties.

Intrinsic Data Quality

Intrinsic data quality characteristics are built into the data itself. Enhancing these aspects typically falls to those who generate biomedical data, such as researchers or healthcare professionals conducting studies. Once data is collected, its intrinsic quality is usually fixed and cannot be enhanced.
 

High-quality intrinsic data is more adaptable to various applications. Quality assurance measures taken during the collection and processing stages of biomedical data can greatly enhance its intrinsic quality. These intrinsic qualities serve as benchmarks to assess if the data meets the necessary standards for analysis.

Key Contributors to Intrinsic Data Quality

1. Experiment Design:

  1. Clearly defined experimental variables
  2. Sufficient number of samples to statistically isolate the effect of variable(s) of interest from confounding biological or technical factors
  3. Sufficient number of replicates (technical and biological)
  4. Controls (negative and positive)

2. Metadata:

  1. Annotations on the biological system and the samples being studied
  2. Experimental factors, observational variables, and confounding factor
  3. Instruments and technology used molecules etc.

3. Measurement:

  1. Use of appropriate technology platforms that have been designed to measure the features of interest at the desired resolution 
  2. Stringent quality controls. 
  3. The measurements should be dependable for downstream analysis

Extrinsic Data Quality

Extrinsic data quality refers to the aspects influenced by the systems and procedures that engage with the data post-creation. It encompasses all elements that don't affect the data's inherent quality. 
Enhancing extrinsic data quality typically falls under the responsibility of data custodians and managers, often achieved through meticulous data curation. High levels of extrinsic data quality simplify the process for users to evaluate and utilize pertinent data.

Key Contributors to Extrinsic Data Quality

1. Standardization:

  1. Consistent field names that contain a specific type of metadata 
  2. Permissibility of the values in metadata fields. (Eg. usage of accepted ontologies)

2. Accuracy:

  1. Correctness of the values present in a metadata field 
  2. Correctness of measurements

3. Data Integrity:

  1. Alteration of metadata fields - accidentally/ maliciously modified/ destroyed.
  2. Retention of all metadata provided by data generators.
  3. Availability of all eligible data from source. 
  4. Inclusion of measurements from all samples in a dataset.

4. Breadth:

  1. Presence of essential metadata fields for most use cases. 
  2. Conformation of the metadata to information standards defined by the community.

5. Completeness:

  1. Availability of all relevant metadata fields.

Ensuring Good Data Quality at Elucidata

At Elucidata, data quality is ensured at nearly every stage of the data delivery process starting from its ingestion from the source to the customers’ Atlas on Polly or on a platform of choice. 

Polly is Elucidata’s biomedical data harmonization platform. Polly's powerful harmonization engine processes measurements, links to ontology-backed metadata and transforms datasets into a consistent data schema.

Here’s how we ensure good data quality with Polly:

1. Data Integrity with Ontology-Backed Metadata

Data from various public sources often comes in inconsistent formats, making it challenging to use due to incomplete metadata annotations, which leaves out crucial context. Additionally, finding specific data within a vast collection is difficult because the descriptions of experimental setups and biological details vary greatly from one study to another, lacking a unified language.

Data Quality in Early-Stage Drug Development

A foundational aspect of Data Quality is the application of ontologies and standardized vocabularies for the annotation of metadata fields such as disease, organism, cell line, tissue, cell type, drugs, and various perturbations. These annotations provide crucial information about the biological entities and interventions being studied. They are essential not only for understanding the focus of the data but also for facilitating the discovery of both new and existing relevant datasets.

Data Quality in Early-Stage Drug Development

On Polly, the implementation of ontologies to annotate metadata fields is critical. It ensures uniformity of terminology across varied data sources and enhances the ability to efficiently navigate and interrogate the data, offering insights and connections that might not be readily apparent without this level of organization.

2. Consistently Processed and Quality Controlled

Ensuring data quality in biomedical research, particularly in omics studies, requires consistent processing and rigorous quality control measures. This involves the use of sophisticated bioinformatics tools and methodologies to standardize data processing, improve accuracy, and minimize technical variations.

Consistently Processed: Data must be processed uniformly to ensure comparability across different datasets. This includes using standardized protocols for data normalization, alignment, and quantification. Tools like STAR (for RNA sequencing data alignment) and Kallisto (for quantifying gene expression levels) are crucial in this step, as they provide reliable and efficient ways to process large omics datasets.

Quality-Controlled: Quality control (QC) metrics are essential to evaluate the integrity and usability of the data. QC metrics can include assessments of read quality, alignment rates, and the presence of potential contaminants. Implementing rigorous QC checks at various stages of data processing helps in identifying and correcting issues that could compromise data quality.

At Elucidata, a standardized NGS pipeline is used to process raw data from public sources such as SRA.

Data Quality in Early-Stage Drug Development

3. Structured into an Extensible and Usable ‘Data Model’

In organizations, where large and diverse datasets are generated from various experimental techniques, data models provide a structured framework for organizing, storing, analyzing, and interpreting complex biological information. The concept of creating “standards” through a common data model is recognized as good data management and stewardship practice.

What is a Data Standard?

A data standard outlines the desirable amount of information that should be captured and exposed to users for effective re-use of data.

The notion of defining standards for the sharing and reuse of biological data is not new. Over the last 20 years, several organizations and consortiums have defined standards, often called minimum information standards, for pre-clinical and clinical data. Some standards are general guidelines and outline the types of information that should be captured for data gathered using specific technologies, like MIAME for microarray data, MINESEQ for RNA Sequencing data, MIPROT for proteomics data, MIcyt for flow cytometry data, MiMET for metabolomics data, while others not only provide guidelines on the type of information to be captured but also standardize the format in which that information should be captured, e.g. MIxS for genomic data and SDTM for clinical data (CDISC).  

What do these Data Quality Standards Mean for Polly?

Polly is a platform that offers machine learning-ready biomedical data from various public and private repositories. As this data is integrated into Polly and made accessible to our users, it's crucial to align the data's content with the needs of Polly's data users. Therefore, establishing data standards for Polly is essential for several reasons:

  • To identify and define the specific information that users deem valuable for each data category.
  • To strategize and focus on curating the most critical information for each data type hosted on Polly.
  • To evaluate new data sources, whether public or private, and assess the curation efforts they would require.
  • To ensure a user-centric approach in the data ingestion, preprocessing, and curation phases on Polly.

Framework for Defining Data Standards on Polly:

Establishing Information Categories

Defining data standards through a Data model contributes to the efficiency and effectiveness of research and development efforts in areas such as drug discovery and personalized medicine.

They provide a structured representation of data and its relationships, facilitating understanding, communication, and implementation. 

Most importantly data models serve as a framework as to how data should be collected, harmonized, and stored for efficient retrieval and analysis.

Based on a review of available information standards for pre-clinical and clinical data, we have identified a framework consisting of the following twelve information categories to help us define the overall data standards for different types of data on Polly:

Data Quality in Early-Stage Drug Development

At Elucidata, we follow a holistic approach towards generating data-type specific data models. Here, not only the existing data models are reviewed, moreover, key data access patterns are identified through extensive literature review and data audits to define specific consumption journeys for different data types. In a nutshell, the consumption journey dictates the manner in which the data is modeled and stored on Polly.

Data Quality in Early-Stage Drug Development

Polly-verified Data: Ensuring Gold-standard Quality

At Elucidata, the emphasis on data quality is paramount, and this is reflected in the comprehensive measures taken at every stage of the data lifecycle. From the initial ingestion of data to its final delivery on Polly, Elucidata employs a multi-faceted approach to ensure that data integrity, standardization, and quality control are maintained at the highest levels. 

‘Polly-verified Data’ by Elucidata stands as the gold standard in data quality for early-stage drug development. Achieving this feat involves a meticulous process on our cutting-edge biomedical data harmonization platform- Polly. Our unwavering commitment to maintaining top-notch data quality is highlighted by a thorough Quality Assurance check comprising around 50 steps, ensuring the utmost reliability and accuracy. Each harmonized dataset comes with an extensive verification report with extensive data quality and assurance checks for UMAP visualizations, gene count distributions, data matrices, metadata information and more. You can look at a sample report here

Our Polly-verified Data epitomizes precision, adhering to stringent standards for consistency, accuracy, and completeness. Through these efforts, data on Polly not only upholds the integrity of the data it manages but also fosters an environment where data-driven insights can thrive, ultimately contributing to the acceleration of scientific discovery and innovation in the field of biomedicine.

Connect with us or reach out to us at info@elucidata.io to learn more. 

Blog Categories

Blog Categories

Request Demo