Data Science & Machine Learning

Don't Feed Junk to Your Model

Deepthi Das, Trisha Dhawan
January 19, 2023

If an ML model identifying the target audience for a particular ad gives an inaccurate output, the impact is just in terms of lost revenue. However, if an ML model which diagnoses/ predicts a disease condition gives an inaccurate output, the stakes are much higher and could cost human lives. There is no easy way to get an accurate domain-trained model. What could help improve model accuracy is dependent on you. Yes! It depends on what you feed your model. Curious to explore more? Read on.

No alt text provided for this image
Source

How ML Models Have Evolved to Work on Biomedical Data - The Use Cases

With the advent of high-throughput technologies for pre-clinical and clinical experimentation, the volumes of data generated daily are upwards of TBs. For drug discovery and biological insight derivation, this data is a goldmine of extremely valuable information. Needless to say, there are challenges in accessing, processing, and using this data manually and there is a pressing need for automation in the data processing pipeline.

Recently, it has been recognized widely that ML models can be and need to be used to accelerate the data analysis process and ‘bench to bedside’ time in drug development.

In 2020, about 18 candidates that were identified using AI-driven tools entered clinical pipelines.

Multiple projects, such as the development of microbiome therapeutics, precision medicine for a rheumatoid arthritis blockbuster, research publication, and database scanning for biomarkers of stroke which apply machine learning algorithms like the random forest, deep learning algorithms like convolutional neural networks (CNN), or sophisticated NLP algorithms to analyze biomedical data and answer major biological questions are underway.

Evidently, AI-driven data solutions hold the potential to solve these challenges.

However, training AI/ML models specifically for biomedical data is no mean feat. ML models are constantly being improved and are increasingly being used for various applications, as discussed above. However, as the ML model size and the number of parameters it is trained on increase, the opacity of the model also increases. This poses a huge risk of inaccurate results or wrong predictions that cannot be traced or rectified, as we do not understand how exactly the model came to a specific conclusion. Hence, it is imperative to ensure that the data fed to the model is of high quality, accurate, and unbiased, as well as sufficient and representative enough so that the output from the model is as accurate as possible. The complexity and heterogeneity of biological data add a layer of intricacy to the model training process.

The Challenges in Using Big Biological Data to Train ML Models

The general challenges of big data, namely the 3Vs- volume, velocity, and variety apply to ML in the biological context as well.


No alt text provided for this image
The 3 Vs of big data in life sciences

Apart from this, some distinct characteristics of the complex data landscape of biology can pose unique challenges while using it to train models.

1. Bias

Bias can be introduced into the machine learning process at different phases of a model's development. Insufficient data, inconsistent data collecting, and poor data practices can all lead to bias in the model's decisions. For example, a study revealed that only 50% of the articles published in 2014 reported the ‘age’ and ‘sex’ of the mice that were used in the experiments.

This is an alarming statement as this means that a model trained on this data would potentially bias the outcome, given the lack of important information. They also found evidence for different levels of sex bias in these areas: the strongest male bias was observed in cardiovascular disease models, and the strongest female bias was found in infectious disease models. Since age and gender play a decisive role in most disease progressions, scientists should correct for such biases in a training set.

“By 2022, 85% of AI projects will deliver erroneous outcomes due to bias in data, algorithms, or the teams responsible for managing them.” ~ Gartner, Inc.

2. Lack of a Standard Ontology

In life sciences research, domain-specific knowledge is often encoded in ontologies and in the data- and knowledge-bases that use ontologies for annotation. Hundreds of ontologies have been developed, spanning almost all biological and biomedical research domains.

For instance, a researcher can follow ChEBI or PubChem to document drug information. Some researchers even record it using common drug names. All of these terms make sense for a domain expert, but for a model to ingest this, it has to be unified into a single machine-identifiable format. Biomedical terms are specific, and training the model on these terms is critical for the model to make the right decisions.

3. Data Availability on Specific Use Cases like Rare Diseases

Rare diseases like diastrophic dysplasia, a genetic disorder that affects 1 in 100,000 newborns, will have very limited data for training and evaluating ML models. This can lead to overfitting and poor generalization of the models. Moreover, in many cases, this limited data is not curated, making finding relevant data the first bottleneck in the model training process.

4. Contextual Understanding of Data to Find Relevant Patterns

Some biological terms, though very similar to general literature, can have specific meanings contextually.

Like fish and chips- in general text, it is a food item; in the biological context, it stands for Fluorescence in situ hybridization and microarray chips. Also, when acronyms derive value from the POV of the context they are used in. For example, IBD could be ‘Inflammatory Bowel Disease’ or ‘Identity-By-Descent.’ This contextual understanding of a term is crucial for training a semantic model for knowledge graph generation or an NLP model for publication scanning.

What to Feed Your ML Model to Generate Accurate Results

It seems fairly straightforward to say that a model trained on high-quality data gives accurate results. But what are the attributes of this ‘high-quality data’? Does it only mean data that is machine-readable? Obviously not. As discussed above, there are several aspects of data that go beyond structuring data in a machine-readable format.

Here is a checklist of high-quality training data attributes that you can look at or save for later:

  • Accuracy: The data should be consistent and free of errors. All data points should be available and statistical measures taken to ensure that the data is valid and accurate.
  • Precise labels: The data should be labeled and annotated correctly. Labels allow analysts to isolate variables within datasets, which enables the selection of the right features to optimize predictive ML models.
  • Recency: Some studies/techniques become obsolete with time. The discovery of gene-protein interactions and other aspects of epigenetics has introduced evidence that the environment can influence inheritable traits, something once considered a genetic impossibility. Using more recent data in the relevant cases would help tackle this.
  • No bias: It should be ensured that the data has no inherent experimental bias. This can be understood by checking the metadata. For example, while trying to identify a target in a disease progression using differential gene expression, it is important to ensure that the training data have datasets that are evenly distributed between different ethnicities and genders.
  • Relevance: The data used should be relevant to the problem at hand. For example, if you want to train a model to detect breast cancer, you probably shouldn't feed the model with X-rays of leg or CT scans of the brain. It is also essential to ask the right questions to your model to ensure accurate results.
  • Models such as BioBERT are pre-trained on biological corpora, making biomedical text-mining tasks faster and better. At Elucidata, we optimized our specialized BERT model, PollyBERT, and compared it with other large language models for Named Entity Recognition (NER) tasks. We found that it outperformed GPT-J-6B and OPT-175B by far!
  • Diversity: ML models are trained on large volumes of data to improve the accuracy and prediction of the model; however, without emphasis on attributes such as diversity and data distribution, the model remains biased.

Each one of these data attributes seems like an obvious choice. What’s the catch, then? We have great models. Why are we not accurately predicting genetic diseases faster and designing precision medicines sooner?

One of the major hurdles is that this data- the high-quality data with all the abovementioned attributes - is hiding in a sea of biological data that has been and is being generated. It is neither curated (standardized and harmonized) nor FAIR- Findable, Accessible, Interoperable & Reusable. A researcher has to spend hours finding relevant data and thoroughly reviewing each dataset to find if the data fits the above criteria. Imagine doing this for the thousands of relevant datasets needed to train a model. This is where we add value!

At Elucidata, we strive to support and accelerate biological discoveries using ML models by providing highly curated ML-ready data. We have the world’s largest collection of single-cell and bulk RNA seq datasets, which help you take a leap towards faster insight discovery. Talk to us to explore more!

This post was originally published in Polly Bits- our biweekly newsletter on LinkedIn.

Blog Categories

Blog Categories

Request Demo