AI Approaches for Long-Tailed Biomedical Data

Artificial Intelligence is being increasingly used in biology due to the availability of inexpensive technology that enables us to generate and store vast quantities of biomedical data. In recent years, we have had an explosion of omics data that captures many different but complementary biological layers including genomics, epigenomics, transcriptomics, proteomics, and metabolomics. We also have large amounts of data available on the structure of biomolecules such as proteins, metabolites, RNAs, etc., structures of small molecules, interactions between molecules, and clinical data from patients in electronic health records. These data along with imaging and textual data from public and proprietary sources and literature are being mined to develop domain-specific models for a variety of use cases in understanding health and disease in pre-clinical and clinical drug discovery as well as in biomedical Natural Language Processing (NLP) applications.

As AI practitioners increasingly use these data, it is important to be aware that many interesting phenomena in biology involve events that are rare. Most diseases are rare events since they occur in a very small proportion of the total population (Figure 1), and even among diseases, some are rarer than others.

Figure 1: Distribution of diseases in the population1

Out of hundreds of millions of known variants in the human DNA, only a small percentage are pathogenic, and many among these are rare. The inherent variations amongst individuals and differences in the prevalence of events of interest, such as diseases, often manifest as long-tail events in real-world biomedical datasets that sample large populations or combine data collected across different sources. For example, a long tail is apparent in the distribution of diseases mentioned in the datasets submitted to a disease-agnostic public gene expression database (Gene expression Omnibus, GEO) (Figure 2).

Figure 2: Distribution of top 50 diseases mentioned in a disease-agnostic public gene expression database (Gene expression Omnibus)

If we randomly sample the metadata from all datasets in GEO to train a biocuration model and try to predict the presence of a disease term, the model is likely to perform well on majority disease terms but not on the terms that are in the minority. Similarly, the distribution of length of stay in the ICU for a large electronic medical record dataset, MIMIC-III, shows a long tail (Figure 3) and any models designed to predict the length of stay for a patient in the ICU will need to account for both majority and minority classes.

Figure 3: Distribution of length of stay in the ICU in an electronic medical record dataset, MIMIC-III2

Limitation of Conventional AI Approaches for Long-tailed Data Problems

Current AI techniques are not well equipped to handle long-tailed data distributions. Supervised learning models trained by the common practice of empirical risk minimization tend to perform well on common inputs (i.e., the head of the distribution) but struggle where examples are sparse (the tail). The trained model can be easily biased towards head classes with massive training data, leading to poor model performance on tail classes that have limited data (Figure 4). In such cases, big data problems essentially become small data problems.

Figure 4: Performance of supervised learning models with long-tailed data

The conventional approach to mitigate the “small data problem“ is the collection of more training data and retraining of the models. However, the process is very costly in terms of time, compute, and data labeling. The process also does not scale well, and it has been observed that the marginal benefit of additional data tapers off exponentially. Typically one would need a 10-fold increase in the training data volume to achieve a 2-fold increase in a subjective improvement of model performance. Since data augmentation typically suffers from “diseconomies of scale,“ the focus of the research and developer community naturally shifted to the quality of the data from the quantity of data. This incidentally aligned well with the concomitant emergence of the Data-centric AI paradigm.

Data-Centric AI for Long-Tailed Data

Data-centric AI, which is fast emerging as an alternative to the model-centric paradigm, helps alleviate the problem of “diseconomies of scale,” by iteratively developing data from big data to good data with the model itself remaining relatively fixed. Good data is mainly curated with the following iterative steps, typically integrated with the MLOps processes:

Data is defined consistently by making the definition of labels unambiguous; this helps in getting high-quality training data for the long tail.
Coverage of important cases is ensured through good coverage of long-tailed events.
The model has timely feedback from production data about data drifts and concept drifts.
Dataset is sized appropriately for best model performance.

As is evident from the above steps, what to label and how to label data points are very important considerations for generating curated good data. While the ‘how to label’ aspect of curation is critical for the quality of labels in general, the ‘what to label’ aspect addresses the issue of imbalanced training data. Since here we are discussing the challenges with long-tailed data, we are going to talk about the what part, which deals with the identification of data points for labeling corresponding to the long tail of the data. This identification of data points for labeling is typically handled by the MLOps processes with semi-supervised learning-based approaches because of the lack of annotated data, especially in the tail. There are broadly two different semi-supervised approaches that are used in this context, viz., active learning and weak supervised learning. Let us see how these two approaches can be adapted to address the long tail data distribution.

Active Learning

Active learning, in general, addresses the problem of requiring a large set of labeled data for maximizing model performance by searching for the most informative data points to label from a pool of unlabeled data. The search for the most informative data points can be leveraged in this context for edge cases or the long tail of the data. Typically the following steps are adopted to implement an active learning feedback loop for the examination of the difficult edge cases where the model is failing and collecting more of that data.

Getting the inferences with the model trained with the currently labeled dataset
Finding the distribution patterns corresponding to edge cases where the model performs poorly. Finding more of such edge-case data points from the production data stream, labeling them, and adding them to the training dataset. This is called mining of the long tail.
Training the model with the revised dataset and evaluating its performance in production
The above steps are repeated until we get good model performance with the production data stream.

Weak or Semi-Supervised Learning

Active learning requires manual intervention with some knowledge of the data distribution and feedback from the model performance, and that makes it difficult to scale over large volumes of unlabeled data. Weak or Semi-Supervised Learning could also be used to label and improve the quality of the labels progressively to alleviate the problem. This is an active area of research, and typically a combination of labeled and unlabeled data is used for training the model iteratively. However, this approach does not work well with imbalanced data because of the same reason for which supervised learning fails with long-tailed data, viz., the lack of representation of the edge cases. Various techniques such as class-imbalanced sampling combined with weak labeling functions, often using domain knowledge (e.g., biomedical ontological information), could be explored to make semi-supervised learning robust against long-tailed data distributions.

At Elucidata, we are exploring the above approaches in the context of automated annotation of biomolecular datasets with BERT-based NLP models. As discussed above, training the NLP model for high fidelity performance is very challenging due to the long-tailed phenomenon inherent to biomolecular data, which results in poor quality of annotations. Active-learning based approach has helped us in solving this problem by generating training data covering the long tail, which in turn resulted in very high quality of dataset annotations.

Contact us if you want to take advantage of our data-centric platform, Polly, to find and analyze relevant datasets or learn more about using our 1.5 million curated datasets to train your models.

‍