Data-centricity Is the Future of Biomedical NLP

Natural Language Processing (NLP) has a variety of applications in the biomedical domain, some of which are Named Entity Recognition, Document Classification, Relationship Discovery, and Information Extraction. The global market size for NLP applications in healthcare and life sciences is expected to grow from USD 1.8 billion in 2021 to USD 4.3 billion by 2026. Until a few years ago, a lot of this NLP work was being done using diverse models. However, with the advent of foundational models based on transformers, this field has seen massive convergence. Nowadays, many NLP models are built around fine-tuning these foundational models for specific tasks. However, harnessing the power of NLP in the biomedical domain will require a lot more than just the development of larger and larger language models.

‍

NLP Models - Bigger Is Not Always Better

The success of pre-trained language models after the discovery of Transformers in 2017 has also led to a kind of arms race in the NLP domain. Every few months, we see the release of an even bigger model with a lot more parameters. However, increasing the number of parameters beyond a point becomes prohibitive in terms of cost and time. According to some estimates, training GPT-3 costs around USD 12 million and takes a few weeks on several powerful GPUs. Hope comes in the form of some studies that showed that much smaller models like T0 with 11 billion parameters perform better than GPT-3 with 175 billion parameters on many language tasks.

The BERT-Large model released by Google had 340 million parameters, GPT-3 by OpenAI has 175 billion parameters, and PaLM by Google with 540 billion parameters. Efforts are in full swing to train even larger language models!

‍

Comparative performance of GPT-3 with 175 billion parameters and T0 model with 11 billion parameters, and clearly T0 significantly beats GPT-3 in some tasks.
Image Source : https://huggingface.co/blog/large-language-models

‍

NLP Models Need Re-training with Biomedical Data

Soon after the release of the BERT model in 2018, it was felt that this pre-trained model could not be directly used for biomedical applications since it had been trained on general English texts which do not have sufficient biomedical context. After all, ML models can only learn the patterns they see in the data! To address this issue, two models, PubMedBERT and BioBERT, were subsequently released after training the original BERT model architecture on millions of texts from the PubMed database, which is the standard repository for biomedical research articles. And interestingly, these two models trained specifically on biomedical texts significantly outperform the original pre-trained BERT model on most biomedical tasks. This now raises the question of what happens when a new larger foundational model is released. Do we need to train the new architecture on biomedical data again? This is again very expensive and time-consuming. And it is also not clear a priori if this approach will necessarily lead to a significant gain over other existing language models pre-trained using PubMed data, which will justify the vast resources put into this training process. Added to this is the limited availability of high-quality labeled data in the biomedical domain, which leads to poor performance by even such large language models.

‍

What We Need: a Data-centric Approach

‍

Schematic of a data-centric approach and how it is different from a model-centric approach.

‍

The primary issue in the BioNLP domain then is :

‍

How to get further improvement in prediction accuracy without changing the underlying foundational model being used?

Labelling of biomedical data for NLP tasks is usually a tedious process and prone to human errors. This is where data-centric models come into the picture since they lead to significant improvement in final performance by iteratively improving the data quality without changing the final predictive models.

‍

With predictive models reaching saturation in their development, any further improvement in model performance will naturally come from improved data quality. And the lack of availability of high-quality data for biomedical NLP tasks makes a data-centric approach a perfect candidate for making progress. The data-centric models are based on the premise that a relatively small amount of data is needed to achieve human-level accuracy, making a careful analysis of this data reasonably practical and doable.

A data-centric approach has many facets, one of which is to design algorithms to catch data points that could be wrongly labeled automatically. These can either be sent back for human review or temporarily removed from the training process. Data-centric models also allow ML researchers to encode contextual intelligence into the models, which is crucial for practical applications in the biomedical domain.

‍

Data-centric Models at Elucidata Achieve Human Level Accuracy

By carefully understanding the data and intelligent design of data-centric models, we have also achieved human-level accuracy for specific NER tasks like identification of cell type, cell line, tissue, disease, etc., from biomedical abstracts available on public datasets like GEO.

Data-Centric models developed at Elucidata achieve human-level accuracy for several biomedical NER tasks. As can be seen their performance is much better than the basic BERT-based models.

The development of foundational models has profoundly influenced the NLP domain and research for biomedical applications in particular. But as we advance, it makes a lot more practical sense to focus on developing small data-centric models instead of waiting for the development of larger and larger models. And this is where the domain niche of startups like Elucidata plays a critical role since the development of data-centric models requires close collaboration between ML researchers and domain experts.