Large Language Models (or LLMs) are humongous models usually in the order of tens or hundreds of gigabytes which are trained on huge amounts of training data, which are generally in terabyte to the petabyte scale.
One of the most famous LLMs rights is GPT-3 with about 175 billion parameters and a 45TB big corpus. Some researchers calculated that if we were to train GPT-3 again using 1024 A100 GPUs it will still take us about 34 days [1]. It’s easy to see why a lot of individuals and organizations cannot afford to train an LLM.
The silver lining with LLMs is that they are few-shot learners, which means that due to the sheer number of parameters and size of the training dataset they are able to perform tasks without being fine-tuned or with minimal tuning.
Recently, we came across an article that showed some interesting results using prompt design. The purpose of the article was to highlight the task-solving capabilities of LLMs which meet two conditions - the number of parameters in the model exceeds 100 billion and the amount of training exceeds 1023 FLOPS. This leads to the emergence of the model which gives it the capability to solve tasks without being specifically trained for them [3].
In the past, we have seen that vanilla natural language models tend to struggle in the biomedical domain. They usually need some fine-tuning or pre-training with biomedical data to give good results. Due to this our hypothesis is that the Quality of our dataset is more important than both the size of the model and the size of the corpus.
This intrigued us to explore LLMs with biomedical text data, specifically for Named Entity Recognition (NER) task and compare them against our BERT models which we use for NER on textual metadata from biomolecular datasets.
Before we jump to the experiments and numbers, let’s first define two more terms - domain-specificity and task-specificity, and what we mean by Quality of a dataset.
Domain-specificity can be defined as the similarity between the domain of our dataset and the domain of the use case or problem statement we are developing a model for.
Task-specificity can be defined as the similarity between the task that needs to be performed for our use case or problem statement and the task our dataset is built for.
According to our definition of the Quality of a dataset, it is a property of a dataset that will affect how well can a model learn from that dataset and this quality can be increased for a dataset by increasing domain-specificity and task-specificity.
Task-specificity should be an intuitive factor that can increase model performance. To learn patterns, a model needs a task to perform, and learn from the mistakes it makes while performing that task. While it is possible for a model to learn different tasks by using just one general task, empirically we have seen that fine-tuning for specific tasks yields better results.
Why domain specificity can help us train a better model, can also be explained using the distribution of our training and test sets. For Empirical Risk Minimization learning algorithms (which are almost all the supervised learning models we use), we need our training and test sets to be sampled from the same distribution. In the real world, after we serve a model we don’t have control over the test set but by increasing domain-specificity according to the use case, we can increase the similarity between the distribution of train and test examples since examples from the same domain will be similar. This becomes a bigger factor for biomedical data due to words that are unique to this domain.
Prompt-based learning is an ML model training method in which we describe our task to perform with/without some examples. This description is then given to a pre-trained LLM for filling the missing information which is used as the output for the said task.
To create our prompt, we used the paper QaNER as a reference. In their paper, they suggested a Question Answering based NER framework and a method to convert NER problems to QA problems which we have used to create our prompts [4].
Along with methods from QaNER we also created 10 prompts and empirically selected the best prompts using a small test set. This was done to get the best prompt we can get so that the prompt is not a bottleneck and we can have an apples-to-apples comparison.
The task we want our BERT model to perform is NER. and for that, we use BERT in a token classification setting. In token classification, the model classifies each token as a certain entity or a regular token.
To fine-tune our model, we used a manually labelled dataset we created with Huggingface models and training APIs. The manually labelled corpus we used for fine-tuning PollyBERT was also a relatively small corpus with about 4k samples from GEO and only a few MBs in size.
To test prompt design we picked GPT-J-6B and OPT-175B to represent two different sizes of LLMs, but their size is not the only difference between GPT-J-6B and OPT-175B. GPT-J-6B and OPT-175B were trained with the Pile Dataset which is an 800GB big corpus, which includes natural language data from a lot of different sources. An important fact to note here is that OPT-175B was trained on a subset of Pile and this subset didn’t include PubMed Central and some other categories.
For our BERT model, we used PubMedBERT weights for initializing a BERT model, PubMedBERT is a BERT-based model which has been pre-trained with a 21GB big PubMed abstract corpus. The PubMedBERT model was then fine-tuned for NER using a manually labelled training dataset. This dataset was created for extracting tissues from a given piece of text.
The task we chose for comparing these models was to perform NER and extract tissue entities from a piece of text. We are using the same prompt that was designed in the earlier section in a 0-shot learning setting with OPT-175B and GPT-J-6B (due to API limitations).
Since we had the capability to perform 2-shot learning with GPT-J-6B we also did that with the same test set to see if using 2-shot learning can help the model perform better.
To quantify performance for any model we are using both accuracy and F1-score as our metrics to measure performance for NER. One thing to note for these scores is that we are considering partial matches between the prediction and ground truth.
The test set we will be using is made using textual metadata from GEO which has been manually labelled by Subject Matter Experts (or SMEs). The test set is a perfectly balanced set with 100 examples containing tissue entities and 100 examples not containing any tissue entities.
Now coming to numbers with PollyBERT we saw an accuracy of 87% and an F1 score of 0.84, GPT-J-6B gave us an accuracy of 51.7% and an F1 score of 0.515 and OPT-175B was not able to give any predictions hence accuracy and F1 score both stand at 0.
For 2-shot learning with GPT-J-6B, we saw a further drop in performance, we got an accuracy of 47% and an F1 score of 0.38. Why this drop in performance happened is not clear to us, we suspect this might just be because of the test set.
From these results, it’s clear that PollyBERT is the better performing model by a huge margin even though it has 10-1000 times lesser parameters and a significantly smaller training corpus than both GPT-J-6B and OPT-175B.
From our results, we have seen that PollyBERT performs much better than LLMs when it comes to NER for tissue entities. Also, we have a decrease in performance with an increase in the number of parameters. These results are completely opposite of what we expected given the condition of emergence was only met by OPT-175B.
According to us, this was due to the difference in the training dataset for these models. Biomedical data is very different from general language data, it has a lot of named entities that are only found in the biomedical domain. To identify these entities for NER, a model needs them in its corpus and since the occurrence of these entities will be negligible in a general language corpus it leads to poor performance.
If we recall the corpus for all these models, we used data from GEO for PollyBERT, GPT-J-6B had PubMed Central data and OPT-175B lacked any biomedical data while we were testing them on GEO. On top of that, while the dataset for PollyBERT was made for tissue NER, both GPT-J-6B and OPT-175B were trained using an autoregressive task.
If we look at domain-specificity and task-specificity here, the dataset for GPT-J-6B was more domain-specific than OPT-175B due to which GPT-J-6B performed better than OPT-175B. The dataset for PollyBERT was more task-specific and domain-specific (since they share the same source i.e., GEO) than GPT-J-6B which lead to a huge jump in performance.
After the quality of data, there is also a case to be made for using manual labelling instead of spending more time creating and experimenting with prompts, specifically for biomedical NER tasks.
Now, let’s assume we have an LLM which is trained a lot in biomedical data and is capable of performing a variety of tasks with prompt-based learning. If we go with prompt-based learning, we will decide on a framework for designing prompts like QaNER and then begin experimenting with different prompts on a test set. The problem with this method is that we are not guaranteed that we can design a prompt in the first place or the best possible one.
On the other hand, manual labelling has a defined process for getting labelled data with consistent results. Due to the reliability and consistency of manual labelling, it should be the preferred method of fine-tuning the language models for biomedical NER tasks.
We are focussed on improving not just the accuracy but also the speed of the PollyBERT model. Read this blog to find out how some optimizations which were tried out recently made PollyBERT 10x faster!