How PollyBERT Outperforms Large Language Models in BioNER

Vishal Samal

October 11, 2022

Large Language Models (or LLMs) are humongous models usually in the order of tens or hundreds of gigabytes which are trained on huge amounts of training data, which are generally in terabyte to the petabyte scale.

One of the most famous LLMs rights is GPT-3 with about 175 billion parameters and a 45TB big corpus. Some researchers calculated that if we were to train GPT-3 again using 1024 A100 GPUs it will still take us about 34 days [1]. It’s easy to see why a lot of individuals and organizations cannot afford to train an LLM.

The silver lining with LLMs is that they are few-shot learners, which means that due to the sheer number of parameters and size of the training dataset they are able to perform tasks without being fine-tuned or with minimal tuning.

Recently, we came across an article that showed some interesting results using prompt design. The purpose of the article was to highlight the task-solving capabilities of LLMs which meet two conditions - the number of parameters in the model exceeds 100 billion and the amount of training exceeds 10²³ FLOPS. This leads to the emergence of the model which gives it the capability to solve tasks without being specifically trained for them [3].

In the past, we have seen that vanilla natural language models tend to struggle in the biomedical domain. They usually need some fine-tuning or pre-training with biomedical data to give good results. Due to this our hypothesis is that the Quality of our dataset is more important than both the size of the model and the size of the corpus.

This intrigued us to explore LLMs with biomedical text data, specifically for Named Entity Recognition (NER) task and compare them against our BERT models which we use for NER on textual metadata from biomolecular datasets.

Defining the Quality of a Dataset

Before we jump to the experiments and numbers, let’s first define two more terms - domain-specificity and task-specificity, and what we mean by Quality of a dataset.

Domain-specificity can be defined as the similarity between the domain of our dataset and the domain of the use case or problem statement we are developing a model for.

Task-specificity can be defined as the similarity between the task that needs to be performed for our use case or problem statement and the task our dataset is built for.

According to our definition of the Quality of a dataset, it is a property of a dataset that will affect how well can a model learn from that dataset and this quality can be increased for a dataset by increasing domain-specificity and task-specificity.

Task-specificity should be an intuitive factor that can increase model performance. To learn patterns, a model needs a task to perform, and learn from the mistakes it makes while performing that task. While it is possible for a model to learn different tasks by using just one general task, empirically we have seen that fine-tuning for specific tasks yields better results.

Why domain specificity can help us train a better model, can also be explained using the distribution of our training and test sets. For Empirical Risk Minimization learning algorithms (which are almost all the supervised learning models we use), we need our training and test sets to be sampled from the same distribution. In the real world, after we serve a model we don’t have control over the test set but by increasing domain-specificity according to the use case, we can increase the similarity between the distribution of train and test examples since examples from the same domain will be similar. This becomes a bigger factor for biomedical data due to words that are unique to this domain.

What Is Prompt-based Learning and How Do We Make Our Prompt?

Prompt-based learning is an ML model training method in which we describe our task to perform with/without some examples. This description is then given to a pre-trained LLM for filling the missing information which is used as the output for the said task.

To create our prompt, we used the paper QaNER as a reference. In their paper, they suggested a Question Answering based NER framework and a method to convert NER problems to QA problems which we have used to create our prompts [4].

Along with methods from QaNER we also created 10 prompts and empirically selected the best prompts using a small test set. This was done to get the best prompt we can get so that the prompt is not a bottleneck and we can have an apples-to-apples comparison.

How Do We Fine-tune PollyBERT?

The task we want our BERT model to perform is NER. and for that, we use BERT in a token classification setting. In token classification, the model classifies each token as a certain entity or a regular token.

To fine-tune our model, we used a manually labelled dataset we created with Huggingface models and training APIs. The manually labelled corpus we used for fine-tuning PollyBERT was also a relatively small corpus with about 4k samples from GEO and only a few MBs in size.

Experiments

To test prompt design we picked GPT-J-6B and OPT-175B to represent two different sizes of LLMs, but their size is not the only difference between GPT-J-6B and OPT-175B. GPT-J-6B and OPT-175B were trained with the Pile Dataset which is an 800GB big corpus, which includes natural language data from a lot of different sources. An important fact to note here is that OPT-175B was trained on a subset of Pile and this subset didn’t include PubMed Central and some other categories.

For our BERT model, we used PubMedBERT weights for initializing a BERT model, PubMedBERT is a BERT-based model which has been pre-trained with a 21GB big PubMed abstract corpus. The PubMedBERT model was then fine-tuned for NER using a manually labelled training dataset. This dataset was created for extracting tissues from a given piece of text.

The task we chose for comparing these models was to perform NER and extract tissue entities from a piece of text. We are using the same prompt that was designed in the earlier section in a 0-shot learning setting with OPT-175B and GPT-J-6B (due to API limitations).

Since we had the capability to perform 2-shot learning with GPT-J-6B we also did that with the same test set to see if using 2-shot learning can help the model perform better.

To quantify performance for any model we are using both accuracy and F1-score as our metrics to measure performance for NER. One thing to note for these scores is that we are considering partial matches between the prediction and ground truth.

The test set we will be using is made using textual metadata from GEO which has been manually labelled by Subject Matter Experts (or SMEs). The test set is a perfectly balanced set with 100 examples containing tissue entities and 100 examples not containing any tissue entities.

Results

Now coming to numbers with PollyBERT we saw an accuracy of 87% and an F1 score of 0.84, GPT-J-6B gave us an accuracy of 51.7% and an F1 score of 0.515 and OPT-175B was not able to give any predictions hence accuracy and F1 score both stand at 0.

Plot to show as F1 score increases the model parameter count increases

For 2-shot learning with GPT-J-6B, we saw a further drop in performance, we got an accuracy of 47% and an F1 score of 0.38. Why this drop in performance happened is not clear to us, we suspect this might just be because of the test set.

Comparing 2-shot learning with GPT-J-6B with PollyBERT

From these results, it’s clear that PollyBERT is the better performing model by a huge margin even though it has 10-1000 times lesser parameters and a significantly smaller training corpus than both GPT-J-6B and OPT-175B.

Conclusion

Effect of Quality of the Dataset

From our results, we have seen that PollyBERT performs much better than LLMs when it comes to NER for tissue entities. Also, we have a decrease in performance with an increase in the number of parameters. These results are completely opposite of what we expected given the condition of emergence was only met by OPT-175B.

According to us, this was due to the difference in the training dataset for these models. Biomedical data is very different from general language data, it has a lot of named entities that are only found in the biomedical domain. To identify these entities for NER, a model needs them in its corpus and since the occurrence of these entities will be negligible in a general language corpus it leads to poor performance.

If we recall the corpus for all these models, we used data from GEO for PollyBERT, GPT-J-6B had PubMed Central data and OPT-175B lacked any biomedical data while we were testing them on GEO. On top of that, while the dataset for PollyBERT was made for tissue NER, both GPT-J-6B and OPT-175B were trained using an autoregressive task.

If we look at domain-specificity and task-specificity here, the dataset for GPT-J-6B was more domain-specific than OPT-175B due to which GPT-J-6B performed better than OPT-175B. The dataset for PollyBERT was more task-specific and domain-specific (since they share the same source i.e., GEO) than GPT-J-6B which lead to a huge jump in performance.

Manual Labelling vs Prompt Design

After the quality of data, there is also a case to be made for using manual labelling instead of spending more time creating and experimenting with prompts, specifically for biomedical NER tasks.

Now, let’s assume we have an LLM which is trained a lot in biomedical data and is capable of performing a variety of tasks with prompt-based learning. If we go with prompt-based learning, we will decide on a framework for designing prompts like QaNER and then begin experimenting with different prompts on a test set. The problem with this method is that we are not guaranteed that we can design a prompt in the first place or the best possible one.

On the other hand, manual labelling has a defined process for getting labelled data with consistent results. Due to the reliability and consistency of manual labelling, it should be the preferred method of fine-tuning the language models for biomedical NER tasks.

We are focussed on improving not just the accuracy but also the speed of the PollyBERT model. Read this blog to find out how some optimizations which were tried out recently made PollyBERT 10x faster!

Other Resources

Blogs Case Studies Dataset Roundup Documentation Glossary Webinars Whitepapers

Thank you for reaching out!

Our team will get in touch with you over email within next 24-48hrs.

Oops! Something went wrong while submitting the form.

FAQs

What are the key benefits of using Polly for gene target prioritization in patient stratification?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

View Video

How does Polly help in training classifier models for patient stratification?

View Video

How does Polly assist in defining genetic signatures for different stages of cell differentiation?

View Video

What is the process of creating a disease-specific atlas using Polly’s harmonization engine?

View Video

How does Polly integrate multiple data types for more reliable patient stratification?

View Video

Can Polly handle data quality issues and unstructured data from public repositories?

View Video

How does Polly harmonize multi-omic datasets to improve the quality of patient stratification?

View Video

How does Elucidata's Polly help in overcoming the challenges of patient stratification?

View Video

What challenges do researchers face when performing patient stratification using multi-omics data?

View Video

What is patient stratification, and why is it important for precision medicine?

View Video

What are the key advantages of using Polly for transcriptome profiling and biomarker identification?

View Video

Upcoming Webinar - Polly KG -A Co-Built Knowledge Graph That Evolves With Your Unique Research

Register Now

[Upcoming Webinar] Scaling High-Quality Data Processing: Achieve 4x Cost Reduction for Foundation ModelsRegister Now->

Reserve Your Seat

Pharma Company Achieves 4x Faster Target Identification for Inflammatory Disease

Key Highlights

What’s a Rich Text element?

Static and dynamic content editing

How to customize formatting for each rich text

All Solution Briefs

Other Resources

How PollyBERT Outperforms Large Language Models in BioNER

Defining the Quality of a Dataset

What Is Prompt-based Learning and How Do We Make Our Prompt?

How Do We Fine-tune PollyBERT?

Experiments

Results

Conclusion

Effect of Quality of the Dataset

Manual Labelling vs Prompt Design

Other Resources

Talk to our Data Expert

More Solution Briefs

Faster Insights on Omics Data Signatures with Polly Discover

Enhancing Data Quality: QC Filters for Single Cell RNA-seq Analysis

How to Perform Patient Stratification on Polly

ChatGPT in Drug Discovery

Solving Biomedical Data Findability Issues Using Polly

How to Compare Gene Signatures on Polly

FAQs

What are the key benefits of using Polly for gene target prioritization in patient stratification?

How does Polly help in training classifier models for patient stratification?

How does Polly assist in defining genetic signatures for different stages of cell differentiation?

What is the process of creating a disease-specific atlas using Polly’s harmonization engine?

How does Polly integrate multiple data types for more reliable patient stratification?

Can Polly handle data quality issues and unstructured data from public repositories?

How does Polly harmonize multi-omic datasets to improve the quality of patient stratification?

How does Elucidata's Polly help in overcoming the challenges of patient stratification?

What challenges do researchers face when performing patient stratification using multi-omics data?

What is patient stratification, and why is it important for precision medicine?

What are the key advantages of using Polly for transcriptome profiling and biomarker identification?

What methodologies does Polly use to identify synergistic drug combinations?

How does Polly rank datasets similar to a gene signature query?

What steps are involved in creating a query gene signature on Polly?

How does Polly's RNA-Seq Atlas simplify gene signature analysis?

What is gene signature comparison, and why is it important in drug discovery?

Get the latest news, industry insights, and updates delivered directly to your inbox.

All Solution Briefs

Faster Insights on Omics Data Signatures with Polly Discover

Enhancing Data Quality: QC Filters for Single Cell RNA-seq Analysis

How to Perform Patient Stratification on Polly

ChatGPT in Drug Discovery

Solving Biomedical Data Findability Issues Using Polly

How to Compare Gene Signatures on Polly

info@elucidata.io

info@elucidata.io

info@elucidata.io