ChatGPT in Drug Discovery

ChatGPT in Drug Discovery

Vishal Samal, Shrushti Joshi
July 5, 2023

In today's digital age, where information is abundant and constantly evolving, curation has become more crucial than ever. Curators are the gatekeepers of the vast sea of knowledge, sifting through the overwhelming volume of data to deliver relevant, valuable, and engaging content to their audience. But with the exponential growth of information, curators face the challenge of efficiently and effectively navigating this vast landscape.

Enter ChatGPT, an AI-powered language model developed by Open AI and launched in November 2022.

With its advanced natural language processing capabilities, ChatGPT empowers curators to streamline their processes, enhance their findability, evaluate and organize information, and ultimately provide an elevated experience for their readers. ChatGPT has captivated the public's imagination like few other innovations. Its progress has surprised the machine learning community, surpassing the previous benchmark set by BERT models released in 2018.

”It won’t be a surprise to see, in the next 24 months, multiple billion-dollar companies built on top of OpenAI’s foundational models. The startups that will be the most successful won’t be the best at prompt engineering, which is the focus today; instead, success will be found in what novel data and use cases they incorporate into OpenAI’s models. This anonymous data and application will be the moat that establishes the next set of AI unicorns.” ~David Shim

The Life Sciences community is particularly interested in understanding the implications of ChatGPT for their work. In this blog, we dive into the world of curation and explore how ChatGPT can revolutionize this practice.

Why is Data Curation Important?

As public data repositories accept data in flexible arrangements, significant variations arise in how data is submitted. Consequently, intelligent systems are necessary to extract and categorize pertinent information from the provided metadata on these repositories.

Elucidata specializes in ingesting and providing omics datasets from diverse sources in standardized machine learning (ML)-ready formats to expedite drug development.

To address this challenge, Elucidata employs Biological Natural Language Processing (BioNLP) systems to curate its platform’s vast array of metadata, thereby automating the process. This system comprises two primary components:

  • One responsible for extracting relevant information from public data,
  • Other for harmonizing the data to a standardized vocabulary.

By leveraging these BioNLP systems, Elucidata establishes a consistent format across all its diverse data sources, significantly reducing the effort required to render public data usable. This standardized approach not only enhances the efficiency of data curation but also contributes to accelerating research and analysis in the field of drug development.

Data Curation Using BioNLP Systems (Before ChatGPT)

To streamline and standardize the curation process, we employ Bio-NLP systems to extract relevant entities from metadata, abstracts, and publications, automating the process effectively.

ChatGPT in Drug Discovery
Workflow of Curation Process using Bio-NLP Systems

The high-level process of training a model involves several key steps.

  • There is the field definition and manual curation phase, where the field to be extracted is defined, and guidelines are created to generate training data. This data undergoes a double-blinded review process to ensure its reliability for training task-specific models.
  • In the training phase, a large corpus of texts, such as publications, is preprocessed by dividing them into smaller paragraphs. These paragraphs are then used to train BERT models specific to the task.

BERT (Bidirectional Encoder Representations from Transformers) is a state-of-the-art natural language processing (NLP) model introduced by researchers at Google AI in 2018. It revolutionized the field of NLP by significantly advancing the understanding of contextual language representations.

ChatGPT in Drug Discovery
Workflow of Normalization Process
  • After the models are trained, they are tested using a separate dataset known as test data.
  • The accuracy of the models is evaluated, and if they meet the desired performance standards, they can be used for curation in production. However, it is often necessary to iterate the training process or obtain more training data to improve the models. Typically, around 5 to 15 iterations are required to develop "production-ready models" ready to be deployed.
  • In the final stage, the extracted information is standardized using specific ontologies such as MeSH, PubChem, BTO, and others. This process yields "normalized entities," terms selected from a regulated dictionary.
  • Additionally, relevant entities are extracted from metadata, publications, and abstracts using the task-specific model that was trained earlier. To ensure consistency, the extracted entities are mapped to standard ontologies through a dedicated model explicitly trained. This model, called the "normalization" model, maps extracted entities to the corresponding ontologies.

There are two limitations to using the current process for developing models.

  • First, since one model can curate one field, which puts a development time constraint on how many new areas we can add time.
  • Secondly, BERT's architecture and size impose limitations, making it relatively less effective in understanding context and extracting information.

Enhancing Curation with ChatGPT

ChatGPT has two advantages over BERT.

  1. ChatGPT is an LLM designed to follow instructions, making it flexible regarding the tasks it can perform.
  2. Being an LLM, it has higher accuracy when performing said tasks.

One of the initial applications we explored with ChatGPT is its use in curating various fields using prompts. Our experimentation has revealed that ChatGPT performs better than BERT-based models while requiring significantly less development time.

Experiment 1

To evaluate this, we conducted an experiment on information extraction, specifically extracting disease information from samples within an omics dataset. We selected datasets from GEO (Gene Expression Omnibus) to create a test set for comparison.

Both BERT and ChatGPT were employed to extract disease labels. For BERT, we utilized a custom pipeline designed explicitly for disease extraction. On the other hand, with ChatGPT, we used a prompt that provided instructions on the process of extracting disease from the metadata.

ChatGPT in Drug Discovery
Accuracy of Sample Level Disease Labels
ChatGPT in Drug Discovery
Development Time for Sample Level Disease Labels

BERT demonstrated significantly poorer performance than the results obtained using the developed prompt, and the development time required for creating and testing the prompt was notably shorter than that of BERT.

Experiment 2

In the second experiment, the objective was a classification task involving identifying the presence or absence of a donor in a given experiment. BERT proved to be highly effective in this task, yielding excellent results. The experimental setup remained consistent, utilizing a fine-tuned BERT model alongside ChatGPT with a prompt, facilitating a direct comparison between the two.

Despite the relatively long development time required, ChatGPT emerged as the superior choice in this scenario, outperforming BERT. Additionally, ChatGPT offered the added advantage of being independent of the data source used for testing the model, thereby reducing development time in the long run.

ChatGPT in Drug Discovery
Accuracy of Donor Labels
ChatGPT in Drug Discovery
Development Time for Donor Labels

The potential impact of ChatGPT is substantial, with the prospect of significant time and resource savings on the horizon. While the technology is still in its early stages, the future looks promising as ChatGPT holds the key to unlocking novel possibilities and enhancing efficiency within the curation workflow at Elucidata. By harnessing the power of this advanced language model, the company stands to experience transformative changes in its information extraction endeavors.

Understand more about our ML-ready omics datasets and discover how our innovative solutions can optimize your research workflows. Together, let's unlock new frontiers in data-driven drug R&D.

Book a demo to learn more!

Request Demo