In today's digital age, where information is abundant and constantly evolving, curation has become more crucial than ever. Curators are the gatekeepers of the vast sea of knowledge, sifting through the overwhelming volume of data to deliver relevant, valuable, and engaging content to their audience. But with the exponential growth of information, curators face the challenge of efficiently and effectively navigating this vast landscape.
Enter ChatGPT, an AI-powered language model developed by Open AI and launched in November 2022.
With its advanced natural language processing capabilities, ChatGPT empowers curators to streamline their processes, enhance their findability, evaluate and organize information, and ultimately provide an elevated experience for their readers. ChatGPT has captivated the public's imagination like few other innovations. Its progress has surprised the machine learning community, surpassing the previous benchmark set by BERT models released in 2018.
”It won’t be a surprise to see, in the next 24 months, multiple billion-dollar companies built on top of OpenAI’s foundational models. The startups that will be the most successful won’t be the best at prompt engineering, which is the focus today; instead, success will be found in what novel data and use cases they incorporate into OpenAI’s models. This anonymous data and application will be the moat that establishes the next set of AI unicorns.” ~David Shim
The Life Sciences community is particularly interested in understanding the implications of ChatGPT for their work. In this blog, we dive into the world of curation and explore how ChatGPT can revolutionize this practice.
As public data repositories accept data in flexible arrangements, significant variations arise in how data is submitted. Consequently, intelligent systems are necessary to extract and categorize pertinent information from the provided metadata on these repositories.
Elucidata specializes in ingesting and providing omics datasets from diverse sources in standardized machine learning (ML)-ready formats to expedite drug development.
To address this challenge, Elucidata employs Biological Natural Language Processing (BioNLP) systems to curate its platform’s vast array of metadata, thereby automating the process. This system comprises two primary components:
By leveraging these BioNLP systems, Elucidata establishes a consistent format across all its diverse data sources, significantly reducing the effort required to render public data usable. This standardized approach not only enhances the efficiency of data curation but also contributes to accelerating research and analysis in the field of drug development.
To streamline and standardize the curation process, we employ Bio-NLP systems to extract relevant entities from metadata, abstracts, and publications, automating the process effectively.
The high-level process of training a model involves several key steps.
BERT (Bidirectional Encoder Representations from Transformers) is a state-of-the-art natural language processing (NLP) model introduced by researchers at Google AI in 2018. It revolutionized the field of NLP by significantly advancing the understanding of contextual language representations.
There are two limitations to using the current process for developing models.
ChatGPT has two advantages over BERT.
One of the initial applications we explored with ChatGPT is its use in curating various fields using prompts. Our experimentation has revealed that ChatGPT performs better than BERT-based models while requiring significantly less development time.
To evaluate this, we conducted an experiment on information extraction, specifically extracting disease information from samples within an omics dataset. We selected datasets from GEO (Gene Expression Omnibus) to create a test set for comparison.
Both BERT and ChatGPT were employed to extract disease labels. For BERT, we utilized a custom pipeline designed explicitly for disease extraction. On the other hand, with ChatGPT, we used a prompt that provided instructions on the process of extracting disease from the metadata.
BERT demonstrated significantly poorer performance than the results obtained using the developed prompt, and the development time required for creating and testing the prompt was notably shorter than that of BERT.
In the second experiment, the objective was a classification task involving identifying the presence or absence of a donor in a given experiment. BERT proved to be highly effective in this task, yielding excellent results. The experimental setup remained consistent, utilizing a fine-tuned BERT model alongside ChatGPT with a prompt, facilitating a direct comparison between the two.
Despite the relatively long development time required, ChatGPT emerged as the superior choice in this scenario, outperforming BERT. Additionally, ChatGPT offered the added advantage of being independent of the data source used for testing the model, thereby reducing development time in the long run.
The potential impact of ChatGPT is substantial, with the prospect of significant time and resource savings on the horizon. While the technology is still in its early stages, the future looks promising as ChatGPT holds the key to unlocking novel possibilities and enhancing efficiency within the curation workflow at Elucidata. By harnessing the power of this advanced language model, the company stands to experience transformative changes in its information extraction endeavors.
Understand more about our ML-ready omics datasets and discover how our innovative solutions can optimize your research workflows. Together, let's unlock new frontiers in data-driven drug R&D.
Book a demo to learn more!
Get the latest insights on Biomolecular data and ML