Tech Stack for Generative AI in Life Sciences R&D

Over the past year, there has been a notable rise in the number of large language models (LLMs) made available to the public, granting widespread access to foundational models. A recent addition to this lineup is scGPT, a foundational model specifically trained on omics data. However, to effectively harness models like scGPT, the first few things to consider are: the selection, fine-tuning, and utilization of foundational models, along with the technology needed to facilitate this process.

Selecting the Right Use Case to Use GenAI

The initial step involves determining whether generative AI is appropriate for your specific challenge. Training LLMs is resource intensive, and given that generative AI models tend to hallucinate facts in favor of answering user queries, it is important to pick the right use case.

Possible Use Cases and Techniques

Generative AI finds its most suitable application as a reasoning tool rather than functioning as a knowledge repository since its susceptible to hallucinations. To mitigate this, various techniques, such as retrieval augmented generation (RAG), must be employed to minimize hallucinations. Thus, scenarios like hypothesis generation, where the model can offer suggestions rather than making definitive decisions or predictions, are particularly well-suited for generative AI.

Tech Stack for Generative AI in Life Sciences R&D

Challenges with Training and Using LLMs

Training and fine-tuning demand substantial volumes of both labeled and unlabelled data in conjunction with considerable computational resources. This number can sometimes up to thousands of accelerators (like GPUs) and terabytes of data. Due to this, if you are thinking about fine-tuning your own, ensure you have access to large amounts of compute and curated data.

Fine-tuning your own LLM or utilizing an existing model comes with the challenge of potential model hallucinations. Therefore, the problem at hand must allow for a margin of error and needs the involvement of human experts to oversee both the process and the generated output. This need for human supervision is particularly critical in domains like drug discovery, where certain decisions risk multimillion-dollar losses and potential patient harm throughout the process.

Creating a High-quality Dataset and Fine-tuning a Model

The quality of a generative AI model is influenced by various factors. Still, three key factors stand out: dataset size, dataset quality, and amount of computation (measured using several floating point operations or FLOPs). It has been consistently demonstrated that training a large model with abundant, high-quality data over an extended period consistently outperforms other models.

How to Create a High-quality Dataset?

Defining high-quality data is subjective and should revisited for each specific problem, but it is well-established that enhancing data quality leads to improved performance. To collect a large corpus of high-quality data, we need to gather information from diverse sources to ensure training data diversity. Subsequently, data must undergo preprocessing, involving artifact removal and the correction of data distribution biases.

Tools to Fine-tune a Model

After assembling a substantial and high-quality dataset, the next step is fine-tuning a model with the gathered data. Thanks to the substantial support from the open-source community, with tools like Hugging Face Transformers coupled with the utilization of cloud platforms such as AWS or GCP, we now have access to the necessary techniques and computational resources essential for fine-tuning these models.

Deploying and Monitoring

After obtaining the model, deployment and continuous monitoring is necessary during its utilization. Regular performance checks and fine-tuning based on these metrics contribute to the continued refinement and enhancement of the model's overall capabilities. This iterative process ensures that the model remains effective and relevant in addressing the challenges presented by knowledge cut-offs of trained models.

Tools for Monitoring

Platforms such as AWS Sagemaker provide valuable tools for this purpose, they enable ongoing evaluation. Additionally, benchmark test sets like TruthfulQA and MMLU serve as crucial metrics, allowing a comprehensive assessment of the model's accuracy.

Utilization of Generative AI in Life Sciences

Having gained insights into identifying the appropriate use case and preparing a generative AI model to address it effectively, we now present an example of utilizing generative AI. This architecture revolves around consolidating data omics data, such as GEO and TCGA, to construct a reasoning engine powered by LLMs. This engine is designed to provide comprehensive answers to complex user queries.

To address LLM hallucinations caused by missing data in training sets, we employ vector stores and databases accessible via APIs for precise data retrieval. We've also integrated specialized tools, like cell type annotation models, fine-tuned exclusively for said task. Similarly, other models are given as tools to the LLM that can be used for classification and generating statistics or plots.

Our approach covered the essential elements for effective generative AI use. The LLM is being used for data retrieval and manipulation, which an expert is monitoring to reduce the risks of hallucination. Again, augmenting LLMs with tools and databases helps AI increase user productivity while eliminating challenges that come with using generative AI.

P.S.- Our Co-founder and CTO- Swetabh Pathak, elucidated more on this during our annual event DataFAIR. Here is the recording if you want to watch it. You can reach out to info@elucidata.io or subscribe to our newsletter here to stay updated on Elucidata’s GenAI efforts.

‍