Biopharma and AI: 2022 & Beyond

What a year 2022 was! A confluence of remarkable biomedical discoveries, AI-led innovation, and a massive paradigm shift towards incorporating AI in the industry. The world is at an exciting juncture where big data is generating valuable and actionable insights. This year was a rather eventful one with breakthroughs and ground-breaking innovations such as:

Early Diagnosis of Life-threatening Health Issues:
An AI algorithm, trained on a vast number of electronic health reports, predicted the development of sepsis in patients and directly contributed to reducing sepsis-related deaths by nearly 20%.
Unveiling the Protein Structure - 2 Major Milestones:
Deepmind’s Alpafold 2 trained on over 170,000 proteins from a public repository of protein sequences and structures, predicted the molecular structures of just about every known protein (about 200 million estimated protein shapes). Meta’s ESMfold trained on genetic data from large-scale metagenomic screens of soil, seawater, and other sources predicted the structure of around 600 million putative proteins.
Predicting Viruses with Pandemic Potential:
Researchers are using the data collected during the COVID-19 pandemic to analyze viral mutations, predicting when a new variant such as Omicron will emerge and become dominant. Other algorithms are programmed to examine viruses currently spreading through the animal kingdom to identify which ones might jump to humans, potentially helping researchers avert the next pandemic.

These are just a few examples of the milestones we achieved in 2022. Over the last decade, there has been a palpable shift where pharma companies are focussing deeply on how data - generated in-house or by researchers globally- can be reused rather than performing all experiments firsthand.

Pharma Companies Are in a Race to Use Big Data

Well-established pharmaceutical companies like J&J, GSK, AstraZeneca, Novartis, Pfizer, Sanofi, and Eli Lilly have already made significant investments in AI technology, including equity investments, acquisitions/partnerships with AI-focused companies, building internal capabilities, or a combination of these approaches with the intent of using AI/ML to find relevant drug candidates, targets, biomarkers or disease signatures.

However, they are rudely hit by the fact that finding relevant data is one of the most crucial and time-consuming steps in the insight discovery pipeline. The lack of harmonized metadata, data labels, standardized ontologies, and complete annotations makes data reuse incredibly difficult.

80% of the time in any big data analysis goes in data wrangling!

The next big question is how does one improve the data-wrangling process? Simple- Data curation!
Curation is an umbrella term that encompasses the process of creating, organizing and maintaining datasets so they can be accessed and used by researchers mining for information. It gets even better with an automated curation process. Yes, you read that right!

Automated curation uses domain-specific ML models that are trained on biomedical data and can recognize the context in which a word is used. It's no mean feat for a model to be able to differentiate between ‘Fish and Chips’ in common literature and ‘Fish (Fluorescent in situ hybridization) and chips (microchips)’ in biomedical literature. The ability of the models to curate data automatically after training helps save a ton of time and resources.

So, this automated curation process labels big data such as bulk RNA seq data/ scRNA seq data (>10GB/dataset) and makes these publicly available but unfindable/unusable resources available at a few clicks of your mouse. Literally! Imagine being able to reduce a significant chunk from that 80% spent in data wrangling!

This is the Herculean challenge that a few AI-based start-ups like ours are looking to solve to aid pharma companies in solving bigger challenges. By using AI to make research data FAIR- Findable, Accessible, Interoperable, and Reusable- so that they have the relevant data in a standardized schema to perform downstream analysis. At Elucidata, we have made huge strides in that direction over the past few years. Over the years, we have partnered with a variety of Biopharma companies focusing on finding targets in domains such as - oncology, genomics, and therapeutics- and helped them accelerate the target ID time by ~75%, reduce data analysis time by ~25X and de-risk advancement to phase II, thereby leading to lower trial costs, thus saving ~ $3M. These are just a few of our success stories.

We deserve to take a moment to pause and pat our backs on how far we have come. But also remember that we still have a long way to go. As Robert Frost said,