The real challenge isn’t building an ML model, but building an integrated ML system and having control over it in production.
The nuclei of trillions of cells in a human body can be packed into a handful of poppy seeds. Nevertheless, when one tries to unpack the internal mechanisms of this nucleus–unthreading a genome— it would result in data that can keep a data center the size of a football stadium, running for days. Currently, there are around 80,000 known proteins in the human body. The structure of proteins decides the key functionalities of the cell. Understanding cellular behavior requires deciphering the protein structure. This is foundational even to drug discovery, for outbreaks like COVID-19, that have devastated the entire world. For example, the RNA genome of SARS-CoV-2 alone has 29,811 nucleotides, encoding for 29 proteins. Further, RNA viruses have genomes composed of RNA that encodes several proteins. Connecting this ever-growing set of data points is a feature engineering nightmare. In the case of biomedical research, the lack of structured data has created roadblocks for researchers who have tried to tap into the advantages of machine learning.
Generating genome data has been an uphill task for quite some time now, mainly due to the lack of genome sequencers. However, sequencers are relatively more affordable now; smaller organizations and individual researchers can easily get their hands on these machines. This newfound accessibility further adds to the woes of biology’s big data problem. More sequencers mean more data and hence increased dimensionality. Having no data is a problem but having swathes of unstructured data is a bigger one. Today, we have the hardware and state-of-the-art techniques such as machine learning, but when a pandemic engulfs the world, there is not much time for experimentation. Though expensive, AI research labs can still crunch billions of parameters with modern-day GPUs, TPUs, and ML accelerators.
Recently, machine learning has even found its way into cancer research. Identifying the right therapeutic options or drugs requires sifting through thousands of tumors from various tissues and then mapping them onto human cancer lines. Typically, logic-based modeling is used to find combinations that respond to certain drugs. Now, machine learning can be used to comprehend the relative importance of different data types in predicting drug response. A typical ML-based drug discovery pipeline would be structured like this:
In a paper titled, “A Landscape of Pharmacogenomic Interactions in Cancer”, the authors combined logical-based modeling and machine learning to demonstrate drug response. Further, thousands of cancer cell lines for drug sensitivity were profiled in these experiments. Nearly 265 drugs across 990 cancer cell lines were screened, which resulted in 212,774 dose-response curves. Along with conventional methods, the authors used machine learning to assess the contribution of each molecular data type in explaining variation in drug response.
Making sense of huge swathes of multi-omics data, be it Transcriptomics or Genomics, is key to accelerating drug discovery. The vastness of data combined with its unstructured nature makes data preparation hard. More so in niche fields like computational biology, where one can’t afford to have a hobbyist label protein structures. Domain expertise is crucial, which in turn makes interdisciplinary collaboration a huge challenge. Therefore, having machine learning in their arsenal could prove to be handy for bioinformatics experts. To make sense of data with the very least amount of expertise in data ingestion, model versioning and a whole bunch of software engineering chores is a game-changer. This is where MLOps or Machine Learning Operations comes in.
MLOps unifies ML system development (Dev) and ML system operation (Ops). It allows the user to automate and monitor all steps of ML deployment.
MLOps emerged as a result of trying to solve deep-rooted data analytics problems. For years engineers have struggled to automate the monitoring of workflows in production. What starts as a software engineering hiccup in a cubicle can cascade into a fatal misdiagnosis. Large datasets, inexpensive on-demand compute resources, state-of-the-art algorithms and many other ingredients for applying effective ML are already at our disposal today. For instance, omics, which deals with the study of genes(genomics), proteins(proteomics), mRNA(transcriptomics) and other aspects of biology, explores the functionality of molecules in a cell. The tools that were developed in these respective fields of study have allowed us to understand molecules better and even enable drug discovery.
For example, a mix-up in file formats can put the whole genomic data analysis in jeopardy. Omics data in public and private repositories is often unstructured and not always analysis-ready. Researchers have to spend an enormous amount of time grappling with different file formats (CSV, Excel, GCT, Soft files, H5ad, etc.) and different conventions for metadata representation.
Tools like Elucidata’s Polly facilitate the ML community’s efforts to bring advantages to the field of biology. Polly creates a unique, centralized ecosystem that enables a diverse team of biologists, bioinformaticians and data scientists to get access to ML-ready data and focus on insight discovery in place of data cleaning, preparation and processing.
Often in life sciences, a lot of focus usually is on data generation compared to analyzing it. Programs like the NIH Common Fund’s Bridge to Artificial Intelligence (Bridge2AI) aim to generate new flagship biomedical data sets that are ethically sourced and accessible. Initiatives such as Bridge2AI are an indication of the growing demand for AI in medicine. Combine high-quality biomedical data with MLOps tools and what one has is a robust ML product that can rewrite the history of mankind!
Get the latest insights on Biomolecular data and ML