Role of MLOps in Biomedical Research

Biomedical Research with its cutting edge innovations and futuristic outlook can positively utilize MLOps for successful drug discovery. In sync with current times, research in this field is proactively driven by a need to offer timely and qualitative healthcare solutions. How can Machine Learning Operations (MLOps) transform the way biomedical researchers predict ADMET properties and identify promising drug candidates?

The intensifying reliance on data-driven approaches in research and clinical applications, necessitates the utilization of MLOps in Biomedical Research. In this context, machine learning models need to be robust, scalable, and reproducible for advancing medical research and improving patient outcomes. One of the key applications of MLOps in biomedical research is ADMET property prediction, which involves assessing the Absorption, Distribution, Metabolism, Excretion, and Toxicity of chemical compounds. Accurate ADMET predictions are vital for drug discovery and development, helping researchers identify promising drug candidates and avoid costly failures at later stages.

Another significant use case is predicting the efficacy of a particular drug for a specific individual, known as personalized medicine. By leveraging patient-specific data and advanced ML models, researchers can foretell how different individuals will respond to a given treatment, leading to more effective and tailored therapies.

Additionally, MLOps play a crucial role in stratifying patients, grouping them based on genetic, phenotypic, or clinical characteristics. This stratification enables precise diagnosis, targeted treatment, and improved disease management.

MLOps practices enhance the effectiveness and scalability of these applications by ensuring that ML models are continuously updated with new data, rigorously tested, and efficiently deployed. In the realm of biomedical research, this implies reliable predictions, faster insights, and greater ease to handle large-scale data sorting and analysis.

What is MLOps?

MLOps encompasses the principles, practices, and tools necessary to streamline the lifecycle of ML projects-from data preparation and model training to deployment and monitoring. It also offers significant benefits in terms of operational effectiveness, reducing overhead, enabling more reliable predictions and accurate insights in biomedical research.

Lately, ML has become increasingly pervasive across industries. However, deploying ML models into production environments entails a set of challenges. Traditionally, the development and deployment of software applications have been governed by DevOps practices, emphasizing collaboration, automation, and continuous integration/continuous deployment (CI/CD). MLOps extend these principles to machine learning, ensuring that ML systems are developed, deployed, and maintained efficiently and reliably.

Role of MLOps in Biomedical Research — ML Project Steps

The steps in an ML project typically include the following:

Data Preparation: Collecting, cleaning, and preprocessing data to make it suitable for training ML models.
Model Training: Developing and training ML models using the prepared data.
Model Evaluation: Assessing the performance of the trained models to ensure they meet the required standards.
Deployment: Deploying the models into production environments where they can be used for making predictions.
Monitoring and Maintenance: Continuously monitoring the performance of deployed models and making necessary updates to maintain their effectiveness.

By adopting MLOps practices, biomedical researchers can ensure that their ML models are not only accurate and reliable but also scalable and maintainable, thereby accelerating the pace of medical advancements and improving healthcare around the globe.

Challenges in MLOps and Maintaining the Infrastructure

MLOps encounter a wide range of challenges spanning across different domains, including those unique to machine learning as well as issues common in software engineering. The most prevalent ones include:

Data Preparation

Data Quality: Ensuring availability of clean, consistent, accurately labeled data for model training can be a challenging task.
Data Integration: Integrating diverse data sources (e.g., clinical data, genomic data) can be complex and requires robust data harmonization techniques.

Model Training

Availability of High-Quality Labeled Data: Obtaining sufficient labeled data for training ML models in biomedical applications is often challenging due to the high cost and effort required in data annotation.
Computational Resources: Training models, especially deep learning models, can be computationally expensive and time-consuming, requiring significant computational power and infrastructure.

Model Evaluation

Defining Appropriate Metrics: Selecting the right performance metrics which accurately reflect model efficacy in biomedical contexts can be difficult.
Generalization: Ensuring that model generalizes new, and unseen data correctly, requires rigorous validation and testing processes.

Deployment

System Integration: Integrating ML models into existing healthcare systems and workflows can be technically challenging warranting careful planning.
Scalability: Establishing robust, scalable, and reliable computing infrastructure capable of handling high workloads can be onerous.
Latency: Balancing latency and throughput requirements to meet performance expectations requires careful optimization.
Monitoring: Implementing comprehensive logging and monitoring mechanisms for pipelines and production usage is essential for maintaining system health.

Monitoring and Maintenance

Model Drift Detection: Continuously monitoring models for performance degradation over time, known as model drift, is essential to maintain accuracy and reliability, mostly due to changes in the incoming data over time.
Updating Models: Updating deployed models without disrupting ongoing services requires careful planning and robust CI/CD pipelines.

Addressing MLOps Challenges at Elucidata

Elucidata has been tackling these challenges by creating and utilizing various ML pipelines and MLOps flows tailored to different projects. This includes curating biomedical datasets, streamlining and speeding up the auditing process of identifying suitable datasets for specific needs. This also covers end-to-end data journeys, in conjunction with data preparation, model training, deployment, monitoring and maintenance.

Elucidata’s cloud platform ‘Polly’ delivers harmonized medical data to accelerate key research milestones. Data is ingested into Elucidata’s data harmonization platform- Polly from a number of different sources, each of which may have its own format and structure for presenting the data.

Steps- from data uptake to user ready data in Polly

In MLOps, various components form the backbone of the data infrastructure and are essential for ensuring seamless operations:

Connectors: These are responsible for establishing ETL (Extract, Transform, Load) pipelines dedicated to data ingestion. Their primary function is to fetch data from diverse sources and prepare it for further processing.
Curation: This step involves the meticulous annotation and labeling of data, utilizing both machine learning techniques and manual curation processes. Curation is pivotal for ensuring the quality and relevance of the data for downstream tasks.
Ingestion: Ingestion mechanisms include the development of APIs tailored for seamlessly incorporating data into the system. Additionally, they incorporate data validation processes to ensure the integrity and reliability of the ingested data.
Indexing: Indexing functionalities involve the implementation of ETL pipelines dedicated to indexing data within the storage layer. These pipelines facilitate efficient data retrieval and query processing.
Storage: Data storage is a critical aspect of MLOps infrastructure, involving the storage of processed data in repositories such as ElasticSearch and Deltalakes. These storage solutions offer scalability and performance optimizations tailored for handling large volumes of data.
Consumption: This component focuses on exposing the processed data to clients for utilization. It encompasses the development of interfaces, such as Polly Python and Polly Frontend, through which clients can access and interact with the data effectively.

Together, these components form a comprehensive framework for managing data within MLOps, ensuring its integrity, accessibility, and usability throughout the entire lifecycle.

Data Selection and Preparation

The first step in the ingestion process is curation of data. It involves transforming the data into a consistent, machine-readable format, annotating datasets as well as samples, and making it accessible on the platform. This generally involves selecting data to train on, cleaning it to remove any inconsistencies, standardizing according to set conventions, and ensuring that it is labeled properly. Proper data selection is crucial to avoid skewed models, which can lead to inaccurate predictions and poor generalization.

Elucidata stands at the forefront of biomedical research with its vast experience in processing biomedical data, and hosts an extensive repository of meticulously cleaned, labeled, curated, and harmonized datasets spanning various types such as Bulk RNASeq and Single Cell. Whenever encountered with a novel variation, the team selects the appropriate data from various sources, including open repositories like GEO and CPTAC, and prepares the data for model training. This process includes cleaning the data, labeling or curating it manually, and presenting it in a standard format.

Model Training and Tuning

Once the data is cleaned, properly labeled, harmonized, and stored securely, the next step is to prepare it for model training. This includes loading data in a specific format suitable for selected model training, and normalizing, if required. This data serves as a source for training, validating, and testing our models. Model training, often considered one of the easiest steps in the entire ML pipeline, involves fine-tuning the selected algorithm using a subset of meticulously prepared data. Despite its relative simplicity compared to other phases, such as data preparation and deployment, its importance cannot be underestimated. During training, we meticulously experiment with hyperparameters like learning rates, epochs, and batch sizes to optimize model performance. The ultimate goal is to minimize the difference between the model's predicted values and actual outcomes, a process facilitated by robust loss functions.

While opinions vary among practitioners, many agree that even though training is straightforward in concept, achieving optimal performance requires meticulous attention to detail and iterative refinement. This phase lays the foundation for subsequent stages, including deployment and ongoing monitoring, where the true impact and reliability of the model are validated in real-world applications.

By focusing on rigorous experimentation and parameter tuning during training, we ensure that our models not only meet , but exceed performance expectations, paving the way for successful deployment and operationalization in diverse biomedical research settings.

Model Training Process

A subset of prepared and loaded data is used to train and tune the models, and then we test our trained models on another set of datasets. As a part of training, we experiment with different hyperparameters like learning rate, number of epochs, and batch size to optimize model performance. Finding optimal hyperparameters takes time, but it is crucial for model performance. During training, a suitable loss function is used to measure the difference between the model’s predicted values and actual values, with the goal of minimizing this difference.

Our training images are stored on ECR. The model is trained using these trained images, with the underlying infrastructure running on ECS and consequently, training artifacts are pushed back to S3.

Till date, Elucidata has built more than 25 ML models which have been used in production for various use cases:

Disease Subtype Prediction: Predicting specific sub-types of diseases based on patient data to enable more precise diagnoses and treatment plans.
Patient Stratification: Grouping patients based on genetic, phenotypic, or clinical characteristics to target treatments more effectively.
Automated Annotations on Datasets: Automatically annotating datasets, such as identifying cell types in single-cell RNA sequencing data, to accelerate research workflows.
ML Models for Powering Chatbots: Developing ML models to power chatbots that answer biomedical research-related questions, enhancing accessibility to information and resources.

By implementing MLOps practices, Elucidata ensures that these models are developed, deployed, and maintained efficiently, paving a way for advancement in biomedical research and healthcare solutions.

Production Deployment and Monitoring

The deployment and monitoring phase is often considered the hardest part of the MLOps life cycle.Once the best model is selected post training, we deploy these models to the AWS SageMaker endpoint. From here, these models are used to make predictions. The performance of these models is monitored continuously to ensure they perform above baseline expectations and are able to detect any data drift or concept drift.

We utilize various tools to monitor the entire MLOps pipeline and model quality. This includes Amazon managed services like CloudWatch for logging and performance metrics, Model Monitor for monitoring model endpoints, and a Prometheus/Grafana setup for monitoring underlying machine performance.

Production Use and Cost Considerations

Models deployed in production are used daily, which incurs operational costs including computational resources and infrastructure overhead. Bulk monitoring ensures that any deviations from expected performance are promptly identified and addressed. Over a period of time, usage of deployed models may fluctuate, leading to periods of lower utilization where costs must still be borne. This ongoing cost management is crucial to optimize resources and maintain efficiency.

Model Development Process

Once automated curation of the dataset is complete, it undergoes manual validation by curators using Elucidata's in-house curation tool. This manual validation helps eliminate inaccuracies and allows tracking of performance metrics effectively. If any drift is detected during monitoring, Elucidata conducts thorough analyses to assess model conditions and determine if any upstream systems require auditing or adjustments.

Addressing Infrastructure Challenges

In MLOps, navigating technological infrastructure and financial constraints can pose significant challenges. At Elucidata, we leverage our expertise in cost optimization and scalability to tackle these challenges proactively. Employing a range of strategies and solutions, we optimize resource utilization, manage scalability effectively, and mitigate expenses, therefore ensuring operational efficiency and sustainable growth. Our approach encompasses several key initiatives including the following:

Scalability and Reliability

Running ML pipelines demands robust and scalable infrastructure. This entails infrastructure that can dynamically scale to meet substantial compute demands and maintain high availability to prevent failures.

At Elucidata, on any given day, we handle up to 5000 vCPUs or 20TB of RAM.This necessitates flexible infrastructure that can scale seamlessly based on demand. Moreover, processing and delivering data within promised SLAs is crucial to avoid downtime or operational setbacks.

Based on these requirements, Elucidata has designed the infrastructure on top of AWS services like ECS, AWS Batch, and Kubernetes clusters. For some pipelines, we use pipeline orchestrators like openly available Prefect, Nextflow or our inhouse pipeline orchestrators for specific needs. Using these underlying infra offers us scalability without any practical limits and compute capacity required.

We use various AWS services like S3, EBS, EFS, FSx for storage depending on different requirements. We have built datalakes using AWS services to store and analyze this data. AWS cloudwatch, and prometheus/grafana are utilized for monitoring and logging. Using these AWS services as building blocks to architect our solution, provides a scalable and reliable underlying bedrock.

As the detailed information and analyses of our system architecture is beyond the scope of this blog, it is recommended to further read this piece “How Polly’s Curated Biomedical Molecular Data Streamlines MLOps for Drug Discovery”, which dives into the technical architecture of our data processing pipelines infrastructure.

Cost Optimization and Performance Enhancement

Elucidata has achieved significant cost savings and performance improvements through a hybrid cloud setup as compared to a pure cloud solution. We have reduced the cost of our processing pipelines by 85% while achieving a 2.5X faster processing speed. This hybrid approach allows us to leverage the benefits of both on-premises as well as cloud infrastructure, and optimize costs without compromising on performance or scalability.

By focusing on scalability, reliability, and cost-efficiency in its infrastructure strategy, Elucidata ensures that our MLOps capabilities remain robust and adaptable to the dynamic demands of biomedical research and healthcare applications.

Selecting the appropriate storage solutions is a critical decision. For instance, AWS offers various options, including:

S3 object storage: Ideal for cost-effective long-term data storage. However, it may not be suitable for IO-intensive workloads due to slower read/write speeds, potentially leading to increased compute costs and processing delays.
Instance-attached local EBS storage: Well-suited for IO-intensive tasks as data is stored locally, facilitating faster read/write operations. Nonetheless, it lacks the ability to share data among tasks running on different instances, a common requirement in workflow pipelines.
EFS storage solution: Provides shared storage with costs based on used capacity. While it offers flexibility, it tends to be slower than EBS. Increasing throughput is possible but comes with additional costs.

A careful consideration is essential while selecting the appropriate storage solution,provisioning capacity and throughput for large-scale processing pipelines. Without proper architectural alignment with the workload, costs can escalate rapidly, underscoring the importance of thoughtful design to avoid unnecessary expenditure.

Further, continuous attention is required for fine-tuning pipelines to allocate optimal resources for performance. Under-resourcing the pipeline steps can lead to increased execution time and failure rates. Conversely, over-provisioning resources results in wastage and increased costs. We actively monitor resource allocation and utilization in pipelines using tools like AWS CloudWatch and Prometheus to address these challenges.. Furthermore, we consistently enhance our pipelines to achieve maximum optimization.

Elucidata is also currently exploring BareMetal hybrid cloud infrastructure to provide low cost, high performance compute and storage. BareMetal instances provide high performance compared to virtualized environments and are also available at low cost as contrasted with virtualized cloud instances. Moreover, local network storage solutions in BareMetal data centers provide high throughput and IOPS at a comparatively low cost.

This has facilitated faster processing time and economy in terms of cost leading to substantial reduction in processing cost as compared to pure cloud solution.Since the initial results from this experiment have been promising, we will continue to invest in this hybrid infrastructure.

In a nutshell, this field requires continuous and consistent research and Elucidata stays committed in terms of time and effort to optimize infrastructure cost. With this razor sharp focus, our teams continuously strive and develop a unique skill-set needed to produce the best and qualitative results in biomedical research with minimal costs.

Conclusion

Elucidata's commitment to advance MLOps practices has been pivotal in driving innovation and enhancing efficacy within the biotech sector. By optimizing processes across data preparation, model training, deployment, and monitoring, we have established a robust and scalable infrastructure capable of managing complex biomedical data pipelines. Our continuous research and refinement efforts ensure cost-efficient solutions while maintaining high standards of reliability and performance.

Addressing the challenges inherent in MLOps, such as data integration complexities, computational resource management, and maintaining model accuracy over time, underscores our dedication to deliver exceptional outcomes. Through our expertise in leveraging MLOps, we empower researchers and clinicians to accelerate discoveries and improve patient outcomes in biomedical research and healthcare.

Looking ahead, Elucidata remains steadfast in its commitment to advance the frontier of MLOps, fostering collaboration, and driving impactful innovations in biotechnology. For further insights into our MLOps methodologies or to explore collaboration opportunities contact us, or learn more at info@elucidata.io. We look forward to partnering with you to accelerate your research journey and achieve transformative results in biomedicine.

‍