Biomedical Research with its cutting edge innovations and futuristic outlook can positively utilize MLOps for successful drug discovery. In sync with current times, research in this field is proactively driven by a need to offer timely and qualitative healthcare solutions. How can Machine Learning Operations (MLOps) transform the way biomedical researchers predict ADMET properties and identify promising drug candidates?
The intensifying reliance on data-driven approaches in research and clinical applications, necessitates the utilization of MLOps in Biomedical Research. In this context, machine learning models need to be robust, scalable, and reproducible for advancing medical research and improving patient outcomes. One of the key applications of MLOps in biomedical research is ADMET property prediction, which involves assessing the Absorption, Distribution, Metabolism, Excretion, and Toxicity of chemical compounds. Accurate ADMET predictions are vital for drug discovery and development, helping researchers identify promising drug candidates and avoid costly failures at later stages.
Another significant use case is predicting the efficacy of a particular drug for a specific individual, known as personalized medicine. By leveraging patient-specific data and advanced ML models, researchers can foretell how different individuals will respond to a given treatment, leading to more effective and tailored therapies.
Additionally, MLOps play a crucial role in stratifying patients, grouping them based on genetic, phenotypic, or clinical characteristics. This stratification enables precise diagnosis, targeted treatment, and improved disease management.
MLOps practices enhance the effectiveness and scalability of these applications by ensuring that ML models are continuously updated with new data, rigorously tested, and efficiently deployed. In the realm of biomedical research, this implies reliable predictions, faster insights, and greater ease to handle large-scale data sorting and analysis.
MLOps encompasses the principles, practices, and tools necessary to streamline the lifecycle of ML projects-from data preparation and model training to deployment and monitoring. It also offers significant benefits in terms of operational effectiveness, reducing overhead, enabling more reliable predictions and accurate insights in biomedical research.
Lately, ML has become increasingly pervasive across industries. However, deploying ML models into production environments entails a set of challenges. Traditionally, the development and deployment of software applications have been governed by DevOps practices, emphasizing collaboration, automation, and continuous integration/continuous deployment (CI/CD). MLOps extend these principles to machine learning, ensuring that ML systems are developed, deployed, and maintained efficiently and reliably.
The steps in an ML project typically include the following:
By adopting MLOps practices, biomedical researchers can ensure that their ML models are not only accurate and reliable but also scalable and maintainable, thereby accelerating the pace of medical advancements and improving healthcare around the globe.
MLOps encounter a wide range of challenges spanning across different domains, including those unique to machine learning as well as issues common in software engineering. The most prevalent ones include:
Elucidata has been tackling these challenges by creating and utilizing various ML pipelines and MLOps flows tailored to different projects. This includes curating biomedical datasets, streamlining and speeding up the auditing process of identifying suitable datasets for specific needs. This also covers end-to-end data journeys, in conjunction with data preparation, model training, deployment, monitoring and maintenance.
Elucidata’s cloud platform ‘Polly’ delivers harmonized medical data to accelerate key research milestones. Data is ingested into Elucidata’s data harmonization platform- Polly from a number of different sources, each of which may have its own format and structure for presenting the data.
Steps- from data uptake to user ready data in Polly
In MLOps, various components form the backbone of the data infrastructure and are essential for ensuring seamless operations:
Together, these components form a comprehensive framework for managing data within MLOps, ensuring its integrity, accessibility, and usability throughout the entire lifecycle.
The first step in the ingestion process is curation of data. It involves transforming the data into a consistent, machine-readable format, annotating datasets as well as samples, and making it accessible on the platform. This generally involves selecting data to train on, cleaning it to remove any inconsistencies, standardizing according to set conventions, and ensuring that it is labeled properly. Proper data selection is crucial to avoid skewed models, which can lead to inaccurate predictions and poor generalization.
Elucidata stands at the forefront of biomedical research with its vast experience in processing biomedical data, and hosts an extensive repository of meticulously cleaned, labeled, curated, and harmonized datasets spanning various types such as Bulk RNASeq and Single Cell. Whenever encountered with a novel variation, the team selects the appropriate data from various sources, including open repositories like GEO and CPTAC, and prepares the data for model training. This process includes cleaning the data, labeling or curating it manually, and presenting it in a standard format.
Once the data is cleaned, properly labeled, harmonized, and stored securely, the next step is to prepare it for model training. This includes loading data in a specific format suitable for selected model training, and normalizing, if required. This data serves as a source for training, validating, and testing our models. Model training, often considered one of the easiest steps in the entire ML pipeline, involves fine-tuning the selected algorithm using a subset of meticulously prepared data. Despite its relative simplicity compared to other phases, such as data preparation and deployment, its importance cannot be underestimated. During training, we meticulously experiment with hyperparameters like learning rates, epochs, and batch sizes to optimize model performance. The ultimate goal is to minimize the difference between the model's predicted values and actual outcomes, a process facilitated by robust loss functions.
While opinions vary among practitioners, many agree that even though training is straightforward in concept, achieving optimal performance requires meticulous attention to detail and iterative refinement. This phase lays the foundation for subsequent stages, including deployment and ongoing monitoring, where the true impact and reliability of the model are validated in real-world applications.
By focusing on rigorous experimentation and parameter tuning during training, we ensure that our models not only meet , but exceed performance expectations, paving the way for successful deployment and operationalization in diverse biomedical research settings.
A subset of prepared and loaded data is used to train and tune the models, and then we test our trained models on another set of datasets. As a part of training, we experiment with different hyperparameters like learning rate, number of epochs, and batch size to optimize model performance. Finding optimal hyperparameters takes time, but it is crucial for model performance. During training, a suitable loss function is used to measure the difference between the model’s predicted values and actual values, with the goal of minimizing this difference.
Our training images are stored on ECR. The model is trained using these trained images, with the underlying infrastructure running on ECS and consequently, training artifacts are pushed back to S3.
Till date, Elucidata has built more than 25 ML models which have been used in production for various use cases:
By implementing MLOps practices, Elucidata ensures that these models are developed, deployed, and maintained efficiently, paving a way for advancement in biomedical research and healthcare solutions.
The deployment and monitoring phase is often considered the hardest part of the MLOps life cycle.Once the best model is selected post training, we deploy these models to the AWS SageMaker endpoint. From here, these models are used to make predictions. The performance of these models is monitored continuously to ensure they perform above baseline expectations and are able to detect any data drift or concept drift.
We utilize various tools to monitor the entire MLOps pipeline and model quality. This includes Amazon managed services like CloudWatch for logging and performance metrics, Model Monitor for monitoring model endpoints, and a Prometheus/Grafana setup for monitoring underlying machine performance.
Models deployed in production are used daily, which incurs operational costs including computational resources and infrastructure overhead. Bulk monitoring ensures that any deviations from expected performance are promptly identified and addressed. Over a period of time, usage of deployed models may fluctuate, leading to periods of lower utilization where costs must still be borne. This ongoing cost management is crucial to optimize resources and maintain efficiency.
Once automated curation of the dataset is complete, it undergoes manual validation by curators using Elucidata's in-house curation tool. This manual validation helps eliminate inaccuracies and allows tracking of performance metrics effectively. If any drift is detected during monitoring, Elucidata conducts thorough analyses to assess model conditions and determine if any upstream systems require auditing or adjustments.
In MLOps, navigating technological infrastructure and financial constraints can pose significant challenges. At Elucidata, we leverage our expertise in cost optimization and scalability to tackle these challenges proactively. Employing a range of strategies and solutions, we optimize resource utilization, manage scalability effectively, and mitigate expenses, therefore ensuring operational efficiency and sustainable growth. Our approach encompasses several key initiatives including the following:
Running ML pipelines demands robust and scalable infrastructure. This entails infrastructure that can dynamically scale to meet substantial compute demands and maintain high availability to prevent failures.
At Elucidata, on any given day, we handle up to 5000 vCPUs or 20TB of RAM.This necessitates flexible infrastructure that can scale seamlessly based on demand. Moreover, processing and delivering data within promised SLAs is crucial to avoid downtime or operational setbacks.
Based on these requirements, Elucidata has designed the infrastructure on top of AWS services like ECS, AWS Batch, and Kubernetes clusters. For some pipelines, we use pipeline orchestrators like openly available Prefect, Nextflow or our inhouse pipeline orchestrators for specific needs. Using these underlying infra offers us scalability without any practical limits and compute capacity required.
We use various AWS services like S3, EBS, EFS, FSx for storage depending on different requirements. We have built datalakes using AWS services to store and analyze this data. AWS cloudwatch, and prometheus/grafana are utilized for monitoring and logging. Using these AWS services as building blocks to architect our solution, provides a scalable and reliable underlying bedrock.
As the detailed information and analyses of our system architecture is beyond the scope of this blog, it is recommended to further read this piece “How Polly’s Curated Biomedical Molecular Data Streamlines MLOps for Drug Discovery”, which dives into the technical architecture of our data processing pipelines infrastructure.
Elucidata has achieved significant cost savings and performance improvements through a hybrid cloud setup as compared to a pure cloud solution. We have reduced the cost of our processing pipelines by 85% while achieving a 2.5X faster processing speed. This hybrid approach allows us to leverage the benefits of both on-premises as well as cloud infrastructure, and optimize costs without compromising on performance or scalability.
By focusing on scalability, reliability, and cost-efficiency in its infrastructure strategy, Elucidata ensures that our MLOps capabilities remain robust and adaptable to the dynamic demands of biomedical research and healthcare applications.
Selecting the appropriate storage solutions is a critical decision. For instance, AWS offers various options, including:
A careful consideration is essential while selecting the appropriate storage solution,provisioning capacity and throughput for large-scale processing pipelines. Without proper architectural alignment with the workload, costs can escalate rapidly, underscoring the importance of thoughtful design to avoid unnecessary expenditure.
Further, continuous attention is required for fine-tuning pipelines to allocate optimal resources for performance. Under-resourcing the pipeline steps can lead to increased execution time and failure rates. Conversely, over-provisioning resources results in wastage and increased costs. We actively monitor resource allocation and utilization in pipelines using tools like AWS CloudWatch and Prometheus to address these challenges.. Furthermore, we consistently enhance our pipelines to achieve maximum optimization.
Elucidata is also currently exploring BareMetal hybrid cloud infrastructure to provide low cost, high performance compute and storage. BareMetal instances provide high performance compared to virtualized environments and are also available at low cost as contrasted with virtualized cloud instances. Moreover, local network storage solutions in BareMetal data centers provide high throughput and IOPS at a comparatively low cost.
This has facilitated faster processing time and economy in terms of cost leading to substantial reduction in processing cost as compared to pure cloud solution.Since the initial results from this experiment have been promising, we will continue to invest in this hybrid infrastructure.
In a nutshell, this field requires continuous and consistent research and Elucidata stays committed in terms of time and effort to optimize infrastructure cost. With this razor sharp focus, our teams continuously strive and develop a unique skill-set needed to produce the best and qualitative results in biomedical research with minimal costs.
Elucidata's commitment to advance MLOps practices has been pivotal in driving innovation and enhancing efficacy within the biotech sector. By optimizing processes across data preparation, model training, deployment, and monitoring, we have established a robust and scalable infrastructure capable of managing complex biomedical data pipelines. Our continuous research and refinement efforts ensure cost-efficient solutions while maintaining high standards of reliability and performance.
Addressing the challenges inherent in MLOps, such as data integration complexities, computational resource management, and maintaining model accuracy over time, underscores our dedication to deliver exceptional outcomes. Through our expertise in leveraging MLOps, we empower researchers and clinicians to accelerate discoveries and improve patient outcomes in biomedical research and healthcare.
Looking ahead, Elucidata remains steadfast in its commitment to advance the frontier of MLOps, fostering collaboration, and driving impactful innovations in biotechnology. For further insights into our MLOps methodologies or to explore collaboration opportunities contact us, or learn more at [email protected]. We look forward to partnering with you to accelerate your research journey and achieve transformative results in biomedicine.