What is the Role of Machine Learning in Drug Discovery?

Earlier this year, our co-founder and CEO, Abhishek Jha, was the guest of honor during an Ask Me Anything-style Q&A session on Slack hosted by Bits in Bio. Bits in Bio is a community devoted to people who build tools that help scientists unlock new insights. During these conversations, participants interview fellow community members about their companies, work, and plans for the future.

In part two of this two-part series, the conversation dove into the role of machine learning in drug discovery and the challenges associated with applying these technologies. In part one, Abhishek gave an overview of our mission and shared his inspiration for founding Elucidata. You can read part one here.

Please note that some of the questions have been paraphrased for clarity and reordered to improve the flow of this article.

‍Where do you see the major applications of ML in drug discovery? How about major challenges?

ML applications span the broad spectrum of drug discovery from hypothesis testing to drug property predictions. A major challenge is that sparsely annotated data makes it difficult for ML models to learn and ML teams spend a large chunk of their time cleaning and wrangling data. Another challenge is that the hits that are output by ML models rarely provide feasible or exciting avenues for chemists and biologists to follow up on. Unlike internet companies such as Grammarly, drug discovery applications do not have access to billions of data points as our sample sizes tend to be in the hundreds at best. Besides, any event of interest is a rare event. Therefore, it becomes even more critical to have clean and linked data. This is one of the major challenges that can compromise the promise of ML and AI in drug discovery.

What about beyond drug discovery?

I think the opportunity is immense. A lot of verticals (manufacturing etc.) suffer from similar problems. One challenge for us (since we are small) is to remain focused and deliver value for our customers. But there is a whole world beyond drug discovery where our technology will be helpful!

Can you talk about FAIR data standards and why you see those as especially important in bio?

The FAIR principles³ put the onus on organizations that own and publish data to make it “machine-actionable”, i.e., a machine can read the metadata that describes the data, and this enables the machine to access and utilize the data for various applications.”

Currently, for most organizations, data generation, storage, analysis, and insight derivation are owned by different stakeholders. A significant bottleneck is the disconnect between these stakeholders. FAIRly stored, managed and shared data facilitates data reuse, and enables verification of the credibility and accuracy of the data, and the insights derived from it. Further, it enables interdisciplinary collaboration and innovation- accelerating drug discovery.

There’s often a dichotomy drawn between traditional bioinformatics and ML. How do you think about using omics data for developing ML models for drug discovery?

I appreciate the question. It is aligned with my own research experience as well as our experience at Elucidata. Most (not all) of bioinformatics is what I would call business intelligence. Analyzing 1-5 datasets at a time, looking at differential expressions and pathways. This type of work is very helpful to drive programs. But, increasingly we are seeing asks from our customers and outside which rely on large datasets (~100s) to pick up classifications. We know of our customers who have used it to answer specific questions around patient segmentation for the target they had validated. Some of the papers being published are even more audacious. So this list is growing as we speak. It is important for predictive ML models to have a narrow question. At the same time, we see such models as beyond our scope. Our customers develop such models. We feed those models ML-ready datasets into such models. Getting the data ready for ML is our core focus.

Are certain settings more challenging for your NLP models? (e.g., language in clinical trials vs. ELN entries)

Yes. We try to put guardrails around what we can do. For example digital text. English. And it is worth stressing that NLP is just a part of it. We also do data engineering at scale. That has led to some interesting discoveries. Folks would cut and paste excel sheets in their doc files. We can't do much there. One example is that extracting information from an ELN is far tougher than clinical trial text, primarily because ELNs are written in a hurry.

It is a huge challenge. But that is what gets us excited!!

Do you see value in working with pre-competitive players like Pistoia Alliance and others around data standardization and harmonization in the life science space?

We are working with Pistoia. We are a member too. A funny thing about "standards" is that everyone has one. And that undermines the whole premise of it. We are big believers in using the community to converge upon broadly agreed-upon standards.

We wrote a case study in collaboration with Pistoia which may interest our readers.

We've seen huge amounts of unstructured, messy data lead to amazing results in other domains (I'm thinking GPT-3 and Dall-E). What about drug discovery makes you think that's the wrong approach here?

This is very interesting. We also learn a lot from such models. We talked a lot about it at this year’s DataFAIR, which is our annual event where our community can discuss the challenges of making data FAIR, and the promise of it.

We and others have seen that we can outperform large models (trained on billions of parameters) if the training is on the relevant dataset. This is the core promise of data-centric AI. Happy to get into more details if anyone is interested. More specifically, we outperform BERT² quite significantly. Very exciting time to be dealing with such challenges. So much is happening and so fast!

I see on your website that you provide clinical data from sources like “PPMI” - do you see any efforts in using such data to inform pre-clinical research/drug discovery to improve the prediction of “clinical success”?

Yes. One of our customers had access to PPMI and they used it quite effectively. We believe it is valuable. We cleaned it and linked it to other datasets to make PPMI more valuable and usable. Providing PPMI out of the box on Polly is tricky because of the usual constraints. But we can talk more about it should you be interested.

Do you spend a lot of time consulting your clients on the potential use cases/business analysis for an ML approach (and potential model design) or do you try to stay at the pure data service level?

It is hard to be a purist. We do consult and provide service too. But we are very clear about what we are 10x best at, which is cleaning and linking biomedical data. We do at times create processing data pipelines or create custom tools because it helps our customers. But a continuous challenge is to focus as narrowly as possible so that we can create something highly differentiated. We have been reasonably good at it so far but it is a journey! :)

Can Elucidata work with companies that are just beginning their data science/ML journey or is your product better served for advanced ML teams/deep learning teams?

We work with large pharma companies. But, we love working with companies that are very young. Stealth mode. Day 0. Seed/SeriesA Companies. Companies that do not have a name yet. They have been a big part of our story and traction. We have a very strong services team which enables us to do it. Happy to talk more. We have a number of academic customers too.

How should organizations decide which biological problems could be solved using machine learning? What are the factors you consider?

How narrowly can it be defined?
How critical is it to business?
What kind of underlying data can be used to train and test the model?
What is the quality of data?

This list is not prescriptive. But often we see that folks dive straight into the model. That has not served anyone well. It requires a lot of planning and thought to invest and think about data quality and usability before diving into the models.

‍

References

[1] https://bitsinbio.substack.com/p/introducing-bits-in-bio

[2] https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html

[3] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4792175/

[4] https://xkcd.com/927/

Blog Categories

Data Analysis and Management

Data Quality & Compliance

Industry Features

Product & Engineering

Data Science & Machine Learning

Company & Culture

FAIR Data

Others

Thank you for reaching out!

Our team will get in touch with you over email within next 24-48hrs.

Oops! Something went wrong while submitting the form.

Other Resources

Case Studies Dataset Roundup Documentation Glossary Solution Briefs Webinars Whitepapers

Upcoming Webinar - AlphaGenome Unpacked: Promise, Progress, and What Comes Next for AI in Genomics

Join us

[Upcoming Webinar] Scaling High-Quality Data Processing: Achieve 4x Cost Reduction for Foundation ModelsRegister Now->

Reserve Your Seat

What is the Role of Machine Learning in Drug Discovery?

What about beyond drug discovery?

Can you talk about FAIR data standards and why you see those as especially important in bio?

There’s often a dichotomy drawn between traditional bioinformatics and ML. How do you think about using omics data for developing ML models for drug discovery?

Are certain settings more challenging for your NLP models? (e.g., language in clinical trials vs. ELN entries)

Do you see value in working with pre-competitive players like Pistoia Alliance and others around data standardization and harmonization in the life science space?

We've seen huge amounts of unstructured, messy data lead to amazing results in other domains (I'm thinking GPT-3 and Dall-E). What about drug discovery makes you think that's the wrong approach here?

I see on your website that you provide clinical data from sources like “PPMI” - do you see any efforts in using such data to inform pre-clinical research/drug discovery to improve the prediction of “clinical success”?

Do you spend a lot of time consulting your clients on the potential use cases/business analysis for an ML approach (and potential model design) or do you try to stay at the pure data service level?

Can Elucidata work with companies that are just beginning their data science/ML journey or is your product better served for advanced ML teams/deep learning teams?

How should organizations decide which biological problems could be solved using machine learning? What are the factors you consider?

References

Blog Categories

Talk to our Data Expert

Other Resources

Related Blogs

How to Choose the Right Data Analytics Platform for Biopharma Research

Navigating the Future of Healthcare AI: Opportunities, Challenges, and Ethical Considerations

Clinical Trials Data: Best Practices for Effective Analysis and Integration

AI Agents in Healthcare: Real Use Cases, Benefits, and How to Deploy Them Effectively

Scalable Infrastructure for Biomedical Data: Best Practices and Common Pitfalls to Avoid

Understanding Knowledge Graphs: Definition, Benefits, and Best Practices

Blog Categories

Get the latest news, industry insights, and updates delivered directly to your inbox.

Latest Blogs

How to Choose the Right Data Analytics Platform for Biopharma Research

How to Choose the Right Data Analytics Platform for Biopharma Research

Navigating the Future of Healthcare AI: Opportunities, Challenges, and Ethical Considerations

Navigating the Future of Healthcare AI: Opportunities, Challenges, and Ethical Considerations

Clinical Trials Data: Best Practices for Effective Analysis and Integration

Clinical Trials Data: Best Practices for Effective Analysis and Integration

AI Agents in Healthcare: Real Use Cases, Benefits, and How to Deploy Them Effectively

AI Agents in Healthcare: Real Use Cases, Benefits, and How to Deploy Them Effectively

Scalable Infrastructure for Biomedical Data: Best Practices and Common Pitfalls to Avoid

Scalable Infrastructure for Biomedical Data: Best Practices and Common Pitfalls to Avoid

Understanding Knowledge Graphs: Definition, Benefits, and Best Practices

Understanding Knowledge Graphs: Definition, Benefits, and Best Practices

Trending Blogs

How to Choose the Right Data Analytics Platform for Biopharma Research

Clinical Trials Data: Best Practices for Effective Analysis and Integration

Scaling Data Pipelines for High-throughput Bioinformatics

Decoding Complexities: The Critical Role of Deconvolution in Spatial Transcriptomics

Challenges with Diagnostics Data Processing Pipelines

info@elucidata.io

info@elucidata.io

info@elucidata.io