Company & Culture

What is the Role of Machine Learning in Drug Discovery?

Earlier this year, our co-founder and CEO, Abhishek Jha, was the guest of honor during an Ask Me Anything-style Q&A session on Slack hosted by Bits in Bio. Bits in Bio is a community devoted to people who build tools that help scientists unlock new insights. During these conversations, participants interview fellow community members about their companies, work, and plans for the future.

In part two of this two-part series, the conversation dove into the role of machine learning in drug discovery and the challenges associated with applying these technologies. In part one, Abhishek gave an overview of our mission and shared his inspiration for founding Elucidata. You can read part one here.

Please note that some of the questions have been paraphrased for clarity and reordered to improve the flow of this article.

Where do you see the major applications of ML in drug discovery? How about major challenges?

ML applications span the broad spectrum of drug discovery from hypothesis testing to drug property predictions. A major challenge is that sparsely annotated data makes it difficult for ML models to learn and ML teams spend a large chunk of their time cleaning and wrangling data. Another challenge is that the hits that are output by ML models rarely provide feasible or exciting avenues for chemists and biologists to follow up on. Unlike internet companies such as Grammarly, drug discovery applications do not have access to billions of data points as our sample sizes tend to be in the hundreds at best. Besides, any event of interest is a rare event. Therefore, it becomes even more critical to have clean and linked data. This is one of the major challenges that can compromise the promise of ML and AI in drug discovery.

What about beyond drug discovery?

I think the opportunity is immense. A lot of verticals (manufacturing etc.) suffer from similar problems. One challenge for us (since we are small) is to remain focused and deliver value for our customers. But there is a whole world beyond drug discovery where our technology will be helpful!

Can you talk about FAIR data standards and why you see those as especially important in bio?

The FAIR principles3 put the onus on organizations that own and publish data to make it “machine-actionable”, i.e., a machine can read the metadata that describes the data, and this enables the machine to access and utilize the data for various applications.”

Currently, for most organizations, data generation, storage, analysis, and insight derivation are owned by different stakeholders. A significant bottleneck is the disconnect between these stakeholders. FAIRly stored, managed and shared data facilitates data reuse, and enables verification of the credibility and accuracy of the data, and the insights derived from it. Further, it enables interdisciplinary collaboration and innovation- accelerating drug discovery.

There’s often a dichotomy drawn between traditional bioinformatics and ML. How do you think about using omics data for developing ML models for drug discovery?

I appreciate the question. It is aligned with my own research experience as well as our experience at Elucidata. Most (not all) of bioinformatics is what I would call business intelligence. Analyzing 1-5 datasets at a time, looking at differential expressions and pathways. This type of work is very helpful to drive programs. But, increasingly we are seeing asks from our customers and outside which rely on large datasets (~100s) to pick up classifications. We know of our customers who have used it to answer specific questions around patient segmentation for the target they had validated. Some of the papers being published are even more audacious. So this list is growing as we speak. It is important for predictive ML models to have a narrow question. At the same time, we see such models as beyond our scope. Our customers develop such models. We feed those models ML-ready datasets into such models. Getting the data ready for ML is our core focus.

Are certain settings more challenging for your NLP models? (e.g., language in clinical trials vs. ELN entries)

Yes. We try to put guardrails around what we can do. For example digital text. English. And it is worth stressing that NLP is just a part of it. We also do data engineering at scale. That has led to some interesting discoveries. Folks would cut and paste excel sheets in their doc files. We can't do much there.  One example is that extracting information from an ELN is far tougher than clinical trial text, primarily because ELNs are written in a hurry.

It is a huge challenge. But that is what gets us excited!!

Do you see value in working with pre-competitive players like Pistoia Alliance and others around data standardization and harmonization in the life science space?

We are working with Pistoia. We are a member too. A funny thing about "standards" is that everyone has one. And that undermines the whole premise of it. We are big believers in using the community to converge upon broadly agreed-upon standards.

We wrote a case study in collaboration with Pistoia which may interest our readers.

Source
We've seen huge amounts of unstructured, messy data lead to amazing results in other domains (I'm thinking GPT-3 and Dall-E). What about drug discovery makes you think that's the wrong approach here? 

This is very interesting. We also learn a lot from such models. We talked a lot about it at this year’s DataFAIR, which is our annual event where our community can discuss the challenges of making data FAIR, and the promise of it.

We and others have seen that we can outperform large models (trained on billions of parameters) if the training is on the relevant dataset. This is the core promise of data-centric AI. Happy to get into more details if anyone is interested. More specifically, we outperform BERT2 quite significantly. Very exciting time to be dealing with such challenges. So much is happening and so fast!

I see on your website that you provide clinical data from sources like “PPMI” - do you see any efforts in using such data to inform pre-clinical research/drug discovery to improve the prediction of “clinical success”?

Yes. One of our customers had access to PPMI and they used it quite effectively. We believe it is valuable. We cleaned it and linked it to other datasets to make PPMI more valuable and usable. Providing PPMI out of the box on Polly is tricky because of the usual constraints. But we can talk more about it should you be interested.

Do you spend a lot of time consulting your clients on the potential use cases/business analysis for an ML approach (and potential model design) or do you try to stay at the pure data service level?

It is hard to be a purist. We do consult and provide service too. But we are very clear about what we are 10x best at, which is cleaning and linking biomedical data. We do at times create processing data pipelines or create custom tools because it helps our customers. But a continuous challenge is to focus as narrowly as possible so that we can create something highly differentiated. We have been reasonably good at it so far but it is a journey! :) 

Can Elucidata work with companies that are just beginning their data science/ML journey or is your product better served for advanced ML teams/deep learning teams?

We work with large pharma companies. But, we love working with companies that are very young. Stealth mode. Day 0. Seed/SeriesA Companies. Companies that do not have a name yet. They have been a big part of our story and traction. We have a very strong services team which enables us to do it.  Happy to talk more. We have a number of academic customers too.

How should organizations decide which biological problems could be solved using machine learning? What are the factors you consider?
  1. How narrowly can it be defined?
  2. How critical is it to business?
  3. What kind of underlying data can be used to train and test the model?
  4. What is the quality of data?

This list is not prescriptive. But often we see that folks dive straight into the model. That has not served anyone well. It requires a lot of planning and thought to invest and think about data quality and usability before diving into the models.

References

[1] https://bitsinbio.substack.com/p/introducing-bits-in-bio 

[2] https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html 

[3] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4792175/ 

[4] https://xkcd.com/927/

Blog Categories

Talk to our Data Expert
Thank you for reaching out!

Our team will get in touch with you over email within next 24-48hrs.
Oops! Something went wrong while submitting the form.

Blog Categories