Earlier this year, our co-founder and CEO, Abhishek Jha, was the guest of honor during an Ask Me Anything-style Q&A session on Slack hosted by Bits in Bio. Bits in Bio is a community devoted to people who build tools that help scientists unlock new insights. During these conversations, participants interview fellow community members about their companies, work, and plans for the future.
In part two of this two-part series, the conversation dove into the role of machine learning in drug discovery and the challenges associated with applying these technologies. In part one, Abhishek gave an overview of our mission and shared his inspiration for founding Elucidata. You can read part one here.
Please note that some of the questions have been paraphrased for clarity and reordered to improve the flow of this article.
Where do you see the major applications of ML in drug discovery? How about major challenges?
ML applications span the broad spectrum of drug discovery from hypothesis testing to drug property predictions. A major challenge is that sparsely annotated data makes it difficult for ML models to learn and ML teams spend a large chunk of their time cleaning and wrangling data. Another challenge is that the hits that are output by ML models rarely provide feasible or exciting avenues for chemists and biologists to follow up on. Unlike internet companies such as Grammarly, drug discovery applications do not have access to billions of data points as our sample sizes tend to be in the hundreds at best. Besides, any event of interest is a rare event. Therefore, it becomes even more critical to have clean and linked data. This is one of the major challenges that can compromise the promise of ML and AI in drug discovery.
I think the opportunity is immense. A lot of verticals (manufacturing etc.) suffer from similar problems. One challenge for us (since we are small) is to remain focused and deliver value for our customers. But there is a whole world beyond drug discovery where our technology will be helpful!
The FAIR principles3 put the onus on organizations that own and publish data to make it “machine-actionable”, i.e., a machine can read the metadata that describes the data, and this enables the machine to access and utilize the data for various applications.”
Currently, for most organizations, data generation, storage, analysis, and insight derivation are owned by different stakeholders. A significant bottleneck is the disconnect between these stakeholders. FAIRly stored, managed and shared data facilitates data reuse, and enables verification of the credibility and accuracy of the data, and the insights derived from it. Further, it enables interdisciplinary collaboration and innovation- accelerating drug discovery.
I appreciate the question. It is aligned with my own research experience as well as our experience at Elucidata. Most (not all) of bioinformatics is what I would call business intelligence. Analyzing 1-5 datasets at a time, looking at differential expressions and pathways. This type of work is very helpful to drive programs. But, increasingly we are seeing asks from our customers and outside which rely on large datasets (~100s) to pick up classifications. We know of our customers who have used it to answer specific questions around patient segmentation for the target they had validated. Some of the papers being published are even more audacious. So this list is growing as we speak. It is important for predictive ML models to have a narrow question. At the same time, we see such models as beyond our scope. Our customers develop such models. We feed those models ML-ready datasets into such models. Getting the data ready for ML is our core focus.
Yes. We try to put guardrails around what we can do. For example digital text. English. And it is worth stressing that NLP is just a part of it. We also do data engineering at scale. That has led to some interesting discoveries. Folks would cut and paste excel sheets in their doc files. We can't do much there. One example is that extracting information from an ELN is far tougher than clinical trial text, primarily because ELNs are written in a hurry.
It is a huge challenge. But that is what gets us excited!!
We are working with Pistoia. We are a member too. A funny thing about "standards" is that everyone has one. And that undermines the whole premise of it. We are big believers in using the community to converge upon broadly agreed-upon standards.
We wrote a case study in collaboration with Pistoia which may interest our readers.
This is very interesting. We also learn a lot from such models. We talked a lot about it at this year’s DataFAIR, which is our annual event where our community can discuss the challenges of making data FAIR, and the promise of it.
We and others have seen that we can outperform large models (trained on billions of parameters) if the training is on the relevant dataset. This is the core promise of data-centric AI. Happy to get into more details if anyone is interested. More specifically, we outperform BERT2 quite significantly. Very exciting time to be dealing with such challenges. So much is happening and so fast!
Yes. One of our customers had access to PPMI and they used it quite effectively. We believe it is valuable. We cleaned it and linked it to other datasets to make PPMI more valuable and usable. Providing PPMI out of the box on Polly is tricky because of the usual constraints. But we can talk more about it should you be interested.
It is hard to be a purist. We do consult and provide service too. But we are very clear about what we are 10x best at, which is cleaning and linking biomedical data. We do at times create processing data pipelines or create custom tools because it helps our customers. But a continuous challenge is to focus as narrowly as possible so that we can create something highly differentiated. We have been reasonably good at it so far but it is a journey! :)
We work with large pharma companies. But, we love working with companies that are very young. Stealth mode. Day 0. Seed/SeriesA Companies. Companies that do not have a name yet. They have been a big part of our story and traction. We have a very strong services team which enables us to do it. Happy to talk more. We have a number of academic customers too.
This list is not prescriptive. But often we see that folks dive straight into the model. That has not served anyone well. It requires a lot of planning and thought to invest and think about data quality and usability before diving into the models.
[1] https://bitsinbio.substack.com/p/introducing-bits-in-bio
[2] https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html