I doubt if a life science researcher today hasn’t used the PubMed search option to find datasets/ articles or spent a lot of time trying to find out which of the 1000-10000 odd results are relevant to their research. Fine-tuning your search is challenging, and you end up reading the description or abstract manually to find if a dataset is relevant. This is a criminal waste of a researcher’s time in this age of the AI revolution.
With PubMed alone publishing around two biomedical papers every minute and over a million every year; gone are the days when we searched for biomedical data manually!
We have been using ML models in everyday life via chatbots, spam filters, search engines, grammar correction software, etc. They also play a role in narrowing down the search results in biomedical data. Still, the process is not as effective as in other areas because the existing large-language ML models are trained on the general text and not biomedical literature. For these models, fish and chips would generally mean an item from a restaurant menu, whereas in the biomedical context, it could mean Fluorescence In situ Hybridization technique and microchips used for conducting the assays. Owing to this, we need models that are trained with biomedical data.
At Elucidata, we have embraced a data-centric approach to increase the accuracy of the existing models. We have developed small models exclusively trained on highly curated biomedical data. This is intended to augment the existing general language models and improve the overall performance. The principle is not very tough to understand. You are as good as what you eat. Likewise, your model is as good as the data you feed into it. So we use models that improve the data quality, remove the biases.. and eureka! The results start improving dramatically!
Generally, text data without errors or spelling mistakes is considered good enough to train a model. While training a model with biomedical data, besides the correct keywords, factors such as proper labels, fixed ontologies, recency of the research, the rationale for choosing a dataset, number of data points available, etc. play a huge role in deciding the quality of input data. This is explained in more detail here. At Elucidata, we have a team of curators and domain experts who collaborate closely with ML experts and engineers to ensure that the relevance and quality of data being used in the data-centric models meet our high standards.
Through a careful understanding of the data and by adding an iterative data-centric model for improving data quality and another data-centric model for post-processing, we have shown that we can surpass the accuracy of PubMedBERT (a popular ML model for biomedical data) in certain tasks like identification of cell type, cell line, tissue, disease, etc. from biomedical abstracts available on public datasets like GEO.
This is a huge milestone as we can substantially automate and streamline your search of datasets with minimal human intervention! Think of how many hours you can save if you can filter your dataset search by cell type, cell line, tissue, disease drugs, etc., and get relevant datasets within seconds! That is the power of curated data and a data-centric approach.
Contact us if you want to learn more about using our 1.5 million curated datasets to train your models or to take advantage of our data-centric platform Polly, to find and analyze relevant datasets.
Get the latest insights on Biomolecular data and ML