How Curation of Biomedical Data Can Accelerate the Drug Discovery Process?

Pranav Divakar, Jayashree
May 17, 2023

The healthcare systems continually require new drugs to address the unmet medical needs across diverse therapeutic areas. Pharmaceutical industries primarily strive to deliver new drugs to the market through the complex activities of drug discovery and development. $150M - $2.6B & 10-15 years of time is needed to bring a drug to market. Let us take a quick look at the current drug discovery process before we talk about how the process can be accelerated.

There are about 4 ways drugs are discovered:

  • Testing molecular compounds against a large number of diseases & finding possible beneficial effects against a few diseases.
  • Breaking down gene sequences & finding new insights into a disease process allows researchers to design a product to stop or reverse the effect of the disease.
  • Treating certain diseases with a particular drug that can have unanticipated effects & thus finding new drugs Ex: Viagra - serendipity.
  • New technologies, such as those that provide new ways to target medical products to specific sites within the body or to manipulate genetic material.

One common element across the above ways is the process of collecting & curating large amounts of biomedical micro-data to test various hypotheses.

Fun Fact: Did you know there are about 10,000 compounds that are tested before 97.5% of compounds are eliminated in the pre-clinical trial phase which generally takes ~4-6 years & you are left with just 5 compounds by the time you get to the clinical trial stage.
How Curation of Biomedical Data Can Accelerate the Drug Discovery Process
An illustration showing the different stages involved in developing a drug.
Image credit: Genome Research Limited

💡$150M - $2.6B & 10-15 years of time is needed to bring a drug to market.

Out of the 15 years in development time of a successful compound, about 6 years are devoted to the drug discovery and preclinical phase, 6.7 years to clinical trials, and 2.2 years to the approval phase.

Current Challenges with the Drug Discovery Process

Various elements slow down the process of drug discovery, but some of the crucial ones are:

  • Inability to identify better targets and molecules.
  • Time consumed in deciding & making iterations. Validating & invalidating each compound is a long and cumbersome process.
  • Lack of access to large volumes of diverse yet contextualized, processed & accurate biomedical data where confidence has been assessed, and applicability has been established.
  • The time required to cleanse data sets, map metadata & reformat multiple data sets into a common structure to let machines read data.
  • The lack of skilled resources to clean and analyze also poses a problem.  Ex: Different skill sets are required to analyze Metabolomics vs. Proteomics data.
  • The multiple regulatory laws that must be followed while designing your own study & collecting your own data.

It is safe to say that data plays a massive role in any drug to be discovered, developed & brought to market! And biomedical data is further critical because that’s the basis of research & hypothesis building.

Current State of Biomedical Data

  • Data is available in different formats across public and private forums - but not in one place. It is scattered offline, in the cloud & in repositories.
  • Metadata is inconsistent and non-standard, thus making it difficult to bring data from different sources & find insights.
  • Public data is not machine-readable due to various inconsistencies.
  • Public data is available but not always usable. Most public data is not standardized & often limited to one omics type.

Current Needs of Biomedical Data - FAIR Data

Using the FAIR - Findable, Accessible, Interoperable, and Reusable- principles to store, manage and share data is a step toward being better prepared for expediting the discovery of safer, better drugs.
  • F - Findability through standardized metadata and availability of multiple datasets in the curated form.
  • A - Accessibility by making it available in one place, allowing one to combine different data sets.
  • I - Interoperability through providing structured data sets in machine-readable formats and making available formats that can be integrated into diverse computational environments.
  • R - Reusability. Since it’s already formatted, the data has standard metadata and can be integrated seamlessly into most environments.

This emphasizes that it is time to adopt tools and software to store, manage and share data using FAIR guiding principles to enable scientists to use it better.

Benefits of Curated FAIR data:

We are surrounded by data but starved for insights.

Anything that reduces the need for users to spend time and effort on discovering, cleaning, re-formatting, and analyzing, thus making it easier and quicker for scientific discovery to move forward, can be called curation. Combine this with FAIR guiding principles, and you get the most efficient data for analysis!

There are a bunch of benefits of using curated FAIR data:

  • Easy access to large data sets with standardized metadata available on one platform, thus making discovery easier and more contextualized.
  • Easy access to structured and formatted data sets from people research teams worldwide.
  • The data needs to be contextualized, accurate, confidence assessed, and established applicability - so that Machine Learning can be applied.
  • The availability of diverse data sets across multiple omics and the ability to use the data sets in any computational environment, thus reducing operational overload.

Importance of Curating Biomedical Data at a Large Scale

Curation is a necessary step to make sense of data using ML, and it takes time, sometimes becoming a bottleneck. Without a dedicated curation effort, researchers have to restrict themselves to 1 or 2 datasets they can analyze, thus making it a resource-constraint-driven process - a huge red flag!

Moreover, as datasets are assembled from multiple sources, each might be processed differently & it becomes hard to compare multiple datasets. Hence, it becomes a long & cumbersome process for researchers to re-process data into the required format & then run this data using their own custom pipelines to compare effectively. This is not a quick activity and can immediately become a huge overload if the team isn’t equipped with the right skill sets/ scalable way of doing it.

Also, many times, metadata isn’t accurate or thorough for a question that the research team is trying to answer. There needs to be a regular metadata update by adding new fields, ideally through automation. This consistent updation of metadata is extremely important to keep contextualized discoverability of the data set easy.

Thus, biomedical data needs to be curated & formatted, ideally in a single place, for research teams to harness its power of it.

How Does Data Curation Accelerate Drug Discovery?

  • Curation helps researchers discover & process large volumes of formatted data in significantly short periods of time at each step of the drug discovery process.
  • It decreases the time required for validating & invalidating a compound during each stage of the drug discovery process. Streamlined data processing & analysis can make this iterative process faster.
  • Curation makes data from multiple omics available, enabling comprehensive insights. Drug discovery research is essentially a systems biology problem best approached with diverse data types. In the future, along with omics data, we will potentially see a rise in the demand for clinical & phenotypic data as well. For such effective practices, any dataset must be curated and easily accessible.
  • Accessible curated data can speed up pre-clinical research and reduce the overall cost of a research program.
  • Publicly available curated data will help scientists re-validate hypotheses from their previous work. This will allow them to enrich their existing data, leading to newer insights.
  • Data Curation will bring multi-disciplinary teams of research scientists and computational scientists on the same page, enabling more collaboration and contextualized insights than ever before.

If you are spending time scouring datasets just to find out relevant ones for downstream analysis, now is the time to reach out.

Connect with us to learn more about how to accelerate your drug discovery process using curated data.

Blog Categories

Request Demo