Data-driven Target Identification Using Public Datasets

Elucidata’s mission is to accelerate drug discovery by using a data-driven approach. In line with this, over the last 4 years, we have worked with several partners to improve the understanding of biological systems and design novel therapies. These organizations include big and small pharmaceutical companies, early-stage biotech startups and academic labs with diverse research interests.

While working with our partners, a consistent trend we have noticed is the increasing use of public datasets. More than 70% of our projects involve significant analysis on one or more public datasets. This is far more true for biotech startups that have fewer resources to conduct their own experiments.

Using Relevant Public Datasets Helps Shortens Timelines

A lot has been spoken about the expanse of publicly available biological data. Data accumulation by EMBL- EBI has increased by more than 7 orders of magnitude in less than 10 years. TCGA, GTEx, GEO, Metabolomics Workbench, PRIDE, etc. are some of the most popular publicly available resources which generate and/or aggregate biological data.

In our experience of supporting drug programs over the past couple of years, we have realized the benefits of using public data to aid and supplement any research effort, along with the challenges it brings along. It allows scientists to ask questions and generate hypotheses. All this without having to invest time, money and other resources to generate their own data. Public datasets are often also used to support the hypothesis and findings from independent experiments.

For discovery programs working on a tight budget using published datasets, if done effectively and efficiently, can be the differentiator between going to the next step or dying a natural death.

Finding and Using Relevant and High-quality Datasets Is a Challenge

In spite of the availability of numerous resources of molecular data, they aren’t used to their full potential. The biggest roadblock in getting started is identifying the most relevant publication/datasets for your context from the massive data sea out there. The search capabilities provided with data repositories return thousands of related hits. Sifting through the hits requires significant manual intervention.

It could take weeks for a scientist to analyze all the datasets and find the most relevant one. Machine learning techniques coupled with a scalable technology platform can reduce this to minutes.

Even if the scientist has read tens of papers and identified the most relevant datasets, the challenges around data handling, storage, analysis, and integration follow. The size and complexity of biological data call for an extensive in-house data storage infrastructure, computing resources and analytical expertise to mine it for meaningful insights. This often is a big ask from academic labs and small biotech companies.

Public portals like cBioPortal, GEPIA, ARCHS4, TCGA Firehose, UCSC Xena browser, etc. are some of the efforts in recent times to overcome these challenges. They provide easy access to the datasets as well as results of pre-defined analysis pipelines on these datasets and eliminates the need for in-house resources to perform a preliminary analysis. Each of these tools and portals is however rigid in its own ways and lacks the ability to customize an analysis, based on the specific needs of a project, which is often the requirement.

Unified Recommendation Engines for Public Datasets Might Be the Answer

So how do we solve these challenges? There are already efforts such as CREEDS, GEM-TREND which can recommend signatures (set of genes with a characteristic expression pattern) and datasets, given a scientist has data of their own. However, these algorithms do not take metadata into account and that hampers the specificity of the dataset search. Efforts such as CREEDS have shown that manual curation + algorithms work much better than algorithms alone. A machine learning solution which takes into account the metadata and high-quality manual curation will then provide far better results than recommendations based on data alone. We need a platform which can integrate different repositories, enable manual curation, create personalized models and enable powerful analysis in 1 click.

A platform, like Polly, can return a limited list of publications based on the user’s personalized history. It can also rank them in order of relevance. The scientist, instead of looking at thousands of search results would look at results curated for him by the platform. This could also take care of the localized ecosystem – what are my peers in my lab reading. Once the user selects the relevant datasets, the platform would suggest the analysis tools and pipelines that can be run on the data. The user would be able to run the desired analysis tools and pipelines in 1 click without spending hours wrangling data. All this while, they retain complete control over the parameters being used by the pipelines.

A platform which solves these challenges will be able to inform the research efforts of a drug program most effectively and help identify potential targets with faster iterations. In our view, such a platform would help meet our mission of accelerating drug discovery by using data.