Polly: Elucidata’s ML-Ops Platform for Biomedical R&D

Jayashree

September 23, 2022

Data, data everywhere, not a byte to use!

2 trillion GB of data is generated every year; however, 80% of the data being generated is unstructured and thus unusable. In other words- Biomedical data is unFAIR.

A vast amount of biological multi-omics data is generated worldwide at any given point; this data has enormous potential for discovery and reusability for various R&D projects; however, it is extremely hard to search and keep up with all the newly emerging data.

What is Polly?

Polly is a data-centric MLOps platform for biomedical data that provides access to FAIR (Findable Accessible Interoperable and Reusable) multi-omics data from public and proprietary sources. Data from across various sources is harmonized and curated using ML models, ensuring that it is machine-actionable and analysis-ready. Polly’s cloud infrastructure enables seamless data analysis, visualization, and sharing by offering a toolbox of scalable, easy-to-customize bioinformatics pipelines.

Our Technology: How Does Data Become ML-ready?

Data is ingested from different sources like publications and databases (public like GEO or proprietary), and made machine-actionable on Polly. All datasets are stored in a consistent file format that is analysis-ready.

Every source has different protocols for accessing the data. One way would be to manually download the data and keep it in our infrastructure. But that would make data untraceable and we would need to manually keep track of new datasets. Specific ETL pipelines called connectors are designed to help solve these challenges to a great extent.

Connectors enable us to download datasets from a particular source and keep track of any new datasets. Apart from downloading data, a connector is also responsible for data harmonization i.e. the process of combining data of varying file formats, naming conventions, and columns, and transforming it into one cohesive data set. Seamless data ingestion and metadata harmonization are facilitated using ETL pipelines.

Metadata annotation is a crucial process to improve the quality of datasets. There are more than a million datasets currently present on Polly. It won’t be a very scalable approach if we manually annotate all the datasets present on Polly. We thus use MLOps pipeline that annotates most of our datasets automatically.

BERT model has been one of the widely accepted models in NLP benchmarks that makes it spread to various tasks in Natural language processing (NLP). These language models help to scan through biomedical literature and extract information which is later used to enhance search. PollyBERT - built on top of BERT, enriches the way we access metadata from various data sources.

A central pillar of PollyBERT (Polly’s curation infrastructure) is the use of ontologies and controlled vocabularies for annotation of metadata fields such as disease, organism, cell line, tissue, cell type, drugs, genotypic perturbation, chemical perturbation, etc. Access to these annotations gives users powerful mechanisms to query this data. Through our curation pipeline, the metadata is harmonized using ontologies and the data is saved in accessible formats either as gct files which support a lot of omics and non-omics data, or as h5ad files which support larger, complex data like single-cell RNAseq.

Manual curation infrastructure generates training data and that training data is being used to create these machine learning models. These machine learning models are deployed on AWS Sagemaker and can be accessed via APIs.

The clean, curated and annotated data is stored in a repository on Polly called OmixAtlas.

OmixAtlas - The Data Warehouse

OmixAtlas is a collection of millions of datasets from public, proprietary, and licensed sources that have been curated, harmonized and made ready for downstream machine learning and analytical applications. It is one central location to access data over 26 data types from over 30 public repositories and licensed sources. Our offerings can be categorized as Public OmixAtlas or Enterprise OmixAtlas.

Public OmixAtlas

These datasets can be accessed through GUI or programmatically with Polly Python. Computational requirements can be scaled based on the complexity of the job using Polly's notebooks, dockers, and machine types.

Polly Python:

Polly-python is a library, which makes it easy for the users to search and access rich multi-omics data linked with metadata.

With Polly-python one can:

interact with all the Polly functionalities (Workspaces, OmixAtlas, Computational machines on Polly)
can build queries that are not limited to functions on Polly-python
easily use the APIs in multiple programming languages.
easily integrate or dockerize with 3rd party products - apps, libraries.
use it conveniently outside Polly (on different cloud computing platforms)with controlled data consumption metrics.

Polly Notebooks:

Polly Notebook is a scalable analytics platform that allows us to perform data analysis remotely in a Jupyter-like notebook. It provides the flexibility to select the compute capacity, and the environment as per our needs.

Polly CLI:

Polly CLI (Command Line Interface) is a tool that enables bioinformaticians to interact with Polly services using commands in your command-line shell. It lets us upload data and run jobs on the Polly cloud infrastructure by scaling computation resources as per need. Further, it also allows the user to start and stop jobs, monitor them, and view logs.

Features:

List Workspaces
Transfer files to and from local to Polly workspaces
Launch a batch job, get the status and logs
Manage dockers on Polly
Build docker in the cloud and get status and logs
Publish Polly environments and apps.

Polly for Different Personas:

Contact us if you want to learn more about using our 1.5 million curated datasets to train your models or to take advantage of our data-centric platform Polly, to find and analyze relevant datasets.

‍

Other Resources

Blogs Case Studies Dataset Roundup Documentation Glossary Webinars Whitepapers

Thank you for reaching out!

Our team will get in touch with you over email within next 24-48hrs.

Oops! Something went wrong while submitting the form.

Upcoming Webinar: Evidence-Driven Target Discovery: Knowledge Graphs That Reconstruct Disease-State Transitions

Register Now

[Upcoming Webinar] Scaling High-Quality Data Processing: Achieve 4x Cost Reduction for Foundation ModelsRegister Now->

Reserve Your Seat

Pharma Company Achieves 4x Faster Target Identification for Inflammatory Disease

Key Highlights

What’s a Rich Text element?

Static and dynamic content editing

How to customize formatting for each rich text

All Solution Briefs

Other Resources

Polly: Elucidata’s ML-Ops Platform for Biomedical R&D

What is Polly?

Our Technology: How Does Data Become ML-ready?

OmixAtlas - The Data Warehouse

Public OmixAtlas

Polly Python:

Polly Notebooks:

Polly CLI:

Features:

Polly for Different Personas:

Other Resources

Talk to our Data Expert

More Solution Briefs

Faster Insights on Omics Data Signatures with Polly Discover

Enhancing Data Quality: QC Filters for Single Cell RNA-seq Analysis

How to Perform Patient Stratification on Polly

ChatGPT in Drug Discovery

Solving Biomedical Data Findability Issues Using Polly

How to Compare Gene Signatures on Polly

FAQs

What are the key benefits of using Polly for gene target prioritization in patient stratification?

How does Polly help in training classifier models for patient stratification?

How does Polly assist in defining genetic signatures for different stages of cell differentiation?

What is the process of creating a disease-specific atlas using Polly’s harmonization engine?

How does Polly integrate multiple data types for more reliable patient stratification?

Can Polly handle data quality issues and unstructured data from public repositories?

How does Polly harmonize multi-omic datasets to improve the quality of patient stratification?

How does Elucidata's Polly help in overcoming the challenges of patient stratification?

What challenges do researchers face when performing patient stratification using multi-omics data?

What is patient stratification, and why is it important for precision medicine?

What are the key advantages of using Polly for transcriptome profiling and biomarker identification?

What methodologies does Polly use to identify synergistic drug combinations?

How does Polly rank datasets similar to a gene signature query?

What steps are involved in creating a query gene signature on Polly?

How does Polly's RNA-Seq Atlas simplify gene signature analysis?

What is gene signature comparison, and why is it important in drug discovery?

Get the latest news, industry insights, and updates delivered directly to your inbox.

All Solution Briefs

Faster Insights on Omics Data Signatures with Polly Discover

Enhancing Data Quality: QC Filters for Single Cell RNA-seq Analysis

How to Perform Patient Stratification on Polly

ChatGPT in Drug Discovery

Solving Biomedical Data Findability Issues Using Polly

How to Compare Gene Signatures on Polly

info@elucidata.io

info@elucidata.io

info@elucidata.io