Fast-track Time to Insight with Harmonized CPTAC Data

Polly incorporates all accessible source metadata from diverse sources (PDC and GDC) in CPTAC and harmonizes it into a unified data model to accelerate analysis.


Find and Query ML-ready Datasets from CPTAC 10x Faster

Polly meticulously curates metadata and ensures efficient and swift data querying.


Perform Multi-omics Studies with Ease Using CPTAC Datasets

Polly houses consistently processed CPTAC data, enriched with detailed metadata conforming to ontologies/ controlled vocabulary, streamlining multi-omics analysis.



Polly Makes CPTAC Data Usable & Actionable

Use Polly's data concierge service for tailored matches (based on your inclusion/exclusion criteria) from harmonized CPTAC datasets across ~10 cancer types.

Our experts can help you swiftly locate the datasets of interest by performing complex queries on Polly’s metadata-annotated CPTAC data - all within minutes.

CPTAC metadata is split between GDC and PDC, with only ~25% overlap. Polly consolidates data, creating comprehensive superset information for each dataset, ensuring completeness, facilitating  multi-omics analysis.

Polly hosts CPTAC data processed through the Common Data Analysis Pipeline, featuring 30+ metadata fields at dataset, sample, and feature levels, and rendering them ML-ready for downstream analysis.

Polly also ensures data integrity and quality by performing ~50 QA/QC checks for lexical errors, schema compliance, metadata validation, technical artifacts, and more, across datasets.

Analyze and visualize harmonized proteomics and transcriptomics data from CPTAC using Polly's Python package, pre-configured, or custom applications.

Collaborate with our experts to perform multi-omics analyses or metadata-based exploration, build interactive dashboards, and delve deeper into data for enhanced insights.


How Does Polly Harmonize CPTAC Datasets?

Polly harmonizes CPTAC datasets processed through the Common Data Analysis Pipeline, linking proteomics (PDC) and transcriptomics data (GDC) to ontology-backed metadata. Following rigorous quality checks, it stores the high-quality, ML-ready data on Polly's Atlas or any custom platform for analysis.

The Polly Difference

CPTAC v/s Polly

Polly offers a superior alternative to CPTAC by providing meticulously harmonized proteomics and transcriptomics data in a queryable format.
With Polly, researchers can seamlessly explore and analyze data without the hassle of reconciling incomplete metadata from multiple sources like  GDC and PDC.
Datasets are indexed as GCT files in Polly's Atlas, presenting a log 2 transformed data matrix along with metadata fields, empowering researchers with accessible and comprehensive resources.

request demo

Of the datasets are consistently processed.


Decrease in time spent on data curation.


Metadata fields annotated on every dataset.


Richer metadata after harmonization.

Snapshot of a Polly Harmonized Dataset

Compare a harmonized dataset on Polly with un-annotated data from CPTAC.

Why Choose Polly to Access CPTAC Datasets?

Metadata Accuracy

Polly’s datasets come with 99% accuracy and have curated fields like disease, tissue, cell type, cell line, organism, etc., linked to their ontologies. Also, they are checked for logical errors, lexical errors, schema mismatch, publication information and more.

Metadata Completeness

Polly ensures complete metadata coverage by capturing all metadata from PDC and GDC. It includes 6 standard fields linked to standard ontology and over 30 harmonized fields, covering dataset, sample, and feature levels.

Data Quality

Polly ensures highest quality data fit for downstream analysis by performing a rigorous ~50 steps QA/QC check for each dataset. All datasets are checked for standard file format, sample number mismatch, duplication of IDs, inconsistent metadata and more.

Request Demo