Use Polly’s Data Concierge services to find relevant data from its expansive corpus of over 2400 curated metadata files for PRIDE datasets to best match your inclusion/exclusion criteria.
Our experts can run extensive searches and query Polly’s metadata-annotated proteomics collection from PRIDE to help you swiftly locate the datasets of interest- all within minutes.
Polly extracts data (with inconsistent formats and missing data matrix) from PRIDE and engineers it to a consistent GCT (Gene Cluster Format).
The datasets are processed in custom pipelines (based on the user requirement), and are annotated with over 30+ metadata fields at the dataset, sample and feature levels, making them ML-ready and fit for downstream analysis.
Polly also ensures data integrity and quality by performing ~50 QA/QC checks for schema compliance, metadata validation, technical artifacts, and more, across datasets.
Analyze and visualize harmonized proteomics data from PRIDE through Polly’s Python package, pre-configured or custom applications that enable data querying and analysis.
Work with our experts to perform metadata-based exploration, build knowledge graphs, develop interactive dashboards, and more, to deep-dive into data for better insights.
Polly curates data from PRIDE in custom pipelines (via MaxQuant or as required by the user). Polly’s powerful harmonization engine processes measurements, links to ontology-backed metadata, and transforms datasets into a consistent data schema. It further performs rigorous quality checks and then stores pristine quality, ML-ready data on an Atlas on Polly, or a platform of choice for further analysis.
Extracting data from PRIDE is extremely inefficient and time-consuming. The bulk of datasets on PRIDE have insufficient metadata and only ~25% of the datasets are annotated for disease, rendering programmatic search futile. Furthermore, the data is available in diverse formats ( like mzTab, mzIdentML, mzML, etc.) and most often does not have the data matrix. This makes it necessary to do extensive quality checks and harmonization before any downstream use.
Polly’s harmonization engine transforms the unstructured data into ML-ready datasets, labels them with rich metadata and consistently processes them in pipelines to ensure highest quality data and maximum integrity, fit for insight generation.
Decrease in time spent on data curation.
Metadata fields annotated on every dataset.
Of the datasets are consistently processed.
Times as much metadata after harmonization.
Compare a harmonized dataset on Polly with un-annotated data from PRIDE.
Polly’s datasets come with 99% accuracy and have curated fields like disease, tissue, cell type, cell line, organism, etc., linked to their ontologies. Also, they are checked for logical errors, lexical errors, schema mismatch, publication information and more.
Polly delivers datasets with 100% metadata completeness and 0 empty metadata fields. All datasets are linked to ontology-backed metadata at dataset, sample / cell, and feature levels with 6 standard metadata fields at dataset level, and 15 standard fields at the sample level.
Polly ensures highest quality data fit for downstream analysis by performing a rigorous ~50 steps QA/QC check for each dataset. All datasets are checked for standard file format, sample number mismatch, duplication of IDs, inconsistent metadata and more.