Noteworthy Proteomics Datasets For Biomarker Discovery and Target Identification

Anurag Srivastava
January 5, 2024

Acquiring proteomics datasets is challenging, requiring mining from diverse sources like PRIDE, ProteomeXchange, MassIVE, iProX, etc. However, this is just the beginning. Researchers invest significant time, given that only 25% of datasets are fully annotated, with disease labeling being a critical omission.

Additionally, databases vary in schemas, being semi-structured and containing data in formats like mzTab, mzIdentML, mzML, SDRF, and ISA-TAB, posing further challenges in utilization.

Elucidata’s biomedical data platform Polly, harmonizes proteomics data and links it to metadata extracted from PRIDE, enabling efficient programmatic searches. This curated data streamlines complex queries, making dataset discovery quick and exclusive to Polly, unlike PRIDE.

In this 'Monthly Dataset Roundup,' we highlight noteworthy proteomics datasets in various disorders. Polly's high-quality ML-ready proteomics datasets are valuable resources for unraveling molecular mechanisms, predicting biomarkers, and identifying potential targets in numerous disorders.

Dataset 1

Plasma proteome profiling reveals dynamics of inflammatory and lipid homeostasis markers after Roux-en-Y gastric bypass surgery

Dataset ID: PXD009348
Year of Publication: 2018
Experiment Type: Proteomics
Total Samples: 47
Organism: Homo sapiens
Reference Link: Publication
Proteomics Datasets For Biomarker Discovery and Target ID
Metadata table showing metadata features curated on Polly

Summary

Obesity-related diseases impact half of the global population, and among the few interventions demonstrating enduring weight loss and cardio-metabolic effects is bariatric surgery. In this study, the authors explore the impact of Roux-en-Y gastric bypass surgery on the plasma proteome. They hypothesize that specific proteins or protein patterns may act as key mediators and markers of the metabolic response.

Utilizing mass spectrometry (MS)-based proteomics in two longitudinal studies involving 47 morbidly obese patients, the authors provide quantitative information on over 1,700 proteins. A global correlation matrix, representing about 200,000 relationships, unveils functional connections between proteins, categorizing them into physiological processes. The primary classes of significantly altered proteins include markers of systemic inflammation and those involved in lipid metabolism.

The study illuminates robust correlative and anti-correlative behaviors among circulating proteins and their associations with clinical parameters. Particularly, a group of inflammation-related proteins exhibits distinct inverse relationships with proteins consistently associated with insulin sensitivity.

Dataset 2

Analysis of 1508 plasma samples by capillary-flow data-independent acquisition profiles proteomics of weight loss and maintenance

Dataset ID: PXD0013231
Year of Publication: 2019
Experiment Type: Proteomics
Total Samples: 1508
Organism: Homo sapiens
Reference Link: Publication
Proteomics Datasets For Biomarker Discovery and Target ID
Metadata table showing metadata features curated on Polly

Summary

A comprehensive and high-throughput analysis of the plasma proteome holds the potential to offer a holistic assessment of an individual's health. Building on the experience and the evaluation of recent large-scale plasma MS-based proteomic studies, the authors identified two primary challenges: the slow and delicate nano-flow liquid chromatography and the irreproducibility of identification in data-dependent acquisition.

To address these challenges, the authors propose a robust capillary-flow data-independent acquisition (DIA) MS solution. This platform allows the measurement of 31 plasma proteomes per day. The study applied this approach to a large-scale analysis of the diet, obesity, and genes dietary study, comprising 1508 samples. Demonstrating robustness, the complete acquisition was achieved using a single analytical column. In total, 565 proteins (459 identified with two or more peptide sequences) were profiled, with a dataset completeness of 74%. On average, 408 proteins (5246 peptides) were identified per acquisition (319 proteins in 90% of all acquisitions).

The workflow's reproducibility was assessed using 34 quality control pools acquired at regular intervals, resulting in a 92% dataset completeness with a coefficient of variations (CVs) for protein measurements at 10.9%. The study successfully profiled 20 apolipoproteins, revealing distinct changes. Weight loss and weight maintenance showed sustained effects on low-grade inflammation, steroid hormone, and lipid metabolism, indicating beneficial effects.

When compared to other large-scale plasma weight loss studies, this approach demonstrated high robustness and quality in identifying biomarker candidates. The tracking of nonenzymatic glycation indicated a delayed, slight reduction of glycation in the weight maintenance phase. Using stable-isotope-references, the study directly and absolutely quantified 60 proteins in the DIA. This study represents the first large-scale plasma DIA study and is one of the largest clinical research proteomic studies to date. The application of this fast and robust workflow holds great potential for advancing biomarker discovery in plasma.

Dataset 3

Alzheimer’s disease progression characterized by alterations in the molecular profiles and biogenesis of brain extracellular vesicles

Dataset ID: PXD015578
Year of Publication: 2020
Experiment Type: Proteomics
Total Samples: 18
Organism: Homo sapiens
Reference Link: Publication
Proteomics Datasets For Biomarker Discovery and Target ID
Metadata table showing metadata features curated on Polly

Summary

The role of brain intercellular communication mechanisms, specifically extracellular vesicles (EV), in the progression of Alzheimer’s disease (AD) remains elusive. In this study, the authors aimed to elucidate the contributions of brain EV to the progressive course of AD through unbiased proteome-wide analyses of EV derived from the temporal lobe. Simultaneously, complementary portions of the remaining brain were subjected to proteome-label quantitation.

Additionally, relevant proteins identified were further screened using multiple reaction monitoring. The study reveals altered EV biogenesis during preclinical AD, leading to the genesis of a specific population of EV containing MHC class-type markers. Notably, the study identifies the significant presence of the prion protein PrP in these brain vesicles during preclinical AD. The sequestration of amyloid protein APP in brain EV coincides with the observed PrP patterns. Conversely, the active incorporation of the mitophagy protein GABARAP in these brain vesicles is disrupted as AD progresses.

Similarly, disrupted incorporation of LAMP1 in brain EV is evident from the initial manifestation of AD clinical symptoms, although the levels of the protein remain significantly upregulated in the temporal lobe of diseased brains. The findings suggest that impaired autophagy in preclinical AD coincides with the appearance of proinflammatory and neuropathological features in brain extracellular vesicles, persisting moderately throughout the entire AD progression. These data underscore the significance of brain EV in establishing AD neuropathology, representing a crucial step toward therapeutic interventions involving these vesicles in human dementias.

Dataset 4

Quantitative proteomics of human heart samples collected in vivo reveals the remodeled protein landscape of dilated left atrium without atrial fibrillation

Dataset ID: PXD008722
Year of Publication: 2020
Experiment Type: Proteomics
Total Samples: 21
Organism: Homo sapiens
Reference Link: Publication
Proteomics Datasets For Biomarker Discovery and Target ID
Metadata table showing metadata features curated on Polly

Summary

Genetic and genomic research has significantly advanced our understanding of heart disease. However, we still lack comprehensive, in-depth, quantitative maps of protein expression in the hearts of living humans. To address this gap, the authors conducted a study using samples obtained during valve replacement surgery in patients with mitral valve prolapse (MVP).

The primary goals were to define inter-chamber differences, explore the intersection of proteomic data with genetic or genomic datasets, and assess the impact of left atrial dilation on the proteome in patients with no history of atrial fibrillation (AF). Biopsies were collected from the right atria (RA), left atria (LA), and left ventricle (LV) of seven male patients with mitral valve regurgitation and dilated LA but no history of AF. High-resolution MS was employed, with peptides pre-fractionated by reverse-phase high-pressure liquid chromatography before MS measurement on a Q-Exactive-HF Orbitrap instrument. The authors identified 7,314 proteins based on 130,728 peptides. Results were validated in an independent set of biopsies collected from three additional individuals.

Comparative analysis against data from post-mortem samples demonstrated enhanced quantitative power and confidence levels in samples collected from living hearts. The combined analysis with data from genome-wide association studies suggested candidate gene associations to MVP, identified higher abundance in the ventricle for proteins associated with cardiomyopathies, and revealed the dilated LA proteome.

This demonstrated a differential representation of molecules previously associated with AF in non-AF hearts. This study represents the largest dataset of cardiac protein expression from human samples collected in vivo. It offers a comprehensive resource providing insights into the molecular fingerprints of MVP and facilitates novel inferences between genomic data and disease mechanisms. The authors propose that the over-representation of proteins in the ventricle is not due to redundancy but to functional necessity. They conclude that changes in the abundance of proteins known to associate with AF are not sufficient for arrhythmogenesis.

Dataset 5

Investigation of specific proteins related to different types of coronary atherosclerosis

Dataset ID: PXD028664
Year of Publication: 2021
Experiment Type: Proteomics
Total Samples: 30
Organism: Homo sapiens
Reference Link: Publication
Proteomics Datasets For Biomarker Discovery and Target ID
Metadata table showing metadata features curated on Polly

Summary

Coronary heart disease is a complex condition arising from the intricate interplay between genetic and environmental factors. This complexity poses a significant challenge in identifying potential disease candidate proteins and their associated risk markers. Atherosclerosis, a key component of heart disease, encompasses a range of conditions, such as stable coronary artery disease (SCAD) and acute myocardial infarction (AMI), the latter being the progressive stage of SCAD.

Given this, accurate and timely diagnosis of atherosclerosis becomes crucial for effective disease management and prognosis. This study focuses on the search for specific protein markers to differentially diagnose coronary atherosclerosis. Thirty male patients, aged 45 to 55, diagnosed with atherosclerosis, were analyzed using tandem mass tag mass spectrometry (TMTMS). The study excluded those additionally diagnosed with hypertension and type 1 and 2 diabetes. The authors applied Mufuzz analysis to select target proteins for precise diagnosis of atherosclerosis, with a significant association with high lipid metabolism.

Subsequently, the authors did the verification of these target proteins using parallel reaction monitoring (PRM). The receiver operating characteristic curve (ROC) was calculated through a random forest experiment. The TMTMS identified 1,147 proteins, with 907 quantifiable. In the PRM study, six proteins related to the lipid metabolism pathway—ALB, SHBG, APOC2, APOC3, APOC4, and SAA4—were selected for verification. The specific changes detected in these six proteins contribute to accurate diagnosis in patients with atherosclerosis, especially in cases with varying disease types.

Dataset 6

Intracranial aneurysm biomarker candidates identified by a proteome-wide study

Dataset ID: PXD013442
Year of Publication: 2020
Experiment Type: Proteomics
Total Samples: 30
Organism: Homo sapiens
Reference Link: Publication
Proteomics Datasets For Biomarker Discovery and Target ID
Sunburst chart showing metadata features curated on Polly

Summary

The scientific understanding of intracranial aneurysm (IA) formation, rupture, and the subsequent development of cerebral vasospasm remains incomplete. Aberrant protein expression may play a pivotal role in driving structural alterations in the vasculature associated with IA. Deciphering the molecular mechanisms underlying these events is essential for identifying early detection biomarkers and, consequently, improving treatment outcomes.

To unravel differential protein expression in three clinical subgroups of IA patients—(1) unruptured aneurysm, (2) ruptured aneurysm without vasospasm and (3) ruptured aneurysm with vasospasm—the authors conducted untargeted quantitative proteomic analysis on aneurysm tissue and serum samples from these subgroups, along with control subjects. Candidate molecules were validated in a larger patient cohort using enzyme-linked immunosorbent assays.

A total of 937 and 294 proteins were identified from aneurysm tissue and serum samples, respectively. Dysregulation of several proteins known to maintain the structural integrity of the vasculature was observed in the context of the aneurysm. Specifically, ORM1, a glycoprotein, was significantly upregulated in both tissue and serum samples from unruptured aneurysm patients. Further validation in a larger cohort (n = 26) confirmed ORM1 as a potential biomarker for screening unruptured aneurysms. Samples from ruptured aneurysms with vasospasm exhibited significant upregulation of MMP9, a protease, compared to ruptured aneurysms without vasospasm.

The study validated MMP9 as a potential biomarker for vasospasm in a larger cohort (n = 52). This study represents the first global proteomic analysis covering the entire clinical spectrum of IA. Furthermore, it suggests ORM1 and MMP9 as potential biomarkers for unruptured aneurysms and cerebral vasospasm, respectively.

Unraveling the molecular mechanisms is vital for gaining valuable insights into the progression of diseases. Polly's harmonization engine produces high-quality, ML-ready multi-modal datasets tailored to customer needs. It processes raw data consistently, transforms it into a harmonized form with metadata annotation, and performs rigorous quality checks.

Ready to expedite your research journey? Connect with us to explore how Polly can potentially reduce your analysis time by up to 80%, accelerating the drug discovery process.

Request Demo