Noteworthy Datasets on Colorectal Cancer

Harsha Malavia, Deepthi Das
April 28, 2022

The ‘Monthly Dataset Roundup’ series features datasets on Polly that are of scientific value, intended to promote data sharing and reuse of useful cancer data. This month, we feature datasets that capture the comprehensive list of large colorectal cancer (CRC) datasets, the curated versions of which can be found and analyzed on Polly.

Polly hosts a large variety of curated datasets that the users can filter based on disease, organism, tissue, or the datatypes

Dataset 1

The cancer genome atlas colorectal adenocarcinoma dataset (TCGA-COAD)

A visual summary of the TCGA-COAD data collection on Polly

Dataset ID: COAD-*

Year of Publication: 2012

Total Samples: 2549 samples from 460 patients

Experiment type: Multiomics (CNV, miRNA, Transcriptomics, Proteomics, Methylation, Mutation)

Organism: Homo sapiens

Reference link: Publication

TCGA-COAD is a large collection of multi-omics data of colorectal adenocarcinoma from CRC patients, published as part of a larger effort to build a research community focused on connecting cancer phenotypes to genotypes.

The collection consists of multiple omics studies carried out on tissue from the same patient and the samples studying the response of various drugs CRC.

Samples from the TCGA-COAD collection can be downloaded from Polly in the gct file format or, depending on the data type, can even be analyzed on the platform using GUI based apps or a programming environment

Dataset 2

RNA-seq of tumor-educated platelets enables blood-based pan-cancer, multiclass, and molecular pathway cancer diagnostics.

Dataset ID: GSE68086_GPL16791

Year of Publication: 2015

Total Samples: 283 cells

Experiment type: Single cell RNA-Seq

Organism: Homo sapiens

Reference link: Publication

The dataset is composed of scRNA seq data of tumor educated platelets (TEPs) from the blood of various types of cancers along with platelets from a healthy control


Tumor-educated blood platelets (TEPs) are implicated as central players in the systemic and local responses to tumor growth, thereby altering their RNA profile.

The researchers report RNA-sequencing data of 283 blood platelet samples, including 228 tumor-educated platelet (TEP) samples collected from patients with six different malignant tumors (non-small cell lung cancer, colorectal cancer, pancreatic cancer, glioblastoma, breast cancer, and hepatobiliary carcinomas). Additionally, RNA-sequencing data of blood platelets isolated from 55 healthy individuals is also reported . This dataset highlights the ability of TEP RNA-based 'liquid biopsies' in patients with several types with cancer, including the ability for pan-cancer, multiclass cancer and companion diagnostics.

By utilizing this data the scientist, distinguished 228 patients with localized and metastasized tumors from 55 healthy individuals with 96% accuracy. Across six different tumor types, the location of the primary tumor was correctly identified with 71% accuracy.

Also, MET or HER2-positive, and mutant KRAS, EGFR, or PIK3CA tumors were accurately distinguished using surrogate TEP mRNA profiles.

The results indicate that blood platelets provide a valuable platform for pan-cancer, multiclass cancer, and companion diagnostics, possibly enabling clinical advances in blood-based "liquid biopsies".

UMAP of the TEPs derived from cancer patients along with healthy volunteers
Polly can be used to easily carry out differential gene expression analysis of Single Cell RNA-seq data using the GUI based tool Cellxgene hosted on the Polly platform

Dataset 3

Gene expression profiles of breast, colorectal, prostate, and non-small cell lung cancer

Dataset ID: GSE103512_GPL13158

Year of Publication: 2017

Total Samples: 280

Experiment type: Transcriptomics

Organism: Homo sapiens

Reference link: Publication


The tumor microenvironment is an important factor in cancer immunotherapy response. To further understand how a tumor affects the local immune system, the researchers analyzed immune gene expression  profiles from 280 formalin-fixed and paraffin embedded normal and tumor samples of four cancer types.

Regulatory T cells (Tregs) were found to be one of the main drivers of immune gene expression differences between normal and tumor tissue. Hence the conclusion that Treg gene expression is highly indicative of the overall tumor immune environment.

PCA clustering of samples based on the cancer type

Dataset 4

Gene expression profiling of colorectal cancer liver metastases (CRLM).

The dataset classifies CRC liver metastasis of different tp53 mutation status into a novel de novo classification system called Liver metastasis subtype (LMS)

Dataset ID: GSE159216_GPL17586

Year of Publication: 2021

Total Samples: 280

Experiment type: Transcriptomics

Organism: Homo sapiens

Reference link: Publication


Gene expression-based subtyping has the potential to form a new paradigm for stratified treatment of colorectal cancer. However, the established frameworks are based on the transcriptomic profiles of primary tumors, and metastatic heterogeneity is a challenge. Here the researchers aimed to develop a de novo metastasis-oriented framework.

High-resolution microarray gene expression profiling was performed of 283 liver metastases from 171 patients treated by hepatic resection, including multiregional and/or multi-metastatic samples from each of 47 patients were analysed.

Using this dataset they were able to develop a de novo liver metastasis subtype (LMS) framework recapitulated the main distinction between epithelial-like and mesenchymal-like tumors, with a strong immune and stromal component only in the latter.

LMS1 metastases had several transcriptomic features of cancer aggressiveness, including secretory progenitor cell origin, oncogenic addictions, and microsatellite instability in a microsatellite stable background, as well as frequent RAS/TP53 co-mutations.

LMS5 showed a mesenchymal phenotype with higher immune system activation while LMS1-4 showed epithelial characteristics.

LMS5 shows a mesenchymal phenotype compared to LMS1-4, which show a epithelial phenotype, as evident by clustering of samples along PC1
Differential expression analysis of LMS5 samples v/s the other subtypes show a high number of DE genes
Pathway analysis of LMS5 compared to other subtypes showed upregulation of immune activity related pathways

Dataset 5

Profiling of CD8+T cells upon treatment with extracellular vesicles derived from colorectal cancer and normal patients with different body mass index

Dataset ID: GSE152508_GPL20844

Year of Publication: 2020

Total Samples: 15

Experiment type: Transcriptomics

Organism: Homo sapiens

Reference link: Publication


Colorectal cancer (CRC) is one of the most widely diagnosed cancers worldwide. It has been shown that the body-mass index (BMI) of the patients could influence the tumor microenvironment, treatment response, and overall survival rates.Nevertheless, the mechanism on how BMI affects the tumorigenesis process, particularly the tumor microenvironment is still elusive.

Here the researchers postulated that extracellular vesicles (EVs) from CRC patients and non-CRC volunteers with different BMI could affect immune cells differently, in CD8 T cells particularly.

The changes in the CD8+T cells upon treatment with different types of extracellular vesicles isolated from obese and non obese volunteers, with and without CRC, was studied using RNA-seq.

This study highlights the possible difference in the regulatory mechanism of cancer patients-derived EVs, especially on CD8 T cells.

PCA plot showed the CRC patient-derived EV samples clustering together, regardless of the BMI status of the EV source
Heatmap of gene expression across CD8+ T cells treated with different EVs
Volcano plot representing differentially expressed genes in CD8+ T cells treated with EVs derived from obese volunteers with CRC v/s obese volunteers without CRC

Request Demo