Integrative Analysis of Spatial and Single-cell RNA-seq Datasets to Characterize Tumor Microenvironment

The tumor microenvironment (TME) is a specialized ecosystem created by and for tumor cells. It is a complex community composed of multiple cell types - tumor cells, immune cells, stromal cells, fibroblasts, and endothelial cells (blood vessels), and surrounding tissue components - the stroma and extracellular matrix. 

A dynamic crosstalk between these components creates a unique microenvironment that is increasingly conducive to the development and progression of the tumor.

It has shown to be essential for generating heterogeneity, clonal evolution and enhancing multi-drug resistance in tumor cells. Variance of the TME composition between patients has also been linked to variance in therapeutic outcomes across a variety of cancers. Thus, the TME has attracted great research and clinical interest as a therapeutic target in cancer. 

This blog explores the challenges around utilizing spatial transcriptomics data and offers solutions to mitigate them. Take a dig. 

Challenges Around Utilization of Spatial Transcriptomics Data

Traditionally, single-cell technologies have been used to unravel the cellular heterogeneity of the Tumor Microenvironment (TME) providing a more comprehensive understanding of tumor biology. However, the tissue context that emerges in the TME dictates how these cells interact with each other and with acellular components. This tissue context is lost in single-cell analyses. Phenotypes related to tumor organization such as delineating tumor edge vs core regions, tertiary lymphoid structures (TLS), etc., are difficult to evaluate. Additionally, rare cell types or those that cannot withstand harsh dissociation protocols are under-represented in single-cell data. 

Spatial transcriptomics (ST) technologies are poised as powerful discovery tools for decoding the TME ecosystem and bringing the current therapeutic research into an entirely new paradigm. However, there are a few challenges to effectively utilizing spatial data:

  1. Limited Expertise: Given the relative nascency of high-throughput ST technologies, the average research laboratory lacks the bioinformatics expertise to effectively analyze spatial datasets. Furthermore, new computational methods tailored towards spatial data are actively being developed. These methods require systematic evaluation to determine their reliability. 
  1. Low resolution: Popular platforms like 10X Visium sample the gene expression on a tissue section in “spots'' containing 7 - 10 cells on average. Deconvolution of the cell type composition within each spot is an initial hurdle in using this data, for which accuracy completely dictates the quality of data output. Accurate deconvolution requires reference single cell datasets that biologically match the tumor samples of interest. This in turn requires auditing multiple public sources, a significant amount of metadata curation, and benchmarking single-cell data integration.

Mitigating Challenges with Elucidata

Elucidata offers technology and services to help scientists go from data to insights. Our bioinformatics experts developed a customized pipeline to dissect tumor organization in spatial datasets by leveraging high-quality reference scRNA-seq datasets available on Elucidata’s data harmonization platform - Polly - to deconvolve the cell type composition and annotate the most malignant tumor regions, followed by annotation of the tumor core vs edge using a combination of gene expression and histology. 

Key steps of the solution are as follows: 

  1. Dataset Identification & Pre-processing: Our approach relies on integrative analysis of spatial transcriptomics and single-cell RNA-seq datasets. The first step to a sensible, effective analysis is ensuring that the input data is of high quality.  

    With Polly, our teams have access to a curated collection of multiple types of omics datasets, including spatial transcriptomics and scRNA-seq data. Elucidata’s expert-driven data audits efficiently identify datasets of interest across public sources. 

    Subsequently, these datasets undergo rigorous QC processing with customized pipelines, annotation with rich metadata, and integration using our LLM-powered harmonization technology. This harmonization process significantly reduces the time and effort required for data cleaning and harmonization, allowing researchers to focus on scientific discovery.  

    In order to study the TME, we selected a breast cancer spatial transcriptomics dataset generated using the 10X Visium technology and two types of reference scRNA-seq datasets - 

    1. a breast cancer dataset to map tumor cell proportions on the spatial data and, 
    2. a normal (breast) dataset to score the cancer data. 
  1. Cell Type Deconvolution: After QC processing of the spatial data, we used the tool - CARD (Conditional AutoRegressive-based Deconvolution) to perform spatially informed cell type deconvolution using the cancer reference scRNA-seq dataset. Apart from cell-type expression information available in the reference single-cell data, CARD leverages neighborhood similarity and spatial correlation in cell-type compositions by implementing a conditional autoregressive modeling assumption, thereby improving the power of the deconvolution analysis.

    Taking advantage of Polly curated metadata, we matched the spatial and single-cell data with respect to the breast cancer clinical subtype in order to minimize potential variation beyond cell type heterogeneity. Spots were deconvolved into multiple cancerous and non-cancerous cell types. The cumulative proportion of cancerous cell types in a spot was used to define its cancer cell type proportion. A preliminary annotation of tumor vs normal spots was performed at this stage by placing a cutoff on the cumulative cancer cell type proportion.   
  1. Copy Number Variation Analysis to Score Malignancy: To identify the most malignant spots in the breast cancer spatial samples, we used copy number variation (CNV) analysis - a common approach across literature for scoring malignancy w.r.t. a normal reference. CNV profiles for individual spots in the spatial data were estimated using the inferCNV tool, using a normal breast sc dataset as a reference. The CNV profiles were used to derive two scores per spot -
    1. CNV signal, which represents the distance of a spot’s CNV profile w.r.t the average CNV profile of reference (normal) cells and, 
    2. CNV correlation, which represents the correlation coefficient between CNV profile of a spot with average CNV profile of other preliminary tumor spots in a sample. 
  1. Annotation of Malignant Regions and Identification of Gene Expression Markers: The cancer cell type proportion and CNV profiles for each spot were used to classify them as malignant vs non-malignant using a stringent classification scheme. Spots with high cancer cell type proportion (> 70%) or CNV signals above the first quartile and showing high CNV correlation (correlation coeff. > 0.5) were classified as malignant. Differential expression between the malignant and non-malignant spots was conducted to identify marker genes for the malignant region for each spatial sample.   
  1. Annotation and Transcriptional Characterization of Tumor Core vs. Edge Regions:  Once marker genes for malignant regions within a spatial section were identified, we leveraged a deep learning-based tool, TESLA to further dissect the tumor core vs edge regions. This tool first enhances gene expression to super-resolution using imputation based on tissue histology, then integrates gene expression markers and histology to annotate the tumor region and separate it into core and edge. Differentially expressed genes between core and edge regions were subject to pathway enrichment analyses to identify transcriptional programs associated with tumor edge and core regions. 
Spatial transcriptomics

Key Outcomes

  • Malignant vs. non-malignant annotations showed good agreement with ground truth labels. Our annotation process was consistently able to discriminate between normal and cancerous spots as annotated by authors of the original study based on histology. This gives confidence that the approach can be applied to other samples lacking such annotation. 
  • Differentially expressed genes identified between tumor core and edge indicated a distinct microenvironment between the two regions and helped in identifying pathways enriched among edge vs core genes.  


  • Spatial transcriptomics datasets available on Polly are consistently annotated with at least 30+ metadata fields, providing valuable insights into associations between the transcriptional landscape, microenvironment, and patient outcomes across multiple studies.
  • Curated, analysis-ready single-cell datasets on Polly can be used for cell type annotation to uncover cell type heterogeneity and their localization/interaction patterns in spatial datasets of various cancers.
  • Processing spatial transcriptomics data through custom downstream processing pipelines enables the study of detailed structures such as the tumor core, tumor edge, and other aspects of the tumor microenvironment.
  • Extending such approaches to analyze data across various types of cancers can enable the characterization of spatial cellular and molecular heterogeneity between the tumor edge and core and better understand tumor-TME dynamics. 
Connect with us or reach out to us at info@elucidata.io to learn more.

Blog Categories

Blog Categories

Request Demo