Dataset Information

This report has been verified by Polly as per framework version 1.0 Learn More

Dataset information	Value
Dataset ID	SCP1963_raw_custom_processed
Title	Single-cell transcriptome landscape of circulating CD4+ T cell populations in autoimmune diseases
Abstract	CD4+ T cells are key mediators of various autoimmune diseases; however, their role in disease progression remains unclear due to cellular heterogeneity. Here, we evaluated CD4+ T cell subpopulations using decomposition-based transcriptome characterization and canonical clustering strategies. This approach identified 12 independent gene programs governing whole CD4+ T cell heterogeneity, which can explain the ambiguity of canonical clustering. In addition, we performed a meta-analysis using public single-cell datasets of over 1.8 million peripheral CD4+ T cells from 953 individuals by projecting cells onto the reference and cataloging cell frequency and qualitative alterations of the populations in 20 diseases. The analyses revealed that the 12 transcriptional programs were useful in characterizing each autoimmune disease and predicting its clinical status. Moreover, genetic variants associated with autoimmune diseases showed disease-specific enrichment within the 12 gene programs. The results collectively provide a landscape of single-cell transcriptomes of CD4+ T cell subpopulations involved in autoimmune disease.
Description	Single-cell transcriptome landscape of circulating CD4+ T cell populations in autoimmune diseases
Publication Link	https://doi.org/10.1016/j.xgen.2023.100473
Number of cells	103037
Number of genes	23119
Number of samples	13
Organism	Homo sapiens
Tissue	blood
Disease	Myasthenia Gravis, Multiple Sclerosis, Lupus Erythematosus, Systemic, Normal
Cell Lines	none
Cell Type	naive thymus-derived CD4-positive, alpha-beta T cell, central memory CD4-positive, alpha-beta T cell, effector memory CD4-positive, alpha-beta T cell, regulatory T cell, effector memory CD4-positive, alpha-beta T cell, terminally differentiated
Drug	none
Marker genes for cell type are available	True
Doublet detection method	scrublet
Normalization method	log1p: true; target_sum: none; scaling_applied: true; max_value: none; zero_center: false
Remove gene groups	none
Batch correction method and key	batch_removal_method: harmony; batch_key: sample_id
Regress covariates	none

QUALITY CONTROL CONTENT

Distribution of Key Quality Control Metrics
Feature IdentiUMAP Visualization of Cells Colored by Samplefier
Stacked barplot of cell types distributed across samples
Stacked barplot of cell-types distributed across clusters
Distribution of (a) Cell Counts (b) Median Gene Counts (c) Median Mitochondrial Genes, across Samples
Gene Counts Distribution
UMI Count Distribution
UMI vs Gene counts distribution scatter plots colored by density
Batch Mixing Metrics
Cell Type Annotation Metrics

1. Distribution of Key Quality Control Metrics

Figure 1: These violin plots display the distribution of quality control metrics for each cell. Metrics include the number of genes detected, total transcript counts, and the percentage of mitochondrial transcripts.

A good-quality dataset would typically have a reasonable number of genes detected per cell and a moderate total transcript count. High mitochondrial transcript percentages can indicate low-quality, dying cells. Please Note: certain datasets do not have mitochondrial genes (MT-), thus figure for percentage of mitochondrial transcripts may be empty.

2. UMAP visualization of cells colored by sample

Figure 2: Sample level distribution of clustering pattern of cells with the help of UMAP embeddings.

If cells from the same sample cluster together distinctly from cells of other samples, it may indicate the presence of batch effects. Ideally, cells should be mixed and group based on their biological characteristics rather than their originating sample, indicating that the data is free of significant batch effects and the samples are comparable.

3. Stacked barplot of cell types distributed across samples

Figure 3: The bar plot showcases the distribution and abundance of different cell types within each sample. Each color in a bar represents a different cell type with the height of the color segment indicating the count of that cell type in the sample.

A uniform distribution of cell types across samples, may suggest that the sample preparation and preprocessing methods used were effective and there was minimal bias or variation in the processing steps. In some cases, if the experiment design ensures enrichment of a cell-type in a sample, then a non-uniform distribution is also valid.

4. Stacked barplot of cell-types distributed across clusters

Figure 4: The bar plot showcases the distribution and abundance of different cell types within each cluster. Each color in a bar represents a different cell-type with the height of the color segment indicating the count of that cell-type in the cluster.

Generally, each cluster should have only one cell-type to indicate accurate cell-type annotation. A corner-cases are observed when the authors have only provided cell ID to cell-type mapping and no marker genes. These need to manually rectified.

5. Distribution of (a) Cell Counts (b) Median Gene Counts (c) Median Mitochondrial Genes, across Samples

Figure 5a: The bar plot visualizes the total count of cells detected in each sample. Each bar corresponds to a different sample, with its height representing the number of cells.

This plot provides an understanding of the sample distribution in terms of cellularity. A wide variance in cell numbers across samples might indicate inconsistencies in cell isolation, sample preparation, or sequencing depth. Consistent cell counts across samples, however, would suggest a more uniform sampling process.

Figure 5b: The bar plot illustrates the median number of genes detected in each sample. Each bar represents a different sample, and its height corresponds to the median gene counts.

Consistently low gene counts might indicate low sequencing depth or poor-quality samples. On the other hand, large variances between samples or cell types might point to technical biases or true biological differences.

Figure 5c: The bar plot showcases the median percentage of mitochondrial gene transcripts across samples.

Consistently high mitochondrial gene percentages across samples might indicate a widespread issue with cell viability, while sporadic high values could suggest sample-specific issues which can be removed before downstream analysis

6. Gene Counts Distribution

Figure 6: The plot provides a smoothed representation of the distribution of detected genes across cells.

This plot gives an idea about the average gene richness in cells. High variability might indicate a mix of high and low-quality cells.

7. Unique molecular identifier (UMI) Count Distribution

Figure 7: The plot provides a smoothed representation of the distribution of UMIs across cells.

This plot offers insight into the typical transcriptomic depth of the dataset. A broad distribution might indicate variability in sequencing depth across cells.

8. UMI vs Gene counts distribution scatter plots colored by density

Figure 8: The scatter plot provides a visual representation of the relationship between the number of unique molecular identifiers (UMIs) and the number of genes detected in single cells. The color intensity indicates the density of data points in a particular region of the plot, allowing for the identification of trends and patterns.

Ideally, one would expect to see a positive correlation between UMIs and genes, indicating that cells with more transcripts also express more unique genes. Areas with higher density may represent the most typical cells in the dataset, while outliers could indicate low-quality cells or potential doublets.

9. Batch Mixing Metrics

	NMI	ARI	PCR_batch	Graph_iLISI	kBET_accept_rate	batch_correction_score
uncorrected	0.7051	0.8182	0.8793	0.0244	0.0772	0.0000
corrected	0.9713	0.9910	0.9855	0.0748	0.7651	1.0000

Table 1: Table displaying batch mixing metrics

These metrics are adopted from a recent benchmarking study of single-cell integration methods (Lueken et al. 2022). Values closer to 1 indicate better mixing of cells from the different batches.

10. Cell Annotation Metrics

sc_cluster	prediction	sctype_score	sctype_confidence	diff_exp_cell_markers
0	CD4+ naive T cell	0.3867	0.4287	DACT1,EDA,FAM13A,LRRN3,PECAM1
1	CD4+ central memory T cell	0.1992	0.4951	CCR6,KLRB1
2	CD4+ naive T cell	-0.2365	-0.5666	LRRN3,PECAM1
3	CD4+ effector memory T cell	2.4715	1.8014	CCL5,CEBPD,CST7,GZMA,GZMK
4	CD4+ naive T cell	-0.0367	-0.2476	LRRN3,PECAM1
5	regulatory T cell	3.5664	3.5823	FCRL3,FOXP3,HLA-DRB1,IKZF2,TIGIT
6	CD4+ central memory T cell	1.2004	2.8250	CCR6,KLRB1,PHLDA3
7	terminally differentiated effector memory CD4+ alpha-beta T cell	4.8192	3.7242	GZMH,NKG7
8	CD4+ naive T cell	0.0473	-0.1134	STAT1
9	CD4+ naive T cell	0.0363	-0.1309	LRRN3,PECAM1
10	CD4+ naive T cell	1.5244	2.2458	IFI44L,MX1,STAT1
11	CD4+ naive T cell	-0.0147	-0.2124
12	CD4+ naive T cell	1.7115	2.5446	LRRN3,PECAM1,SOX4
13	CD4+ naive T cell	0.0420	-0.1218
14	CD4+ naive T cell	-0.1009	-0.3500

Table 2: Table displaying sctype score and differential expressed genes per cell annotation

Disclaimer: The cell type annotation for the clusters that have a negative sctype_score and/or sctype_confidence value has been manually re-annotated based on specific markers that were prominent in those clusters. The updated annotations are stored in the "uns" slot of the final h5ad file.

Cell type predictions are made using the author-reported cell types. Next to the predictions, the marker genes of the assigned cell type which are differentially expressed in the corresponding cluster are also highlighted (where found). Differentially expressed genes were identified by running the Scanpy rank_genes_groups function with the following settings:Log-fold change cutoff: 1.0, Statistical test: t-test Adjusted p-value cutoff (Benjamini-Hochberg): 0.05 By default, "normalized_counts" layer is used for DE testing. DE genes per cluster are identified separately within each batch, and the results from all batches are summarized at the cluster level.

QUALITY ASSURANCE CONTENT

Metadata Information
Data Matrix
Cell Clusters in umap Embeddings Colored by Samples: Re-Processed and Polly Datasets
Cell Clusters in umap Embeddings Colored by 'Author Cell Types': Comparison Between Polly and Re-Processed Datasets
Cell Clusters in umap Embeddings Colored by 'Curated Cell Types': Comparison Between Polly Dataset and Re-Processed Data
Violin plot visualization for doublet
Cell Type Frequency Distribution

1. Metadata Information

Metadata information	Value
Polly curated metadata fields are present at dataset level ℹ	Pass
Polly curated metadata fields are present at sample level ℹ	Pass
Polly curated metadata fields are present in output file ℹ	Pass
Custom fields are present in output file ℹ	Pass
Publication Link is provided ℹ	Pass
Publication Link is valid ℹ	Pass
Dataset-Level vs Sample-Level Metadata: concordance check ℹ	Pass
Accuracy of raw counts availability tag ℹ	NA

2. Data Matrix

Data Matrix	Value
Unique Cell Barcodes ℹ	Pass
Unique Gene Identifiers ℹ	Pass
Embeddings are available ℹ	Pass
Gene Identifier Format ℹ	Pass
Raw counts are available in output file ℹ	Pass
Raw vs Processed Counts are different ℹ	Pass
Valid Raw Counts ℹ	Pass
Concordance of number of cells in raw and processed counts matrices in output file ℹ	NA
Valid Columns ℹ	Pass
Highly Variable Genes is available ℹ	Pass
Valid Processed Counts ℹ	Pass
UMAP/tSNE Projections are available ℹ	Both present
QC Metrics are available ℹ	Pass
Reproducibility of Gene Counts ℹ	Pass
Reproducibility of UMI Counts ℹ	Pass
Cluster information is available ℹ	Pass
Number of Clusters ℹ	15
Minimum genes per cell threshold ℹ	350
Minimum cells per gene threshold ℹ	2

3. Cell Clusters in umap Embeddings Colored by Samples: Re-Processed and Polly Datasets

Figure 1a: Sample level distribution of clustering pattern of cells with the help of umap embeddings on the existing on polly data.

Figure 1b: Sample level distribution of clustering pattern of cells with the help of umap embeddings on the re - processed data to validate reproducibility of results.

The plot visualizes the distribution of samples across various clusters. For both Polly and reprocessed dataset, these should appear very similar. Additionally the plot for Polly datasets can be used to understand if there is any batch-effect.

‍Sample Clustering: If samples are grouped in a diverse manner, where cells from the same sample are not closely clustered together, this suggests no batch effects on samples.
‍Batch Effect Evidence: If the opposite is true, with cells from the same sample clustering together, there might be evidence of batch effects on samples.
‍Biological Variation Check: It's essential to ensure that any batch effects observed are not due to inherent biological differences between samples.
‍Distribution Visualization: The plot also illustrates how samples are spread across different clusters, providing insights into their distribution.
‍Limitation of Reprocessed dataset: Note that using the UMAP/tSNE plot for reprocessed dataset may not be a valid approach to assess batch effects on samples, particularly when dealing with re-processed data primarily focused on reproducibility checks.

4. Cell Clusters in umap Embeddings Colored by 'Author Cell Types': Comparison Between Polly and Re-Processed Datasets

Figure 2a: Author cell type level distribution of clustering pattern of cells with the help of umap embeddings on the existing on polly data.

Figure 2b: Author cell type level distribution of clustering pattern of cells with the help of umap embeddings on the re - processed data to validate reproducibility of results.

Cell Type Distribution (author-defined): The plot visualizes the distribution of author-defined cell types across various clusters. As a quality check, for both Polly and reprocessed dataset, these should appear very similar.
‍Cell Type Similarity: UMAP plot also reveals the degree of similarity between different cell types. If cell types A and B are closely clustered, their gene expression patterns are similar, indicating biological similarities between these cell types.

5. Cell Clusters in umap Embeddings Colored by 'Curated Cell Types': Comparison Between Polly Dataset and Re-Processed Data

Figure 5a: Curated cell type level distribution of clustering pattern of cells with the help of umap embeddings on the existing on polly data.

Figure 5b: Curated cell type level distribution of clustering pattern of cells with the help of umap embeddings on the re - processed data to validate reproducibility of results.

Cell Type Distribution by Elucidata (Curation Experts): The plot visualizes how curated cell types are distributed among different clusters. As a quality check, For both Polly and reprocessed dataset, these should appear very similar.
Cell Type Relationships: It shows the proximity of different cell types within the clusters. If cell types A and B cluster closely, it suggests similar gene expression patterns between them, indicating biological similarities between these cell types.

6. Sample Wise Distribution of Number of Genes Expressing Using a Barplot

Figure 5: Sanity check of detected doublets

To assess the validity of doublet predictions, we plot the distribution of detected genes in predicted doublets v/s singlets per sample (number of genes per count are expected to be typically higher in heterotypic doublets). If doublets are removed the plot only shows the distribution of genes per count in singlets.

7. Cell Type Frequency Distribution

	Cell type (reported in publication)	Cell type (Polly curated)	Number of cells
0	["CD4+ central memory T cell"]	["central memory CD4-positive, alpha-beta T cell"]	22047
1	["CD4+ effector memory T cell"]	["effector memory CD4-positive, alpha-beta T cell"]	8793
2	["CD4+ naive T cell"]	["naive thymus-derived CD4-positive, alpha-beta T cell"]	59918
3	["regulatory T cell"]	["regulatory T cell"]	6823
4	["terminally differentiated effector memory CD4+ alpha-beta T cell"]	["effector memory CD4-positive, alpha-beta T cell, terminally differentiated"]	5456

Table 2: Table displaying author cell types, curated cell types and the number of cells for each cell-type

Authors frequently supply cell types that may not adhere to ontological standards or utilize abbreviations and marker gene names. These are substituted with ontological terms. The table offers insight into the degree of alignment between the ontological terms and the terms provided by the authors.

DATA EXPLORATION CONTENT

Variance Ratio Explained by Top 10 Principal Components
Expression of Marker Genes Across Cell Types
Expression of Marker Genes Across Clusters
Sunburst plots for metadata fields
umap plots for categorical metadata
umap plots for Polly curated metadata

1. Variance Ratio Explained by Top 10 Principal Components

Figure 1: The bar chart illustrates the variance ratio explained by each of the top 10 principal components (PCs). Each bar represents the proportion of the total variance in the data attributed to the corresponding principal component.

The PCA variance plot highlights the proportion of total variance captured by each of the top 10 principal components. This helps in understanding how much of the total variance in the data is captured by the initial components.

2. Expression of Marker Genes Across Cell Types

Figure 2: The dot plot showcases the expression levels (often represented by dot size) and prevalence (often represented by dot color intensity) of specific marker genes across different cell types.

Marker genes that are predominantly expressed in specific cell types validate the identified cell populations and help in characterizing and annotating them.

3. Expression of Marker Genes Across Clusters

Figure 3: The dot plot showcases the expression levels (often represented by dot size) and prevalence (often represented by dot color intensity) of specific marker genes across different clusters.

This visualization aids in understanding the heterogeneity within the dataset and can hint at different cellular states or subtypes within a cell type.

4. Sunburst plots for metadata fields

Figure 4: A Sunburst plot illustrating the distribution of data. It reflects user-defined custom fields if specified; otherwise, it represents standard fields.

5. umap plots for categorical metadata

Figure 5: The umap visualization represents cells in a reduced dimensional space, with colors indicating various categorical attributes.

6. umap plots for Polly curated metadata

Figure 6: This umap visualization represents cells in a reduced dimensional space, with colors indicating the Polly curated fields.

Download Sample Dataset