Figures and data

SCellBOW workflow.
a, Schematic overview of SCellBOW workflow for identifying cellular clusters and assessing the aggressiveness of the predicted clusters. For SCellBOW clustering, firstly, a corpus was created from the gene expression matrix, where cells were analogous to documents and genes to words. Next, the pre-trained model was retrained with the vocabulary of the target dataset. Then, clustering was performed on embeddings generated from the neural network. For SCellBOW phenotype algebra, vectors were created for reference (total tumor) and queries. Then, the query vector was subtracted from the reference vector to calculate the predicted risk score using a bootstrapped random survival forest. Finally, survival probability was evaluated, and phenotypes were stratified by the median predicted risk score.

Evaluation of single-cell representations using SCellBOW.
a-c, UMAP plots for the normal prostate (a), PBMC (b), and pancreas (c) datasets. The coordinates are colored by cell types.
d-f, UMAP plots for normal prostate (d), PBMC (e), and pancreas (f) datasets, where the coordinates are colored by SCellBOW clusters. CL is used as an abbreviation for cluster.
g-i, Radial plot for the percentage of contribution of different methods towards ARI for various resolutions ranging from 0.2 to 2.0. ItClust is a resolution-independent method; thus, the ARI is kept constant across all the resolutions.
j, Box plot for the NMI of different methods across different resolutions ranging from 0.2 to 2.0 in steps of 0.2.
k, Bar plot for the cell type silhouette index (SI) for different methods. The default resolution was set to 1.0.

Evaluation of in-house splenocytes and matched PBMC dataset.
a, An experiment schematic diagram highlighting the sites of the organs for tissue collection and sample processing. In this matched PBMC-splenocyte CITE-seq experiment, PBMCs and splenocytes were collected, followed by high-throughput sequencing and downstream analyses.
b-c, UMAP plots for SCellBOW embedding colored by donors (b) and cell types (c).
d, The UMAP plots for the embedding of SCellBOW compared to different benchmarking methods. The coordinates of all the plots are colored by cell type annotation results using Azimuth. e, Bar plot for ARI, NMI, cell type SI at resolution 1.0.
f-g, Alluvial plots for Azimuth cell types mapped to SCellBOW clusters (f) and Scanpy clusters (g). The resolution of SCellBOW was set to 1.0. CL is used as an abbreviation for cluster.

Phenotype algebra on GBM and BRCA known molecular subtypes.
a, Heatmap for GSVA score for three molecular subtypes of GBM: CLA, MES, and PRO, grouped by SCellBOW clusters at resolution 1.0.
b, UMAP plot for the embedding of BRCA target dataset colored by PAM50 molecular subtype.
c, Survival plot for GBM molecular subtypes based on phenotype algebra.
d, Violin plot for predicted risk scores for GBM molecular subtypes.
e, Survival plot for BRCA molecular subtypes based on phenotype algebra. The total tumor is denoted by T.
f, Violin plot for predicted risk scores for BRCA molecular subtypes.

Phenotype algebra on mCRPC known molecular subtypes based on AR- and NE-activity.
a, Schematic of the transdifferentiation states underlying lineage plasticity that occurs during mCRPC progression from an ARPC to NEPC.
b, Scatter plot of GSVA scores of ARPC and NEPC gene sets, K-means clustering was used to allocate cells into the three high-level ARAH, ARAL, and NEPC categories.
c, UMAP plot for projection of SCellBOW embedding colored by ARAH, ARAL, and NEPC.
d, Heatmap showing the top differentially expressed genes (y-axis) between each high-level category (x-axis) and all other cells, tested with a Wilcoxon rank-sum test.
e, Survival plot for mCRPC cancer phenotypes based on phenotype algebra. The total tumor is denoted by T.
f, Violin plot for predicted risk scores for mCRPC phenotypes - ARAH, ARAL, and NEPC.
g, Survival plot for mCRPC tumor microenvironment phenotypes based on phenotype algebra. The total tumor is denoted by T.
h, Violin plot for predicted risk scores for mCRPC tumor microenvironment phenotypes - ARAH, ARAL, and NEPC.

Phenotype algebra on He et al.70 mCRPC data based on SCellBOW clusters.
a, UMAP plot for projection of embeddings with coloring based on the SCellBOW clusters at resolution 0.8. CL is used as an abbreviation for cluster.
b, Violin plot of phenotype algebra-based cluster-wise risk scores for SCellBOW clusters based on phenotype algebra-based predictions.
c, Patient and organ site distribution across the SCellBOW clusters.
d, Illustration of the distribution of cells from the three high-level groups-ARAH, ARAL, and NEPC across the SCellBOW clusters.
e, Bubble plot of row-scaled GSVA scores for custom curated gene sets containing activated and repressed AR- and NE-signatures.
f, Correlation plot of six phenotypic categories based on DSP gene expression correlated with the SCellBOW clusters based on scRNA-seq gene expression. The six phenotypic categories are defined by Brady et al. based on the activity of AR and NE programs.
g, Top gene sets correlated with SCellBOW clusters. Signatures were collected from the C2 ‘‘curated’’, C5 ‘‘Gene Ontology’’, and H ‘‘hallmark’’ gene sets from mSigDB94. Ranking by row scaled GSVA scores of one cluster against all others.