A novel machine learning algorithm selects proteome signature to specifically identify cancer exosomes

  1. Bingrui Li
  2. Fernanda G Kugeratski
  3. Raghu Kalluri  Is a corresponding author
  1. Department of Cancer Biology, University of Texas MD Anderson Cancer Center, United States
  2. Department of Bioengineering, Rice University, United States
  3. Department of Molecular and Cellular Biology, Baylor College of Medicine, United States
6 figures and 4 additional files

Figures

Overview of the study.
Proteomic characterization of exosomes derived from 285 cell lines from four studies.

(A) Overlapped proteins from four different studies of cell-line-derived exosomes. (B) PCA plot of cancer and control cell line-derived exosomes. (C) Positivity for eight commonly used exosomal protein biomarkers in various cell lines. The percentage of samples expressing each protein is shown in the boxes. Darker red indicates a higher percentage. (D) Annotation of the proteins detected in more than 90% of all samples. (E) GO and KEGG pathway enrichment analysis of the proteins detected in more than 90% of all samples. (F) Plasma membrane proteins detected in more than 90% of all samples.

Proteomic characterization of exosomes derived from cell lines and tissues.

(A) Proteins detected at higher frequency in cancer cell line-derived exosomes. (B) Positivity for 11 commonly used exosomal protein biomarkers in various tissues. (C) Overlapping proteins (>90% frequency) between cell line- and tissue-derived exosomes. (D) Positivity of five plasma membrane proteins detected in more than 90% of both cell line- and tissue-derived exosomes.

Figure 4 with 1 supplement
Identification of the signature proteins of plasma or serum-derived exosomes and the evaluation of random forest classifier.

(A) Overlapping exosome proteins detected in the plasma and serum of 205 cancer and 51 control samples from five studies. (B) Heat map of 46 overlapping exosome proteins in cancer and control plasma or serum samples. (C) AUROC score of the random forest classifier on including various numbers of protein features. (D) AUROC of different models in comparison. (E) Classification error matrix of the 75% training set using a random forest classifier for the 18 selected proteins. The number of samples is indicated in each box. (F) AUROC score of the random forest classifier trained using 75% of the dataset. Other metrics are indicated on right. (G) Classification error matrix of 25% testing set using a random forest classifier for the 18 selected proteins. The number of samples is indicated in each box. (H) AUROC score of the random forest classifier tested using 25% of the dataset. Other metrics are indicated on right.

Figure 4—figure supplement 1
Machine learning models for plasma-derived exosomes.

(A) PCA plot of cancer and control plasma or serum-derived exosomes. (B) Protein abundance of 18 selected protein features in 205 cancer- and 51 control plasma or serum-derived exosomes. Significance was determined by the Wilcoxon rank-sum test. *p < 0.05, ***p < 0.001, ****p < 0.0001, ns: not significant.

Identification of signature proteins expressed by plasma or serum-derived exosomes for classifying five common cancer types and evaluation of random forest classifier.

(A) PCA plot of plasma or serum-derived exosomes from five cancer types. (B) AUROC score of the random forest classifier by including various number of protein features. (C–D) Classification error matrix of a 60% training set and 40% testing set to classify the five cancer types using a random forest classifier for the five selected proteins. The number of samples is indicated in each box. (E) Protein abundance of five selected protein features in 158 samples across five cancer types. Significance was determined by the Kruskal-Wallis test. ****p < 0.0001.

Identification of signature proteins expressed by urine-derived exosomes and evaluation of random forest classifier.

(A) Overlapping exosome proteins detected in the urine from 261 cancer and 124 control samples from four studies. (B) PCA plot of cancer and control urine-derived exosomes. (C) AUROC score of the random forest classifier by including a various number of protein features. (D) Protein abundance of 17 selected protein features in 261 cancer- and 124 control urine-derived exosomes. (E) Classification error matrix of 75% training set using a random forest classifier for the 17 selected proteins. The number of samples is indicated in each box. (F) AUROC score of the random forest classifier trained using 75% of the dataset. Other metrics are indicated on right. (G) Classification error matrix of 25% testing set using a random forest classifier for the 17 selected proteins. The number of samples is indicated in each box. (H) AUROC score of the random forest classifier tested using 25% of the dataset. Other metrics are indicated on the right. Significance was determined by the Wilcoxon rank-sum test. **p < 0.01, ***p < 0.001, ****p < 0.0001, ns: not significant.

Additional files

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Bingrui Li
  2. Fernanda G Kugeratski
  3. Raghu Kalluri
(2024)
A novel machine learning algorithm selects proteome signature to specifically identify cancer exosomes
eLife 12:RP90390.
https://doi.org/10.7554/eLife.90390.3