The theory of massively repeated evolution and full identifications of cancer-driving nucleotides (CDNs)

  1. Lingjie Zhang
  2. Tong Deng
  3. Zhongqi Liufu
  4. Xueyu Liu
  5. Bingjie Chen
  6. Zheng Hu
  7. Chenli Liu
  8. Miles E Tracy
  9. Xuemei Lu  Is a corresponding author
  10. Hai-Jun Wen  Is a corresponding author
  11. Chung-I Wu  Is a corresponding author
  1. State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, China
  2. State Key Laboratory of Genetic Resources and Evolution/Yunnan Key Laboratory of Biodiversity Information, Kunming Institute of Zoology, The Chinese Academy of Sciences, China
  3. GMU-GIBH Joint School of Life Sciences, Guangzhou Medical University, China
  4. CAS Key Laboratory of Quantitative Engineering Biology, Shenzhen Institute of Synthetic Biology, Institute of Advanced Technology, Chinese Academy of Sciences, China
  5. Innovation Center for Evolutionary Synthetic Biology, Sun Yat-sen University, China
  6. Department of Ecology and Evolution, University of Chicago, United States
14 figures, 5 tables and 2 additional files

Figures

Two modes of DNA sequence evolution.

(A) A hypothetical example of DNA sequences in organismal evolution. (B) Cancer evolution that experiences the same number of mutations as in (A) but with many short branches. (C) A common pattern of sequence variation in cancer evolution. (D) In cancer evolution, the same mutation at the same site may occasionally be seen in multiple sequences. The recurrent sites could be either mutational or functional hotspots, their distinction being the main objective of this study.

Figure 2 with 1 supplement
The average Ai and Si values across different i ranges (X-axis).

(Top): The average of Ai and Si in the log scale. Color lines - full data; gray lines - CpG sites removed. The dash lines are linear extrapolations. Bottom: The Ai / Si ratio as a function of i. The drop of Ai / Si ratio at i [8, 9] is due to the potential synonymous CDNs, see Supplementary file 1.

Figure 2—figure supplement 1
Mutation and context landscape across 12 cancer types.

(A) Single nucleotide changes within a local context of 3 bp across 12 cancer types. Mutations are grouped into 6 nucleotide change directions based on base complementarity, with colors representing each type. The Y-axis indicates the total number of each change type across the 12 cancers. The black box highlights the most abundant C>T (complementary G>A) changes occurring at CpG sites, while purple boxes designate the second-most prevalent nucleotide change at TCW (W = A or T) context which is potentially associated with APOBEC family of cytidine deaminases. (B) Context abundance within the human reference genome. For each coding region site, we extract the local context by extending 1 bp to either side. Contexts are then collapsed into 32 categories with centered base being C or T based on base complementarity. The Y-axis shows the site count for each context along the X axis. The four red bars (ACG, CCG, GCG, TCG) highlight the abundance of CpG sites in the human genome.

Site-level mutation rate variation obtained from Dig Sherman et al., 2022, a published AI tool.

(A) Each dot represents the expected SNVs (Y-axis) at a site where missense mutations occurred i times in the corresponding cancer population. The boxplot shows the overall distribution of mutability at i, with the red dashed line denoting the average. There is no observable trend that sites of higher i are more mutable (The blank areas are due to the absence of CDNs with mutation recurrence counts of 8 or 9 in CNS cancer mutation data, see Supplementary file 1). (B) A detailed look at the coding region of PAX3 gene in colon cancer. The expected mutability of sites in the 200 bp window is plotted. The three mutated sites in this window, marked by green and red (a CDN site) stars, are not particularly mutable. Overall, the mutation rate varies by about tenfold as is generally known for CpG sites.

Figure 4 with 1 supplement
Conventional analyses of local contexts at recurrence sites.

(A) From top panel down - For the 64 (43) 3-mer motifs, their mutational rates are shown on the X-axis. The most mutable motif over the average mutability (α) is 4.69. For the 1024 (=45) 5-mer and 16,384 (47) 7-mer motifs, the α values are, respectively, 8.79 and 11.52. The most mutable motifs, as expected, are dominated by CpG’s. (B) Each dot represents the motif surrounding a high-recurrence site. The recurrence number is shown on the X-axis and the mutability of the associated motif’s mutability (mutations per 0.1 M) is shown on the Y-axis. The average mutation rate across all motifs of given length category is indicated by a red horizontal dashed line. The absence of a trend indicates that the high recurrence sites are not associated with the mutability of the motif. (C) The analysis is extended to longer motifs surrounding each CDN (21, 41, 61, 81, and 101 bp). For each length group, all pairwise comparisons are enumerated. The observed distributions (black bars and points) are compared to the expected Poisson distributions (red bars and curves) and no difference is observed. Thus, local sequences of CDNs do not show higher-than-expected similarity.

Figure 4—figure supplement 1
Sliding window to explore the consensus sequences between recurrence sites.

The blue arrow indicates the positive strand of reference genome, with a mutated site highlighted by the red box. The green strip represents a sliding window covering the mutated site. With each stride of 10 bp, we extract the sequence context and conduct a pairwise comparison between all recurrence sites. The presence of consensus motifs would skew the Hamming distance distribution away from the expected Poisson model, reflecting non-randomness.

Patient level analysis for mutation load and mutational signatures.

(A) Boxplot depicting the distribution of mutation load among patients with recurrent mutations. The X-axis denotes the count of recurrent mutations, while the Y-axis depicts the normalized z-score of mutation load (see Methods). The green dashed line indicates the mean mutation load. In short, the mutation load does not influence the mutation recurrence among patients. (B) Signature analysis in patients with mutations of recurrences ≥i* (X-axis). For lung cancer (left), the upper panel presents the number of patients for each group, while the lower panel depicts the relative contribution of mutational signatures. For breast cancer, APOBEC-related signatures (SBS2 and SBS13) are notably elevated in all groups of patients with i*≥3, while patients with mutations of recurrence ≥ 20 in CNS cancer exhibit an increased exposure to SBS11 (Blough et al., 2011; Lin et al., 2021; Noeuveglise et al., 2023). Again, patients with higher mutation recurrences do not differ in their mutation signatures.

i* values (Y-axis, log scale) against sample sizes (n), X-axis across different shape parameter k’s.

The Y axis presents the i* values under different sample sizes (n) of the X-axis in log scale. Five shape parameters (k) of the gamma-Poisson model are used. In the literatures on the evolution of mutation rate, k is usually greater than 1. The inset figure illustrates how i*/ n (prevalence) would decrease with increasing sample sizes. The prevalence would approach the asymptotic line of [logLAEu] when n reaches 106. In short, more CDNs (those with lower prevalence) will be discovered as n increases. Beyond n=106, there will be no gain.

Analysis of CDNs with expanded sample set in GENIE.

(A) Schematic illustrating the impact of sample size expansion on the number of discovered CDNs. The two vertical lines show the cutoffs of i*/n at (3/1000) vs. (12/100,000). The Y axis shows that the potential number of sites would decrease with i*/n, which is a function of selective advantage. The area between the two cutoffs below the line represents the new CDNs to be discovered when n reaches 100,000. The power of n=100,000 is even larger if the distribution follows the blue dashed line. (B) The prevalence (i/n) of sites is well correlated between datasets of different n (TCGA with n<1000 and GENIE with generally tenfold higher), as it should be. Sites are displayed by color. ‘1-hit’: CDNs identified in GENIE but remain in singleton in TCGA, ‘2-hit’: CDNs identified in GENIE but present in doubleton in TCGA. ‘CDN both’: CDNs identified in both databases. (C–E) CDNs discovered in GENIE (n>9000) but absent in TCGA (n<1000). The newly discovered CDNs may fall in TCGA as 0–2 hit sites. The numbers in the middle column show the percentage of lower recurrence (non-CDN) sites in TCGA that are detected as CDNs in the GENIE database, which has much larger n’s.

Appendix 1—figure 1
The trend of φi,k with each increase of recurrence (i, the x-axis) under different shape parameters of the gamma distribution (k, designated by different colors).
Appendix 1—figure 2
The gamma distribution of recurrences (i) under different shapes.

With E(u)=5 × 10–6, we set the shape parameter k to 0.2, 1 and 5, represented by three distinct colors. The site number of synonymous recurrence i (Si) is indicated on Y-axis. In the context of a large sample size (n=106), the Si distribution clearly distinguishes between different k values, mitigating the overdispersion issue encountered in smaller sample sizes. The inset depicts the distribution on a log10 scale for i≥10, with a horizontal dashed line indicating Si=1, where i* is the CDN cutoff.

Author response image 1
5-prime.
Author response image 2
3-prime.
Author response image 3
5-prime shuffled.
Author response image 4
3-prime shuffled.
Author response image 5
random sequences from coding regions.

Tables

Table 1
An example of Ai and Si (from lung cancer, n=1035).
All sitesCpG sites removed
iAiSiAi / SiAiSiAi / Si
02254062378042812.892137538470140123.04
1195958693932.82168371568212.96
229469693.0421886433.4
399214.7168164.25
42312317117
516016 : 0909 : 0
610010 : 0606 : 0
7505 : 0505 : 0
8808 : 0606 : 0
9404 : 0303 : 0
≥3178228.09122177.18
≥47917954154
[10-20]717404 : 0
≥20606 : 0404 : 0
  1. Note –The ratio of Ai/ Si is provided as a measure of selection strength.

Table 2
Summary for modeling outlier sites in six cancer types.
Cancer TypeS3pαS4S5
Lung*--0.0------
Breast0.128.75E-04 (8.21E-04)88.6 (32.0)0.102 (0.068)0.004 (0.004)
CNS0.022.73E-04 (1.09E-04)295.1 (57.0)0.448 (0.173)0.026 (0.015)
Kidney0.033.03E-05 (2.98E-05)304.1 (108.0)0.067 (0.056)0.005 (0.006)
Upper-AD tract0.470.002 (0.001)48.9 (10.7)0.174 (0.078)0.005 (0.003)
Large intestine1.030.009 (0.001)51.6 (1.4)0.998 (0.087)0.026 (0.003)
  1. Note – For each cancer type, p stands for the proportion of highly mutable sites, with mutation rate being α-fold of the average. S3 gives the expected number without mutable outliers (P=0). S4 and S5 denote the expected number with the best (p, α) pairs with the standard deviation in parentheses. For lung cancer, S2 and S3 do not fit the outlier model (Table 2—source data 1); therefore, we set P=0.

Table 2—source data 1

The outlier model parameters and expected Si values for 6 cancer types analyzed.

pMinor’ and ‘alpha’ correspond to (‘p, a’) as described in the main text. ‘Eu’ represents the average mutation rate per site per patient for the given cancer type (‘ccType’). ‘s2Expt’ ‘s3Expt’ ‘4’ ‘s5’ and ‘s6’ are the expected Si values for i = 2, 3, 4, 5, 6, respectively. ‘s2Obsv’ and ‘s3Obsv’ represent the observed Si values for S2 and S3.

https://cdn.elifesciences.org/articles/99340/elife-99340-table2-data1-v1.zip
Appendix 1—table 1
literature support for CDN genes in breast cancer.
Gene IdGene NameSupport
AKT1v-akt murine thymoma viral oncogene homolog 1① ② ③
CDC42BPACDC42 binding protein kinase alpha (DMPK-like)Unbekandt and Olson, 2014; Collins et al., 2018; Kwa et al., 2021; Jiang et al., 2023
CDH1cadherin 1, type 1, E-cadherin (epithelial)① ② ③
ERBB2v-erb-b2 avian erythroblastic leukemia viral oncogene homolog 2① ② ③
ERBB3v-erb-b2 avian erythroblastic leukemia viral oncogene homolog 3Holbro et al., 2003; Xue et al., 2006; Hamburger, 2008; Sithanandam and Anderson, 2008; Stern, 2008; Huang et al., 2010
FGFR2fibroblast growth factor receptor 2① ② ③
FOXA1forkhead box A1① ② ③
GATA3GATA binding protein 3① ② ③
HIST1H3Bhistone cluster 1, H3b① ② ③ Xie et al., 2019; Wang et al., 2023*
KIF1Bkinesin family member 1BMunirajan et al., 2008; Yu and Feng, 2010; Liu et al., 2022
KRASKirsten rat sarcoma viral oncogene homolog① ② ③
NUP93nucleoporin 93 kDaBersini et al., 2020; Nataraj et al., 2022
PIK3CAphosphatidylinositol-4,5-bisphosphate 3-kinase, catalytic subunit alpha① ② ③
PTENphosphatase and tensin homolog① ② ③
RARS2arginyl-tRNA synthetase 2, mitochondrialWang et al., 2020*
SF3B1splicing factor 3b, subunit 1, 155 kDa① ② ③
TP53tumor protein p53① ② ③
  1. The serial number corresponds to the inclusion of target gene in the following driver gene list: ① CGC Tier-1 list, ② IntOGen, ③ Bailey’s list.

  2. The inclusion necessitates that the target gene is annotated as a cancer driver in breast cancer.

  3. *

    ambiguous, meaning the literature indicates an association between the candidate gene and breast cancer, but lacks explicit experimental evidence.

Appendix 1—table 2
k estimated from 12 cancer types.
Cancer typek
Breast5.05
CNS2.59
Endometrium5.49
Kidney7.70
Large intestine4.76
Liver5.23
Lung2.62
Ovary4.30
Prostate3.60
Stomach4.17
Upper-AD tract4.14
Urinary tract6.14
merged set*3.27
  1. Note:- Estimation of k is derived from negative binomial regression, based on synonymous changes aggregated by the 3 bp local context at mutated sites across all coding genes. The estimation method is implemented in package dndscv.

  2. *

    The merged set contains mutation information from all 12 cancer types.

Author response table 1
TRAIN accuracyTEST accuracy
Randomrandom
Model# layers# parametersALLdonneracceptorCDSALLdonneracceptorCDS
resNet1463,7210.770.770.760.790.710.720.690.72
resNet216564,4450.960.960.950.950.760.770.770.73
deepGRU23144,9250.860.850.840.890.790.790.760.8
deepLSTM624,4450.850.820.870.860.790.760.820.78

Additional files

Supplementary file 1

All CDN sites with population allele frequency annotation.

CDN_sites.i ≥ 20’ presents CDN sites with total hits ≥20 in each cancer type. ‘CDN.Missense.thres_3’ provides all CDN sites analyzed in this study, ranked in decreasing order based on the highest recurrence across 12 cancer types. ‘CDN.Missense.gnomAD’ presents the gnomAD population allele frequency of all missense CDNs. ‘Synonymous_high_hits’ lists the synonymous mutations potentially under selection, while ‘Synonymous_high_hits.gnomAD’ provides their corresponding allele frequency annotations from gnomAD.

https://cdn.elifesciences.org/articles/99340/elife-99340-supp1-v1.xlsx
MDAR checklist
https://cdn.elifesciences.org/articles/99340/elife-99340-mdarchecklist1-v1.pdf

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Lingjie Zhang
  2. Tong Deng
  3. Zhongqi Liufu
  4. Xueyu Liu
  5. Bingjie Chen
  6. Zheng Hu
  7. Chenli Liu
  8. Miles E Tracy
  9. Xuemei Lu
  10. Hai-Jun Wen
  11. Chung-I Wu
(2024)
The theory of massively repeated evolution and full identifications of cancer-driving nucleotides (CDNs)
eLife 13:RP99340.
https://doi.org/10.7554/eLife.99340.3