Research Article

Characterization of cancer-driving nucleotides (CDNs) across genes, cancer types, and patients

State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, China
Center for Excellence in Animal Evolution and Genetics, The Chinese Academy of Sciences, China
GMU-GIBH Joint School of Life Sciences, Guangzhou Medical University, China
CAS Key Laboratory of Quantitative Engineering Biology, Shenzhen Institute of Synthetic Biology, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, China
Cancer Center, Clifford Hospital, Jinan University, China
Cancer Research Institute, School of Basic Medical Sciences, Southern Medical University, China
Department of Ecology and Evolution, University of Chicago, United States

Dec 17, 2024

https://doi.org/10.7554/eLife.99341.3

Open access
Copyright information

eLife Assessment

This valuable study is a companion to a paper introducing a theoretical framework and methodology for identifying cancer-driving nucleotides (CDNs). The evidence that recurrent SNVs or CDNs are common in true cancer driver genes is convincing, with more limited evidence that many more undiscovered cancer driver mutations will have CDNs, and that this approach could identify these undiscovered driver genes with about 100,000 samples.

https://doi.org/10.7554/eLife.99341.3.sa0

Significance of the findings:

Valuable: Findings that have theoretical or practical implications for a subfield

Landmark
Fundamental
Important
Valuable
Useful

Strength of evidence:

Convincing: Appropriate and validated methodology in line with current state-of-the-art

Exceptional
Compelling
Convincing
Solid
Incomplete
Inadequate

During the peer-review process the editor and reviewers write an eLife Assessment that summarises the significance of the findings reported in the article (on a scale ranging from landmark to useful) and the strength of the evidence (on a scale ranging from exceptional to inadequate). Learn more about eLife Assessments

Abstract
Introduction
Results
Discussion
Methods
Appendix 1
Appendix 2
Data availability
References
Article and author information
Metrics

Abstract

A central goal of cancer genomics is to identify, in each patient, all the cancer-driving mutations. Among them, point mutations are referred to as cancer-driving nucleotides (CDNs), which recur in cancers. The companion study shows that the probability of i recurrent hits in n patients would decrease exponentially with i; hence, any mutation with i ≥ 3 hits in The Cancer Genome Atlas (TCGA) database is a high-probability CDN. This study characterizes the 50–150 CDNs identifiable for each cancer type of TCGA (while anticipating 10 times more undiscovered ones) as follows: (i) CDNs tend to code for amino acids of divergent chemical properties. (ii) At the genic level, far more CDNs (more than fivefold) fall on noncanonical than canonical cancer-driving genes (CDGs). Most undiscovered CDNs are expected to be on unknown CDGs. (iii) CDNs tend to be more widely shared among cancer types than canonical CDGs, mainly because of the higher resolution at the nucleotide than the whole-gene level. (iv) Most important, among the 50–100 coding region mutations carried by a cancer patient, 5–8 CDNs are expected but only 0–2 CDNs have been identified at present. This low level of identification has hampered functional test and gene-targeted therapy. We show that, by expanding the sample size to 10⁵, most CDNs can be identified. Full CDN identification will then facilitate the design of patient-specific targeting against multiple CDN-harboring genes.

Introduction

Tumorigenesis in each patient is driven by mutations in the patient’s genome. Hence, a central goal of cancer genomics is to identify all driving mutations in each patient. This task is particularly challenging because each driving mutation is present in only a small fraction of patients. As the number of driver mutations in each patient has been estimated to be >5 (Armitage and Doll, 1954; Bozic et al., 2010; Hanahan and Weinberg, 2011; Belikov, 2017; Anandakrishnan et al., 2019), the total number of driver mutations summed over all patients must be quite high.

This study, together with the companion paper (Zhang et al., 2024), is based on one simple premise: in the massively repeated evolution of cancers, any advantageous cancer-driving mutation should recur frequently, say, i times in n patients. The converse that nonrecurrent mutations are not advantageous is part of the same premise. We focus on point mutations, referred to as cancer-driving nucleotides (CDNs), and formulate the maximum of i (denoted i^*) in n patients if mutations are not advantageous. For example, in The Cancer Genome Atlas (TCGA) database with n generally in the range of 500–1000, i^* = 3. Hence, any point mutation with i ≥ 3 is a CDN. At present, a CDN would have a prevalence of 0.3% among cancer patients. If the sample size approaches 10⁶, a CDN only needs to be prevalent at 5 × 10^–5, the theoretical limit (Zhang et al., 2024).

Although there are many other driver mutations (e.g., fusion genes, chromosomal aberrations, epigenetic changes, etc.), CDNs should be sufficiently numerous and quantifiable to lead to innovations in functional tests and treatment strategies. Given the current sample sizes of various databases (Cerami et al., 2012; Weinstein et al., 2013; Tate et al., 2019; de Bruijn et al., 2023), each cancer type has yielded 50–150 CDNs while the CDNs to be discovered should be at least 10 times more numerous. The number of CDNs currently observed in each patient is 0–2 for most cancer types. This low level of discovery has limited functional studies and hampered treatment strategies.

While we are proposing the scale-up of sample size to discover most CDNs, we now characterize CDNs that have been discovered. The main issues are the distributions of CDNs among genes, across cancer types, and, most important, among patients. In this context, cancer driver genes (CDGs) would be a generic term. We shall use ‘canonical CDGs’ (or conventional CDGs) for the driver genes in the union set of three commonly used lists (Bailey et al., 2018; Sondka et al., 2018; Martínez-Jiménez et al., 2020). In parallel, CDN-harboring genes, referred to as ‘CDN genes’, constitute a new and expanded class of CDGs.

The first issue is that CDNs are not evenly distributed among genes. The canonical cancer drivers such as TP53, KRAS, and EGFR tend to have many CDNs. However, the majority of CDNs, especially those yet-to-be-identified ones, may be rather evenly distributed with each gene harboring only 1–2 CDNs. Hence, the number of genes with tumorigenic potential may be far larger than realized so far. The second issue is the distribution of CDNs and CDGs among cancer types. It is generally understood that the canonical CDGs are not widely shared among cancer types. However, much (but not all) of the presumed cancer-type specificity may be due to low statistical resolution at the genic level.

The third issue concerns the distribution of CDNs among patients. Clearly, the CDN load of a patient is crucial in diagnosis and treatment. However, the conventional diagnosis at the gene level may have two potential problems. One is that many CDNs do not fall in canonical CDGs as signals from one or two CDNs get diluted. Second, a canonical CDG, when mutated, may be mutated at a non-CDN site. In those patients, the said CDG does not drive tumorigenesis. We shall clarify the relationships between CDN mutations and genes that may or may not harbor them.

The characterizations of discovered CDNs are informative and offer a road map for expanding the CDN list. A complete CDN list for each cancer type will be most useful in functional test, diagnosis, and treatment. A full list of mutations that drive the evolution of complex traits is at the center of evolutionary genetics. Such phenomena as complex human diseases (e.g., diabetes) (Vujkovic et al., 2020; Lagou et al., 2023; Xue et al., 2023; Suzuki et al., 2024), the genetics of speciation (Chen et al., 2022b; Wang et al., 2022; Wu, 2022), and the evolution of viruses in epidemics (Deng et al., 2022; Ruan et al., 2022; Cao et al., 2023; Ruan et al., 2023) are all prime examples in need of a full list. Thanks to their massively repeated evolution, cancers could be the first complex systems well resolved at the genic level.

Results

In molecular evolution, a gene under positive selection is recognized by its elevated evolutionary rate (Figure 1A and C). There have been numerous methods for determining the extent of rate elevation (Li et al., 1985; Nei and Gojobori, 1986; Yang and Swanson, 2002; Lawrence et al., 2013; Martincorena et al., 2017; Pan et al., 2022; Sherman et al., 2022; Wang et al., 2022; Ruan et al., 2023), and cancer evolution studies have adopted many of them. However, no model has been developed to take advantage of the massively repeated evolution of cancers (Figure 1B), which happens in tens of millions of people at any time.

Figure 1

Download asset Open asset

Mutations in organismal evolution vs. cancer evolution.

(**A, B**) A hypothetical example of DNA sequence evolution in organism vs. in cancer with the same number of mutations. (C) Mutation distribution in two species in the organismal evolution of (A). (**D, E**) Mutation distribution in cancer evolution among 10 sequences may have D and E patterns. (F) Another pattern of mutation distribution in cancer evolution with a recurrent site but shows too few total mutations. Mutations of (F) are cancer-driving nucleotides (CDNs) missed in the conventional screens.

In the whole-gene analysis, Figure 1C–E are identical, each with A:S = 10:1, where A and S denote nonsynonymous and synonymous mutations, respectively. However, the presence of a four-hit site in Figure 1E is far less likely to be neutral than Figure 1C and D. Although the ratio in Figure 1F, A:S = 4:1, is statistically indistinguishable from the neutral ratio of about 2.5:1, Figure 1F in fact has much more power to reject the neutral ratio than Figure 1C and D. After all, the probability that multiple hits are at the same site in a big genome is obviously very small.

The analyses of CDNs across the whole genome

For the entire coding regions in the cancer genome data, we define A_i (or S_i) as the number of nonsynonymous (or synonymous) sites that harbor a mutation with i recurrences. Table 1 presents the distribution of A_i and S_i across the 12 cancer types with n > 300 (Weinstein et al., 2013).

Table 1

Mutation recurrences (A_is and S_is) in 12 cancer types.

	Lung	Breast	Central nervous system	Kidney	Upper aerodigestive tract	Colon	Endometrium	Prostate	Stomach	Urinary tract	Ovary	Liver	Average
Patients #	1035	963	873	711	688	571	465	465	423	404	404	367	614
*A₀	22,540,623	21,683,136	20,783,835	22,247,653	21,580,444	20,601,026	20,766,001	21,300,810	20,892,755	21628918	22278124	22618059	21576782
*S₀	78,042,81	9,388,418	10,298,911	87,814,83	93,332,83	10,428,913	10,375,596	97,543,31	10,243,634	9426888	8746002	8255268	9403084
A/S_0	2.89	2.31	2.02	2.53	2.31	1.98	2.00	2.18	2.04	2.29	2.55	2.74	2.29
A₁	195958	44696	25122	25669	66924	94634	78870	9583	78834	66153	21138	25731	61109
S₁	69393	16732	10182	9317	26151	38606	31982	3613	32538	26546	7227	9398	23474
A/S_1	2.82	2.67	2.47	2.76	2.56	2.45	2.47	2.65	2.42	2.49	2.92	2.74	2.60
A₂	2946	233	287	56	489	1662	1052	29	1176	816	51	46	737
S₂	969	62	75	11	159	736	386	9	489	308	9	12	249
A/S_2	3.04	3.76	3.83	5.09	3.08	2.26	2.73	3.22	2.40	2.65	5.67	3.83	2.74
A₃	99	18	42	14	28	91	52	6	79	60	9	9	42.3
S₃	21	2	6	1	5	28	11	0	14	9	0	0	8.08
A/S_3	4.71	9	7	14	5.6	3.25	4.73	6:0	5.64	6.67	9:0	9:0	5.23
^†A_{i ≥3}	178	51	84	18	77	148	142	14	124	100	26	23	82.1
^†A_{i ≥4}	79	33	42	4	49	57	90	8	45	40	17	14	39.8
A₄	23	10	8	2	14	23	21	3	23	11	4	3	11.1
A₅	16	6	10	2	10	6	20	2	9	9	3	5	8.2
A_6-9	27	10	10	0	13	9	32	2	7	12	6	2	10.8
A_{[10, 20)}	7	3	10	0	9	11	9	1	6	5	4	4	5.75
A_≥20	6	4	4	0	3	8	8	0	0	3	0	0	3
^‡Total	202828	45669	26596	25841	68387	98931	81898	9706	81678	68297	21387	25944	63097
SiteNbr	22739705	21728116	20809328	22273396	21647934	20697470	20846065	21310436	20972889	21695987	22299339	22643859	21638710
nE(u)	9.07E-03	1.79E-03	1.00E-03	1.06E-03	2.83E-03	3.84E-03	3.15E-03	3.72E-04	3.27E-03	2.88E-03	8.28E-04	1.14E-03	2.6E-03

*

See ‘Methods’ for the calculations of A₀ and S₀.
†

A_i and S_i are as defined in the text.
‡

‘Total’ represents the total number of missense mutations, or . ‘Site number’ refers to the count of missense sites. nE(u) is calculated based on synonymous mutations, representing the expected number of neutral mutations per site in a population of size n.

For neutral mutations, we define i^* as the threshold above which the expected numbers of A_i would be <1, that is, $E [A_{i \geq i^{*}}] < 1$ , The corollary is that all $A_{i \geq i^{*}}$ sites are advantageous CDNs. (Since S_i is ~A_i/2.3, the same i^* would apply to S_i as well: $E [S_{i \geq i^{*}}] < 1$ .) As i^* is a function of the number of patients (n), it is shown mathematically in the companion study (Zhang et al., 2024) that i^* = 3 for n < 1000. Interestingly, while the $E [A_{i \geq 3}]$ is < 1, the expected $E [A_{i \geq 4}]$ is ≪ 1, in the order of 0.001. Hence, i^* = 4 may be considered unnecessarily stringent.

We should note that this study is constrained by n < 1000 in TCGA databases. (Databases with larger ns are also used where the actual ns are often uncertain.) At i^* = 3, we could detect only a fraction (<10%; see below) of CDNs. Many more tumorigenic mutations may be found in the i = 1 or 2 classes although not every one of them is a CDN. Since these two classes of mutations are far more numerous, they should account for the bulk of CDNs to be discovered. Indeed, Table 1 shows 76 $A_{i \geq 3}$ CDN mutations per cancer type but 681 A₂ and 56,648 A₁ mutations in the lower recurrence groups. If n reaches 10^5–6, most of the undiscovered CDNs in the A₁ and A₂ classes should be identified (Zhang et al., 2024).

In Table 2, we estimate the proportion of the A₁ and A₂ mutations that are possible CDNs. The relationships of A₃/S₃ > A₂/S₂, A₂/S₂ > A₁/S₁, and A₁/S₁ > A₀/S₀ are almost always observed in Table 1 with 32 (3 × 8 + 2 × 4) out of 36 such relationships. The use of A/S ratios may still underestimate the selective advantages of A_1~3 mutations because S_1~3 may have slight advantages as well (Zhang et al., 2024). Assuming S₁ is truly neutral, we use S₀ to S₁ as the basis to calculate the excess of A_1~3 in Table 2 where 35 of the 36 Obs(A_i) > Exp(A_i) relationships can be observed. The implication is that hundreds and, likely low thousands, of A₁s and A₂' should be CDNs, whereas we have only confidently identified ~76 strong CDNs, on average, for a cancer type. (Note that A₁ excesses are less reliable since a 1% error in the calculation would mean 566 CDNs.)

Table 2

Excess of A_is of each i class.

Recurrences	Lung	Breast	Central nervous system	Kidney	Upper aerodigestive tract	Colon	Endometrium	Prostate	Stomach	Urinary tract	Ovary	Liver
*A₁_o	195958	44696	25122	25669	66924	94634	78870	9583	78834	66153	21138	25731
*^{, †}A₁_e	198627	38586	20532	23582	60316	76049	63860	7888	66194	60751	18396	25720
Excess	–2669	6110	4590	2087	6608	18585	15010	1695	12640	5402	2742	11
^‡Ratio (%)	–1.36	13.67	18.27	8.13	9.87	19.64	19.03	17.69	16.03	8.17	12.97	0.04
A₂_o	2946	233	287	56	489	1662	1052	29	1176	816	51	46
A₂_e	1750	69	20	25	169	280	196	3	210	171	15	29
Excess	1195.61	164.36	266.72	31.01	320.48	1381.54	855.77	26.08	966.42	645.41	35.81	16.75
Ratio (%)	40.58	70.54	92.93	55.37	65.54	83.13	81.35	89.93	82.18	79.09	70.22	36.42
A₃_o	99	18	42	14	28	91	52	6	79	60	9	9
A₃_e	15.43	0.12	0.02	0.03	0.47	1.03	0.60	0.00	0.66	0.48	0.01	0.03
Excess	83.57	17.88	41.98	13.97	27.53	89.97	51.40	6.00	78.34	59.52	8.99	8.97
Ratio (%)	84.42	99.32	99.95	99.81	98.32	98.86	98.84	99.98	99.16	99.20	99.86	99.63
A₄_o	23	10	8	2	14	23	21	3	23	11	4	3
A₄_e	0.13593	0.00022	1.98E-05	2.81E-05	0.00132	0.00381	0.00185	4.00E-07	0.00210	0.00135	1.04E-05	3.78E-05
Excess	22.8641	9.99978	7.99998	1.99997	13.9987	22.9962	20.9981	3	22.9979	10.9987	3.99999	2.99999
Ratio (%)	99.41	100	100	100	99.99	99.98	99.99	100.00	99.99	99.99	100	100.00

*

The notation of ‘o’ and ‘e’ following A_is represents the observed A_i and expected A_i.
†

See ‘Methods’ for the calculation of expected A_i ’s.
‡

Ratio is the proportion of observed sites in excess, that is, the proportion of putative CDNs in the observation.

CDNs and the amino acids affected

We now ask whether the amino acid changes associated with CDNs bear the signatures of positive selection. Amino acids that have divergent physico-chemical properties have been shown to be under strong selection, both positive and negative (Chen et al., 2019a; Chen et al., 2019b; Chen et al., 2022b). We note that, in almost all cases in cancer evolution, when a codon is altered, only one nucleotide of the triplet codon is changed. Among the 190 amino acid (AA, 20×19/2) pairs, only 75 of the pairs differ by 1 bp (Tang et al., 2004). For example, Pro (CCN) and Ala (GCN) may differ by only 1 bp but Pro and Gly (GGN) must differ by at least 2 bp. These 75 AA changes, referred to as the elementary AA changes (Grantham, 1974; Li et al., 1985; Yang et al., 2003; Meyer et al., 2021), account for almost all AA substitutions in somatic evolution.

In a series of studies (Tang et al., 2004; Chen et al., 2019a; Chen et al., 2019b), we have defined the physico-chemical distances between AAs of the 75 elementary pairs as $Δ U i$ , where i = 1–75. $Δ U i$ reflects 47 measures of AA differences including hydrophobicity, size, charge, etc., and ranges between 0 and 1. The most similar pair, Ser and Thr, has $Δ U i$ = 0, and the most dissimilar pair is Asp and Try with $Δ U i$ = 1. These studies show that $Δ U i$ is a strong determinant of the evolutionary rates of DNA sequences and that large-step changes (i.e., large $Δ U i$ s) are more acutely ‘recognized’ by natural selection. These large-step changes are either highly deleterious or highly advantageous. Most strikingly, advantageous mutations are enriched with AA pairs of $Δ U i$ > 0.8 (Chen et al., 2019a).

To analyze the properties of CDNs, we choose six cancer types from Table 1 that have the largest sample sizes (n > 500) but leap over kidney since kidney cancers have unusually low CDN counts. In Figure 2, we divide the CDNs into groups according to the number of recurrences, i. CDNs of similar is are merged into the same group in the descending order of i, until there are at least 10 CDNs in the group. The six cancer types show two clear trends: (1) the proportion of CDNs with $Δ U i$ > 0.8 (red color segments) increases in groups with higher recurrences; and (2) in contrast, the proportion of CDNs with $Δ U i$ < 0.4 (green segments) decreases as recurrences increase. These two trends would mean that highly recurrent CDNs tend to involve larger AA distances ( $Δ U i$ > 0.8) and similar AAs tend not to manifest strong fitness increases. In general, CDNs alter amino acids in ways that expose the changes to strong selection.

Figure 2

Download asset Open asset

$Δ U i$ analysis across six cancer types.

$Δ U i$ , ranging between 0 and 1 (Tang et al., 2004; Chen et al., 2019a), is a measure of physico-chemical differences among the 20 amino acids (see the text). The most similar amino acids have $Δ U i$ near 0 and the most dissimilar ones have $Δ U i$ near 1. Each panel corresponds to one cancer type, with horizontal bar represents $Δ U i$ distribution of each recurrence group. The numbers on the left of the panel are i values and on the right are the number of sites. Note that the proportion of dark red segments increases as i increases. This figure shows that mutations at high recurrence sites (larger is) code for amino acids that are chemically very different from the wild type.

CDNs in relation to the genes harboring them

We shall use the term ‘CDN genes’ for genes having at least one CDN site. Since CDN genes contribute to tumorigenesis when harboring a CDN mutation, they should be considered cancer drivers as well. CDN genes have two desirable qualities for recognition as driver genes. First, CDNs are straightforward and unambiguous to define (e.g., i ≥ 3 for n < 1000). In the literature, there have been multiple definitions of CDGs (Reimand and Bader, 2013; Porta-Pardo and Godzik, 2014; Mularoni et al., 2016; Arnedo-Pac et al., 2019), resulting in only modest overlaps among cancer gene lists (Appendix 1—figure 1). Second, the evolutionary fitness of CDN, and hence the tumorigenic potentials of CDN genes, can be computed (Appendix 2, section ‘Quantifying evolutionary fitness of CDN’).

We now present the analyses of CDN genes using the same six cancers of Figure 2. Two types of CDN genes are shown in Table 3. Type I genes fulfill the conventional criterion of fast evolution with the whole-gene Ka/Ks (or dN/dS) significantly larger than 1 (Martincorena et al., 2017). Averaged across cancer types, type I overlaps by 95.7% with the canonical CDG list, which is the union of three popular lists (Bailey et al., 2018; Sondka et al., 2018; Martínez-Jiménez et al., 2020). Type I genes are mostly well-known canonical CDGs (e.g., TP53, PIK3CA, and EGFR).

Table 3

Distribution of cancer-driving nucleotides (CDNs) among genes.

CDN calls based on i^*=3	Lung	Breast	Central nervous system	Upper aerodigestive tract	Colon	Endometrium	Mean	^†Total	Overlap with the conventional set	Criteria of classification
# of patients (n)	1035	963	873	688	571	465	-	-	-
CDN count	178	50	83	77	148	142	113.3	495	-
# CDN-carrying genes (type I fulfills the convention of ^‡Ka/Ks > 1; type II does not**)
Type I (Ka/Ks >1^**)	10	8	12	13	10	21	12.33	45	95.7%	Conventional
Type II (Ka/Ks ~1)	79	9	12	19	86	35	40	229	26.1%	This study only
All CDN genes	89	17	24	32	96	56	52.33	258	47%	Both types
Genes with 1–2 CDNs (% all CDN genes)	80 (89.9 %)	14 (82.4 %)	19 (79.2 %)	27 (84.4 %)	90 (93.8 %)	45 (80.4 %)	45.8 (85 %)	250 (96.9%)		A subset of both types
Number of driver genes in three major CDG lists
*Other criteria:									–	Variable (see legends)
IntOGen	118	100	100	106	86	72	97	321
Bailey et al.	36	29	32	38	20	55	35	134
CGC Tier 1	30	32	32	24	44	23	30.83	118

*

intOGen, Bailey et al., and CGC Tier 1 are the three major CDG lists adopted here for comparison (Bailey et al., 2018; Sondka et al., 2018; Martínez-Jiménez et al., 2020).
†

”Total” refers to the cumulative number of unique genes identified across all six cancer types.
‡

Here, ** denotes significant Ka/Ks results with a corrected q-value < 0.1 based on dndscv analysis.

Type II (CDN genes) is the new class of CDGs. These genes have CDNs but do not meet the conventional criteria of whole-gene analysis. Obviously, if a gene has only one or two CDNs plus some sporadic hits, the whole-gene Ka/Ks would not be significantly greater than 1. As shown in Table 3, over 80% of CDN genes have only 1–2 CDN sites. The salient result is that type II genes outnumber type I genes by a ratio of 5:1 (229:45, column 8, Table 3). Furthermore, type II genes overlap with the canonical CDG list by only 23%.

Type II genes represent a new class of cancer drivers that concentrate their tumorigenic strength on a small number CDN sites. They have been missed by the conventional whole-gene definition of cancer drivers. One such example is the FGFR3 gene in lung cancer. This gene of 809 codons has only eight hits, among which one is a CDN (i = 3) in lung cancer. It is noticed solely for this CDN. In Appendix 2, section ‘Functional annotation of new cancer drivers’, we briefly annotate these new CDGs for comparisons with the canonical driver genes. Possible functional tests in the future can be found in ‘Discussion’.

We now briefly discuss the driver genes listed in previous studies as shown at the lower part of Table 3 (Bailey et al., 2018; Sondka et al., 2018; Martínez-Jiménez et al., 2020). From the total number of CDGs listed, it is clear that the overlaps are limited. As analyzed before (Wu et al., 2016), conventional gene lists overlap mainly by a core set of high Ka/Ks genes. This core set has not changed much as various criteria such as the replication timing, expression profiles, and epigenetic features are introduced. These criteria are the reasons for the many CDGs recognized by only a small subset of CDG lists. CDN genes, in contrast, can be objectively defined as CDN mutations (i recurrences in n samples) themselves are unambiguous.

Variation in CDN number and tumorigenic contribution among genes

By and large, the distribution of CDNs among genes is very uneven. Figure 3A shows 10 genes with at least six CDNs, whereas 87 genes have only one CDN. Two genes stand out for the number of CDNs they harbor, TP53 and PIK3CA, which also happen to be the only genes mutated in >15% of all cancer patients surveyed (Kandoth et al., 2013). Clearly, the prevalence of mutations in a gene is a function of the number of strong CDNs it harbors.

Figure 3

Download asset Open asset

Distribution of cancer-driving nucleotides (CDNs) among genes.

(A) Out of 119 CDN-carrying genes (red bars), 87 have only one CDN. For the rest, *TP53* possesses the most CDNs with three others having more than 10 CDNs. (B) CDN number in *TP53* among patients. The dark bar represents the observed patient number with corresponding CDNs of the X-axis. The gray bar shows the expected patient distribution. Clearly, *TP53* only needs to contribute one CDN to drive tumorigenesis. Hence, *TP53* (and other canonical driver genes; see text), while prevalent, does not contribute disproportionately to the tumorigenesis of each patient.

Although a small number of genes have unusually high number of CDNs, these genes may not drive the tumorigenesis in proportion to their CDN numbers in individual patients. Figure 3B shows the number of CDN mutations on TP53 that occur in any single patient. Usually, only one CDN change is observed in a patient, whereas two or three CDN mutations are expected. It thus appears that CDNs on the same genes are redundant in their tumorigenic effects such that the second hit may not yield additional advantages. This pattern of disproportionally lower contribution by CDN-rich genes is true in other genes such as EGFR and KRAS. Consequently, the large number of genes with only one or two CDN sites are disproportionately important in driving the tumorigenesis of individual patients.

CDNs in relation to the cancer types: The pan-cancer properties

In the current literature, CDGs (however they are defined) generally meet the statistical criteria for driver genes in only one or a few cancer types. However, genes may in fact contribute to tumorigenesis but are insufficiently prevalent to meet the statistical requirements for CDGs. Many genes are indeed marginally qualified as drivers in some tissues and barely miss the statistical cutoff in others. To see if genes that drive tumorigenesis in multiple tissues are more common than currently understood, we need to raise the sensitivity of cancer driver detection. Thus, CDNs may provide the resolution.

To test the pan-cancer-driving capacity of CDNs, we define i_max as the largest i values among the 12 cancer types for each CDN. The number of cancer types where the said mutation can be detected (i.e., i > 0) is designated NC12. Figure 4 presents the relationship between the observed NC12 of each CDN against i_max of that CDN. Clearly, many CDNs are observed in multiple cancer types (NC12 > 3), even though they do not qualify as a driver gene in all but a single cancer type. It happens frequently when a mutation has i > 3 in one cancer type but has i < 3 in others. One extreme example is C394 and G395 in IDH1. In central nervous system (CNS), both sites show i ≫ 3, while in six other cancer types (lung, breast, large intestine, prostate, urinary tract, liver), their hits are i < 3 but > 0. Conditional on a specific site informed by a cancer type, a mutation in another cancer type should be very unlikely if the mutation is not tumorigenic in multiple tissues. Hence, the pattern in Figure 4 is interpreted to be drivers in multiple cancer types, but with varying statistical strength.

Figure 4

Download asset Open asset

Sharing of cancer-driving nucleotides (CDNs) across cancer types.

The X-axis shows *i_max*, which is the largest i a CDN reaches among the 12 cancer types. The Y-axis shows the number of cancer types where the mutation also occurs. Each dot is a CDN, and the number of dots in the cloud is given. The blue and red dots denote, respectively, mutations classified as a CDN in one or multiple cancer types. Gray dots are non-CDNs. The table in the lower panel summarizes the number of sites and the number of genes harboring these sites.

Examining Figure 4 more carefully, we could see that CDNs with a larger i_max in one cancer type are more likely to be identified as CDNs in multiple cancer types (red dots, r = 0.97, p=9.23 × 10^–5, Pearson’s correlation test). Of 22 sites with i_max > 20, 15 are identified as CDNs (i ≥ 3) in multiple cancer types, with a median NC12 of 9. On the opposite end, two CDNs with i_max > 20 are observed in only one cancer type (EGFR: T2573 in lung and FGFR2: C755 in endometrium cancer). The bimodal pattern suggests that a few cancer driver mutations are tissue specific, whereas most others appear to have pan-cancer-driving potentials.

To conclude, when a driver is observed in more than one cancer type, it is often a cancer driver in many others, but insufficiently powerful to meet the statistical criteria for driver mutations. This pan-cancer property can be seen at the higher resolution of CDN, but is often missed at the whole-gene level. Cancers of the same tissue in different patients, often reported to have divergent mutation profiles (Nik-Zainal et al., 2012; Roberts and Gordenin, 2014), should be a good test of this hypothesis.

CDNs in relation to individual patients and therapeutic strategies

In previous sections, the focus is on the population of cancer patients; for example, how many in the patient population have certain mutations. We now direct the attention to individual patients. It would be necessary to pinpoint the CDN mutations in each patient in order to delineate the specific evolutionary path and to devise the treatment strategy. We shall first address the cancer-driving power of CDN vs. non-CDN mutations in the same gene.

Efficacy of targeted therapy against CDNs vs. non-CDNs

In general, a patient would have many point mutations, only a few of which are strong CDNs. We may ask whether most mutations on the canonical genes, such as EGFR, are CDNs. Presumably, synonymous, and likely many nonsynonymous, mutations on canonical genes may not be CDNs. It would be logical to hypothesize that patients whose EGFR has a CDN mutation (group 1 patients) should benefit from the gene-targeted therapy more than patients with a non-CDN mutation on the same gene (group 2 patients). In the second group, EFGR may be a nondriver of tumorigenesis.

Published data (André et al., 2017; Choudhury et al., 2023) are re-analyzed as shown in Figure 5. The hypothesis that patients of group 2 would not benefit as much as those of group 1 is supported by the analysis. This pattern further strengthens the underlying assumption that non-CDN mutations, even on canonical genes, are not cancer drivers.

Figure 5

Download asset Open asset

Survival analysis of non-small cell lung cancer (NSCLC) patients based on EGFR mutation status.

Patient data were retrieved from the GENIE database (https://genie-public-beta.cbioportal.org/) and stratified into three groups based on *EGFR* mutation profiles: Group 1 comprises patients with *EGFR* CDN mutations; group 2 includes patients with nonsynonymous mutations in *EGFR* that are not cancer-driving nucleotides (CDNs); the *EGFR^WT* group consists of patients with no *EGFR* mutations (see ‘Methods’). Patients of groups 1 and 2 received *EGFR*-targeted therapies in accordance with the guidelines for managing *EGFR* mutant NSCLC (Passaro et al., 2022; Choudhury et al., 2023). Survival analysis using the Kaplan–Meier method revealed a significantly higher survival rate for group 1 patients compared to group 2 and the *EGFR^WT* group (p<0.001).

Number of CDNs in each patient

We postulate that a full set of CDNs should be able to inform about the cause of each cancer as well as the design of gene-targeted therapy. In Table 4, the known CDNs based on TCGA are tallied. Note that only a few CDNs fall on the canonical driver genes, whereas most CDNs fall on the nonconventional ones.

Table 4

Numbers of patients with cancer-driving nucleotides (CDNs) vs. number of patients with any non-synonymous mutations in the same genes.

	Lung		Breast		Central nervous system		Upper aerodigestive tract		Colon		Endometrium
	CDN*^† (178)	Gene^† ^‡ (89)	CDN (50)	Gene (17)	CDN (83)	Gene (24)	CDN (77)	Gene (32)	CDN (148)	Gene (96)	CDN (142)	Gene (56)
n₀	342 (33%) ^§	53 (5.3%)	492 (51.1%)	415 (43.1%)	235 (26.9%)	163 (18.7%)	268 (39%)	140 (20.3%)	102 (17.9%)	42 (7.4%)	42 (9%)	14 (3%)
n₁	411 (39.7%)	70 (6.8%)	379 (39.4%)	395 (41%)	359 (41.1%)	306 (35.1%)	268 (39%)	229 (33.3%)	159 (27.8%)	79 (13.8%)	108 (23.2%)	59 (12.7%)
n₂	192 (18.6%)	84 (8.1%)	73 (7.6%)	114 (11.8%)	225 (25.8%)	293 (33.6%)	101 (14.7%)	171 (24.9%)	140 (24.5%)	93 (16.3%)	169 (36.3%)	101 (21.7%)
n_>2	90 (8.7%)	826 (79.8%)	18 (1.9%)	38 (3.9%)	53 (6.1%)	110 (12.6%)	50 (7.3%)	147 (21.4%)	170 (29.8%)	357 (62.5%)	146 (31.4%)	291 (62.6%)
Total n	1035	1035	963	963	873	873	688	688	571	571	465	465
Mean #	1.06	7.19	0.61	0.78	1.12	1.44	0.93	1.63	1.96	4.6	2.17	3.7

*

n_i designates the number of patients with i CDN mutations.
†

The number in the parentheses is the total number of CDNs or genes.
‡

In this column, n_i designates the number of patients with any nonsynonymous mutation in the same gene as the CDN column.
§

There are 684 CDNs summed over all cancer types. The percentage is n_i/Total n.

In most cancer types, 10–30% of patients, shown in the n₀ row of Table 4, have no known CDNs (and >50% among breast cancer patients). Hence, the current practice is to rely on missense mutations, regardless of CDNs or non-CDNs, on the canonical genes. The CDN column vs. the gene column in Table 4 addresses this issue. For example, the CDN column suggests that 33% of lung cancer patients (the n₀ row) would not respond well to gene-targeted therapy, whereas the gene column shows only 5.3%. The difference is due to a higher, and likely inflated, detection rate of candidate drivers in the gene column. We suggest that patients who have a non-CDN mutation on a driver gene would not respond to the targeted therapy against that gene, as demonstrated in Figure 5. In the above example, 27.7% (33–5.3%) of patients may be subjected to the targeted treatment but may not respond well.

Prevalence vs. potency of CDN-bearing genes in driving tumorigenesis

The last question is the relationship between mutation prevalence and tumorigenic strength (or potency) among CDN-bearing genes. For example, when a patient is diagnosed to have five CDNs in five genes, what may be their relative contributions to the tumorigenesis? Are they equally valid candidates for targeted therapy? It would seem logical that canonical CDGs with many CDNs should be the targets. However, because these genes would contribute at most one CDN to the tumorigenesis (Figure 3B), targeting a high-prevalence gene may not yield more benefits to the patients than targeting a low-prevalence gene that has a CDN.

The implication is that prevalence and potency of CDNs may not be strongly correlated. Some genes may be prevalently mutated in the patient population but, in each affected patient, these genes may not be more potent than the less prevalent genes with a CDN mutation. Potency can be tested in vitro by gene editing or in vivo by targeting treatment. In this interpretation, targeting a CDN of low prevalence (say, i = 3) may be as effective in treatment as targeting a high-prevalence CDN with i = 20. The model and Table 5 present this hypothesis based on cancer hallmarks.

Table 5

Gene numbers for different cancer hallmarks.

	Gene number
Hallmark	All records	Breast	Colon
Angiogenesis	78	8	6
Cell division control	107	12	10
Cell replicative immortality	44	4	3
Change of cellular energetics	70	10	4
Escaping immune response to cancer	51	1	1
Escaping programmed cell death	202	32	20
Genome instability and mutations	106	10	7
Invasion and metastasis	206	52	27
Proliferative signaling	176	40	20
Senescence	48	3	5
Suppression of growth	130	11	12
Tumor-promoting inflammation	54	2	3

Data downloaded from COSMIC (https://cancer.sanger.ac.uk/cosmic/download), see ‘Methods’.

The hallmarks of cancer were first proposed by Hanahan and Weinberg, 2000 with several updates (Hanahan and Weinberg, 2011; Hanahan, 2022). Each hallmark is a cancer phenotype shown in Table 5 that lists the number of genes involved in each particular hallmark (see ‘Methods’). While each hallmark may be associated with a number of genes, many genes are also involved in multiple hallmarks. As even the highly prevalent genes would usually have at most one mutation in each patient, we assume that each gene is associated with one hallmark in each patient.

Suppose that tumorigenesis requires a mutation in most (but perhaps not all) of the hallmarks, then the number of mutation combinations would be the product of all numbers in the corresponding column. For breast cancer, it would be 8 × 12 × 4.... × 11 × 2–1.7 × 10¹¹. In other words, the possible mutation combinations that can drive breast cancer is over a billion. Hence, two breast cancers are unlikely to have the same set of CDGs or CDNs. In this view, the prevalence of a gene would be inversely proportional to the hallmark gene number. For example, genes of ‘invasion and metastasis’ in breast cancer would have a prevalence of <1/52. In contrast, the potency in tumorigenesis should depend on the hallmark phenotype itself and independent of gene number for that hallmark. In this example, each gene of ‘invasion and metastasis’ may be lowly prevalent, but could also be highly potent in each patient.

In short, the prevalence and potency of CDNs may be poorly correlated. The hypothesis can be functionally tested (by gene editing in vitro or targeting treatment in vivo) in conjunction with the data on the attraction (i.e., co-occurrences) vs. repulsion (lack of co-occurrences) of CDNs.

Discussion

The companion study presents the theory that computes the limit of recurrences (i/n, i times in n patients) of reachable by neutral mutations. Above the cutoff (e.g., 3/1000), a recurrent mutation is deemed an advantageous CDN (Zhang et al., 2024). At present, the power of CDN analysis is hampered by the still small sample sizes, generally between 300 and 3000. We show that, when n reaches 10⁵, a mutation only has to recur 12 times to be shown as a CDN, that is, 25 times more sensitive than 3/1000. In short, nearly all CDNs should be discovered with n ≥ 10⁵.

In this study, we apply the theory on existing data to characterize the discovered CDNs. Based on TCGA data, this study concludes that each cancer patient carries only 1–2 CDNs, whereas 6–10 drivers are usually hypothesized to be present in each cancer genome (Hanahan and Weinberg, 2011; Vogelstein et al., 2013; Campbell et al., 2020). This deficit signifies the current incomplete understanding of cancer-driving potentials. Across patients of the same cancer type, about 50–150 CDNs have been discovered for each cancer type, representing perhaps only 10% of all possible CDNs. Given a complete set of CDNs, it should be possible to delineate the path of tumor evolution for each individual patient.

Direct functional test of CDNs would be to introduce putative cancer-driving mutations and observe the evolution of tumors. Such a task of introducing multiple mutations that are collectively needed to drive tumorigenesis has been done only recently and only for the best-known cancer-driving mutations (Ortmann et al., 2015; Takeda et al., 2015; Hodis et al., 2022). In most tumors, the correct combination of mutations needed is not known. Clearly, CDNs, with their strong tumorigenic strength, are suitable candidates.

Many CDNs in a patient may not fall on conventional CDGs, whereas these conventional CDGs may have passenger or weak mutations. Therefore, the efforts in gene-targeting therapy may well be shifted to the CDN-harboring genes. Given a complete set of CDNs, many more driver genes can be identified. Since many driver genes cannot be targeted for biological or technical reasons (Dang et al., 2017; Danesi et al., 2021; Waarts et al., 2022), a large set of CDGs will be desirable. The goal is that each cancer patient would have multiple targetable CDGs, all driven by CDNs they carry. In that case, the probability that resistance mutations eluding multiple targeting drugs should be diminished (Chen et al., 2022a; Zhai et al., 2022; Bian et al., 2023; Lin et al., 2023; Zhu et al., 2023).

In this context, we should comment on the feasibility of targeting CDNs that may occur in either oncogenes (ONCs) or tumor suppressor genes (TSGs). It is generally accepted that ONCs drive tumorigenesis thanks to the gain-of-function (GOF) mutations, whereas TSGs derive their tumorigenic powers by loss-of-function (LOF) mutations. Nevertheless, since LOF mutations are likely to be widespread on TSGs, they are less likely to recur as CDNs. The even distributions of nonsense mutations along the length of many TSGs provide such evidence. Importantly, as gene targeting aims to diminish gene functions, GOF mutations are perceived to be targetable, whereas LOF mutations are not. By extension, ONCs should be targetable but TSGs are not, an assertion we address below.

The data suggest that missense mutations on TSGs may often be of the GOF kind. If missense mutations are far more prevalent than nonsense mutations in tumors, the missense mutations cannot possibly be LOF mutations. (After all, it is not possible to lose more functions than nonsense mutations.) In a separate study (Deng et al., 2022), we compare missense and nonsense mutations (referred to as the escape-route analysis). For example, AAA to AAC (K to Q) is a missense mutation while the same AAA codon to AAT (K to stop) is a nonsense mutation. We found many cases where the missense mutations on TSGs are more prevalent (>10×) than nonsense mutations. We interpret these missense mutations to be of the GOF kind because they could not possibly ‘lose’ more functions than the nonsense mutations.

Another interesting pattern may be the distributions of CDNs across different cancer types. Cancer evolution in different tissues represents parallel evolution driven by similar selection for cell proliferation but under different ecological conditions. Figure 4 suggests that CDNs previously identified to be cancer-specific may have pan-cancer effects. In different cancer types, the same CDNs may drive tumorigenesis but the strength may not be sufficient to raise the data above the statistical threshold.

The CDN approach has two additional applications. First, it can be used to find CDNs in non-coding regions. Although the number of whole-genome sequences at present is still insufficient for systematic CDN detection, the preliminary analysis suggests that the density of CDNs in noncoding regions is orders of magnitude lower than in coding regions. Second, CDNs can also be used in cancer screening with the advantage of efficiency as the targeted mutations are fewer. For the same reason, the false-negative rate should be much lower too. Indeed, the false-positive rate should be far lower than the gene-based screen which often shows a false-positive rate of >50% (Appendix 2, ‘The specificity of CDNs in cancer detection’).

Cancer evolution falls within the realm of ultra-microevolution (Wu et al., 2016). The repeated evolution addresses the single most severe criticism of evolutionary studies, namely all evolutionary events have a sample size of one. Such repeated evolution offers the opportunity to uncover the full list of mutations underlying complex traits that is at the heart of molecular evolution. The genetics of speciation (Wu and Ting, 2004; Pan et al., 2022; Wang et al., 2022; Wu, 2023) and the emergence of major viral strains (such as COVID-19) (Deng et al., 2022; Ruan et al., 2022; Cao et al., 2023; Ruan et al., 2023) are both phenomena of complex gene interactions. The two companion studies may thus unite evolutionary biology and cancer medicine.

Methods

Data preparation

Single-nucleotide variant (SNV) data for TCGA patients were downloaded from the GDC Data Portal (https://portal.gdc.cancer.gov/, data version 28 February 2022), with mutations identified by at least two pipelines were included in this study. Mutations exceeding a 1‰ frequency in the Genome Aggregation Database (gnomAD, version v2.1.1) were excluded to minimize potential false positives arising from germline variants. Patients with more than 3000 coding region point mutations were filtered out as potential hypermutator phenotypes. This filtering process yielded a final analysis set encompassing 7369 patients across 12 diverse cancer types for subsequent analysis. The calculation of A_i and S_i follows the same method as described in the companion paper (Zhang et al., 2024).

For CDN analysis in noncancerous tissues, mutation profiles for normal tissues were retrieved from SomaMutDB (Sun et al., 2022). Mutations from different samples originating from the same individual were consolidated. Donners above the age of 80 were excluded from our dataset. The mutation processing followed the same pipeline as previously described. In total, we have mutation profiles from 487 donners serving as a negative control.

The canonical lists of CDGs were obtained from three distinct data sources. The CGC Tier 1 genes, encompassing genes with the highest confidence of driver status, were retrieved from the COSMIC Cancer Gene Census (https://cancer.sanger.ac.uk/census; Sondka et al., 2018). The IntOGen driver gene list, which employs an integrated pipeline for gene discovery, was downloaded from https://www.intogen.org/download (Martínez-Jiménez et al., 2020). Bailey’s driver gene list comprises 299 CDGs identified through a PanSoftware strategy, with further experimental validation confirming their role in driving cell lines (Bailey et al., 2018). The consistency of cancer types across all studies was manually verified using oncotree (#/home). For the analysis of driver gene overlap, only drivers from the same cancer type were compared.

The hallmark annotation of genes was downloaded from COSMIC (https://cancer.sanger.ac.uk/cosmic/download), encompassing 331 genes with annotated dysregulated biological processes. It is important to note that these hallmarks are manually annotated as part of an ongoing effort to characterize the role of genes in cancer based on literature evidence. The actual scale of hallmark genes may be substantially larger than the current version.

For gene-level selection analysis, we utilized the R package 'dndscv' to quantify selection signals for missense and nonsense mutations in a given gene (Martincorena et al., 2017). Specifically, the package calculates the Ka/Ks ratio, denoted as 'w' in the final results, for a given mutation impact (missense or nonsense). The significance of selection is presented as q values after Benjamini–Hochberg (BH) adjustment. Genes with w > 1 and q < 0.1 were identified as being significantly under positive selection.

We employ i^* = 3 as a cutoff for identifying CDNs across various cancer types. The specific value of i^* is detailed in Eq. 10 of the companion paper (Zhang et al., 2024). Here, i^* = 3 is chosen consistently across all cancer types, taking into account the abundance of sites under positive selection given i = 3 in Table 2. Throughout our analysis, emphasis is placed on CDNs of the missense category, where missense mutations with a recurrence ≥3 are identified as CDNs. For $Δ U i$ analysis, the reference table for 75 single-step amino acid changes was obtained from Chen et al., 2019a, and the $Δ U i$ for each CDN is derived by mapping the amino acid change to the reference table.

Calculation of A_i_e

We employ Eq. 9 from the companion paper to calculate the expected value for A_i under neutrality. For a given site, the cumulative probability for recurrence $x \leq i - 1$ could be expressed as

F (x \leq i - 1) = 1 - {(1 - \frac{1}{1 + n E (u)})}^{i - 1}

where n is the population size of a given cancer type, and E(u) is the mutation rate per site per patient derived from singleton synonymous mutations:

S_{1} = L_{S} \cdot n E (u) e^{- (n - 1) E (u)}

Then by expectation, site number of recurrence i ( $A_{x \geq i}$ ) could be represented by

A_{x \geq i} = L_{A} - L_{A} \cdot F (x \leq i - 1)

Following the same logic, we will have $A_{x \geq i + 1}$ as

A_{x \geq i + 1} = L_{A} - L_{A} \cdot F (x \leq i)

Then the expected value for A_i_e is

A_{i}_e = A_{x \geq i} - A_{x \geq i + 1} = L_{A} \cdot [F (x \leq i) - F (x \leq i - 1)]

L_A and L_S are missense and synonymous sites, respectively. The calculation procedure is described in methods of the companion paper (Zhang et al., 2024).

With Equation S1, Equation S2, Equation S3, we could solve for the expected number of sites with missense mutation recurrence i.

Survival analysis of EGFR-targeted therapy

The mutation and clinical profiles of 23,253 patients were retrieved from the GENIE project (Cerami et al., 2012; de Bruijn et al., 2023), with 7216 patients harboring EGFR mutations. Survivor months were calculated as the time elapsed between the date of sequencing and the date of the last contact (or day of death). In cases where patients had multiple sequencing reports, the earliest one was selected. For CDN calling, we applied Eq. 10 from the companion paper (Zhang et al., 2024). With $ε = 0.01$ , we set the CDN cutoff i^* = 14. To mitigate potential biases from other common drivers in lung cancer, patients with indels in exons 19 and 20 of EGFR, G12/13 mutations in KRAS, V600 mutations in BRAF, exon 20 insertions in HER2, fusions in MET, ALK, ROS1, RET, NTRK, and MET were filtered out. The final survival analysis was conducted using GraphPad Prism 8.

Annotation for noncanonical CDN genes

We conducted functional annotation and enrichment analysis for newly identified noncanonical CDN genes using four independent databases (Gene Ontology, KEGG, Disease Ontology, and Reactome) with R packages (clusterProfiler, DOSE, ReactomePA). For each analysis, we set a p-value cutoff of 0.05 and a q-value cutoff of 0.2, with p-value adjustment method set to ‘BH’. To explore the connections between noncanonical CDN genes and canonical CDGs, enrichment analyses were performed alongside cancer drivers from IntOGen. Specifically, for enrichment annotations related to cancer hallmarks, the corresponding genes were subjected to manual confirmation using CancerGeneNET (https://signor.uniroma2.it/CancerGeneNet/).

Appendix 1

Appendix 1—figure 1

Download asset Open asset

The overlap of cancer drivers from IntOGen, Bailey et al. and CGC Tier 1 (Bailey et al., 2018; Sondka et al., 2018; Martínez-Jiménez et al., 2020).

Driver genes (dots) for 12 cancer types were extracted from each driver list, indicated by three different region colors. The area size of each region is proportional to the gene number, with 384 genes for IntOGen, 168 for Bailey et al. and 137 for CGC Tier 1. Genes with a significant positive selection signal in the merged mutation set are marked in red, while nonsignificant ones are colored in blue. Notably, genes shared across the three driver sets are largely those with a significant Ka/Ks > 1.

Appendix 2

1. Quantifying evolutionary fitness of CDN

We leverage Eqs. 2 and 4 from the companion paper (Zhang et al., 2024) and rewrite A_i as follows:

A_{i} = A_{i}^{n e u t} + A_{i}^{*}

where A_i represents the observed site number with missense recurrence of i, which could be further decomposed into two components: $A_{i}^{n e u t}$ , the site number with missense recurrence of i under neutral mutational force, and $A_{i}^{*}$ , which occurs under positive selection. For $A_{i}^{*}$ , we have

A_{i}^{*} = f \cdot L_{A} \cdot g (i, k) {[w \cdot n E (u)]}^{i}

With Equations S4 and S5, A_i could be expressed as

A_{i} \sim \frac{Γ (k + i)}{Γ (i + 1) Γ (k)} \cdot \frac{L_{A}}{k^{i}} {[n E (u)]}^{i} \cdot (1 + f \cdot w^{i}) = 2.3 \cdot S_{i} \cdot (1 + f \cdot w^{i})

where f denotes the fraction of missense sites under positive selection (f ≪ 1), and w represents the selective advantage (s) scaled by the population size of progenitor cancer cell (N). First, we aimed to estimate f from the discrepancy between $R_{0} = A_{0} / S_{0}$ and $R_{1} = A_{1} / S_{1}$ . The number of sites under positive selection in A₁ ( $A_{1}^{*}$ ) could be approximated from A₀ as $A_{0} \cdot f$ , which could also be derived from the excess of mutations from A₁ as $A_{1} \cdot (R_{1} - R_{0}) / R_{1}$ , then we have

A_{0} \cdot f = A_{1} \cdot (R_{1} - R_{0}) / R_{1}

f = \frac{A_{1}}{A_{0}} \cdot (1 - \frac{R_{0}}{R_{1}})

Based on the average statistics in Table 1, f could be estimated from Equation S6 to be 3.13 × 10^–4. With synonymous recurrence sites as neutral reference, $A_{i} \sim 2.3 \cdot S_{i} \cdot (1 + f \cdot w^{i})$ . Given A₃/S₃ = 5.25, we will have w = 16, which means A₃ would be 4096 (16³) fold higher than the neutral expectation.

2. Functional annotation of new cancer drivers

A limitation in cancer driver discovery lies in the modeling of background mutation process, which often necessitates a balance between current knowledge and unknown mutational mechanisms. Consequently, genes recognized as canonical drivers in one cancer type may be categorized as noncanonical in others due to the lack of statistical significance. Of the 229 noncanonical drivers identified across six cancer types, 19 genes have been previously recognized as canonical drivers in different cancer types, while 23 genes were classified as drivers in IntOGen through a combination of diverse statistical methods.

For the newly identified noncanonical CDN genes in this study, we explore their potential functional relevance to cancer through a two-step procedure. First, we annotate these genes considering gene ontology, pathway, disease association, and protein–protein interaction with known cancer drivers. Subsequently, we conduct manual curation by reviewing published literatures for evidence related to cancer. Appendix 2—figure 1 illustrates how noncanonical CDGs are enriched in cancer-related biological processes in lung and colon cancers. The results reveal that processes such as cell migration/adhesion, epithelial-to-mesenchymal transition (EMT), cell proliferation, energy metabolism, immune response, and DNA transcription activity are among the most enriched processes in both cancer types. Additionally, other cancer hallmark-related processes, such as the cell cycle control, DNA stability, and response to stresses, are also enriched among noncanonical CDN genes.

In Appendix 2—figure 2, we present the top 10 genes being most connected to known cancer drivers across four independent enrichment analyses. Here, we take two unidentified driver genes for example to illustrate their functional roles relating to cancer. PIK3R2 (phosphoinositide-3-kinase regulatory subunit 2), which encodes p85β of class I PI3K, is often highly expressed in most tumors (Liu et al., 2022). This gene has a CDN mutation of G1117A (with amino acid change of G373R) in endometrium cancer, which is also presented in lung and urinary tract cancers. PIK3R2 has been reported as an oncogene, with its overexpression triggering cell transformation in culture and promoting cancer progression in mouse model (Vallejo-Díaz et al., 2019). SLC7A5 (solute carrier family 7 member 5), with a CDN mutation of G1480A (V494I) in colon cancer, plays a critical oncogenic role in maintaining intracellular amino acid levels for an elevated protein synthesis rate in KRAS-mutant cells. Depletion of SLC7A5 suppresses intestinal tumorigenesis in mice and resensitizes tumors to protein synthesis inhibition (Najumudeen et al., 2021). In conclusion, although excluded from the canonical driver list due to the lack of statistical significance, noncanonical CDN genes may still undergo positive selection at the site level. Ongoing research may provide further experimental evidence for these genes as part of an ongoing effort to identify the complete set of cancer drivers.

Appendix 2—figure 1

Download asset Open asset

Noncanonical cancer driver genes (CDGs) in colon and lung cancer along with associated biological processes (Y-axis).

For each gene, we examine its annotation results from GO analysis and search for cancer-related evidence in the literature. Biological processes are summarized and curated in relation to cancer hallmarks. Each connection between gene ID and biological process is depicted by a blue block in the grid.

Appendix 2—figure 2

Download asset Open asset

Top 10 noncanonical cancer driver genes (CDGs) with the highest enrichment records with IntOGen’s driver list from four enrichment analysis.

Panels (**A–D**) corresponds to Gene Ontology, KEGG, Disease Ontology, and Reactome analysis, respectively. The X-axis represents the number of enrichment records for each gene, while genes are listed on the Y-axis according to their enrichment record number. Genes with different occurrences across the top set of four analysis are marked with red (three hits), blue (two hits) and black (one hit).

3. The specificity of CDNs in cancer detection

Generally, in a given sample, each mutated gene would harbor one mutation. Therefore, we measure the false-positive rate for CDNs (or CDGs) as the proportion of individuals harboring nonsynonymous mutations at CDN (or CDGs). Across 487 individuals from noncancerous set, 446 are devoid of any mutations at CDNs, yielding a specificity of 91.6% (false-positive rate: 8.4%). In contrast, for CDGs downloaded from CGC, the specificity is 37.6% (false-positive rate: 62.4%). The high specificity implies the potential application of CDNs in biopsy and companion diagnostics, as well as the possibility of integration with other early screening pipelines. Furthermore, when we compared the mutations of recurrences i ≥ 3 in SomaMutDB with CDNs identified in our analysis, no overlap was observed. The high exclusiveness of CDNs between cancer and noncancer implies that positive selection operates in a specific manner in cancer, distinct from normal tissues.

Data availability

The scripts for generating the key results of this study and the accompanying paper (Zhang et al., 2024) are available at GitLab (copy archived at Zhang, 2024). Example files for breast cancer analysis have also been included. The complete set of CDNs can be found in Supplementary file 1 of the accompanying paper (Zhang et al., 2024).

References

(2019) Estimating the number of genetic mutations (hits) required for carcinogenesis based on the distribution of somatic mutations
PLOS Computational Biology 15:e1006881.

https://doi.org/10.1371/journal.pcbi.1006881
- PubMed
- Google Scholar
1. André F
2. Arnedos M
3. Baras AS
4. Baselga J
5. Bedard PL
6. Berger MF
7. Bierkens M
8. Calvo F
9. Cerami E
10. Chakravarty D
11. Dang KK
12. Davidson NE
13. Del Vecchio Fitz C
14. Dogan S
15. DuBois RN
16. Ducar MD
17. Futreal PA
18. Gao J
19. Garcia F
20. Gardos S
21. Gocke CD
22. Gross BE
23. Guinney J
24. Heins ZJ
25. Hintzen S
26. Horlings H
27. Hudeček J
28. Hyman DM
29. Kamel-Reid S
30. Kandoth C
31. Kinyua W
32. Kumari P
33. Kundra R
34. Ladanyi M
35. Lefebvre C
36. LeNoue-Newton ML
37. Lepisto EM
38. Levy MA
39. Lindeman NI
40. Lindsay J
41. Liu D
42. Lu Z
43. MacConaill LE
44. Maurer I
45. Maxwell DS
46. Meijer GA
47. Meric-Bernstam F
48. Micheel CM
49. Miller C
50. Mills G
51. Moore ND
52. Nederlof PM
53. Omberg L
54. Orechia JA
55. Park BH
56. Pugh TJ
57. Reardon B
58. Rollins BJ
59. Routbort MJ
60. Sawyers CL
61. Schrag D
62. Schultz N
63. Shaw KRM
64. Shivdasani P
65. Siu LL
66. Solit DB
67. Sonke GS
68. Soria JC
69. Sripakdeevong P
70. Stickle NH
71. Stricker TP
72. Sweeney SM
73. Taylor BS
74. ten Hoeve JJ
75. Thomas SB
76. Van Allen EM
77. Van ’T Veer LJ
78. van de Velde T
79. van Tinteren H
80. Velculescu VE
81. Virtanen C
82. Voest EE
83. Wang LL
84. Wathoo C
85. Watt S
86. Yu C
87. Yu TV
88. Yu E
89. Zehir A
90. Zhang H
91. The AACR Project GENIE Consortium
92. The AACR Project GENIE Consortium
(2017) AACR project genie: powering precision medicine through an international consortium
Cancer Discovery 7:818–831.

https://doi.org/10.1158/2159-8290.CD-17-0151
- Google Scholar
1. Armitage P
2. Doll R
(1954) The age distribution of cancer and a multi-stage theory of carcinogenesis
British Journal of Cancer 8:1–12.

https://doi.org/10.1038/bjc.1954.1
- PubMed
- Google Scholar
(2019) OncodriveCLUSTL: a sequence-based clustering method to identify cancer drivers
Bioinformatics 35:4788–4790.

https://doi.org/10.1093/bioinformatics/btz501
- PubMed
- Google Scholar
1. Bailey MH
2. Tokheim C
3. Porta-Pardo E
4. Sengupta S
5. Bertrand D
6. Weerasinghe A
7. Colaprico A
8. Wendl MC
9. Kim J
10. Reardon B
11. Ng PKS
12. Jeong KJ
13. Cao S
14. Wang Z
15. Gao J
16. Gao Q
17. Wang F
18. Liu EM
19. Mularoni L
20. Rubio-Perez C
21. Nagarajan N
22. Cortés-Ciriano I
23. Zhou DC
24. Liang WW
25. Hess JM
26. Yellapantula VD
27. Tamborero D
28. Gonzalez-Perez A
29. Suphavilai C
30. Ko JY
31. Khurana E
32. Park PJ
33. Van Allen EM
34. Liang H
35. MC3 Working Group
36. Cancer Genome Atlas Research Network
37. Lawrence MS
38. Godzik A
39. Lopez-Bigas N
40. Stuart J
41. Wheeler D
42. Getz G
43. Chen K
44. Lazar AJ
45. Mills GB
46. Karchin R
47. Ding L
(2018) Comprehensive characterization of cancer driver genes and mutations
Cell 173:371–385.

https://doi.org/10.1016/j.cell.2018.02.060
- PubMed
- Google Scholar
1. Belikov AV
(2017) The number of key carcinogenic events can be predicted from cancer incidence
Scientific Reports 7:12170.

https://doi.org/10.1038/s41598-017-12448-7
- PubMed
- Google Scholar
1. Bian S
2. Wang Y
3. Zhou Y
4. Wang W
5. Guo L
6. Wen L
7. Fu W
8. Zhou X
9. Tang F
(2023) Integrative single-cell multiomics analyses dissect molecular signatures of intratumoral heterogeneities and differentiation states of human gastric cancer
National Science Review 10:wad094.

https://doi.org/10.1093/nsr/nwad094
- PubMed
- Google Scholar
1. Bozic I
2. Antal T
3. Ohtsuki H
4. Carter H
5. Kim D
6. Chen S
7. Karchin R
8. Kinzler KW
9. Vogelstein B
10. Nowak MA
(2010) Accumulation of driver and passenger mutations during tumor progression
PNAS 107:18545–18550.

https://doi.org/10.1073/pnas.1010978107
- Google Scholar
1. Campbell PJ
2. Getz G
3. Korbel JO
4. Stuart JM
5. Jennings JL
6. Stein LD
7. Perry MD
8. Nahal-Bose HK
9. Ouellette BFF
10. Li CH
(2020) Pan-cancer analysis of whole genomes
Nature 578:82–93.

https://doi.org/10.1038/s41586-020-1969-6
- Google Scholar
1. Cao Y
2. Chen L
3. Chen H
4. Cun Y
5. Dai X
6. Du H
7. Gao F
8. Guo F
9. Guo Y
10. Hao P
11. He S
12. He S
13. He X
14. Hu Z
15. Hoh BP
16. Jin X
17. Jiang Q
18. Jiang Q
19. Khan A
20. Kong HZ
21. Li J
22. Li SC
23. Li Y
24. Lin Q
25. Liu J
26. Liu Q
27. Lu J
28. Lu X
29. Luo S
30. Nie Q
31. Qiu Z
32. Shi T
33. Song X
34. Su J
35. Tao SC
36. Wang C
37. Wang CC
38. Wang GD
39. Wang J
40. Wu Q
41. Wu S
42. Xu S
43. Xue Y
44. Yang W
45. Yang Z
46. Ye K
47. Ye YN
48. Yu L
49. Zhao F
50. Zhao Y
51. Zhai W
52. Zhang D
53. Zhang L
54. Zheng H
55. Zhou Q
56. Zhu T
57. Zhang YP
(2023) Was Wuhan the early epicenter of the COVID-19 pandemic?-A critique
National Science Review 10:wac287.

https://doi.org/10.1093/nsr/nwac287
- PubMed
- Google Scholar
1. Cerami E
2. Gao J
3. Dogrusoz U
4. Gross BE
5. Sumer SO
6. Aksoy BA
7. Jacobsen A
8. Byrne CJ
9. Heuer ML
10. Larsson E
11. Antipin Y
12. Reva B
13. Goldberg AP
14. Sander C
15. Schultz N
(2012) The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data
Cancer Discovery 2:401–404.

https://doi.org/10.1158/2159-8290.CD-12-0095
- PubMed
- Google Scholar
1. Chen Q
2. He Z
3. Lan A
4. Shen X
5. Wen H
6. Wu CI
(2019a) Molecular evolution in large steps-codon substitutions under positive selection
Molecular Biology and Evolution 36:1862–1873.

https://doi.org/10.1093/molbev/msz108
- PubMed
- Google Scholar
1. Chen Q
2. Lan A
3. Shen X
4. Wu CI
(2019b) Molecular evolution in small steps under prevailing negative selection: a nearly universal rule of codon substitution
Genome Biology and Evolution 11:2702–2712.

https://doi.org/10.1093/gbe/evz192
- PubMed
- Google Scholar
1. Chen B
2. Wu X
3. Ruan Y
4. Zhang Y
5. Cai Q
6. Zapata L
7. Wu CI
8. Lan P
9. Wen H
(2022a) Very large hidden genetic diversity in one single tumor: evidence for tumors-in-tumor
National Science Review 9:wac250.

https://doi.org/10.1093/nsr/nwac250
- PubMed
- Google Scholar
1. Chen Q
2. Yang H
3. Feng X
4. Chen Q
5. Shi S
6. Wu CI
7. He Z
(2022b) Two decades of suspect evidence for adaptive molecular evolution-negative selection confounding positive-selection signals
National Science Review 9:wab217.

https://doi.org/10.1093/nsr/nwab217
- PubMed
- Google Scholar
1. Choudhury NJ
2. Lavery JA
3. Brown S
4. de Bruijn I
5. Jee J
6. Tran TN
7. Rizvi H
8. Arbour KC
9. Whiting K
10. Shen R
11. Hellmann M
12. Bedard PL
13. Yu C
14. Leighl N
15. LeNoue-Newton M
16. Micheel C
17. Warner JL
18. Ginsberg MS
19. Plodkowski A
20. Girshman J
21. Sawan P
22. Pillai S
23. Sweeney SM
24. Kehl KL
25. Panageas KS
26. Schultz N
27. Schrag D
28. Riely GJ
29. on behalf of the AACR GENIE BPC Core Team
(2023) The GENIE BPC NSCLC cohort: a real-world repository integrating standardized clinical and genomic data for 1,846 patients with non–small cell lung cancer
Clinical Cancer Research 29:3418–3428.

https://doi.org/10.1158/1078-0432.CCR-23-0580
- Google Scholar
1. Danesi R
2. Fogli S
3. Indraccolo S
4. Del Re M
5. Dei Tos AP
6. Leoncini L
7. Antonuzzo L
8. Bonanno L
9. Guarneri V
10. Pierini A
11. Amunni G
12. Conte P
(2021) Druggable targets meet oncogenic drivers: opportunities and limitations of target-based classification of tumors and the role of Molecular Tumor Boards
ESMO Open 6:100040.

https://doi.org/10.1016/j.esmoop.2020.100040
- PubMed
- Google Scholar
1. Dang CV
2. Reddy EP
3. Shokat KM
4. Soucek L
(2017) Drugging the “undruggable” cancer targets
Nature Reviews. Cancer 17:502–508.

https://doi.org/10.1038/nrc.2017.36
- PubMed
- Google Scholar
1. de Bruijn I
2. Kundra R
3. Mastrogiacomo B
4. Tran TN
5. Sikina L
6. Mazor T
7. Li X
8. Ochoa A
9. Zhao G
10. Lai B
11. Abeshouse A
12. Baiceanu D
13. Ciftci E
14. Dogrusoz U
15. Dufilie A
16. Erkoc Z
17. Garcia Lara E
18. Fu Z
19. Gross B
20. Haynes C
21. Heath A
22. Higgins D
23. Jagannathan P
24. Kalletla K
25. Kumari P
26. Lindsay J
27. Lisman A
28. Leenknegt B
29. Lukasse P
30. Madela D
31. Madupuri R
32. van Nierop P
33. Plantalech O
34. Quach J
35. Resnick AC
36. Rodenburg SYA
37. Satravada BA
38. Schaeffer F
39. Sheridan R
40. Singh J
41. Sirohi R
42. Sumer SO
43. van Hagen S
44. Wang A
45. Wilson M
46. Zhang H
47. Zhu K
48. Rusk N
49. Brown S
50. Lavery JA
51. Panageas KS
52. Rudolph JE
53. LeNoue-Newton ML
54. Warner JL
55. Guo X
56. Hunter-Zinck H
57. Yu TV
58. Pilai S
59. Nichols C
60. Gardos SM
61. Philip J
62. Kehl KL
63. Riely GJ
64. Schrag D
65. Lee J
66. Fiandalo MV
67. Sweeney SM
68. Pugh TJ
69. Sander C
70. Cerami E
71. Gao J
72. Schultz N
73. AACR Project GENIE BPC Core Team, AACR Project GENIE Consortium
(2023) Analysis and visualization of longitudinal genomic and clinical data from the AACR Project GENIE biopharma collaborative in cBioPortal
Cancer Research 83:3861–3867.

https://doi.org/10.1158/0008-5472.CAN-23-0816
- PubMed
- Google Scholar
1. Deng S
2. Xing K
3. He X
(2022) Mutation signatures inform the natural host of SARS-CoV-2
National Science Review 9:wab220.

https://doi.org/10.1093/nsr/nwab220
- PubMed
- Google Scholar
1. Grantham R
(1974) Amino acid difference formula to help explain protein evolution
Science 185:862–864.

https://doi.org/10.1126/science.185.4154.862
- PubMed
- Google Scholar
1. Hanahan D
2. Weinberg RA
(2000) The hallmarks of cancer
Cell 100:57–70.

https://doi.org/10.1016/s0092-8674(00)81683-9
- PubMed
- Google Scholar
1. Hanahan D
2. Weinberg RA
(2011) Hallmarks of cancer: the next generation
Cell 144:646–674.

https://doi.org/10.1016/j.cell.2011.02.013
- PubMed
- Google Scholar
1. Hanahan D
(2022) Hallmarks of cancer: new dimensions
Cancer Discovery 12:31–46.

https://doi.org/10.1158/2159-8290.CD-21-1059
- PubMed
- Google Scholar
1. Hodis E
2. Torlai Triglia E
3. Kwon JYH
4. Biancalani T
5. Zakka LR
6. Parkar S
7. Hütter J-C
8. Buffoni L
9. Delorey TM
10. Phillips D
11. Dionne D
12. Nguyen LT
13. Schapiro D
14. Maliga Z
15. Jacobson CA
16. Hendel A
17. Rozenblatt-Rosen O
18. Mihm MC Jr
19. Garraway LA
20. Regev A
(2022) Stepwise-edited, human melanoma models reveal mutations’ effect on tumor and microenvironment
Science 376:eabi8175.

https://doi.org/10.1126/science.abi8175
- PubMed
- Google Scholar
1. Kandoth C
2. McLellan MD
3. Vandin F
4. Ye K
5. Niu B
6. Lu C
7. Xie M
8. Zhang Q
9. McMichael JF
10. Wyczalkowski MA
11. Leiserson MDM
12. Miller CA
13. Welch JS
14. Walter MJ
15. Wendl MC
16. Ley TJ
17. Wilson RK
18. Raphael BJ
19. Ding L
(2013) Mutational landscape and significance across 12 major cancer types
Nature 502:333–339.

https://doi.org/10.1038/nature12634
- PubMed
- Google Scholar
1. Lagou V
2. Jiang L
3. Ulrich A
4. Zudina L
5. González KSG
6. Balkhiyarova Z
7. Faggian A
8. Maina JG
9. Chen S
10. Todorov PV
11. Sharapov S
12. David A
13. Marullo L
14. Mägi R
15. Rujan R-M
16. Ahlqvist E
17. Thorleifsson G
18. Gao Η
19. Εvangelou Ε
20. Benyamin B
21. Scott RA
22. Isaacs A
23. Zhao JH
24. Willems SM
25. Johnson T
26. Gieger C
27. Grallert H
28. Meisinger C
29. Müller-Nurasyid M
30. Strawbridge RJ
31. Goel A
32. Rybin D
33. Albrecht E
34. Jackson AU
35. Stringham HM
36. Corrêa IR Jr
37. Farber-Eger E
38. Steinthorsdottir V
39. Uitterlinden AG
40. Munroe PB
41. Brown MJ
42. Schmidberger J
43. Holmen O
44. Thorand B
45. Hveem K
46. Wilsgaard T
47. Mohlke KL
48. Wang Z
49. GWA-PA Consortium
50. Shmeliov A
51. den Hoed M
52. Loos RJF
53. Kratzer W
54. Haenle M
55. Koenig W
56. Boehm BO
57. Tan TM
58. Tomas A
59. Salem V
60. Barroso I
61. Tuomilehto J
62. Boehnke M
63. Florez JC
64. Hamsten A
65. Watkins H
66. Njølstad I
67. Wichmann H-E
68. Caulfield MJ
69. Khaw K-T
70. van Duijn CM
71. Hofman A
72. Wareham NJ
73. Langenberg C
74. Whitfield JB
75. Martin NG
76. Montgomery G
77. Scapoli C
78. Tzoulaki I
79. Elliott P
80. Thorsteinsdottir U
81. Stefansson K
82. Brittain EL
83. McCarthy MI
84. Froguel P
85. Sexton PM
86. Wootten D
87. Groop L
88. Dupuis J
89. Meigs JB
90. Deganutti G
91. Demirkan A
92. Pers TH
93. Reynolds CA
94. Aulchenko YS
95. Kaakinen MA
96. Jones B
97. Prokopenko I
98. Meta-Analysis of Glucose and Insulin-Related Traits Consortium (MAGIC)
(2023) GWAS of random glucose in 476,326 individuals provide insights into diabetes pathophysiology, complications and treatment stratification
Nature Genetics 55:1448–1461.

https://doi.org/10.1038/s41588-023-01462-3
- PubMed
- Google Scholar
1. Lawrence MS
2. Stojanov P
3. Polak P
4. Kryukov GV
5. Cibulskis K
6. Sivachenko A
7. Carter SL
8. Stewart C
9. Mermel CH
10. Roberts SA
11. Kiezun A
12. Hammerman PS
13. McKenna A
14. Drier Y
15. Zou L
16. Ramos AH
17. Pugh TJ
18. Stransky N
19. Helman E
20. Kim J
21. Sougnez C
22. Ambrogio L
23. Nickerson E
24. Shefler E
25. Cortés ML
26. Auclair D
27. Saksena G
28. Voet D
29. Noble M
30. DiCara D
31. Lin P
32. Lichtenstein L
33. Heiman DI
34. Fennell T
35. Imielinski M
36. Hernandez B
37. Hodis E
38. Baca S
39. Dulak AM
40. Lohr J
41. Landau D-A
42. Wu CJ
43. Melendez-Zajgla J
44. Hidalgo-Miranda A
45. Koren A
46. McCarroll SA
47. Mora J
48. Crompton B
49. Onofrio R
50. Parkin M
51. Winckler W
52. Ardlie K
53. Gabriel SB
54. Roberts CWM
55. Biegel JA
56. Stegmaier K
57. Bass AJ
58. Garraway LA
59. Meyerson M
60. Golub TR
61. Gordenin DA
62. Sunyaev S
63. Lander ES
64. Getz G
(2013) Mutational heterogeneity in cancer and the search for new cancer-associated genes
Nature 499:214–218.

https://doi.org/10.1038/nature12213
- PubMed
- Google Scholar
1. Li WH
2. Wu CI
3. Luo CC
(1985) A new method for estimating synonymous and nonsynonymous rates of nucleotide substitution considering the relative likelihood of nucleotide and codon changes
Molecular Biology and Evolution 2:150–174.

https://doi.org/10.1093/oxfordjournals.molbev.a040343
- PubMed
- Google Scholar
1. Lin J
2. Zhan G
3. Liu J
4. Maimaitiyiming Y
5. Deng Z
6. Li B
7. Su K
8. Chen J
9. Sun S
10. Zheng W
11. Yu X
12. He F
13. Cheng X
14. Wang L
15. Shen B
16. Yao Z
17. Yang X
18. Zhang J
19. He W
20. Wu H
21. Naranmandura H
22. Chang KJ
23. Min J
24. Ma J
25. Björklund M
26. Xu PF
27. Wang F
28. Hsu CH
(2023) YTHDF2-mediated regulations bifurcate BHPF-induced programmed cell deaths
National Science Review 10:wad227.

https://doi.org/10.1093/nsr/nwad227
- PubMed
- Google Scholar
1. Liu Y
2. Wang D
3. Li Z
4. Li X
5. Jin M
6. Jia N
7. Cui X
8. Hu G
9. Tang T
10. Yu Q
(2022) Pan-cancer analysis on the role of PIK3R1 and PIK3R2 in human tumors
Scientific Reports 12:5924.

https://doi.org/10.1038/s41598-022-09889-0
- Google Scholar
1. Martincorena I
2. Raine KM
3. Gerstung M
4. Dawson KJ
5. Haase K
6. Van Loo P
7. Davies H
8. Stratton MR
9. Campbell PJ
(2017) Universal patterns of selection in cancer and somatic tissues
Cell 171:1029–1041.

https://doi.org/10.1016/j.cell.2017.09.042
- PubMed
- Google Scholar
(2020) A compendium of mutational cancer driver genes
Nature Reviews. Cancer 20:555–572.

https://doi.org/10.1038/s41568-020-0290-x
- PubMed
- Google Scholar
1. Meyer D
2. Kames J
3. Bar H
4. Komar AA
5. Alexaki A
6. Ibla J
7. Hunt RC
8. Santana-Quintero LV
9. Golikov A
10. DiCuccio M
11. Kimchi-Sarfaty C
(2021) Distinct signatures of codon and codon pair usage in 32 primary tumor types in the novel database CancerCoCoPUTs for cancer-specific codon usage
Genome Medicine 13:122.

https://doi.org/10.1186/s13073-021-00935-6
- PubMed
- Google Scholar
(2016) OncodriveFML: a general framework to identify coding and non-coding regions with cancer driver mutations
Genome Biology 17:128.

https://doi.org/10.1186/s13059-016-0994-0
- PubMed
- Google Scholar
1. Najumudeen AK
2. Ceteci F
3. Fey SK
4. Hamm G
5. Steven RT
6. Hall H
7. Nikula CJ
8. Dexter A
9. Murta T
10. Race AM
11. Sumpton D
12. Vlahov N
13. Gay DM
14. Knight JRP
15. Jackstadt R
16. Leach JDG
17. Ridgway RA
18. Johnson ER
19. Nixon C
20. Hedley A
21. Gilroy K
22. Clark W
23. Malla SB
24. Dunne PD
25. Rodriguez-Blanco G
26. Critchlow SE
27. Mrowinska A
28. Malviya G
29. Solovyev D
30. Brown G
31. Lewis DY
32. Mackay GM
33. Strathdee D
34. Tardito S
35. Gottlieb E
36. CRUK Rosetta Grand Challenge Consortium
37. Takats Z
38. Barry ST
39. Goodwin RJA
40. Bunch J
41. Bushell M
42. Campbell AD
43. Sansom OJ
(2021) The amino acid transporter SLC7A5 is required for efficient growth of KRAS-mutant colorectal cancer
Nature Genetics 53:16–26.

https://doi.org/10.1038/s41588-020-00753-3
- PubMed
- Google Scholar
1. Nei M
2. Gojobori T
(1986) Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions
Molecular Biology and Evolution 3:418–426.

https://doi.org/10.1093/oxfordjournals.molbev.a040410
- PubMed
- Google Scholar
1. Nik-Zainal S
2. Alexandrov LB
3. Wedge DC
4. Van Loo P
5. Greenman CD
6. Raine K
7. Jones D
8. Hinton J
9. Marshall J
10. Stebbings LA
11. Menzies A
12. Martin S
13. Leung K
14. Chen L
15. Leroy C
16. Ramakrishna M
17. Rance R
18. Lau KW
19. Mudie LJ
20. Varela I
21. McBride DJ
22. Bignell GR
23. Cooke SL
24. Shlien A
25. Gamble J
26. Whitmore I
27. Maddison M
28. Tarpey PS
29. Davies HR
30. Papaemmanuil E
31. Stephens PJ
32. McLaren S
33. Butler AP
34. Teague JW
35. Jönsson G
36. Garber JE
37. Silver D
38. Miron P
39. Fatima A
40. Boyault S
41. Langerød A
42. Tutt A
43. Martens JWM
44. Aparicio SAJR
45. Borg Å
46. Salomon AV
47. Thomas G
48. Børresen-Dale A-L
49. Richardson AL
50. Neuberger MS
51. Futreal PA
52. Campbell PJ
53. Stratton MR
54. Breast Cancer Working Group of the International Cancer Genome Consortium
(2012) Mutational processes molding the genomes of 21 breast cancers
Cell 149:979–993.

https://doi.org/10.1016/j.cell.2012.04.024
- PubMed
- Google Scholar
1. Ortmann CA
2. Kent DG
3. Nangalia J
4. Silber Y
5. Wedge DC
6. Grinfeld J
7. Baxter EJ
8. Massie CE
9. Papaemmanuil E
10. Menon S
11. Godfrey AL
12. Dimitropoulou D
13. Guglielmelli P
14. Bellosillo B
15. Besses C
16. Döhner K
17. Harrison CN
18. Vassiliou GS
19. Vannucchi A
20. Campbell PJ
21. Green AR
(2015) Effect of mutation order on myeloproliferative neoplasms
The New England Journal of Medicine 372:601–612.

https://doi.org/10.1056/NEJMoa1412098
- PubMed
- Google Scholar
1. Pan Y
2. Zhang C
3. Lu Y
4. Ning Z
5. Lu D
6. Gao Y
7. Zhao X
8. Yang Y
9. Guan Y
10. Mamatyusupu D
11. Xu S
(2022) Genomic diversity and post-admixture adaptation in the Uyghurs
National Science Review 9:wab124.

https://doi.org/10.1093/nsr/nwab124
- PubMed
- Google Scholar
1. Passaro A
2. Leighl N
3. Blackhall F
4. Popat S
5. Kerr K
6. Ahn MJ
7. Arcila ME
8. Arrieta O
9. Planchard D
10. de Marinis F
11. Dingemans AM
12. Dziadziuszko R
13. Faivre-Finn C
14. Feldman J
15. Felip E
16. Curigliano G
17. Herbst R
18. Jänne PA
19. John T
20. Mitsudomi T
21. Mok T
22. Normanno N
23. Paz-Ares L
24. Ramalingam S
25. Sequist L
26. Vansteenkiste J
27. Wistuba II
28. Wolf J
29. Wu YL
30. Yang SR
31. Yang JCH
32. Yatabe Y
33. Pentheroudakis G
34. Peters S
(2022) ESMO expert consensus statements on the management of EGFR mutant non-small-cell lung cancer
Annals of Oncology 33:466–487.

https://doi.org/10.1016/j.annonc.2022.02.003
- PubMed
- Google Scholar
1. Porta-Pardo E
2. Godzik A
(2014) e-Driver: a novel method to identify protein regions driving cancer
Bioinformatics 30:3109–3114.

https://doi.org/10.1093/bioinformatics/btu499
- PubMed
- Google Scholar
1. Reimand J
2. Bader GD
(2013) Systematic analysis of somatic mutations in phosphorylation signaling predicts novel cancer drivers
Molecular Systems Biology 9:637.

https://doi.org/10.1038/msb.2012.68
- PubMed
- Google Scholar
1. Roberts SA
2. Gordenin DA
(2014) Hypermutation in human cancer genomes: footprints and mechanisms
Nature Reviews. Cancer 14:786–800.

https://doi.org/10.1038/nrc3816
- PubMed
- Google Scholar
1. Ruan Y
2. Wen H
3. Hou M
4. He Z
5. Lu X
6. Xue Y
7. He X
8. Zhang YP
9. Wu CI
(2022) The twin-beginnings of COVID-19 in Asia and Europe-one prevails quickly
National Science Review 9:wab223.

https://doi.org/10.1093/nsr/nwab223
- PubMed
- Google Scholar
1. Ruan Y
2. Wen H
3. Hou M
4. Zhai W
5. Xu S
6. Lu X
(2023) On the epicenter of COVID-19 and the origin of the pandemic strain
National Science Review 10:wac286.

https://doi.org/10.1093/nsr/nwac286
- PubMed
- Google Scholar
1. Sherman MA
2. Yaari AU
3. Priebe O
4. Dietlein F
5. Loh PR
6. Berger B
(2022) Genome-wide mapping of somatic mutation rates uncovers drivers of cancer
Nature Biotechnology 40:1634–1643.

https://doi.org/10.1038/s41587-022-01353-8
- PubMed
- Google Scholar
1. Sondka Z
2. Bamford S
3. Cole CG
4. Ward SA
5. Dunham I
6. Forbes SA
(2018) The COSMIC Cancer Gene Census: describing genetic dysfunction across all human cancers
Nature Reviews. Cancer 18:696–705.

https://doi.org/10.1038/s41568-018-0060-1
- PubMed
- Google Scholar
1. Sun S
2. Wang Y
3. Maslov AY
4. Dong X
5. Vijg J
(2022) SomaMutDB: a database of somatic mutations in normal human tissues
Nucleic Acids Research 50:D1100–D1108.

https://doi.org/10.1093/nar/gkab914
- PubMed
- Google Scholar
1. Suzuki K
2. Hatzikotoulas K
3. Southam L
4. Taylor HJ
5. Yin X
6. Lorenz KM
7. Mandla R
8. Huerta-Chagoya A
9. Melloni GEM
10. Kanoni S
11. Rayner NW
12. Bocher O
13. Arruda AL
14. Sonehara K
15. Namba S
16. Lee SSK
17. Preuss MH
18. Petty LE
19. Schroeder P
20. Vanderwerff B
21. Kals M
22. Bragg F
23. Lin K
24. Guo X
25. Zhang W
26. Yao J
27. Kim YJ
28. Graff M
29. Takeuchi F
30. Nano J
31. Lamri A
32. Nakatochi M
33. Moon S
34. Scott RA
35. Cook JP
36. Lee J-J
37. Pan I
38. Taliun D
39. Parra EJ
40. Chai J-F
41. Bielak LF
42. Tabara Y
43. Hai Y
44. Thorleifsson G
45. Grarup N
46. Sofer T
47. Wuttke M
48. Sarnowski C
49. Gieger C
50. Nousome D
51. Trompet S
52. Kwak S-H
53. Long J
54. Sun M
55. Tong L
56. Chen W-M
57. Nongmaithem SS
58. Noordam R
59. Lim VJY
60. Tam CHT
61. Joo YY
62. Chen C-H
63. Raffield LM
64. Prins BP
65. Nicolas A
66. Yanek LR
67. Chen G
68. Brody JA
69. Kabagambe E
70. An P
71. Xiang AH
72. Choi HS
73. Cade BE
74. Tan J
75. Broadaway KA
76. Williamson A
77. Kamali Z
78. Cui J
79. Thangam M
80. Adair LS
81. Adeyemo A
82. Aguilar-Salinas CA
83. Ahluwalia TS
84. Anand SS
85. Bertoni A
86. Bork-Jensen J
87. Brandslund I
88. Buchanan TA
89. Burant CF
90. Butterworth AS
91. Canouil M
92. Chan JCN
93. Chang L-C
94. Chee M-L
95. Chen J
96. Chen S-H
97. Chen Y-T
98. Chen Z
99. Chuang L-M
100. Cushman M
101. Danesh J
102. Das SK
103. de Silva HJ
104. Dedoussis G
105. Dimitrov L
106. Doumatey AP
107. Du S
108. Duan Q
109. Eckardt K-U
110. Emery LS
111. Evans DS
112. Evans MK
113. Fischer K
114. Floyd JS
115. Ford I
116. Franco OH
117. Frayling TM
118. Freedman BI
119. Genter P
120. Gerstein HC
121. Giedraitis V
122. González-Villalpando C
123. González-Villalpando ME
124. Gordon-Larsen P
125. Gross M
126. Guare LA
127. Hackinger S
128. Hakaste L
129. Han S
130. Hattersley AT
131. Herder C
132. Horikoshi M
133. Howard A-G
134. Hsueh W
135. Huang M
136. Huang W
137. Hung Y-J
138. Hwang MY
139. Hwu C-M
140. Ichihara S
141. Ikram MA
142. Ingelsson M
143. Islam MT
144. Isono M
145. Jang H-M
146. Jasmine F
147. Jiang G
148. Jonas JB
149. Jørgensen T
150. Kamanu FK
151. Kandeel FR
152. Kasturiratne A
153. Katsuya T
154. Kaur V
155. Kawaguchi T
156. Keaton JM
157. Kho AN
158. Khor C-C
159. Kibriya MG
160. Kim D-H
161. Kronenberg F
162. Kuusisto J
163. Läll K
164. Lange LA
165. Lee KM
166. Lee M-S
167. Lee NR
168. Leong A
169. Li L
170. Li Y
171. Li-Gao R
172. Ligthart S
173. Lindgren CM
174. Linneberg A
175. Liu C-T
176. Liu J
177. Locke AE
178. Louie T
179. Luan J
180. Luk AO
181. Luo X
182. Lv J
183. Lynch JA
184. Lyssenko V
185. Maeda S
186. Mamakou V
187. Mansuri SR
188. Matsuda K
189. Meitinger T
190. Melander O
191. Metspalu A
192. Mo H
193. Morris AD
194. Moura FA
195. Nadler JL
196. Nalls MA
197. Nayak U
198. Ntalla I
199. Okada Y
200. Orozco L
201. Patel SR
202. Patil S
203. Pei P
204. Pereira MA
205. Peters A
206. Pirie FJ
207. Polikowsky HG
208. Porneala B
209. Prasad G
210. Rasmussen-Torvik LJ
211. Reiner AP
212. Roden M
213. Rohde R
214. Roll K
215. Sabanayagam C
216. Sandow K
217. Sankareswaran A
218. Sattar N
219. Schönherr S
220. Shahriar M
221. Shen B
222. Shi J
223. Shin DM
224. Shojima N
225. Smith JA
226. So WY
227. Stančáková A
228. Steinthorsdottir V
229. Stilp AM
230. Strauch K
231. Taylor KD
232. Thorand B
233. Thorsteinsdottir U
234. Tomlinson B
235. Tran TC
236. Tsai F-J
237. Tuomilehto J
238. Tusie-Luna T
239. Udler MS
240. Valladares-Salgado A
241. van Dam RM
242. van Klinken JB
243. Varma R
244. Wacher-Rodarte N
245. Wheeler E
246. Wickremasinghe AR
247. van Dijk KW
248. Witte DR
249. Yajnik CS
250. Yamamoto K
251. Yamamoto K
252. Yoon K
253. Yu C
254. Yuan J-M
255. Yusuf S
256. Zawistowski M
257. Zhang L
258. Zheng W
259. VA Million Veteran Program
260. Raffel LJ
261. Igase M
262. Ipp E
263. Redline S
264. Cho YS
265. Lind L
266. Province MA
267. Fornage M
268. Hanis CL
269. Ingelsson E
270. Zonderman AB
271. Psaty BM
272. Wang Y-X
273. Rotimi CN
274. Becker DM
275. Matsuda F
276. Liu Y
277. Yokota M
278. Kardia SLR
279. Peyser PA
280. Pankow JS
281. Engert JC
282. Bonnefond A
283. Froguel P
284. Wilson JG
285. Sheu WHH
286. Wu J-Y
287. Hayes MG
288. Ma RCW
289. Wong T-Y
290. Mook-Kanamori DO
291. Tuomi T
292. Chandak GR
293. Collins FS
294. Bharadwaj D
295. Paré G
296. Sale MM
297. Ahsan H
298. Motala AA
299. Shu X-O
300. Park K-S
301. Jukema JW
302. Cruz M
303. Chen Y-DI
304. Rich SS
305. McKean-Cowdin R
306. Grallert H
307. Cheng C-Y
308. Ghanbari M
309. Tai E-S
310. Dupuis J
311. Kato N
312. Laakso M
313. Köttgen A
314. Koh W-P
315. Bowden DW
316. Palmer CNA
317. Kooner JS
318. Kooperberg C
319. Liu S
320. North KE
321. Saleheen D
322. Hansen T
323. Pedersen O
324. Wareham NJ
325. Lee J
326. Kim B-J
327. Millwood IY
328. Walters RG
329. Stefansson K
330. Ahlqvist E
331. Goodarzi MO
332. Mohlke KL
333. Langenberg C
334. Haiman CA
335. Loos RJF
336. Florez JC
337. Rader DJ
338. Ritchie MD
339. Zöllner S
340. Mägi R
341. Marston NA
342. Ruff CT
343. van Heel DA
344. Finer S
345. Denny JC
346. Yamauchi T
347. Kadowaki T
348. Chambers JC
349. Ng MCY
350. Sim X
351. Below JE
352. Tsao PS
353. Chang K-M
354. McCarthy MI
355. Meigs JB
356. Mahajan A
357. Spracklen CN
358. Mercader JM
359. Boehnke M
360. Rotter JI
361. Vujkovic M
362. Voight BF
363. Morris AP
364. Zeggini E
(2024) Genetic drivers of heterogeneity in type 2 diabetes pathophysiology
Nature 627:347–357.

https://doi.org/10.1038/s41586-024-07019-6
- PubMed
- Google Scholar
1. Takeda H
2. Wei Z
3. Koso H
4. Rust AG
5. Yew CCK
6. Mann MB
7. Ward JM
8. Adams DJ
9. Copeland NG
10. Jenkins NA
(2015) Transposon mutagenesis identifies genes and evolutionary forces driving gastrointestinal tract tumor progression
Nature Genetics 47:142–150.

https://doi.org/10.1038/ng.3175
- PubMed
- Google Scholar
1. Tang H
2. Wyckoff GJ
3. Lu J
4. Wu CI
(2004) A universal evolutionary index for amino acid changes
Molecular Biology and Evolution 21:1548–1556.

https://doi.org/10.1093/molbev/msh158
- PubMed
- Google Scholar
1. Tate JG
2. Bamford S
3. Jubb HC
4. Sondka Z
5. Beare DM
6. Bindal N
7. Boutselakis H
8. Cole CG
9. Creatore C
10. Dawson E
11. Fish P
12. Harsha B
13. Hathaway C
14. Jupe SC
15. Kok CY
16. Noble K
17. Ponting L
18. Ramshaw CC
19. Rye CE
20. Speedy HE
21. Stefancsik R
22. Thompson SL
23. Wang S
24. Ward S
25. Campbell PJ
26. Forbes SA
(2019) COSMIC: the catalogue of somatic mutations in cancer
Nucleic Acids Research 47:D941–D947.

https://doi.org/10.1093/nar/gky1015
- PubMed
- Google Scholar
(2019) The opposing roles of PIK3R1/p85α and PIK3R2/p85β in cancer
Trends in Cancer 5:233–244.

https://doi.org/10.1016/j.trecan.2019.02.009
- PubMed
- Google Scholar
(2013) Cancer genome landscapes
Science 339:1546–1558.

https://doi.org/10.1126/science.1235122
- PubMed
- Google Scholar
1. Vujkovic M
2. Keaton JM
3. Lynch JA
4. Miller DR
5. Zhou J
6. Tcheandjieu C
7. Huffman JE
8. Assimes TL
9. Lorenz K
10. Zhu X
11. Hilliard AT
12. Judy RL
13. Huang J
14. Lee KM
15. Klarin D
16. Pyarajan S
17. Danesh J
18. Melander O
19. Rasheed A
20. Mallick NH
21. Hameed S
22. Qureshi IH
23. Afzal MN
24. Malik U
25. Jalal A
26. Abbas S
27. Sheng X
28. Gao L
29. Kaestner KH
30. Susztak K
31. Sun YV
32. DuVall SL
33. Cho K
34. Lee JS
35. Gaziano JM
36. Phillips LS
37. Meigs JB
38. Reaven PD
39. Wilson PW
40. Edwards TL
41. Rader DJ
42. Damrauer SM
43. O’Donnell CJ
44. Tsao PS
45. HPAP Consortium
46. Regeneron Genetics Center
47. VA Million Veteran Program
48. Chang K-M
49. Voight BF
50. Saleheen D
(2020) Discovery of 318 new risk loci for type 2 diabetes and related vascular outcomes among 1.4 million participants in a multi-ancestry meta-analysis
Nature Genetics 52:680–691.

https://doi.org/10.1038/s41588-020-0637-y
- PubMed
- Google Scholar
(2022) Targeting mutations in cancer
The Journal of Clinical Investigation 132:e154943.

https://doi.org/10.1172/JCI154943
- PubMed
- Google Scholar
1. Wang X
2. He Z
3. Guo Z
4. Yang M
5. Xu S
6. Chen Q
7. Shao S
8. Li S
9. Zhong C
10. Duke NC
11. Shi S
(2022) Extensive gene flow in secondary sympatry after allopatric speciation
National Science Review 9:wac280.

https://doi.org/10.1093/nsr/nwac280
- Google Scholar
(2013) The cancer genome atlas pan-cancer analysis project
Nature Genetics 45:1113–1120.

https://doi.org/10.1038/ng.2764
- PubMed
- Google Scholar
1. Wu CI
2. Ting CT
(2004) Genes and speciation
Nature Reviews. Genetics 5:114–122.

https://doi.org/10.1038/nrg1269
- PubMed
- Google Scholar
1. Wu CI
2. Wang HY
3. Ling S
4. Lu X
(2016) The ecology and evolution of cancer: the ultra-microevolutionary process
Annual Review of Genetics 50:347–369.

https://doi.org/10.1146/annurev-genet-112414-054842
- PubMed
- Google Scholar
1. Wu C-I
(2022) What are species and how are they formed?
National Science Review 9:nwad017.

https://doi.org/10.1093/nsr/nwad017
- PubMed
- Google Scholar
1. Wu CI
(2023) The genetics of race differentiation-should it be studied?
National Science Review 10:wad068.

https://doi.org/10.1093/nsr/nwad068
- PubMed
- Google Scholar
1. Xue D
2. Narisu N
3. Taylor DL
4. Zhang M
5. Grenko C
6. Taylor HJ
7. Yan T
8. Tang X
9. Sinha N
10. Zhu J
11. Vandana JJ
12. Nok Chong AC
13. Lee A
14. Mansell EC
15. Swift AJ
16. Erdos MR
17. Zhong A
18. Bonnycastle LL
19. Zhou T
20. Chen S
21. Collins FS
(2023) Functional interrogation of twenty type 2 diabetes-associated genes using isogenic human embryonic stem cell-derived β-like cells
Cell Metabolism 35:1897–1914.

https://doi.org/10.1016/j.cmet.2023.09.013
- PubMed
- Google Scholar
1. Yang Z
2. Swanson WJ
(2002) Codon-substitution models to detect adaptive evolution that account for heterogeneous selective pressures among site classes
Molecular Biology and Evolution 19:49–57.

https://doi.org/10.1093/oxfordjournals.molbev.a003981
- PubMed
- Google Scholar
1. Yang Z
2. Ro S
3. Rannala B
(2003) Likelihood models of somatic mutation and codon substitution in cancer genes
Genetics 165:695–705.

https://doi.org/10.1093/genetics/165.2.695
- PubMed
- Google Scholar
1. Zhai W
2. Lai H
3. Kaya NA
4. Chen J
5. Yang H
6. Lu B
7. Lim JQ
8. Ma S
9. Chew SC
10. Chua KP
11. Alvarez JJS
12. Chen PJ
13. Chang MM
14. Wu L
15. Goh BKP
16. Chung AY-F
17. Chan CY
18. Cheow PC
19. Lee SY
20. Kam JH
21. Kow AW-C
22. Ganpathi IS
23. Chanwat R
24. Thammasiri J
25. Yoong BK
26. Ong DB-L
27. de Villa VH
28. Dela Cruz RD
29. Loh TJ
30. Wan WK
31. Zeng Z
32. Skanderup AJ
33. Pang YH
34. Madhavan K
35. Lim TK-H
36. Bonney G
37. Leow WQ
38. Chew V
39. Dan YY
40. Tam WL
41. Toh HC
42. Foo RS-Y
43. Chow PK-H
(2022) Dynamic phenotypic heterogeneity and the evolution of multiple RNA subtypes in hepatocellular carcinoma: the PLANET study
National Science Review 9:wab192.

https://doi.org/10.1093/nsr/nwab192
- Google Scholar
Software
1. Zhang L
(2024) CDN_V1, version swh:1:rev:967361fff2b70ae2a39360e5546c18710dc3700f
Software Heritage.

https://archive.softwareheritage.org/swh:1:dir:537fa75d5dbe96ca6724820877ba5255b2d9cac3;origin=https://gitlab.com/ultramicroevo/cdn_v1;visit=swh:1:snp:f4700c8f857c51a5745c5f3ef4b6c6dbddc3b4c0;anchor=swh:1:rev:967361fff2b70ae2a39360e5546c18710dc3700f
1. Zhang L
2. Deng T
3. Liufu Z
4. Liu X
5. Chen B
6. Hu Z
7. Liu C
8. Lu X
9. Wen H
10. Wu CI
(2024) The theory of massively repeated evolution and full identifications of cancer-driving nucleotides (CDNs)
eLife 13:e99340.

https://doi.org/10.7554/eLife.99340
- Google Scholar
1. Zhu H
2. Lin Y
3. Lu D
4. Wang S
5. Liu Y
6. Dong L
7. Meng Q
8. Gao J
9. Wang Y
10. Song N
11. Suo Y
12. Ding L
13. Wang P
14. Zhang B
15. Gao D
16. Fan J
17. Gao Q
18. Zhou H
(2023) Proteomics of adjacent-to-tumor samples uncovers clinically relevant biological events in hepatocellular carcinoma
National Science Review 10:wad167.

https://doi.org/10.1093/nsr/nwad167
- PubMed
- Google Scholar

Article and author information

Author details

Lingjie Zhang

State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou, China

Contribution
Conceptualization, Data curation, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing – original draft

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0002-6506-4457
Tong Deng

State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou, China

Contribution
Validation, Visualization

Competing interests
No competing interests declared
Zhongqi Liufu
1. State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou, China
2. Center for Excellence in Animal Evolution and Genetics, The Chinese Academy of Sciences, Kunming, China
Contribution
Data curation, Validation

Competing interests
No competing interests declared
Xiangnyu Chen

State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou, China

Contribution
Data curation, Validation

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0001-5078-8906
Shijie Wu

State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou, China

Contribution
Data curation, Validation

Competing interests
No competing interests declared
Xueyu Liu

State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou, China

Contribution
Data curation, Validation, Visualization

Competing interests
No competing interests declared
Changhao Shi

State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou, China

Contribution
Validation, Visualization

Competing interests
No competing interests declared
Bingjie Chen
1. State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou, China
2. GMU-GIBH Joint School of Life Sciences, Guangzhou Medical University, Guangzhou, China
Contribution
Data curation, Validation

Competing interests
No competing interests declared
Zheng Hu

CAS Key Laboratory of Quantitative Engineering Biology, Shenzhen Institute of Synthetic Biology, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China

Contribution
Resources, Investigation, Project administration

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0003-1552-0060
Qichun Cai

Cancer Center, Clifford Hospital, Jinan University, Guangzhou, China

Contribution
Resources, Validation, Investigation

Competing interests
No competing interests declared
Chenli Liu

CAS Key Laboratory of Quantitative Engineering Biology, Shenzhen Institute of Synthetic Biology, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China

Contribution
Resources, Supervision, Investigation, Project administration

Competing interests
No competing interests declared
Mengfeng Li

Cancer Research Institute, School of Basic Medical Sciences, Southern Medical University, Guangzhou, China

Contribution
Resources, Supervision, Project administration

Competing interests
No competing interests declared
Miles E Tracy

State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou, China

Contribution
Writing – review and editing

Competing interests
No competing interests declared
Xuemei Lu

Center for Excellence in Animal Evolution and Genetics, The Chinese Academy of Sciences, Kunming, China

Contribution
Conceptualization, Supervision, Investigation, Project administration

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0001-6044-6002
Chung-I Wu
1. State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou, China
2. Department of Ecology and Evolution, University of Chicago, Chicago, United States
Contribution
Conceptualization, Resources, Supervision, Funding acquisition, Validation, Investigation, Project administration, Writing – review and editing

For correspondence
ciwu@uchicago.edu

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0001-7263-4238
Hai-Jun Wen

State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou, China

Contribution
Funding acquisition, Validation, Investigation, Methodology

For correspondence
wenhj5@mail.sysu.edu.cn

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0001-8676-1254

Funding

National Natural Science Foundation of China (32150006)

Chung-I Wu

Guangdong Key R&D Project of China (2022B1111030001)

Hai-Jun Wen

National Natural Science Foundation of China (32293193)

Chung-I Wu

National Natural Science Foundation of China (32293190)

Chung-I Wu

National Natural Science Foundation of China (82341092)

Hai-Jun Wen

National Key Research and Development Program of China (2021YFC0863300)

Chung-I Wu

National Key Research and Development Program of China (2021YFC0863400)

Chung-I Wu

National Natural Science Foundation of China (32200493)

Chung-I Wu

National Natural Science Foundation of China (32370659)

Chung-I Wu

Guangdong Basic and Applied Basic Research Foundation (2023A1515010016)

Chung-I Wu

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We wish to acknowledge the supports from the First Affiliated Hospital, the Seventh Affiliated Hospital of Sun Yat-sen University, Cancer Center of Clifford Hospital, Jinan University, Cancer Hospital Chinese Academy of Medical Sciences, Shenzhen Center, and Guangdong Academy of Medical Sciences, Guangdong Provincial People’s Hospital on the startup of the Cancer Driving Nucleotide (CDN) project. We would like to acknowledge Kunming Institute of Zoology for discussing the ideas of CDN. We thank Weiwei Zhai, Qianfei Wang, and Weini Huang for insightful comments and suggestions. We would also like to acknowledge the American Association for Cancer Research (AACR) and The Cancer Genome Atlas (TCGA) project, which have provided invaluable datasets and resources that have significantly enriched our understanding of cancer biology and improved patient outcomes. This work was supported by the National Natural Science Foundation of China (32150006, 32293193, 32293190, 32370659, and 32200493) to CIW and 82341092 to HJW, the National Key Research and Development Projects of the Ministry of Science and Technology of China (2021YFC0863300, 2021YFC0863400), Guangdong Key Research and Development Program (no. 2022B1111030001), and Guangdong Basic and Applied Basic Research Foundation (no. 2023A1515010016).

Version history

Sent for peer review: May 29, 2024
Preprint posted: June 2, 2024
Reviewed Preprint version 1: September 4, 2024
Reviewed Preprint version 2: October 25, 2024
Version of Record published: December 17, 2024

Cite all versions

You can cite all versions using the DOI https://doi.org/10.7554/eLife.99341. This DOI represents all versions, and will always resolve to the latest one.

Copyright

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.