Research Article

Heterogeneity of the GFP fitness landscape and data-driven protein design

Institute of Science and Technology Austria, Austria
Synthetic Biology Group, MRC London Institute of Medical Sciences, United Kingdom
Institute of Clinical Sciences, Faculty of Medicine and Imperial College Centre for Synthetic Biology, Imperial College London, United Kingdom
Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry, Russian Academy of Sciences, Russian Federation
Department of Chemistry, Center for Structural Biology, Vanderbilt University, United States
Gregor Mendel Institute, Austrian Academy of Sciences, Vienna BioCenter, Austria
Institute for Drug Discovery, Medical School, Leipzig University, Germany
LabGenius, United Kingdom
Evolutionary and Synthetic Biology Unit, Okinawa Institute of Science and Technology Graduate University, Japan

May 5, 2022

Open access
Copyright information

Abstract
Editor's evaluation
Introduction
Results
Discussion
Materials and methods
Data availability
References
Article and author information
Metrics

Abstract

Studies of protein fitness landscapes reveal biophysical constraints guiding protein evolution and empower prediction of functional proteins. However, generalisation of these findings is limited due to scarceness of systematic data on fitness landscapes of proteins with a defined evolutionary relationship. We characterized the fitness peaks of four orthologous fluorescent proteins with a broad range of sequence divergence. While two of the four studied fitness peaks were sharp, the other two were considerably flatter, being almost entirely free of epistatic interactions. Mutationally robust proteins, characterized by a flat fitness peak, were not optimal templates for machine-learning-driven protein design – instead, predictions were more accurate for fragile proteins with epistatic landscapes. Our work paves insights for practical application of fitness landscape heterogeneity in protein engineering.

Editor's evaluation

Using high throughput mutagenesis, this work shows that evolutionary distance between homologous genes is not predictive of how these genes' functions will change in response to similar mutations. This suggests that the starting gene sequence will influence how the synthetic design of new protein functions can occur and also supports a role for conditionality in the natural evolution of protein functions.

https://doi.org/10.7554/eLife.75842.sa0

Introduction

Understanding the relationship between genotype and phenotype, the fitness landscape, elucidates the fundamental laws of heredity (Canale et al., 2018; de Visser and Krug, 2014; Ferretti et al., 2018; Fragata et al., 2019; Wright, 1932) and may ultimately create novel methods of protein design (Alley et al., 2019; Bryant et al., 2021; Hirabayashi and Arai, 2019; Wrenbeck et al., 2017; Wu et al., 2019). The fitness landscape is often conceptualised as a multidimensional surface (de Visser and Krug, 2014; Ferretti et al., 2018; Kondrashov and Kondrashov, 2015; Wright, 1932) with one dimension representing fitness, or another phenotype, and the other dimensions each representing a genotype’s locus. Originally, the fitness landscape was introduced to describe the relationship between fitness and the entire genome (de Visser and Krug, 2014; Wright, 1932). Over time, the usefulness of the concept of the fitness landscape led to the adaptation of this term to describe the relationship between protein function and its protein-coding gene sequence (Biswas et al., 2021; Ogden et al., 2019; Romero and Arnold, 2009; Wittmann et al., 2021; Zheng et al., 2020). Absolute knowledge of the fitness landscape would reveal the phenotypes conferred by any arbitrary genotype (de Visser and Krug, 2014; Ferretti et al., 2018; Fragata et al., 2019), with immense and obvious practical implications (Alley et al., 2019; Bryant et al., 2021; Hirabayashi and Arai, 2019; Kemble et al., 2019; Wrenbeck et al., 2017; Wu et al., 2019). However, sparse experimental data, even for specific genes, and the concomitant lack of understanding of the rules by which fitness landscapes are formed, limit the accuracy of phenotype predictions based on sequence alone (Lässig et al., 2017) but see Bryant et al., 2021; Rocklin et al., 2017; Senior et al., 2020; Wu et al., 2019.

While several experimentally characterized fitness landscapes for specific proteins have been reported (Hartman and Tullman-Ercek, 2019; Jacquier et al., 2013; Kuo et al., 2020; Melamed et al., 2013; Olson et al., 2014; Sarkisyan et al., 2016), such surveys of large proteins are still hindered by the enormity of the genotype space (de Visser and Krug, 2014; Wright, 1932). Even for the Green Fluorescent Protein (GFP), which is only ~250 amino acids long, there are 20²⁵⁰ possible genotypes. Without complex epistatic interactions between amino acid sites the fitness landscape could be deduced from the independent contribution of each amino acid at each site (Kondrashov and Kondrashov, 2015), requiring just 5000 (20*250) measurements of the effects of all single mutations in GFP. However, epistatic interactions between amino acid sites are common (Russ et al., 2020) and many of them are too complex to predict with available data (Pokusaeva et al., 2019). Despite some advances in the development of data-driven approaches to protein design (Biswas et al., 2021; Biswas et al., 2018; Bryant et al., 2021; Kemble et al., 2019), it is still not clear what fraction of the 20²⁵⁰ sequences of the GFP, or any other gene, must be characterized to approach the coveted absolute knowledge of the fitness landscape (Kemble et al., 2019; Sailer et al., 2020; Zhou and McCandlish, 2020).

Despite lack of data, experiments and theory provide some insights on the global fitness landscape (Fragata et al., 2019; Kemble et al., 2019). Each extant genotype, one that is found in an extant species, is a point of high fitness, or a fitness peak, on the highly dimensional and extraordinarily large genotype space (de Visser and Krug, 2014; Fragata et al., 2019; Smith, 1970; Wright, 1932). These extant genotypes had a common ancestor, so they must be connected by ridges of high fitness (Gong et al., 2013; Smith, 1970; Povolotskaya and Kondrashov, 2010). Nevertheless, only an infinitesimally small fraction of all genotypes are functional (fewer than 10^–11), those that correspond to fitness peaks and ridges, and the remaining genotypes confer low fitness (Keefe and Szostak, 2001). The fitness peaks are sharp (Bank et al., 2015; Melamed et al., 2013; Sarkisyan et al., 2016) and the ridges are narrow (Gong et al., 2013; Kumar et al., 2017; Pokusaeva et al., 2019; Sailer et al., 2020) and, on average, only a few random mutations in a wildtype sequence reduce its fitness to zero (Hartman and Tullman-Ercek, 2019; Kemble et al., 2019). The sharpness of the peaks is enhanced by negative epistasis, such that a genotype with several random mutations has a lower fitness than expected if mutations acted independently (Haddox et al., 2018; Sarkisyan et al., 2016). Thus, a random walk from a fitness peak eventually leads to an area of the genotype space where only an infinitesimally small fraction of sequences are functional, likely explaining why accurate prediction of functional genotypes at a substantial distance away from a functional genotype remains a challenge (Alley et al., 2019; Hirabayashi and Arai, 2019; Russ et al., 2020; Wu et al., 2019).

The shape of fitness peaks and ridges, and their distribution in genotype space has implications for fundamental questions in evolution (de Visser and Krug, 2014) and practical applications (Sardanyés et al., 2008). Evolution starting at a sharp fitness peak is expected to proceed at a different pace than evolution on a flat one (Bershtein et al., 2006; Codoñer et al., 2006; de Visser et al., 2003; Draghi et al., 2010; Wagner, 2008). Furthermore, it has been suggested that flat fitness peaks, representing robust genotypes, may be evolutionarily preferable to sharp peaks, which represent fragile genotypes (Bershtein et al., 2006; de Visser et al., 2003; Draghi et al., 2010; Klug et al., 2019; Zheng et al., 2020). However, how different shapes of fitness peaks may be distributed in genotype space has not been explored (Chan et al., 2017; Kemble et al., 2019). Furthermore, the exploration of the fitness landscape of specific proteins is one of the approaches in protein engineering (Bryant et al., 2021; Romero and Arnold, 2009; Russ et al., 2020; Wittmann et al., 2021). Such studies explore the fitness landscape of the protein of interest through deep mutational scan of a known protein sequence. This information is then used to predict novel functional protein sequences that are designed by introducing mutations into the original sequence. Here, we explored the interplay of the heterogeneity of fitness peaks of orthologous sequences and prediction of novel functional protein sequences (Figure 1a). To this end, we compared the fitness peaks of four GFPs that had different levels of sequence divergence from each other. We then used this information to accurately predict novel functional GFPs at considerable sequence divergence to any known GFP sequence.

Figure 1 with 3 supplements see all

Download asset Open asset

Comparison of four GFP fitness peaks.

(a) A conceptual representation of the GFP fitness landscape following the visualization proposed by Wright, 1932. The black dotted lines represent the unknown regions of the fitness landscape and the green lines the surveyed local fitness peak. Wildtype GFPs (black +) and the predicted functional GFPs (green +) are shown at an approximate scale of sequence divergence from each other. Between the four wildtype fitness peaks there must also be some unknown but large number of other functional and wildtype GFP sequences. For clarity, we do not draw them in the figure. (b) Amino acid sequence identity between different orthologs, displayed in percent. (c) Distribution of fluorescence of mutant libraries (colour), control wildtype protein sequences (white), and protein sequences containing loss-of-function mutations in the chromophore (black).

Results

To complement the available data on the avGFP fitness peak (Sarkisyan et al., 2016; GFP from Aequorea victoria, Hydrozoa), we experimentally characterized three additional GFP sequences, each with a different degree of sequence divergence from avGFP: amacGFP (Aequorea macrodactyla, Hydrozoa), cgreGFP (Clytia gregaria, Hydrozoa), and ppluGPF2 (Pontellina plumata, Copepoda), with 18%, 59%, and 82% sequence divergence, respectively (Figure 1b; Table 1). For simplicity, we refer to all of these sequences as ‘wildtype’, even though only cgreGFP and ppluGPF2 were identical to the true wildtype sequences, while avGFP and amacGFP contain one and three amino acid substitutions, respectively. amacGFP, ppluGFP2, and cgreGFP were subject to a similar experimental pipeline (Figure 2) as avGFP (Sarkisyan et al., 2016). For each sequence a library of genotypes containing random mutations in the respective GFP sequence was generated by error-prone PCR, in which each GFP gene variant was labelled downstream of its stop codon by a primary barcode, a random combination of nucleotides. This mutant library was expressed as a fusion protein with the red fluorescent protein mKate2 in E. coli cells, which were then sorted based on green fluorescence intensity within a narrow red fluorescence gate, to control for gene expression level and other errors (Figure 2—figure supplement 1). The DNA barcodes of the sorted cells were sequenced and these data were used to perform a statistical analysis estimating the level of fluorescence of tens of thousands of GFP genotypes. Three notable improvements to the original experimental pipeline were implemented: gene sequence-agnostic library sequencing, genome integration of the construct, and use of secondary barcodes that introduced internal replicas in the experiment (see Materials and methods). These changes resulted in more physiologically relevant expression levels, made the pipeline more scalable, and reduced the variance of fluorescent genotype measurements by a factor of 7 (Figure 1c; Table 1). The new dataset contained 25,000–35,000 genotypes per each of the three additional fitness peaks, with each mutant genotype harboring on average 3–4 mutations relative to its respective wildtype sequence (Table 1). These data, together with data from avGFP, were then used in our comparative study of the GFP fitness peaks (Figure 1a).

Figure 2 with 1 supplement see all

Download asset Open asset

Flowthrough of the experimental methodology.

Table 1

The dataset in numbers.

The avGFP data is from Sarkisyan et al., 2016.

Gene	amacGFP	cgreGFP	ppluGFP2	avGFP
Number of protein genotypes surveyed	35,500	26,165	32,260	51,715
Average (median) number of AA substitutions per genotype	4.37 (3)	4.23 (3)	3.7 (2)	3.93 (4)
Average (median) number of barcode replicates per protein genotype	8.7 (5)	6.8 (5)	12 (7)	1.2 (1)
Amino acid identity	avGFP: 82% cgreGFP: 43% ppluGFP2: 17%	avGFP: 41% amacGFP: 43% ppluGFP2: 19%	avGFP: 18% amacGFP: 17% cgreGFP: 19%	amacGFP: 82% cgreGFP: 41% ppluGFP2: 18%
False positive rate*	0.55% (9 of 1635)	0.75% (14 of 1860)	0.49% (11 of 2242)	0.24% (2 of 839)
False negative rate*	0% (0 of 1084)	0% (0 of 1583)	0% (0 of 2744)	0.08% (2 of 2444)
Mean wildtype log10 fluorescence level ± standard deviation	3.97±0.031 (3.96±0.030 for amacGFP:V12L)	4.50±0.028	4.23±0.027	3.72±0.082
Fraction of genotypes in which epistasis cannot be ascertained†	7.4%	15.9%	4.5%	16.5%
Fraction of genotypes displaying \|epistasis\|>0.3 (>1) ‡	5.3% (0.2%)	14.4% (5.6%)	6.8% (0.9%)	21.4% (11.6%)
Mutational LD50, loss of function ^§	5.8 (5.7 for amacGFP:V12L)	3.2	6.2	4.1
Mutational LD50, loss of wildtype-level fluorescence level ^§	1.7 (1.8 for amacGFP:V12L)	0.9	1.7	2.2
Proportion of machine-learning predicted genotypes displaying epistasis <–0.3 (<-1)	78% (46%)	57% (21%)	81% (64%)	NA

*

False positive rates refer to the fraction of genotypes which are expected to be dark or dim due to chromophore mutations but which were assigned a bright fitness; false negative rates refer to genotypes encoding wildtype protein which were assigned dim or dark fitnesses.
†

Calculation of epistasis requires knowledge of a genotype’s expected fluorescence, i.e. the sum of contributions of individual mutations. For genotypes with multiple mutations, all individual mutations comprising the genotype must have been measured in isolation.
‡

An absolute epistasis value of 0.3 or 1 implies a two-fold or ten-fold difference between the observed and expected fluorescence levels, respectively.
§

“Mutational LD50, loss of function” refers to the number of mutations at which 50% of genotypes are rendered non-functional (i.e. assigned to the darkest FACS gate), obtained by fitting a logistic curve to the fraction of non-functional genotypes at each mutational step (see values in Supplementary file 1) and solving for f(x)=0.5; “Mutational LD50, loss of wildtype fluorescence level” refers instead to the number of mutations at which 50% of genotypes maintain a fluorescence level within two standard deviations of the WT level.

The four fitness peaks shared substantial similarities (also see Biswas et al., 2018 for sfGFP). In all cases, synonymous variants had no measurable effect on fitness, which may be a consequence of the experimental design aimed to be insensitive to expression levels and, thus, they were pooled for all subsequent analyses (Figure 1—figure supplement 1). Mutations in the chromophore eliminated fluorescence (Figure 1—figure supplement 2) and mutations of buried amino acid residues had a stronger effect than mutations of residues on the protein surface (Figure 3b; Figure 1—figure supplement 2). In all four fitness peaks, a threshold effect of accumulating multiple random mutations was found, such that the median level of fluorescence dropped sharply once a certain number of mutations was reached (Figure 3a; Supplementary file 1). The fitness peak shape differed substantially among different GFP sequences. Only 3–4 mutations were necessary for avGFP and cgreGFP, so the corresponding fitness peaks were sharp (Table 1). By contrast, the fitness peaks of amacGFP and ppluGFP2 were substantially flatter, with each tolerating twice as many mutations (Figure 3a). Furthermore, we compared the sharpness of the wildtype fitness peaks with the fitness peaks corresponding to sequences harbouring a single mutation relative to the wildtype sequence. Fitness peaks of most single-mutation neighbours with high levels of fluorescence were sharper than the respective wildtype fitness peaks (Figure 3c), suggesting a local optimization of robustness of each wildtype sequence (Draghi et al., 2010; Zheng et al., 2020; Draghi et al., 2010). Notably, the shape of the wildtype fitness peak showed no straightforward relationship with its respective level of fluorescence (Table 1), as may have been expected (Johnson et al., 2019).

Figure 3 with 2 supplements see all

Download asset Open asset

Distributions of fluorescence.

(a) Fluorescence level distributions of genotypes at varying distances from the wildtype and the logistic curves fitted to the median fluorescence for each category (black line). (b), Distribution of fluorescence of genotypes with a single amino acid mutation at exposed (colour) versus buried (white) sites. (c) Starting from genotypes with one mutation away from the wildtype sequence, each line represents the median fluorescence as a function of sequence divergence away from such genotypes. Only points with at least 15 available genotypes are shown. The median fluorescence level at varying distances for the wildtype sequence, black lines, are shown for comparison. (d) The fraction of genotypes without epistasis was calculated as the ratio of the number of observed functional genotypes divided by and the number of genotypes expected to be functional under the assumption of no epistatic interactions between amino acid sites. In the absence of epistasis, the expectation is a constant value of 100% independent of the number of amino acid changes relative to the wildtype.

We compared the fluorescence of each genotype to the expected level under an assumption that each mutation influences fluorescence level independently, that is without any epistasis (Equation 1):

e p i s t a s i s = {E f f e c t}_{o b s e r v e d} - {E f f e c t}_{e x p e c t e d} = (F_{m} - F_{w t}) - \sum_{i} (F_{i} - F_{w t}) ∙ x_{i}

where F_i, F_m F_wt are measured levels of fluorescence of a genotype with a single mutation i, of genotype m, or of the wildtype sequence, respectively, and x_i = 1 when mutation i is contained within the genotype m and x_i = 0 when it is not. We then calculated the fraction of genotypes that do not require epistatic interactions to predict their fluorescence. On all four fitness peaks, genotypes with two mutations away from the wildtype sequence rarely exhibited any epistatic interactions. However, a notable difference between the fitness peaks was observed when considering genotypes with multiple mutations. The level of fluorescence for a vast majority of genotypes with >5 mutations cannot be explained without epistasis in sharp fitness peaks, avGFP and cgreGFP. By contrast, few genotypes with >5 mutations in flat fitness peaks required epistasis to explain their fluorescence level, with amacGFP requiring almost no epistasis at all (Figure 3d, Figure 3—figure supplement 1). Interestingly, the sharpness of the fitness peaks and the concomitant extent of epistatic interactions did not correlate with the sequence divergence between the fitness peaks. Indeed, the two closest sequences (82% identity), derived from the same genus, are the sharp, epistatic avGFP peak and the flat, non-epistatic amacGFP peak (Figure 3d).

Flat fitness peaks correspond to mutationally robust proteins, those that are capable of withstanding multiple mutations without losing function, while sharp fitness peaks correspond to mutationally fragile ones. The observed differences in mutational robustness of different proteins may be explained by thermodynamic stability (Bershtein et al., 2006; Echave and Wilke, 2017; Gong et al., 2013; Kurahashi et al., 2018; Poelwijk et al., 2019; Sarkisyan et al., 2016). Therefore, we performed an array of assays aimed at the biophysical characterisation of the four wildtype proteins and an additional genotype, amacGFP:V12L, which differed from amacGFP by the V12L mutation that was extremely common in the amacGFP mutant library. We have assayed the thermal stability of the proteins, using Differential Scanning Fluorimetry (DSF), Differential Scanning Calorimetry (DSC), Circular dichroism (CD), as well as simple measurements of fluorescence in a qPCR machine at different temperatures. We also assayed refolding kinetics of urea-denatured proteins (Pédelacq et al., 2006). Finally, we assessed oligomeric states of each of the proteins using multi-angle light scattering with size-exclusion chromatography (SEC-MALS).

The different methods yielded complementary results (Figure 4; Table 2). Specifically, we observed that the most mutationally fragile protein, cgreGFP is also the most kinetically unstable protein and the most mutationally robust protein, ppluGFP2, was also the most kinetically stable (Table 2; Figure 4; Figure 4—figure supplement 1). These data tentatively suggest that the shape of the GFP fitness peaks, as characterized by mutational robustness, may in part be shaped by the underlying protein stability. This relationship does not appear to be perfect, as the mutationally fragile avGFP is stable, while amacGFP has mutational robustness comparable to ppluGFP2 (Table 1), but a substantially lower stability (Table 2). Indeed, there may be other factors that influence this relationship, such as the impact of protein folding on the GFP chromophore maturation, GFP folding histeresis (Andrews et al., 2009), the oligomeric protein state and the propensity of the mutant genotypes to aggregate. Indeed, avGFP is the only exclusive monomer from among the four wildtype sequences (Table 2; Figure 4—figure supplement 2) while ppluGFP2 is exclusively tetrameric. The propensity for aggregation also appears variable between the genotypes, with amacGFP showing the highest aggregation of non-fluorescent genotypes (Figure 4—figure supplement 2). Furthermore, our measurements of GFP thermostability through refolding may not reflect GFP folding as it occurs in vivo in the course of translation.

Figure 4 with 5 supplements see all

Download asset Open asset

Thermal sensitivity of GFP orthologs.

(a) Thermal unfolding measured by differential scanning fluorimetry (DSF) showing the first derivative of the ratio of 350/330 nm emission. Shaded areas indicate standard deviation of triplicates. (b) Melting curves of green fluorescence emission (510 nm) as a function of temperature measured on a qPCR machine. Shaded areas indicate standard deviations of eight technical replicates. (c) Thermal aggregation measured by DSF showing the first derivative of the light scattering. Shaded areas indicate standard deviation of triplicates. (d) Specific heat capacities measured by differential scanning calorimetry in duplicate. (e) Circular dichroism (CD) spectra measured before (30 °C) and after (98 °C) with the melting curves depicted in (f) where vertical dotted lines indicate the monitored wavelength in (f). f, CD melting curves monitored at 218 nm (and additionally 208 nm in the case of avGFP, where 218 nm did not show a transition), fitted with a logistic curve. In (a), (b), (c), (d), (f), vertical dashed lines indicate the melting temperature, except ppluGFP2 in (b). In (a), (b), (d), (f), temperature was increased at a rate of 1 °C per minute, in (c), at a rate of ~2 °C per minute, the slowest allowed by the LightCycler.

Table 2

Biophysical and biochemical characterisation of wildtype GFPs.

	amacGFP:V12L	amacGFP	cgreGFP	ppluGFP	avGFP
Unfolding Tm (DSF)	80.8 °C	82.6 °C	74.1 °C	91.8 °C	86.8 °C
Aggregation Tm (DSF)	79.5 °C	82.0 °C	73.9 °C	90.2 °C	86.6 °C
Tm (CD)	80.4 °C	82.6 °C	71.2 °C	86.4 °C	83.7 °C
Transition slope (CD)	0.86	0.72	1.27	0.63	0.67
Tm (DSC)	80.2 °C	82.4 °C	72.9 °C	90.3 °C	86.3 °C
Enthalpy of denaturation (DSC)	744 kJ/mol	768 kJ/mol	755 kJ/mol	515 kJ/mol	1012 kJ/mol
Fluorescence loss Tm (qPCR)	81.1 °C	82.6 °C	72.9 °C	-	87.5 °C
Urea denaturation: initial rate*	–0.87	–0.35	–0.18	–0.02	–0.009
Kinetic parameters for urea denaturation curves*	a₁=0.71 k₁=0.96 h^–1 a₂=0.28 k₂=0.25 h^–1	a₁=0.52 k₁=0.54 h^–1 a₂=0.43 k₂=0.12 h^–1	-	a₁=0.92 k₁=0.02 h^–1	a₁=0.92 k₁=0.01 h^–1
Refolding: initial rate†	0.01	0.01	0.000014	0.05	0.007
Kinetic parameters for refolding curves^†	a₁=–0.35 k₁=0.025 s^–1 a₂=–0.36 k₂=0.005 s^–1 a₃=–0.38 k₃=0.001 s^–1	a₁=–0.057 k₁=0.057 s^–1 a₂=–0.39 k₂=0.013 s^–1 a₃=–0.63 k₃=0.002 s^–1	a₁=0.16 k₁=0.036 s^–1 a₂=–0.45 k₂=0.01 s^–1 a₃=–0.87 k₃=0.001 s^–1	a₁=–0.32 k₁=0.14 s^–1 a₂=–0.45 k₂=0.02 s^–1 a₃=–0.21 k₃=0.003 s^–1	a₁=–0.4 k₁=0.016 s^–1 a₂=–0.36 k₂=0.001 s^–1 a₃=–0.31 k₃=0.001 s^–1
Expected monomer size	28.1 kDa	28.1 kDa	27.4 kDa	25.7 kDa	27.9 kDa
Primary oligomeric state (SEC-MALS)	Monomer (67%), dimer (31%)	Monomer (51%), dimer (46%)	Dimer (>99%)	Tetramer (>97%)	Monomer (>99%)

*

Curves monitoring loss of fluorescence in 9 M urea were fitted with two exponential functions in the case of amacGFP and amacGFP:V12L and one exponential function for avGFP and ppluGFP2, while cgreGFP fluorescence loss could not be well modeled using only exponential functions (see Figure 4—figure supplement 1). Initial rates were estimated by calculating the derivative at time t=0.
†

Curves monitoring the recovery of fluorescence after urea denaturation over the course of 20 minutes were fitted with three exponential functions (see Figure 4—figure supplement 1). Initial rates were estimated by calculating the derivative at time t=0.

We then used a computational approach to further explore the relationship between protein stability and the shape of the fitness landscapes. We solved the crystal structure of amacGFP and analysed it along with structures already available for other proteins. We found that mutations causing a substantial reduction of fluorescence tended to have a higher effect on protein stability (Figure 4—figure supplement 3), estimated by predicted ΔΔG (Two-sided Mann Whitney U test, p<10^–6). Furthermore, we found a statistically significant correlation between predicted ΔΔG and the effect of a mutation, which was stronger in sharp fitness peaks, avGFP and cgreGFP, and weaker in the flat fitness peaks, amacGFP and ppluGFP2 (Figure 4—figure supplement 3; Spearman’s correlation r=0.6 and r=0.3, respectively). Interestingly, the V12L mutation in amacGFP:V12L appears to have shifted the distribution of the mutation effects, substantially increasing the effect of mutations on the barrel lid in proximity to residue 12, without impacting the overall mutational robustness (Figure 4—figure supplement 4). Across the whole landscape, epistatically interacting amino acid residues were slightly more likely to be spatially proximal (Melamed et al., 2013; Sarkisyan et al., 2016) and the effect was more pronounced in the flatter fitness peaks (Figure 4—figure supplement 5). Taken together, these data suggest that the heterogeneity in the shape of the orthologous GFP fitness peaks may be related to the stability of the underlying protein sequences.

The apparent lack of a relationship between sequence divergence and fitness peak shape suggests that the shape changes on a scale that is smaller than the distances between the four GFP proteins. Therefore, the difference of the impact of mutations on different fitness peaks should be independent from the sequence divergence between them. We found that the probability that a neutral mutation in one protein becomes deleterious in another one was independent of the sequence divergence, except when considering two protein sequences that were different by a single amino acid change (Figure 5a). A complementary pattern was observed when we considered if there is a difference in which pairs of sites are interacting epistatically. The probability that a pair of epistatically interacting sites in one protein is also epistatically interacting in another protein was largely independent of sequence divergence between the two proteins, with the two highly similar sequences showing the highest degree of similarity (Figure 5b). Taken together, these data indicate that underlying rules that determine epistatic interactions and fitness peak shape change on a scale smaller than 20% of sequence divergence.

Figure 5

Download asset Open asset

Differences in mutational effects in GFP orthologues.

(a) The proportion of single amino acid mutations which were observed to be neutral (maintaining fluorescence within two standard deviations of the wildtype level) in one GFP sequence and deleterious (reducing fluorescence by over five standard deviations) in another GFP, out of all mutations surveyed in both. The total number of amino acid states considered is indicated beneath the bars. (b), For each pairwise GFP comparisons, first we selected all pairs of amino acid sites for which epistasis was measured. Then, out of these pairs of sites we calculated the number that had epistasis >0.3 in either of the two GFP genes (reported underneath each bar). Finally, we calculated the percent of epistatically interacting sites that were measured to be epistatically interacting in both (y-axis). In (a) and (b), pairs of genes are arranged in order of increasing sequence divergence.

The identification of two mutationally robust proteins presented an opportunity to predict novel GFP sequences. Two lines of reasoning led us to hypothesize that it would be easier to create functional genotypes by introducing mutations into mutationally robust, rather than fragile, proteins. First, robust proteins had a higher fraction of fit genotypes with >5 mutations and, therefore, it should be easier to find other genotypes that are farther away. Second, a robust protein should be more tolerant of mistakes in predictions.

Prediction of functional genotypes many mutations away from known functional sequence is akin to looking for a needle in a haystack. There are $(\binom{222}{48}) ∙ 19^{48}$ or ~10¹¹⁰ genotypes that are 48 mutations away from a 222 amino acid long ppluGFP2. Out of all of these sequences, only an infinitesimally small proportion is expected to be functional, perhaps as few as 10^–11 (Keefe and Szostak, 2001) and finding any appreciable number of these sequences requires extraordinary precision. Therefore, we used a machine learning approach, training neural networks on the genotype-to-phenotype relationships revealed by our data (see Materials and methods, Figure 6). We split this data into non-overlapping training and validation sets. Models were trained on the training set and after training, model goodness was calculated as the coefficient of determination between predicted and actual fluorescence values for all genotypes in the validation set. We started with a linear model fitted to the one-hot encoded protein sequences. The validation score of the resulting models indicated that between 59% and 82% of the variance could be explained in all landscapes by the simple linear contribution of mutations in the protein sequence (Figure 6—figure supplement 1). This simple estimate of the fluorescence, which is called fitness potential (Kimura and Crow, 1978; Milkman, 1978), is simply the summed contribution of weighted mutations and does not account for possible interactions between them. We then trained models of increasing capacity and aimed at maximising the validation score while reducing overfitting. In all landscapes, the majority of genotypes were either non-functional or of near-wildtype brightness; this scarcity of genotypes with intermediate fluorescence levels suggests that an abrupt threshold function transforms the fitness potential into the final fluorescence level, as has been observed previously (Pokusaeva et al., 2019; Sarkisyan et al., 2016). Therefore, we decided to train sigmoid models, resulting in the successful capture of an additional 13%, 78%, 53%, and 39% of the remaining unexplained variance for amacGFP, avGFP, cgreGFP, and ppluGFP2, respectively, compared to the results of the linear model (Figure 6—figure supplement 1). This minute transformation of the fitness potential noticeably improves the models’ power, especially for the two genes that display the highest levels of epistatic interactions, avGFP and cgreGFP. To capture the functions that transform the fitness potential into the predicted fluorescence, we decided to train models with an output subnetwork of several sigmoid nodes (Figure 6—figure supplement 1). These functions are shown in Figure 6—figure supplement 1. Theorising that models accounting for interactions between residues would push further the predictive power of the models, we optimised the architecture of two-layered networks, one for each dataset using a grid search approach. This resulted in models capturing 0.88, 0.95, 0.86, and 0.90 of variance for amacGFP, avGFP, cgreGFP, and ppluGFP2 respectively, as shown in Figure 6b. Using the trained model as the evaluation function of a genetic algorithm, we made fitness peak-specific predictions, using the data of each fitness peak to predict fluorescent genotypes containing up to 48 mutations relative to the wildtype sequence.

Figure 6 with 1 supplement see all

Download asset Open asset

Neural network structure.

(a) 1. Each genotype in the dataset was denoted by the mutations it contained relative to its parental wildtype sequence. 2. Genotypes were one-hot encoded. For each position in the sequence, a binary vector indicated present (red = 1) and absent (white = 0) amino acid states. 3. One-hot encoded sequences were flattened and provided to the neural network as input. 4. The first hidden layer contained linear nodes followed by a dropout layer of the same size. 5. The second hidden layer contained sigmoid nodes followed by a dropout layer of the same size. Grey arrows indicate layer widths that were optimised by a random search. Greyed-out neurons without output connections represent randomly inactivated neurons in dropout layers. During training, randomly inactivated neurons prevented overfitting. At inference time, randomly inactivated neurons allowed the model to provide different estimates of the fluorescence each time a prediction was run on a genotype. 6. Linear node outputting predicted fluorescence values. For each predicted genotype, the median of several fluorescence estimates was used as the final fluorescence level. (b) Correlations between observed and predicted levels of fluorescence with an optimized architecture. Datapoint density is represented in color.

Amino acids observed in homologous sequences, or extant states, are more likely to be neutral when introduced into a sequence of interest (Pokusaeva et al., 2019; Figure 3—figure supplement 2). Therefore, one approach to predict a novel functional sequence would be to prefer the introduction of extant amino acid states. However, we wanted to push the envelope of our predictions in exploring uncharted regions of the GFP fitness landscape, avoiding the genotype space between known GFP sequences (space between fitness peaks in Figure 1a). Thus, we aimed to predict genotypes as distant as possible from any known functional GFP sequences, corresponding to an area of GFP genotype space not known to be explored by evolution. Therefore, for experimental verification from among the predictions made by the machine learning algorithm, we selected sequences with the maximum amino acid states that were observed in our libraries but not present in any natural GFP (Source data 8).

Contrary to our expectation, experimental verification showed that the accuracy of our predictions was substantially higher for genotypes predicted by using data from the sharp cgreGFP fitness peak (Figure 7). For genotypes with 48 mutations (>20% sequence divergence of GFP), our predictions had an 8% accuracy when using data for the mutationally robust ppluGFP2 and a 50–60% accuracy for the mutationally fragile cgreGFP (Figure 7). These results may be relatively trivial, if the predictions were based on universally neutral mutations (Kondrashov and Kondrashov, 2015), those that are neutral in any GFP sequence. However, three lines of evidence show that our high rate of prediction cannot be explained by universally neutral mutations (also see Poelwijk et al., 2019). First, individual mutations which were strongly deleterious in some genetic contexts were successfully incorporated into functional predictions (Figure 7—figure supplement 1), and these conditionally deleterious mutations comprised a significant fraction of all mutations used in predictions, particularly in the case of the epistatic cgreGFP protein (Figure 7—figure supplement 1). Second, the mutations used in successful predictions occur in evolution at a rate two times slower than neutral synonymous substitutions (0.057 dn rate vs 0.11 ds rate, respectively, two-sided Mann-Whitney U-test p<0.00001), demonstrating that they are under negative selection. Finally, successful identification of universally neutral mutations would lead to a successful prediction of distant derivatives of any GFP sequence, not just the mutationally fragile cgreGFP. Furthermore, the ML-designed variants derived from the more robust amacGFP and ppluGFP2 proteins were rendered non-fluorescent by negative epistasis substantially more frequently than those derived from the fragile and epistatic cgreGFP (Table 1). This indicates that the success of the neural network was dependent on being able to learn epistatic interactions from the data, which were abundant in cgreGFP but rare in amacGFP and ppluGFP2, and to avoid non-favorable epistatic interactions, rather than relying on universally neutral mutations.

Figure 7 with 1 supplement see all

Download asset Open asset

Predicting functional GFP mutants.

Violin plots show the distribution of fluorescence of all genotypes (black) and combinations of only individually neutral mutations (color). Experimental measurements of the level of fluorescence in genotypes predicted by the neural network are shown as black dots (12 genotypes per distance). Black dashed lines show the median fluorescent values for each group. Red dashed lines indicate the cutoff of detectable fluorescence. Photos of agar plates with *E. coli* spots expressing predicted GFP variants are shown on the right. Spots of bacteria expressing GFP variants are arranged in circles around the wildtype gene at increasing distance with the number of mutations (6, 12, 18, 24, 30, 36, 42, 48 mutations). For each group of genotypes, the brightest ones were inoculated at the top, with fluorescence decreasing clockwise.

Discussion

Experimental survey of the fitness landscape of a protein of interest is increasingly used in protein engineering to discover novel sequences with specific functions (Bryant et al., 2021; Romero and Arnold, 2009; Russ et al., 2020; Wittmann et al., 2021). While this approach remains challenging for proteins with a function that cannot be easily ascertained in a high-throughput manner (Romero and Arnold, 2009), it is likely to be more widely used in the future due to technological advances of experimental (Romero and Arnold, 2009) and analytical (Wittmann et al., 2021; Wu et al., 2019) tools. Our description of heterogeneity of fitness peaks of orthologous GFPs suggests some practical considerations for such surveys of other proteins. Researchers applying such methods to their protein of interest will inevitably have to choose a specific protein sequence to experimentally assay (Romero and Arnold, 2009). When the goal is to discover as many distant functional proteins as possible (i.e. Bryant et al., 2021; Romero and Arnold, 2009; Russ et al., 2020; Wittmann et al., 2021) it may seem natural to select a structurally or mutationally robust protein. Indeed, a robust protein, one that is known to be able to maintain function upon the introduction of many mutations, seems a good starting point to introduce even more mutations. Our results counter this intuition, and our recommendation is to select a fragile protein as the original template for a mutational scan. For the data of the fitness landscape to be useful for a downstream model to predict distant sequences, it has to contain information about epistatic interactions between mutations. Thus, a useful fitness landscape should contain many genotypes that have been rendered non-functional through negative epistatic interactions among a handful of mutations. Our results for cgreGFP demonstrate this principle. Out of 188 genotypes six mutations away from cgreGFP that were expected to be functional by an additive model only 61 (32%) were actually fluorescent. The rest were non-fluorescent, revealing extensive negative epistasis among those mutations. By contrast, our model correctly predicted 100% (12/12) of genotypes 6 mutations away from cgreGFP, learning to avoid combinations of individually neutral mutations that combine to create a non-functional genotype. Without this information, such as for amacGFP that shows almost no epistatic interactions within the surveyed genotypes, the model cannot learn which genotypes to avoid (Figure 7). The reported ~20% prediction accuracy at 40 mutations for sfGFP is also consistent with the sharpness of its fitness peak (Biswas et al., 2018).

Without direct knowledge of mutational robustness of a protein sequence our data indicate that researchers may rely on thermodynamic stability to choose the initial template protein, although the relationship between mutational and thermodynamic robustness may be more complex (Table 2). However, given that for many proteins it is likely to be easier to measure stability than mutational robustness, choosing a structurally unstable protein from several available candidates may prove to be an acceptable compromise.

Despite the high accuracy of prediction with our model, there are still substantial limits to the prediction of functional proteins. Indeed, the relatively accurate prediction of functional GFP sequences up to 20% divergence from cgreGFP does not imply an ability to predict all $(\binom{235}{48}) \cdot 19^{48}$ or ~10¹¹² possible functional sequences at this level of divergence. The substantial heterogeneity between fitness peaks of the highly similar avGFP and amacGFP (18% divergence) suggests that predictions based on a single fitness peak may have lower accuracy of prediction of sequences not governed by the same set of epistatic interactions (Alley et al., 2019; Lee et al., 2018). However, the understanding of the heterogeneity of such predictions would require random sampling of all 10¹¹⁰ sequences, which is not presently feasible.

The four proteins in our study function in different species so the heterogeneity of their fitness landscape may be related to some aspect of the environment in which these species are typically found, such as temperature. By contrast, our experimental setup measures the fluorescence of these proteins in an identical, controlled but an artificial environment. If GFPs mutational robustness, fitness peak shape, was adapted to environmental conditions it implies that mutational robustness is driven by natural selection. Indeed, experimental data suggests that mutational robustness of GFP is a selectable trait (Zheng et al., 2020), so a correlation between sequence divergence and mutational robustness may have been expected. However, selectable traits are generally expected to evolve slowly, which translates to the expectation that similar sequences confer similar phenotypes. The lack of a correlation in our data between sequence similarity and mutational robustness implies that selection for the phenotype of mutational robustness in GFP is weak or that it changes rapidly on an evolutionary timescale.

The heterogeneity of the shape of the fitness peaks is remarkable. Up to 17% of all genotypes six random mutations away from the ppluGFP2 wildtype sequence have the same level of fluorescence as the wildtype. By contrast, only 0.9% of such genotypes derived from the fragile cgreGFP exhibited wildtype fluorescence (Supplementary file 1). However, it remains unclear whether this heterogeneity influences protein evolution. It is tempting to suggest that these data indicate that ppluGFP2 is a more ‘evolvable’ protein compared to cgreGFP. However, 2% of all genotypes 6 mutations away from wildtype cgreGFP were observed to be functional, which is still 2.10¹⁷ ( $(\binom{235}{6}) ∙ 19^{6} ∙ 0.02$ ) total functional genotypes, so even such a relatively fragile protein may not be restricted in its long-term evolution (Povolotskaya and Kondrashov, 2010). What fraction of all genotypes ~250 amino acids in length are functional GFPs, and what factors govern differences in the shape of fitness peaks of orthologous proteins, remain unknown.

Share this article

Cite this article

Comparison of four GFP fitness peaks.

Flowthrough of the experimental methodology.

The dataset in numbers.

Distributions of fluorescence.

Thermal sensitivity of GFP orthologs.

Biophysical and biochemical characterisation of wildtype GFPs.

Differences in mutational effects in GFP orthologues.

Neural network structure.

Predicting functional GFP mutants.

Author details

Louisa Gonzalez Somermeyer

Contribution

Competing interests

Aubin Fleiss

Contribution

Competing interests

Alexander S Mishin

Contribution

Competing interests

Nina G Bozhanova

Contribution

Competing interests

Anna A Igolkina

Contribution

Competing interests

Jens Meiler

Contribution

Competing interests

Maria-Elisenda Alaball Pujol

Contribution

Competing interests

Ekaterina V Putintseva

Contribution

Competing interests

Karen S Sarkisyan

Contribution

For correspondence

Competing interests

Fyodor A Kondrashov

Contribution

For correspondence

Competing interests

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Categories and tags

Research organism

Further reading