Research Article

Integrating genotypes and phenotypes improves long-term forecasts of seasonal influenza A/H3N2 evolution

Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, United States
Molecular and Cell Biology Program, University of Washington, United States
Virology Surveillance and Diagnosis Branch, Influenza Division, National Center for Immunization and Respiratory Diseases (NCIRD), Centers for Disease Control and Prevention (CDC), United States
WHO Collaborating Centre for Reference and Research on Influenza, Crick Worldwide Influenza Centre, The Francis Crick Institute, United Kingdom
Influenza Virus Research Center, National Institute of Infectious Diseases, Japan
The WHO Collaborating Centre for Reference and Research on Influenza, The Peter Doherty Institute for Infection and Immunity, Department of Microbiology and Immunology, The University of Melbourne, The Peter Doherty Institute for Infection and Immunity, Australia
Biozentrum, University of Basel, Switzerland
Swiss Institute of Bioinformatics, Switzerland

Sep 2, 2020

https://doi.org/10.7554/eLife.60067

Open access
Copyright information

Figures
Tables
Additional files

16 figures, 6 tables and 2 additional files

Figures

Figure 1

Download asset Open asset

Schematic representation of the fitness model for simulated H3N2-like populations wherein the fitness of strains at timepoint t determines the estimated frequency of strains with similar sequences one year in the future at timepoint u.

Strains are colored by their amino acid sequence composition such that genetically similar strains have similar colors (Materials and methods). (A) Strains at timepoint t, $𝐱 (t)$ , are shown in their phylogenetic context and sized by their frequency at that timepoint. The estimated future population at timepoint u, $\hat{𝐱} (u)$ , is projected to the right with strains scaled in size by their projected frequency based on the known fitness of each simulated strain. (B) The frequency trajectories of strains at timepoint t to u represent the predicted the growth of the dark blue strains to the detriment of the pink strains. (C) Strains at timepoint u, $𝐱 (u)$ , are shown in the corresponding phylogeny for that timepoint and scaled by their frequency at that time. (D) The observed frequency trajectories of strains at timepoint u broadly recapitulate the model’s forecasts while also revealing increased diversity of sequences at the future timepoint that the model could not anticipate, e.g. the emergence of the light blue cluster from within the successful dark blue cluster. Model coefficients minimize the earth mover’s distance between amino acid sequences in the observed, $𝐱 (u)$ , and estimated, $\hat{𝐱} (u)$ , future populations across all training windows.

Figure 2

Download asset Open asset

Time-series cross-validation scheme for simulated populations.

Models were trained in six-year sliding windows (gray lines) and validated on out-of-sample data from validation timepoints (filled circles). Validation results from 30 years of data were used to iteratively tune model hyperparameters. After fixing hyperparameters, model coefficients were fixed at the mean values across all training windows. Fixed coefficients were applied to 9 years of new out-of-sample test data (open circles) to estimate true forecast errors.

Figure 3

Download asset Open asset

Phylogeny of H3N2-like HA sequences sampled between the 24th and 30th years of simulated evolution.

The phylogenetic structure and rate of accumulated epitope and non-epitope mutations match patterns observed in phylogenies of natural sequences. Sample dates were annotated as the generation in the simulation divided by 200 and added to 2000, to acquire realistic date ranges that were compatible with our modeling machinery.

Figure 4

Download asset Open asset

Simulated population model coefficients and distances between projected and observed future populations as measured in amino acids (AAs).

(A) Coefficients are shown per validation timepoint (solid circles, N = 33) with the mean ± standard deviation in the top-left corner. For model testing, coefficients were fixed to their mean values from training/validation and applied to out-of-sample test data (open circles, N = 18). (B) Distances between projected and observed populations are shown per validation timepoint (solid black circles) or test timepoint (open black circles). The mean ± standard deviation of distances per validation timepoint are shown in the top-left of each panel. Corresponding values per test timepoint are in the top-right. The naive model’s distances to the future for validation and test timepoints (light gray) were 8.97 ± 1.35 AAs and 9.07 ± 1.70 AAs, respectively. The corresponding lower bounds on the estimated distance to the future (dark gray) were 4.57 ± 0.61 AAs and 4.85 ± 0.82 AAs. Source data are in Table 4—source data 1 and 2.

Figure 5 with 1 supplement

Download asset Open asset

Simulated population model coefficients and distances to the future for individual biologically-informed fitness metrics and the best composite model.

(A) Coefficients and (B) distances are shown per validation and test timepoint as in Figure 4. Source data are in Table 4—source data 1 and 2.

Figure 5—figure supplement 1

Download asset Open asset

Composite model coefficients and distances to the future for models fit to simulated populations.

(A) Coefficients and (B) distances are shown per validation timepoint and test timepoint as in Figure 4. Source data are in Table 4—source data 1 and 2.

Figure 6 with 3 supplements

Download asset Open asset

Test of best model for simulated populations (true fitness) using 9 years (18 timepoints) of previously unobserved test data and fixed model coefficients.

These timepoints correspond to the open circles in Figure 2. (A) The correlation of log estimated clade frequency fold change, $\log_{10} \frac{\hat{x} (t + Δ t)}{x (t)}$ , and log observed clade frequency fold change, $\log_{10} \frac{x (t + Δ t)}{x (t)}$ , shows the model’s ability to capture clade-level dynamics without explicitly optimizing for clade frequency targets. (B) The rank of the estimated closest strain based on its distance to the future in the best model was in the top 20th percentile for 89% of 18 timepoints, confirming that the model makes a good choice when forced to select a single representative strain for the future population. (C) Absolute forecast error for clades shown in A by their initial frequency with a mean LOESS fit (solid black line) and 95% confidence intervals (gray shading) based on 100 bootstraps. (D) The correlation of all strains at all timepoints by the percentile rank of their observed and estimated distances to the future. The corresponding results for the naive model are shown in Figure 6—figure supplement 3.

Figure 6—figure supplement 1

Download asset Open asset

Validation of best model for simulated populations of H3N2-like viruses.

Validation of best model for simulated populations of H3N2-like viruses for 33 timepoints. These timepoints correspond to the closed circles in Figure 2. (A) The correlation of estimated and observed clade frequency fold changes shows the model’s ability to capture clade-level dynamics without explicitly optimizing for clade frequency targets. (B) The rank of the estimated closest strain based on its distance to the future for 33 timepoints. The estimated closest strain was in the top 20th percentile of observed closest strains for 100% of timepoints, confirming that the model makes a good choice when forced to select a single representative strain for the future population. (C) Absolute forecast error for clades shown in A by their initial frequency with a mean LOESS fit (solid black line) and 95% confidence intervals (gray shading) based on 100 bootstraps. (D) The correlation of all strains at all timepoints by the percentile rank of their observed and estimated distances to the future. The corresponding results for the naive model are shown in Figure 6—figure supplement 2.

Figure 6—figure supplement 2

Download asset Open asset

Validation of naive model for simulated populations of H3N2-like viruses.

Validation of naive model for simulated populations of H3N2-like viruses for 33 timepoints. These timepoints correspond to the closed circles in Figure 2. Note that the naive model sets future frequencies to current frequencies such that there is no estimated fold change in frequencies for the first panel.

Figure 6—figure supplement 3

Download asset Open asset

Test of naive model for simulated populations of H3N2-like viruses.

Test of naive model for simulated populations of H3N2-like viruses for 18 timepoints. These timepoints correspond to the open circles in Figure 2. Note that the naive model sets future frequencies to current frequencies such that there is no estimated fold change in frequencies for the first panel.

Figure 7

Download asset Open asset

Time-series cross-validation scheme for natural populations.

Models were trained in six-year sliding windows (gray lines) and validated on out-of-sample data from validation timepoints (filled circles). Validation results from 25 years of data were used to iteratively tune model hyperparameters. After fixing hyperparameters, model coefficients were fixed at the mean values across all training windows. Fixed coefficients were applied to four years of new out-of-sample test data (open circles) to estimate true forecast errors.

Figure 8 with 1 supplement

Download asset Open asset

Natural population model coefficients and distances to the future for individual biologically-informed fitness metrics.

(A) Coefficients and (B) distances are shown per validation timepoint (N = 23) and test timepoint (N = 8) as in Figure 4. The naive model’s distance to the future (light gray) was 6.40 ± 1.36 AAs for validation timepoints and 6.82 ± 1.74 AAs for test timepoints. The corresponding lower bounds on the estimated distance to the future (dark gray) were 2.60 ± 0.89 AAs and 2.28 ± 0.61 AAs. Source data are in Table 6—source data 1 and 2.

Figure 8—figure supplement 1

Download asset Open asset

Comparison of epitope-based models with knowledge of the future.

Model coefficients and distances to the future for antigenic novelty models fit to natural populations. (A) Coefficients and (B) distances are shown per validation timepoint and test timepoint as in Figure 4. The epitope antigenic novelty model relies on previously published epitope sites (Luksza and Lässig, 2014). The ‘oracle’ antigenic novelty model relies on sites of beneficial mutations that were manually identified from the entire training and validation time period (Methods). The improved performance of the ‘oracle’ model indicates that the sequence-based antigenic novelty metric can be effective when sites of beneficial mutations are known prior to forecasting. Source data are in Table 6—source data 1 and 2.

Figure 9 with 1 supplement

Download asset Open asset

Natural population model coefficients and distances to the future for composite fitness metrics.

(A) Coefficients and (B) distances are shown per validation timepoint (N = 23) and test timepoint (N = 8) as in Figure 4. Source data are in Table 6—source data 1 and 2.

Figure 9—figure supplement 1

Download asset Open asset

Composite models fit to most recent data from natural populations.

Model coefficients and distances to the future for best composite models and a FRA-based composite fit to recent data from natural populations as in Figure 4. (A) Coefficients and (B) distances are shown per test timepoint (N = 8). In contrast to the results for these models based on fixed coefficients from training/validation, these coefficients were learned for each six-year window prior to the corresponding test timepoint. The corresponding distances reflect the model’s performance with updated coefficients on what is effectively new validation data. The naive model’s distance to the future was 6.82 ± 1.74 AAs for these timepoints.

Figure 10 with 3 supplements

Download asset Open asset

Test of best model for natural populations of H3N2 viruses, the composite model of HI antigenic novelty and mutational load, across eight timepoints.

These timepoints correspond to the open circles in Figure 7. (A) The correlation of estimated and observed clade frequency fold changes shows the model’s ability to capture clade-level dynamics without explicitly optimizing for clade frequency targets. (B) The rank of the estimated closest strain based on its distance to the future for eight timepoints. The estimated closest strain was in the top 20th percentile of observed closest strains for 100% of timepoints. (C) Absolute forecast error for clades shown in A by their initial frequency with a mean LOESS fit (solid black line) and 95% confidence intervals (gray shading) based on 100 bootstraps. (D) The correlation of all strains at all timepoints by the percentile rank of their observed and estimated distances to the future. The corresponding results for the naive model are shown in Figure 10—figure supplement 3.

Figure 10—figure supplement 1

Download asset Open asset

Validation of best model for natural populations of H3N2 viruses.

Validation of best model for natural populations of H3N2 viruses for 23 timepoints using the composite model of mutational load and LBI. These timepoints correspond to the closed circles in Figure 7. (A) The correlation of estimated and observed clade frequency fold changes shows the model’s ability to capture clade-level dynamics without explicitly optimizing for clade frequency targets. (B) The rank of the estimated closest strain based on its distance to the future for 23 timepoints. The estimated closest strain was in the top 20th percentile of observed closest strains for 87% of timepoints, confirming that the model makes a good choice when forced to select a single representative strain for the future population. (C) Absolute forecast error for clades shown in A by their initial frequency with a mean LOESS fit (solid black line) and 95% confidence intervals (gray shading) based on 100 bootstraps. (D) The correlation of all strains at all timepoints by the percentile rank of their observed and estimated distances to the future. The corresponding results for the naive model are shown in Figure 10—figure supplement 2.

Figure 10—figure supplement 2

Download asset Open asset

Validation of naive model for natural populations of H3N2 viruses.

Validation of naive model for natural populations of H3N2 viruses for 23 timepoints. These timepoints correspond to the closed circles in Figure 7. Note that the naive model sets future frequencies to current frequencies such that there is no estimated fold change in frequencies for the first panel.

Figure 10—figure supplement 3

Download asset Open asset

Test of naive model for natural populations of H3N2 viruses.

Test of naive model for natural populations of H3N2 viruses for eight timepoints. These timepoints correspond to the open circles in Figure 7. Note that the naive model sets future frequencies to current frequencies such that there is no estimated fold change in frequencies for the first panel.

Figure 11 with 1 supplement

Download asset Open asset

Observed distance to natural H3N2 populations one year into the future for each vaccine strain (green) and the observed (blue) and estimated closest strains to the future by the mutational load and LBI model (orange), the HI antigenic novelty and mutational load model (purple), and the naive model (black).

Vaccine strains were assigned to the validation or test timepoint closest to the date they were selected by the WHO. The weighted distance to the future for each strain was calculated from their amino acid sequences and the frequencies and sequences of the corresponding population one year in the future. Vaccine strain names are abbreviated from A/Fujian/411/2002, A/Wellington/1/2004, A/California/7/2004, A/Wisconsin/67/2005, A/Brisbane/10/2007, A/Perth/16/2009, A/Victoria/361/2011, A/Texas/50/2012, A/Switzerland/9715293/2013, A/HongKong/4801/2014, A/Singapore/Infimh-16-0019/2016, and A/Switzerland/8060/2017. Source data are available in Figure 11—source data 1.

Figure 11—source data 1 Weighted distances to the future per strain by strain type and timepoint.: https://cdn.elifesciences.org/articles/60067/elife-60067-fig11-data1-v2.csv
Download elife-60067-fig11-data1-v2.csv

Figure 11—figure supplement 1

Download asset Open asset

Relative improvement of model selections over vaccine strains.

Relative distance to future H3N2 populations between vaccine strains and corresponding observed and estimated closest strains at each timepoint as in Figure 11. Strains with relative distances greater than zero were farther from the future than the selected vaccine strain, while strains below zero were closer to the future. Source data are available in Figure 11—source data 1.

Figure 12

Download asset Open asset

Snapshot of live forecasts on nextstrain.org from our best model (HI antigenic novelty and mutational load) for April 1, 2021.

The observed frequency trajectories for currently circulating clades are shown up to April 1, 2020. Our model forecasts growth of the clades 3c3.A and A1b/94N and decline of all other major clades.

Figure 13

Download asset Open asset

Snapshot of the last two years of seasonal influenza H3N2 evolution on nextstrain.org showing the estimated distance per strain to the future population.

Distance to the future is calculated for each strain as the Hamming distance of HA amino acid sequences to all other circulating strains weighted by the other strain’s projected frequencies under the best fitness model (HI antigenic novelty and mutational load).

Figure 14

Download asset Open asset

Bootstrap distributions of the mean difference of distances to the future between biologically-informed and naive models for simulated populations.

Empirical differences in distances to the future were sampled with replacement and mean values for each bootstrap sample were calculated across n = 10,000 bootstrap iterations. The horizontal gray line indicates a difference of zero between a given model and its corresponding naive model. Each model is annotated by the mean ± the standard deviation of the bootstrap distribution. Models are also annotated by the p-value representing the proportion of bootstrap samples with values less than zero (see Methods).

Figure 15

Download asset Open asset

Bootstrap distributions of the mean difference of distances to the future between biologically-informed and naive models for natural populations.

Empirical differences in distances to the future were sampled with replacement and mean values for each bootstrap sample were calculated across n = 10,000 bootstrap iterations. The horizontal gray line indicates a difference of zero between a given model and its corresponding naive model. Each model is annotated by the mean ± the standard deviation of the bootstrap distribution. Models are also annotated by the p-value representing the proportion of bootstrap samples with values less than zero (see Materials and methods).

Author response image 1

Download asset Open asset

Distribution of distances to the future by Hemisphere for the best natural model (HI antigenic novelty and mutational load) subtracted from the corresponding distance to the future for the naive model at the same timepoint.

Tables

Table 1

Summary of models used with simulated and natural populations.

Models are labeled by the type of population they were applied to, the type of data they were based on, and the component of influenza fitness they represent.

Model	Populations	Data type	Fitness category	Originally implemented by
true fitness	simulated	simulated populations	positive control	this study
naive	simulated, natural	HA sequences	negative control	this study
epitope antigenic novelty	simulated, natural	HA sequences	antigenic drift	Luksza and Lässig, 2014
epitope ancestor	simulated, natural	HA sequences	antigenic drift	Luksza and Lässig, 2014
HI antigenic novelty	natural	serological assays	antigenic drift	this study
mutational load	simulated, natural	HA sequences	functional constraint	Luksza and Lässig, 2014
deep mutational scanning (DMS) mutational effects	natural	DMS assays	functional constraint	Lee et al., 2018
local branching index (LBI)	simulated, natural	HA sequences	clade growth	Neher et al., 2014
delta frequency	simulated, natural	HA sequences	clade growth	this study

Table 2

Number of epitope and non-epitope mutations per branch by trunk or side branch status for simulated populations.

Epitope sites were defined previously described (Luksza and Lässig, 2014). Annotation of trunk and side branch was performed as previously described (Bedford et al., 2015). Mutations were calculated for the full validation tree for simulated sequences samples between October of years 10 and 40.

branch type	epitope mutations	non-epitope mutations	epitope-to-non-epitope ratio
side branch	590	1327	0.44
trunk	23	12	1.92

Table 3

Number of epitope and non-epitope mutations per branch by trunk or side branch status for natural populations.

branch type	epitope mutations	non-epitope mutations	epitope-to-non-epitope ratio
side branch	485	1177	0.41
trunk	50	32	1.56

Table 4

Simulated population model coefficients and performance on validation and test data ordered from best to worst by distance to the future in the validation analysis.

Coefficients are the mean ± standard deviation for each metric in a given model across 33 training windows. Distance to the future (mean ± standard deviation) measures the distance in amino acids between estimated and observed future populations. Distances annotated with asterisks (*) were significantly closer to the future than the naive model as measured by bootstrap tests (see Methods and Figure 14). The number of times (and percentage of total times) each model outperformed the naive model measures the benefit of each model over a model than estimates no change between current and future populations. Test results are based on 18 timepoints not observed during model training and validation. Source data are in Table 4—source data 1 and 2.

		Distance to future (AAs)		Model > naive
Model	Coefficients	Validation	Test	Validation	Test
true fitness	9.37 +/− 0.92	6.82 +/− 1.52*	7.38 +/− 1.89*	32 (97%)	16 (89%)
LBI	1.31 +/− 0.33	7.24 +/− 1.66*	7.10 +/− 1.19*	32 (97%)	18 (100%)
+ mutational load	−1.77 +/− 0.49
LBI	2.26 +/− 1.06	7.57 +/− 1.85*	7.51 +/− 1.20*	29 (88%)	17 (94%)
delta frequency	1.46 +/− 0.44	8.13 +/− 1.44*	8.65 +/− 1.99*	26 (79%)	13 (72%)
epitope ancestor	0.35 +/− 0.07	8.20 +/− 1.39*	8.17 +/− 1.52*	29 (88%)	17 (94%)
+ mutational load	−1.57 +/− 0.13
mutational load	−1.49 +/− 0.12	8.27 +/− 1.35*	8.20 +/− 1.50*	29 (88%)	17 (94%)
epitope antigenic novelty	0.03 +/− 0.19	8.33 +/− 1.35*	8.22 +/− 1.51*	28 (85%)	17 (94%)
+ mutational load	−1.38 +/− 0.39
epitope ancestor	0.14 +/− 0.11	8.96 +/− 1.35	9.03 +/− 1.68*	20 (61%)	13 (72%)
naive	0.00 +/− 0.00	8.97 +/− 1.35	9.07 +/− 1.70	0 (0%)	0 (0%)
epitope antigenic novelty	−0.03 +/− 0.19	9.03 +/− 1.37	9.07 +/− 1.69	14 (42%)	7 (39%)

Table 4—source data 1 Coefficients for models fit to simulated populations.: https://cdn.elifesciences.org/articles/60067/elife-60067-table4-data1-v2.csv
Download elife-60067-table4-data1-v2.csv
Table 4—source data 2 Distances to the future for models fit to simulated populations.: https://cdn.elifesciences.org/articles/60067/elife-60067-table4-data2-v2.csv
Download elife-60067-table4-data2-v2.csv

Table 5

Comparison of composite and individual model distances to the future by bootstrap test (see Materials and methods).

The effect size of differences between models in amino acids is given by the mean and standard deviation of the bootstrap distributions. The p values represent the proportion of n = 10,000 bootstrap samples where the mean difference was greater than or equal to zero.

sample	error type	individual model	composite model	bootstrap mean	bootstrap std	p value
simulated	validation	true fitness	mutational load + LBI	0.42	0.23	0.9644
simulated	validation	mutational load	mutational load + LBI	−1.03	0.21	<0.0001
simulated	validation	LBI	mutational load + LBI	−0.33	0.14	0.0091
simulated	test	true fitness	mutational load + LBI	−0.28	0.26	0.1392
simulated	test	mutational load	mutational load + LBI	−1.11	0.25	<0.0001
simulated	test	LBI	mutational load + LBI	−0.42	0.16	0.0001
natural	validation	mutational load	mutational load + LBI	−0.69	0.28	0.0036
natural	validation	LBI	mutational load + LBI	−0.23	0.09	0.0025
natural	validation	mutational load	mutational load + HI antigenic novelty	−0.31	0.18	0.0417
natural	validation	HI antigenic novelty	mutational load + HI antigenic novelty	−0.18	0.11	0.0513
natural	test	mutational load	mutational load + LBI	1.19	0.79	0.9432
natural	test	LBI	mutational load + LBI	−0.70	0.24	<0.0001
natural	test	mutational load	mutational load + HI antigenic novelty	−0.56	0.33	0.0133
natural	test	HI antigenic novelty	mutational load + HI antigenic novelty	−0.24	0.18	0.0999

Table 6

All model coefficients and performance on validation and test data for natural populations ordered from best to worst by distance to the future, as in Table 4.

Distances annotated with asterisks (*) were significantly closer to the future than the naive model as measured by bootstrap tests (see Materials and methods and Figure 15). Distances annotated with carets (∧) were not tested for significance relative to the naive model. Validation results are based on 23 timepoints. Test results are based on eight timepoints not observed during model training and validation. Model results for additional variants of fitness metrics including those based on epitope mutations and DMS preferences are included for reference. Source data are in Table 6—source data 1 and 2.

		Distance to future (AAs)		Model > naive
Model	Coefficients	Validation	Test	Validation	Test
mutational load	−0.68 +/− 0.34	5.44 +/− 1.80*	7.70 +/− 3.53	18 (78%)	4 (50%)
+ LBI	1.03 +/− 0.40
LBI	1.12 +/− 0.51	5.68 +/− 1.91*	8.40 +/− 3.97	17 (74%)	2 (25%)
oracle antigenic novelty	0.80 +/− 0.21	5.71 +/− 1.27^{^}	8.06 +/− 2.49^{^}	18 (78%)	2 (25%)
HI antigenic novelty	0.89 +/− 0.23	5.82 +/− 1.50*	5.97 +/− 1.47*	17 (74%)	6 (75%)
+ mutational load	−1.01 +/− 0.42
HI antigenic novelty	0.90 +/− 0.23	5.84 +/− 1.51*	5.99 +/− 1.46*	16 (70%)	6 (75%)
+ mutational load	−1.00 +/− 0.44
+ LBI	−0.04 +/− 0.09
HI antigenic novelty	0.83 +/− 0.20	6.01 +/− 1.50*	6.21 +/− 1.44*	16 (70%)	7 (88%)
delta frequency	0.79 +/− 0.47	6.13 +/− 1.71*	6.90 +/− 2.30	16 (70%)	5 (62%)
mutational load	−0.99 +/− 0.30	6.14 +/− 1.37*	6.53 +/− 1.39	17 (74%)	6 (75%)
Koel epitope antigenic novelty	0.28 +/− 0.36	6.22 +/− 1.26^{^}	6.72 +/− 1.51^{^}	18 (78%)	4 (50%)
naive	0.00 +/− 0.00	6.40 +/− 1.36	6.82 +/− 1.74	0 (0%)	0 (0%)
DMS entropy	−0.03 +/− 0.10	6.40 +/− 1.36^{^}	6.81 +/− 1.73^{^}	9 (39%)	6 (75%)
DMS mutational load	−0.02 +/− 0.13	6.45 +/− 1.42^{^}	6.82 +/− 1.73^{^}	7 (30%)	5 (62%)
epitope ancestor	0.53 +/− 0.52	6.60 +/− 1.34	6.53 +/− 1.51	12 (52%)	4 (50%)
+ mutational load	−0.77 +/− 0.32
DMS mutational effects	1.25 +/− 0.84	6.75 +/− 1.95	7.80 +/− 2.97	11 (48%)	4 (50%)
Wolf epitope antigenic novelty	0.31 +/− 0.51	6.83 +/− 1.30^{^}	6.97 +/− 1.41^{^}	4 (17%)	3 (38%)
epitope ancestor	0.23 +/− 0.51	6.89 +/− 1.39^{^}	6.82 +/− 1.67^{^}	8 (35%)	4 (50%)
epitope antigenic novelty	0.57 +/− 0.77	6.89 +/− 1.42	6.46 +/− 1.31	7 (30%)	4 (50%)
+ mutational load	−0.77 +/− 0.27
epitope antigenic novelty	0.52 +/− 0.73	7.13 +/− 1.47	6.70 +/− 1.51	7 (30%)	5 (62%)

Table 6—source data 1 Coefficients for models fit to natural populations.: https://cdn.elifesciences.org/articles/60067/elife-60067-table6-data1-v2.csv
Download elife-60067-table6-data1-v2.csv
Table 6—source data 2 Distances to the future for models fit to natural populations. Clustering of amino acid sequences for visualization.: https://cdn.elifesciences.org/articles/60067/elife-60067-table6-data2-v2.csv
Download elife-60067-table6-data2-v2.csv

Additional files

Supplementary file 1 GISAID accessions and metadata including originating and submitting labs for natural strains used across all timepoints.: https://cdn.elifesciences.org/articles/60067/elife-60067-supp1-v2.xlsx
Download elife-60067-supp1-v2.xlsx
Transparent reporting form: https://cdn.elifesciences.org/articles/60067/elife-60067-transrepform-v2.docx
Download elife-60067-transrepform-v2.docx

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Mendeley

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

John Huddleston
John R Barnes
Thomas Rowe
Xiyan Xu
Rebecca Kondor
David E Wentworth
Lynne Whittaker
Burcu Ermetal
Rodney Stuart Daniels
John W McCauley
Seiichiro Fujisaki
Kazuya Nakamura
Noriko Kishida
Shinji Watanabe
Hideki Hasegawa
Ian Barr
Kanta Subbarao
Pierre Barrat-Charlaix
Richard A Neher
Trevor Bedford

(2020)

Integrating genotypes and phenotypes improves long-term forecasts of seasonal influenza A/H3N2 evolution

eLife 9:e60067.

https://doi.org/10.7554/eLife.60067

Share this article

Cite this article

Schematic representation of the fitness model for simulated H3N2-like populations wherein the fitness of strains at timepoint t determines the estimated frequency of strains with similar sequences one year in the future at timepoint u.

Time-series cross-validation scheme for simulated populations.

Phylogeny of H3N2-like HA sequences sampled between the 24th and 30th years of simulated evolution.

Simulated population model coefficients and distances between projected and observed future populations as measured in amino acids (AAs).

Simulated population model coefficients and distances to the future for individual biologically-informed fitness metrics and the best composite model.

Composite model coefficients and distances to the future for models fit to simulated populations.

Test of best model for simulated populations (true fitness) using 9 years (18 timepoints) of previously unobserved test data and fixed model coefficients.

Validation of best model for simulated populations of H3N2-like viruses.

Validation of naive model for simulated populations of H3N2-like viruses.

Test of naive model for simulated populations of H3N2-like viruses.

Time-series cross-validation scheme for natural populations.

Natural population model coefficients and distances to the future for individual biologically-informed fitness metrics.

Comparison of epitope-based models with knowledge of the future.

Natural population model coefficients and distances to the future for composite fitness metrics.

Composite models fit to most recent data from natural populations.

Test of best model for natural populations of H3N2 viruses, the composite model of HI antigenic novelty and mutational load, across eight timepoints.

Validation of best model for natural populations of H3N2 viruses.

Validation of naive model for natural populations of H3N2 viruses.

Test of naive model for natural populations of H3N2 viruses.

Figure 11—source data 1

Relative improvement of model selections over vaccine strains.

Snapshot of live forecasts on nextstrain.org from our best model (HI antigenic novelty and mutational load) for April 1, 2021.

Snapshot of the last two years of seasonal influenza H3N2 evolution on nextstrain.org showing the estimated distance per strain to the future population.

Bootstrap distributions of the mean difference of distances to the future between biologically-informed and naive models for simulated populations.

Bootstrap distributions of the mean difference of distances to the future between biologically-informed and naive models for natural populations.

Distribution of distances to the future by Hemisphere for the best natural model (HI antigenic novelty and mutational load) subtracted from the corresponding distance to the future for the naive model at the same timepoint.

Summary of models used with simulated and natural populations.

Number of epitope and non-epitope mutations per branch by trunk or side branch status for simulated populations.

Number of epitope and non-epitope mutations per branch by trunk or side branch status for natural populations.

Simulated population model coefficients and performance on validation and test data ordered from best to worst by distance to the future in the validation analysis.

Table 4—source data 1

Table 4—source data 2

Comparison of composite and individual model distances to the future by bootstrap test (see Materials and methods).

All model coefficients and performance on validation and test data for natural populations ordered from best to worst by distance to the future, as in Table 4.

Table 6—source data 1

Table 6—source data 2

Supplementary file 1

Transparent reporting form

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)