Research Article

Defining hierarchical protein interaction networks from spectral analysis of bacterial proteomes

Department of Pathology and Immunology, Washington University School of Medicine, United States
Duchossois Family Institute, University of Chicago, United States
Department of Genetics, Washington University School of Medicine, United States
Department of Developmental Biology, Washington University School of Medicine, United States
The Edison Family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, United States
Department of Pathology, University of Chicago, Chicago, United States
Center for the Physics of Evolving Systems, University of Chicago, Chicago, United States

Aug 17, 2022

Open access
Copyright information

Abstract
Editor's evaluation
Introduction
Results
Discussion
Materials and methods
Data availability
References
Article and author information
Metrics

Abstract

Cellular behaviors emerge from layers of molecular interactions: proteins interact to form complexes, pathways, and phenotypes. We show that hierarchical networks of protein interactions can be defined from the statistical pattern of proteome variation measured across thousands of diverse bacteria and that these networks reflect the emergence of complex bacterial phenotypes. Our results are validated through gene-set enrichment analysis and comparison to existing experimentally derived databases. We demonstrate the biological utility of our approach by creating a model of motility in Pseudomonas aeruginosa and using it to identify a protein that affects pilus-mediated motility. Our method, SCALES (Spectral Correlation Analysis of Layered Evolutionary Signals), may be useful for interrogating genotype-phenotype relationships in bacteria.

Editor's evaluation

Since the inception of comparative genomics, mining phyletic patterns has been a powerful approach for the discovery of previously unknown biological interactions. The authors use a combination of singular value decomposition of the phyletic pattern matrix and random forests classification method to uncover potential protein-protein interactions. The work illustrates the utility of such methods, which are finding increasing application in addressing various computational biological problems, such as predicting protein-protein interactions from genomic information.

https://doi.org/10.7554/eLife.74104.sa0

Introduction

A fundamental problem in biology is to understand how proteins interact to create a complex phenotype (Barabási and Oltvai, 2004; Chuang et al., 2010; Hartwell et al., 1999; Costanzo et al., 2016). Biochemical and genetic studies have illustrated that complex behaviors emerge from layers of protein interactions: proteins interact to form complexes, complexes interact to form pathways, and pathways interact to create phenotypes (Papin et al., 2004; Ravasz, 2009; Nurse, 2008). Current strategies for identifying protein-protein interactions (PPIs) span both experiment and computation. Experimental methods are rapidly becoming more high-throughput and comprehensive (Rajagopala et al., 2014; Schoenrock et al., 2017; Häuser et al., 2014; Koo et al., 2017; Luck et al., 2017). Computational methods based on statistical patterns of co-occurrence or co-proximity of functionally related genes first appeared shortly after publication of whole genome sequences and are an active area of research (Eisen, 2017; Pellegrini et al., 2017; Enright et al., 2017; Valencia and Pazos, 2002). More recent efforts have advanced the state-of-the-art in computational methods by incorporating evolutionary models (phylogenomics), interaction models borrowed from statistical physics, or spectral methods borrowed from signal processing (Nagy et al., 2020; Franceschini et al., 2016; Moi et al., 2020; Croce et al., 2019; Cong et al., 2019; Green et al., 2021).

Understanding how a phenotype emerges from a collection of proteins requires the ability to relate different ‘scales’ of interactions, from pairwise to higher-order. While higher-order interactions can be inferred from pairwise interactions, recent experimental evidence shows that such inferences are incomplete (Kuzmin et al., 2018). Using bacteria as a model, here we show that both pairwise and higher-order interactions can be extracted from coevolutionary statistics to create a single multi-scale hierarchical model describing the emergence of complex phenotypes.

Results

Global patterns of covariation between bacterial orthologs arise from phylogeny

The power of using the kingdom bacteria as a model for evolutionary analysis is the availability of thousands of high-quality proteomes originating from diverse organisms. To broadly sample extant bacterial diversity, we downloaded all bacterial proteomes from the UniProt Reference Proteome Database (UniProt Consortium, 2019; downloaded May 20, 2020). Each proteome was annotated using orthologous gene groups (OGGs)—a robust and computationally tractable way of inferring orthologs—and rare OGGs present in less than 1% of proteomes were filtered (Overbeek et al., 1999). The resulting data matrix, D^OGG, consisted of 7047 proteomes (rows) and 10,177 OGGs (columns), where each entry is the number of times an OGG was observed in a proteome (Figure 1A, Materials and methods; Huerta-Cepas et al., 2017; Huerta-Cepas et al., 2019).

Figure 1

Download asset Open asset

Shallow components of covariation measured across bacterial orthologs reflect broad phylogenetic relationships.

(A) D^OGG. Rows are 7047 bacterial proteomes, columns are 10,177 orthologous gene groups (OGGs), entries are the number of annotations of an OGG within a bacterial proteome (Figure 1—source data 1). (B) Percent variance explained versus spectral component (singular value decomposition [‘SVD] component’) number; fit is to a power-law distribution with the indicated exponent (γ). (C) Contributions of bacterial proteomes (colored dots) onto SVD components 1 through 4 (percent variance indicated in parenthesis on axis labels). Dots are colored according to phylum designation (color key).

Figure 1—source data 1 D^OGG matrix shown in Figure 1A.: https://cdn.elifesciences.org/articles/74104/elife-74104-fig1-data1-v2.zip
Download elife-74104-fig1-data1-v2.zip

To explore the structure of ortholog covariation across bacteria, we analyzed D^OGG using a technique called singular value decomposition (SVD), a generalized form of principal components analysis (PCA; Klema and Laub, 1980). SVD defines a spectrum of components of covariation (an ‘SVD spectrum’) where component 1 (SVD₁) explains more data variance than any other component, SVD₂ the second most, and so on (Figure 1B). We observed that bacteria sharing the same phylum clustered together on SVD₁ through SVD₄ suggesting that the most dominant patterns of bacterial variation arise from phylogeny (Figure 1C). These observations are consistent with prior reports that strong phylogenetic signals confound PPI inference using comparative genomics (Schäfer and Strimmer, 2005; Sul et al., 2018). Taken together, SVD₁ to SVD₄ explained only 17% of the overall data variance, motivating us to ask what information lies in the remaining data variance—more phylogenetic signal, functional signal, or noise?

Global to local patterns of bacterial covariation progressively reveal phylogeny, pathways, and protein complexes

We analyzed the SVD spectrum for information regarding (i) phylogeny, (ii) indirect interactions between proteins reflecting biological pathways, and (iii) direct (physical) interactions between proteins reflecting protein complexes. The workflow for our analysis was as follows. SVD applied to D^OGG output three separate matrices: bacterial projections onto left-singular vectors (U^OGG), a set of singular values, and OGG projections onto right-singular vectors (V^OGG; Figure 2A). We determined the statistical similarity of two bacterial proteomes across the SVD spectrum by dividing U^OGG into five-component spectral windows and computing the correlated SVD projections (‘spectral correlations’) for all pairs of proteomes within a spectral window (Figure 2B and C). We determined the extent to which two proteins in a proteome statistically interact across the SVD spectrum by (i) approximating projections of proteins onto right-singular vectors by averaging their constituent OGG content to create V^protein, (ii) dividing V^protein into five-component spectral windows, and (iii) computing protein-protein spectral correlations (Figure 2—figure supplement 1, Figure 2D and E). We then measured the information shared between spectral correlations and the three categories of biological benchmarks using mutual information (MI) and boostrap statistical support (Figure 2F; Materials and methods). For an estimation of the background MI attributed to finite sampling, we shuffled the SVD projections by random permutation across a row of V^OGG or U^OGG and computed the MI of the shuffled projections with the benchmark (Figure 2—figure supplement 2). This operation maintained the distribution of SVD projections while destroying spectral correlations arising from biology, thereby providing a null expectation for MI arising solely from sampling constraints. Collections of biological benchmarks were created from curating publicly available databases and previously published results (Materials and methods) (NCBI Resource Coordinators, 2018; Szklarczyk et al., 2019; Gene Ontology Consortium, 2021; Keseler et al., 2017; Cong et al., 2019). We focused our analysis of protein interactions on the proteome of Escherichia coli K12 because of the wealth of existing data regarding indirect and direct PPIs for this organism.

Figure 2 with 2 supplements see all

Download asset Open asset

Workflow for relating patterns of ortholog covariation with phylogeny and protein interactions.

(A) Singular value decomposition (SVD) performed on ***D^OGG*** yields ***U^OGG*** (rows are proteomes, columns are ‘left singular vectors’ [LSVs]) and ***V^OGG*** (columns are OGGs, rows are ‘right singular vectors’ [RSVs]). ***U^OGG*** is used to relate information in the SVD spectrum with phylogenetic benchmarks. ***V^OGG*** is used to relate information in the SVD spectrum with protein interaction benchmarks. (**B,C**) Spectral correlations between bacterial proteomes are calculated by defining spectral windows of five LSVs each (panel A) and computing correlated projections between proteomes across all spectral windows (panel B). (**D,E**) Spectral correlations between proteins within a proteome are calculated by (i) approximating the projections of proteins onto RSVs, (ii) defining spectral windows of five RSVs each (panel C), and (iii) computing correlated projections between proteins across all spectral windows (panel D). (F) The final step is to compute the information shared between spectral correlations and biological benchmarks of phylogenetic relationships (left panel) or protein interactions (right panel). Shown are example distributions of spectral correlations that have ‘high’ and ‘low’ amounts of MI with a benchmark.

We found that the information for the different types of benchmarks was distributed distinctly across the SVD spectrum. The top tens of components contained phylogenetic information, the top hundreds contained indirect PPI information, and the top thousands contained direct PPI information (Figure 3A). Even SVD components 2996 through 3000 harboring 0.025% data variance contain non-random biological information reflecting direct PPIs (Figure 3B). Beyond component 3000, the MI shared between protein spectral correlations and PPIs converged to the estimation of background MI. Comparing the relative distributions of these different types of information across the SVD spectrum, we observed an order of phylum, class, order, family, genus, indirect PPIs, mixed indirect/direct PPIs, and direct PPIs (Figure 3C, Materials and methods). The ordering of these distributions across the SVD spectrum was robust to subsampling D^OGG to account for uneven phylogenetic distributions of bacterial strains in the UniProt database (Figure 3—figure supplement 1). These results show that global to local patterns of bacterial covariation reflect an intuitive hierarchy of biological scale—phylogeny, pathway, protein complex.

Figure 3 with 1 supplement see all

Download asset Open asset

Shallow to deep spectral components of ortholog covariation reflect global to local biological ‘scales’.

(A) Distribution of information (y-axis, ‘mutual information [MI] density’) for each benchmark (see legend) measured across the singular value decomposition (SVD) spectrum (x-axis, ‘spectral position’). Gray line in each plot is the distribution of background MI (see Figure 2—figure supplement 2). Lines and shaded contours represent the mean±2 standard deviations for bootstraps of each benchmark. (B) Information shared between three benchmarks of direct protein-protein interactions (PPIs) (x-axis) and spectral correlations computed across SVD₂₉₉₆ to SVD₃₀₀₀. Each dot is the MI value for a single bootstrap of the indicated benchmark. Colored dots are non-random MI, gray dots are MI values for background spectral correlations. Values of statistical significance are shown above each distribution (p-value, Student’s t-test). (C) The degree to which information within the SVD spectrum reflects a biological benchmark, reported by the ‘cumulative density of MI’ (y-axis). As a curve for a benchmark approaches a value of ‘1’, deeper spectral components contain progressively less information regarding the benchmark. Colors follow those of panel A. Shaded regions are ±2 standard deviations of the mean MI value.

Figure 3—source data 1 NCBI taxonomic strings for each organism used to generate phylogenetic benchmarks.: https://cdn.elifesciences.org/articles/74104/elife-74104-fig3-data1-v2.xlsx
Download elife-74104-fig3-data1-v2.xlsx
Figure 3—source data 2 Benchmarks of protein-protein interactions (PPIs) in Escherichia coli K12.: https://cdn.elifesciences.org/articles/74104/elife-74104-fig3-data2-v2.xlsx
Download elife-74104-fig3-data2-v2.xlsx

A statistical approach for predicting the organization of emergent phenotype

Given the results shown in Figure 3, we hypothesized that by relating deep and shallow SVD components, we could create a statistical representation of emergence—the integration of local scales of protein interactions into global scales reflecting collective biological functions. Operationally, the way we related statistical information across shallow and deeper SVD components involved five steps. First, we removed SVD components enriched for phylogenetic signal and noise (components 1–33 and 3001–7047) (Figure 4A). Second, we computed spectral correlations between all proteins in a proteome across the remaining components (Figure 4B). Third, we removed spurious spectral correlations by developing a model of statistical significance. The model defined an optimal spectral window of 100 components for computing spectral correlations as well as a threshold for what constitutes ‘statistically significant’ spectral correlations (Figure 4—figure supplement 1, Materials and methods). Fourth, we defined the position in the SVD spectrum at which the spectral correlation between two proteins first dropped below the significance threshold; we term this position the ‘spectral depth’ of correlation between two proteins. Fifth, we discard all spectral correlations, significant or not, deeper than the spectral depth (Figure 4C). The effect of the last step is to condition spectral correlations found in deep regions of the SVD spectrum upon those found in shallow regions. Our rationale was that for a local interaction, like that in a protein complex, to contribute to a biological hierarchy, it must also contribute to a more global interaction network like a pathway.

Figure 4 with 1 supplement see all

Download asset Open asset

Workflow for computing the ‘spectral depth’ between pairs of proteins.

(A) Spectral components enriched for indirect and direct protein interactions (25th to 75th interquartile range of cumulative mutual information [MI] density) are selected, thereby filtering components enriched for phylogeny and noise. (B) Spectral correlations are computed for all pairs of proteins within a proteome; spurious spectral correlations introduced by finite sampling are filtered. Plotted here are spectral correlations (y-axis) as a function of spectral position (x-axis) for two pairs of proteins in *Escherichia coli* K12: MotA-MotB and MotA-CheR. Dashed line reflects the threshold defining statistically significant spectral correlations. (C) ‘Spectral depth’ is the spectral position at which the correlation value first reaches the threshold of statistical significance. Spectral depths of MotA-MotB and MotA-CheR are shown.

Following this five-step approach, we attempted to statistically re-derive the pathway of directed motility in E. coli K12. We chose to study this pathway because (i) there is a wealth of previous biological data regarding its constituent parts and interactions and (ii) it is an illustrative example of a global phenotype that emerges from several layers of protein interactions. From the KEGG hierarchy of directed motility in E. coli K12 (KEGG hierarchy, BRITE ECO:02035), there are three levels of molecular interactions. At the lowest levels, physical interactions between proteins create small units of collective structure and function, such as a basal body, rod, ring, motor, and filament. Integration of these structures and their individual functions produces the flagellum, a machine that turns to move the cell. Integration of the flagellum and the chemotaxis system produces directed motility—the ability to move purposefully along a chemical gradient. We used the flagellar filament, FliC, as a ‘bait’ for identifying spectrally correlated proteins in E. coli K12 putatively related to motility. Aside from this, no other biological knowledge was used: that is, no information regarding components or interactions producing complexes or collective structures contributing to directed motility. We found that 75 proteins were spectrally correlated with FliC (Figure 5A). We computed the spectral depth across all pairs of proteins (Figure 5B). We observed that the pattern of spectral depths was heterogeneous: a majority of protein pairs exhibited shallow spectral depths while sparse groups of proteins exhibited deeper spectral depths (Figure 5B). To depict the structure of spectral depths across all pairs of proteins, we thresholded spectral depth at three different levels: 50, 300, and 1000. These thresholds were chosen to reflect areas of the SVD spectrum found to be enriched for different types of biological information per Figure 3C. If two proteins had a spectral depth deeper than the threshold, they were considered to ‘statistically interact’ (Figure 5C, Supplementary file 1). Network graphs at each thresholded spectral depth were created where nodes are proteins and edges between nodes represent a statistically significant spectral correlation between two proteins.

Figure 5

Download asset Open asset

Pattern of spectral correlations with flagellar filament, FliC, in *Escherichia coli* K12.

(A) Proteins that shared significant spectral correlations with FliC after filtering for phylogeny and noise. (B) Hierarchically clustered spectral depth matrix for all pairs of proteins in panel A. (C) Set of matrices derived from thresholding the spectral depth matrix. Red pixels indicate that two proteins have a spectral depth deeper than the indicated threshold and therefore ‘statistically interact’. White pixels indicate that two proteins have a spectral depth of interaction shallower than the indicated threshold and therefore do not statistically interact.

At a spectral depth of 50, we observed a single densely connected network devoid of obvious substructure (Figure 6A, top panel). Gene-set enrichment analysis (GSEA) indicated that this network was enriched for functional terms related to ‘flagellar system’ (p<10^–45; Huang et al., 2009a; Huang et al., 2009b, Materials and methods). Progressing to spectral depth of 300, we observed that the network at 50 fractured into four discrete subnetworks (Figure 6A, middle panel). These subnetworks were significantly enriched for terms related to ‘chemotaxis signaling’ (p<10^–15), ‘flagellum’ (p<10^–56), ‘LPS biosynthesis’ (p<10^–3), or ‘cyclic di-GMP signaling’ (p<10^–21). Progressing to spectral depth of 1000, the subnetworks at 300 fractured further yielding 9 discrete subnetworks. Each subnetwork was significantly enriched for terms related to a specific function such as ‘cyclic di-GMP catabolism’ (p<10^–25) and ‘cyclic di-GMP synthesis’ (p<10^–13) or ‘chemotransmission’ (p<10^–4) and ‘chemoreception’ (p<10^–12 ; Figure 6A, bottom panel). Notably, considering additional spectral depths revealed further details of relevant hierarchical functional relationships within this pathway (Figure 6—figure supplement 1). Taken together the three network diagrams derived at spectral depths 50, 300, and 1000 depict a hierarchy of structure and function. Subnetworks observed at deeper spectral depths integrate to form the subnetworks observed at shallower spectral depths. As the subnetworks coalesced, the p-value associated with GSEA remained highly significant while the ontology of the significantly enriched terms changed. We interpret these observations to mean that as we ascend the statistical hierarchy, collective structures corresponding to new biological functions emerge from the integration of functional units at lower levels.

Figure 6 with 4 supplements see all

Download asset Open asset

A statistically derived hierarchical model of *Escherichia coli* K12 motility.

(A) Statistical interaction networks defined at spectral depths 50 (top), 300 (middle), and 1000 (bottom). Nodes (yellow circles) are proteins; edges (red lines) reflect statistical interactions between proteins; contours are drawn around groups of connected proteins; assignment of function associated with contours is based on gene-set enrichment analysis (GSEA) with p-value of most-enriched GSEA term in parenthesis (Supplementary file 1). (B) Comparison of the statistically derived model (left) to the KEGG model (BR:eco02035, right) of *E. coli* K12 motility. Venn diagrams represent the overlap between the sets of proteins in the indicated subnetwork of panel A (left) and the indicated KEGG category (right).

To validate our statistical model of motility, we compared it to the experimentally derived model of E. coli K12 motility detailed within the KEGG database (BR:eco02035; Kanehisa and Goto, 2000; Kanehisa, 2019; Kanehisa et al., 2021; Figure 6B). The two models were similar in several ways. First, 44 of 55 of the proteins listed in the KEGG hierarchy also appeared in the statistical hierarchy. Second, 7 of the 12 categories listed in the KEGG hierarchy had a one-to-one correspondence with a subnetwork of the statistical model sharing an overlapping set of proteins and similar descriptive label. Finally, both hierarchies shared a conserved architecture consisting of the integration of chemoreception and chemotransmission into chemotaxis signaling, the integration of flagellar substructures into the flagellum, and at the most global level the integration of chemotaxis and the flagellum. The most striking difference was that our statistical hierarchy included subnetworks related to cyclic-di-GMP signaling and LPS biosynthesis which were absent from the KEGG hierarchy. Prior experimental studies have provided direct genetic evidence that these systems are involved in E. coli K12 motility (Paul et al., 2010; Walker et al., 2004). Overall, of the 75 proteins in our hierarchical model of E. coli K12 motility, 44 (59%) were represented in the KEGG hierarchy, 28 (37%) were missing from the KEGG hierarchy but supported by prior experimental evidence in the literature, and only 3 (4%) remained unvalidated (CsgG, PpdD, TorS; Supplementary file 1). Taken together, these results illustrate that our approach for extracting a hierarchy of spectral correlations from the SVD spectrum produced a valid multi-scale, hierarchical model of directed motility in E. coli K12.

We performed four additional analyses to test the robustness and generalizability of our approach. First, we produced a hierarchical model of motility in E. coli K12 using MotB, the flagellar motor protein, as a query. We found a similar architecture as observed using FliC as the query with chemotaxis signaling, flagellum, and cyclic-di-GMP signaling modules appearing at spectral depth 300, and more fine-grained subnetworks appearing in deeper layers (Figure 6—figure supplement 2 Supplementary file 2). To test generalizability across different bacteria, we created a model of motility in Bacillus subtilis 168 using its flagellar filament protein as a query (Hag). This analysis again produced a hierarchical model of motility that (i) recapitulated the corresponding KEGG hierarchy, (ii) identified proteins missing from the KEGG hierarchy that are known effectors of B. subtilis motility, and (iii) identified a small number of putative motility effectors (Figure 6—figure supplement 3 Supplementary file 3). Next, we tested if our method could generalize to non-physically coupled pathways. We produced a model of amino acid metabolism in E. coli K12 using the query protein HisG, an enzyme involved in histidine biosynthesis. The resultant hierarchical model identified 130 proteins that were densely connected at spectral depth of 50. Progressing to deeper spectral depths revealed modules corresponding to specific functions, such as amino acid and nucleotide biosynthesis. At yet deeper spectral depths, modules enriched for proteins involved in the synthesis of specific amino acids became evident (Figure 6—figure supplement 4 Supplementary file 4). Taken together, these analyses demonstrated that our approach of defining a hierarchy of spectral correlations produced valid hierarchical models of biological pathways across different query proteins, organisms, and pathways.

Inferring functions for an uncharacterized protein from a model of emergent organization

We hypothesized that our statistical models could be used to assign general and specific functions to previously uncharacterized proteins in bacteria. To test this idea, we applied our approach to Pseudomonas aeruginosa, an organism with many uncharacterized proteins. We defined a hierarchical protein interaction network using PilA, a core structural element of the pilus, as a query; 141 proteins were spectrally correlated with PilA. At a spectral depth of 50, these proteins were found to collectively associate with motility of P. aeruginosa (Figure 7—figure supplement 1A Supplementary file 5; Burrows, 2004). Progressing to the spectral depth of 300, the global structure fractured into subnetworks associated with specific functions. Two major subnetworks related to specific types of motility, ‘pilus motility’ (p<10^–17) and ‘bacterial flagellum’ (p<10^–21), were evident, consistent with the ability of P. aeruginosa to use pilus or flagellar-based mechanisms of movement (Kearns et al., 2004; Rashid and Kornberg, 2004; Figure 7A, Figure 7—figure supplement 1B). Four uncharacterized proteins (Q9I5G6, Q9I5R2, Q9I0G2, Q9I0G1) were included in the pilus subnetwork that had not previously been associated with PilA in P. aeruginosa (Figure 7B Supplementary file 6). Our hierarchical models suggested that these proteins may contribute to the general function of directed motility by affecting the specific function of pilus-based motility but not flagellar-based motility.

Figure 7 with 1 supplement see all

Download asset Open asset

Prediction and validation of a novel effector of twitch motility in *Pseudomonas aeruginosa*.

(A) Statistical network derived by applying a spectral depth threshold of 300 to the set of 141 protein in *P. aeruginosa* (strain PAO1) that were significantly correlated with PilA across SVD₃₄ to SVD₁₃₄ (see Figure 7—figure supplement 1). Nodes, edges, shaded contours, gene-set enrichment analysis (GSEA) enrichment terms, and p-values are defined in the same manner as in Figure 6A. (B) The pilus motility subnetwork from panel A. Nodes representing PilA and Q9I5G6 are colored blue and green, respectively. (C) Time-course of pilus-based motility for parent (WT), two transposon mutants of Q9I5G6 (Tn(1) Q9I5G6, Tn(2) Q9I5G6), and transposon mutants complemented with Q9I5G6 (Tn(1) Q9I5G6+Q9I5G6, Tn(2) Q9I5G6+Q9I5G6). Inset shows results of flagellar motility for the parent strain (WT), and the two transposon mutants of Q9I5G6 (Tn(1), Tn(2)) 24 hr post-inoculation. Representative images of the crystal-violet stained plates are shown. SVD, singular value decomposition.

To test these predictions, we assayed single-gene transposon mutants of P. aeruginosa (PAO1) for twitch-based motility mediated by the pilus or flagellar-based motility (Materials and methods) (Kearns et al., 2004; Rashid and Kornberg, 2004; Little et al., 2018). Transposon mutants of Q9I5R2, Q9I0G2, and Q9I0G1 exhibited motility that was not significantly different from the parent strain in both assays (Supplementary file 7). In contrast, we found that two different transposon mutants of Q9I5G6 exhibited significantly reduced pilus-based motility over 24, 48, and 72 hr compared to the parent strain (Figure 7C, p<10^–4 by Dunnett’s multiple comparisons test). This phenotype resembled that of a knockout strain of PilA and was reversed upon trans-complementation. In contrast, flagellar-based motility of the transposon mutants in Q9I5G6 was not significantly different from that of the parent strain (Figure 7C—inset, p>0.05). These results illustrate that Q9I5G6 is a previously unappreciated effector of directed motility in P. aeruginosa that specifically impacts twitch-based motility. Compared to the background expectation of finding a protein that affects twitch motility (22 ‘pilus assembly proteins’ in BRITE KO02035 out of 5564 proteins in the PAO1 proteome equating to a 0.4% background rate of association), our experimental results represent a statistically significant enrichment (25% association rate, p<10^–4 by chi-squared). Moreover, the results of our statistical approach shown in Figure 7B illustrate a far higher enrichment of functional association with identifying 19 effectors of twitch motility out of 22 proteins in the ‘pilus motility module’. Overall, these experiments provide a proof of concept of how our hierarchical models may aid in discovering novel genotype-phenotype relationships.

Using spectral correlations to predict proteome-wide functional and physical protein interaction networks

Microbiome science has taught us that diverse bacteria affect human and environmental health. Therefore, there is a critical need to expand our knowledge of biology more broadly beyond the few well-studied model organisms. Comparative genomics has been critical to inferring functionally relevant interaction networks in newly sequenced organisms. However, these methods are limited in their ability to isolate functional and physical interaction networks from each other as well as from the admixed phylogenetic signal and noise due to finite sampling (Schäfer and Strimmer, 2005; Sul et al., 2018; Nagy et al., 2020). Our spectral approach offers a unique opportunity to isolate interactions arising at a specific biological scale. We hypothesized that spectral correlations could be used to accurately predict proteome-wide functional and physical interaction networks across diverse bacteria.

To test this hypothesis, we classified all possible pairs of proteins in the E. coli K12 proteome as functionally interacting (indirect PPI), physically interacting (direct PPI), or not-interacting (Materials and methods). These predictions were based on protein-protein spectral correlations across three different spectral windows chosen to isolate phylogenetic, indirect PPI, and direct PPI information, respectively (Figure 8—figure supplement 1). For comparison, we predicted the same interaction classes using quantitative features of various existing computational and experimental methods. To compare the various methods, we computed F-scores for each interaction class using a composite gold-standard benchmark as well as three individual database-wide benchmarks (Figure 8—figure supplement 2; Rajagopala et al., 2014; Babu et al., 2014; Babu et al., 2018; Hu et al., 2009). F-score is the harmonic mean of precision and recall; to achieve an F-score of 1, the predictions must be both accurate and complete.

We found that our predictions based on windowed spectral correlations produced significantly greater F-scores for all three interaction classes compared to 18 alternative methods in all four validation tasks (Figure 8A, Figure 8—figure supplement 3, statistical comparisons by Wilcoxon rank-sum test). Of the alternative methods tested, the most mathematically similar to ours is SVD-phy (Franceschini et al., 2016). SVD-phy is a PCA-based approach that also uses SVD but assumes that only the top (most global) components are useful for predicting PPIs. In contrast our approach to selecting specific SVD components is data-driven, guided by the results shown in Figure 3C. We found that the performance of SVD-phy depended on how many of the top components were retained. Including the most shallow SVD components yielded predictions at or below the median rank of all methods across all three interaction classes. Including the shallowest 100 SVD components yielded better indirect PPI prediction without improvement of predicting direct PPIs. Including components beyond SVD₁₀₀ improved the prediction of direct PPIs while decreasing F-scores for indirect PPIs. These results show that SVD-phy can be tuned for predicting either indirect or direct interactions, but not both simultaneously. In addition, the tuned performance on either class was inferior compared to our method of windowed spectral correlations.

Figure 8 with 3 supplements see all

Download asset Open asset

Mutual information (MI) windowed spectral correlations (MIWSCs) enable accurate classification of indirect and direct protein-protein interactions (PPIs).

See Figure 8—figure supplement 1 for definition of ‘MIWSCs’. (A) F-scores for predicting interaction classes for *Escherichia coli* K12 protein pairs using random forest (RF) models trained on MIWSCs or features from several different methods (see legend). Violin plots show distribution of F-scores for models trained and validated on 50 random partitions of the gold-standard dataset. Numbering indicates the rank of the median F-score for models trained on each feature (Figure 8—source data 1). (B) Precision (left) and recall (right) for predictions of any (indirect or direct), direct, or indirect PPIs in 12 bacteria using RF models trained on the MIWSCs of *E. coli* K12 proteins benchmarked against the experimentally supported PPIs in the STRING database (experimental score > 0). Comparisons are made to a set of 10,000 randomly selected pairs and to the ‘medium confidence’ predictions (score > 400) in the STRING database subchannels for GC, GN, and GF. Vertical dashed line indicates the median value for the best performing method. ** in legend indicates an organism that was not part of the input dataset ***D^OGG*** (Figure 8—source data 1). (C) Precision-recall curves were constructed for the methods of GC, GN, GF by thresholding the subchannel scores at 150 (‘low confidence’), 400 (‘medium confidence’), 700 (‘high confidence’), and 900 (‘highest confidence’). The precision versus recall is plotted for any (indirect or direct), direct, or indirect PPIs predicted using the RF models trained on MIWSCs. Symbols and whiskers represent the median and 25–75 percentile range, respectively, for the predictions produced for the 12 organisms in panel B (Figure 8—source data 1). (D) Percent of predicted direct PPIs in *Mycobacterium tuberculosis* H37Rv supported by an absent (0), low (0–0.4), or high (>0.4) composite score (left) or an absent (0) or present (>0) experimental subchannel score (right) in the STRING database. Comparisons were made between the methods of random selection (Random), amino acid coevolution (Cong et al., 2019), or RF models trained on MIWSC features of *E. coli* K12 proteins (MIWSC). Numbers of predicted interactions in each bin are indicated (Figure 8—source data 1).

Figure 8—source data 1 Data and statistical support for random forest (RF) model validation studies, related to Figure 8Figure 8.: https://cdn.elifesciences.org/articles/74104/elife-74104-fig8-data1-v2.xlsx
Download elife-74104-fig8-data1-v2.xlsx

Our choice of SVD windows used to compute spectral correlations was guided by MI distributions generated using known E. coli K12 PPIs. This observation led to the concern that our approach to PPI prediction may not generalize well to other organisms. On the other hand, the underlying spectral correlations were defined from a large ensemble of bacterial proteomes, suggesting that our approach may accurately predict PPIs across diverse bacteria. To test this idea, we predicted proteome-wide direct PPIs for 11 additional phylogenetically diverse bacteria, including one organism (Azotobacter vinelandii) that was not represented in D^OGG (Materials and methods). We computed the precision and recall for our predictions of any (indirect or direct), indirect, or direct PPIs using the experimentally supported PPIs in the STRING database as a benchmark (score > 0 in the STRING ‘experimental’ subchannel). We compared these results to those produced by the well-established methods of gene co-occurrence, gene neighborhood, and gene fusion by thresholding the corresponding STRING subchannel scores at different confidence levels (low, medium, high, highest). Figure 8B shows this comparison at the ‘medium’ confidence level. Figure 8C summarizes the entire analysis showing the precision-recall curves across all thresholds tested. We found that the sets of any indirect or direct PPIs produced by our method exhibited a higher precision for a given recall compared to the established methods. These results show that though the SVD windows used to compute spectral correlations were chosen based on analysis of PPIs in the E. coli K12 proteome, our approach for predicting PPIs performs as well or better than established methods across different organisms.

The state-of-the-art PPI prediction method is that produced by Cong and colleagues (Coev+) that infers direct PPIs from proteome-wide amino acid coevolution (AA Coev) (Cong et al., 2019). Briefly, they predict direct PPIs from primary sequence using a method called ‘direct coupling analysis’ (DCA) borne out of the field of statistical physics. Using the direct PPIs for Mycobacterium tuberculosis H37Rv found in the STRING database as a benchmark, we found that our approach exhibited significantly greater precision and recall whether using the STRING composite score (as done by Cong et al.) or the STRING experimental subscores (Figure 8D, Materials and methods). While we limited our comparative analysis to only M. tuberculosis, these results illustrate that our spectral approach to purifying biological information arising from a specific scale can produce predictions that are on par or better than a more computationally expensive alternative that leverages a higher-resolution genomic feature.

Discussion

We developed a statistical method that predicts hierarchical protein interaction networks reflecting the emergence of complex functions in bacteria. This method involves two aspects: (i) spectral decomposition of variation across bacterial proteomes into components enriched for specific biological scales, (ii) relating covariation structure across components to define hierarchical statistical networks. These hierarchical networks closely resembled the known organization for several well-studied bacterial phenotypes. We call our approach SCALES—Spectral Correlation Analysis of Layered Evolutionary Signals. Having shown that SCALES can be useful for guiding biological discovery, we have developed the following resources: (i) a precomputed database of proteome-wide indirect (122,725,727) and direct (19,546,063) PPI predictions for all 7047 UniProt reference bacterial proteomes; (ii) a tool for predicting indirect and direct PPIs for a user-input proteome; (iii) a tool for generating and interrogating a hierarchical model for a query protein of interest. All of these can be found at scales.cri.uchicago.edu.

The challenge of recovering functional interactions using comparative genomics

The admixture of signals arising from phylogeny, function, and noise negatively impact the accuracy of PPI predictions using comparative genomics (Schäfer and Strimmer, 2005; Sul et al., 2018). One approach to address this problem is to use PCA which assumes that only global covariation is not statistical noise (Wigner, 1967; Franceschini et al., 2016). A known source of variability between orthologs is phylogenetic relatedness. As SVD achieves spectral clustering and phylogenetic reconstruction achieves hierarchical clustering, we do expect some level of coherence between the two approaches. However, our results also illustrate that relevant biological signal is contained throughout the SVD spectrum, including components harboring a minutiae of data variance. Components in different regions of the spectrum contained information about different biological scales—shallow components phylogeny, deeper components pathways, even deeper components protein complexes, and the very deepest noise, thereby signifying a ‘cross-over’ from phylogenetic to functional information. Our interpretation of these results is that while statistical variance reflecting large-scale properties can be ‘compressed’ into just a few shallow components, information about ‘local’ biological scales is distributed broadly across the spectrum. As a result, discarding global components enriched for phylogeny improved prediction of functionally relevant interactions. These results demonstrate that percent variance per component is not a good proxy for relevant biological signal. In the absence of a suitable alternative, we used knowledge in public databases to define the information content of each component. Future work will be needed to provide a theoretical basis for finding relevant signal in cases where such prior knowledge is not available.

Extracting a multi-scale, hierarchical model of biology from coevolutionary statistics

Understanding the origins of complex biological functions requires defining hierarchical relationships describing how protein interactions integrate to create scales of biological organization. While the use of SVD is fundamental to our method, SVD itself does not define hierarchical relationships; SVD defines orthogonal components of variance ordered according to the amount of variance explained. Two results of our study were key for being able to use the SVD spectrum to produce hierarchical models. First, different components contain information regarding different biological scales. Second, the information in different components could be related by tracking the persistence of spectral correlations across components (‘spectral depth’). These two results enabled extracting hierarchical statistical relationships that predict the integration of PPIs into complex structures approximating pathways and phenotypes. To our knowledge, this is a fundamentally new way of constructing hierarchical models. Instead of predicting pairwise interaction networks and then inferring higher-order networks, we infer an entire hierarchy directly from coevolutionary statistics.

Comparison of our results with AA Coev

Recently, Cong and colleagues reported a method, AA Coev, for inferring direct PPIs from bacterial genome sequences. Their method represented a significant advance for two reasons: (i) it considered coevolution at the resolution of amino acids and (ii) it applied DCA, to entire bacterial proteomes. Like our method, AA Coev tries to extract signal related to a specific scale of biology from admixed signals of phylogeny and noise using coevolutionary statistics of genomes. Specifically, DCA is a statistical physical approach that isolates local (i.e. direct) interactions. However, we observed significant differences in quality and breadth of direct PPI inferences, scalability, and ability to reconstruct hierarchical networks.

When we compared our direct PPI predictions in M. tuberculosis with those of AA Coev, we predicted significantly more interactions with significantly greater precision. This was a surprising result given that our approach considers lower-information features—OGGs versus amino acid sequence-level information. One explanation may be that our predictions were derived by explicitly using correlations across three different spectral windows that provided information about three different scales of biology: phylogeny, indirect PPIs, and direct PPIs. In contrast, AA Coev discards signal unrelated to direct PPIs. It may be the case that phylogenetic and pathway-level information provides context clues for enhancing subtle statistical signals related to direct PPIs. In other words, the confidence in assigning a putative physical interaction is increased by additional signals suggesting that the interaction also contributes to a pathway-level function among phylogenetically related bacteria. The power of our method may lie in its ability to simultaneously deconvolute yet leverage information related to these different scales. Further work is needed to better understand the differences between our method and AA Coev and to determine if the observed difference in PPI prediction is consistent across additional organisms outside of M. tuberculosis.

A striking difference between the two approaches is related to computational scalability. AA Coev is computationally expensive and, as a result, has been applied to just two organisms thus far. In contrast, we note that our approach can compute proteome-wide PPIs in a matter of minutes, enabling us to provide predictions for all 7047 UniProt Reference Proteomes. As identifying and characterizing different strains of bacteria is becoming increasingly important, scalable computational tools becomes a necessity.

A final difference between the two methods is how a hierarchy can be constructed. Cong et al. predicted collective units of structure, using a ‘bottom-up’ approach by putting together their direct PPI predictions. However, compared to the networks produced by SCALES, these networks tended to be smaller and less representative of higher-order collective structure, reinforcing results reported by Kuzmin et al. illustrating that building collective organization from pairwise networks results in incomplete hierarchical descriptions of biology. We therefore pose that though SCALES does not consider as information-rich of a feature as AA Coev, it may prove to be a useful framework to extract hierarchical relationships for connecting bacterial genotype with phenotype.

Limitations

We observed two major limitations related to our use of OGGs as orthologous feature. First, many proteins have no annotated OGG—for example, 295 of the ~4000 E. coli K12 proteins (6.7%). These proteins cannot be assigned to interactions or units of function by our method. Second, many proteins share the same OGGs and appear to interact at all spectral depths. These putative interactions may or may not be related to biological function. We anticipate that ortholog annotations will continue to improve with additional bacterial genome sequences, helping to alleviate these limitations. Moreover, the use of phylogenomic methods that incorporate models of the phylogenetic history of proteins to improve ortholog definition might create superior inputs for the methods developed here (Nagy et al., 2020).

Another limitation of the methods developed here is that they are inherently ‘mechanism-free’: they leverage evolutionary constraint without knowledge of the specific pressures driving the selection of interactions. As a result, our methods identify functionally relevant interactors but cannot reveal their collective function or the detailed molecular basis of the interactions.

The potential of generalizing SCALES to other biological systems

To what degree are the approaches developed here applicable to other biological systems? Practically, we note that the spectral properties of any given dataset will be unique. As such, re-application of these methods to a new dataset will require following the steps outlined in this work: creating a diverse ensemble, identifying relevant benchmarks, using the benchmarks to find different scales of interactions in the SVD spectrum, and enforcing the necessary statistical constraints to define hierarchical relationships. Sufficient diversity and suitable benchmarks may not currently be available for other systems as they were for bacteria. Moreover, the parameters used in this work to enable analysis—width of spectral windows, bin width for MI calculations, threshold of spectral correlation value used to define spectral depth—likely need to be derived de novo for other datasets. However, aside from these practical considerations, SCALES represents a statistical way to describe emergence—the integration of individual components into layers of collective units of function. The property of emergence spans biological systems, from proteins to ecosystems. Thus, while it remains to be tested, it may be true that SCALES is a generally useful approach to learning the hierarchical architecture of biological systems.

Share this article

Cite this article

Shallow components of covariation measured across bacterial orthologs reflect broad phylogenetic relationships.

Figure 1—source data 1

Workflow for relating patterns of ortholog covariation with phylogeny and protein interactions.

Shallow to deep spectral components of ortholog covariation reflect global to local biological ‘scales’.

Figure 3—source data 1

Figure 3—source data 2

Workflow for computing the ‘spectral depth’ between pairs of proteins.

Pattern of spectral correlations with flagellar filament, FliC, in Escherichia coli K12.

A statistically derived hierarchical model of Escherichia coli K12 motility.

Prediction and validation of a novel effector of twitch motility in Pseudomonas aeruginosa.

Mutual information (MI) windowed spectral correlations (MIWSCs) enable accurate classification of indirect and direct protein-protein interactions (PPIs).

Figure 8—source data 1

Author details

Mark A Zaydman

Contribution

Contributed equally with

For correspondence

Competing interests

Alexander S Little

Contribution

Competing interests

Fidel Haro

Contribution

Competing interests

Valeryia Aksianiuk

Contribution

Competing interests

William J Buchser

Contribution

Competing interests

Aaron DiAntonio

Contribution

Competing interests

Jeffrey I Gordon

Contribution

Competing interests

Jeffrey Milbrandt

Contribution

Competing interests

Arjun S Raman

Contribution

Contributed equally with

For correspondence

Competing interests

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Categories and tags

Further reading