Establishing comprehensive quaternary structural proteomes from genome sequence

  1. Department of Bioengineering, University of California, San Diego, La Jolla, CA 92101
  2. Omnicorp Inc. (Pilot AI), San Francisco, CA 94129
  3. The Novo Nordisk Foundation (NNF) Center for Biosustainability, The Technical University of Denmark, Kongens Lyngby 2800, Denmark

Peer review process

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, and public reviews.

Read more about eLife’s peer review process.

Editors

  • Reviewing Editor
    Martin Graña
    Institut Pasteur de Montevideo, Montevideo, Uruguay
  • Senior Editor
    Qiang Cui
    Boston University, Boston, United States of America

Reviewer #1 (Public review):

Summary:

This work presents a computational platform that integrates currently available experimental or precomputed datasets and/or state-of-the-art modeling methods to assemble a proteome structure from a given list of genes (representing a whole proteome of an organism, or some specific subset of interest). The main advancement is that the proteome structure contains not only the tertiary structure information (such as is provided by precomputed AlphaFold predicted proteomes) but also information about the quaternary structure. Adding quaternary structure information on the whole proteomes is a challenging problem (and the manuscript would benefit from a more comprehensive introduction section presenting these challenges). Importantly, this addition of quaternary structure information is likely to significantly improve any downstream modelling or prediction. This is because most proteins form either stable or transient complexes, and a significant proportion of proteins interacts with cellular structures such as the different biological membranes. These interactions provide important context for interpreting residue-level information, such as for example the fitness/functional effects of point mutations.

Strengths:

The main strength of this work is that it approaches the question of protein quaternary structure in a comprehensive way. Namely, in addition to oligomeric state, it also includes membrane and cellular localization. It also demonstrates how to use and combine the available experimental and precomputed modelling to achieve the same for any set of genes.

Weaknesses:

The feasibility of obtaining a similar dataset (of useful/informative size) for a more complex organism is not clear.

Reviewer #2 (Public review):

In this study, a methodology called QSPACE is developed and presented. It integrates structural information for a specific organism, here E. coli. The process entails the gathering of individual structures, including oligomeric information/stoichiometry, the incorporation of data on transmembrane regions, and the utilization of the resulting dataset for the analysis of mutation effects and the allocation of proteomes.

This work aims high, setting an ambitious goal of modeling the quaternary structure of a proteome. The method could be applied to other organisms in the future and has value in that respect. At the same time, the work tries to cover (too?) much ground and some of the results/analyses don't measure up. There are indeed a number of shortcomings and/or inconsistencies in the results presented. The comments below will help improve the work and its usefulness.

(1) It is described that "QSPACE then finds the 3D coordinate file (i.e. "structure") that best reflects the user-defined (input #2) multi-subunit protein assembly". What is meant by "best reflects"? What if two different structures with the same stoichiometry are available? Which one is picked?

(2) There appears to be a significant under-estimation of oligomer formation: it is reported that "31% (1,334/4,309) of E. coli genes participate in 1,047 oligomeric complexes, 667 genes are annotated as monomers, and 2,308 genes are not included". However, it is generally observed that ~50% of E coli genes form homo-oligomers (see PMID 10940245 or more recently 38325366), and adding hetero-oligomers on top of that should increase the fraction of oligomers further. In that respect, the estimate forming the basis of this work (31% of genes participating in oligomeric complexes) seems incorrect. It is unclear why the authors did not identify more proteins as adopting a quaternary structure. It is generally hard to grasp details of the dataset, for example, the simple statistic of how many genes participate in homo- versus hetero- oligomer. Such information is partially presented in panels 2c & 2d, but it is very small and hard to see (I would suggest removing the structures of the ABC transporters to make space to present this with more detail).

(3) There are a number of misleading statements/overstatements that I encourage the authors to revise. For example (not exhaustive):
"to our knowledge this result is the most advanced genome-scale structural representation of the E. coli proteome and de facto represents a major advancement in genome annotation."
"angstrom-level subcellular compartmentalization" - Can we really talk about sub-atomic precision when even side chains can move by several angstroms?
"we provide a global accounting of all functionally important regions" - "all" is not justified
"Incorporated into genome-scale models that compute protein expression" - what does that mean? There are gene expression & protein abundance datasets, why is the "compute" necessary?
"Likewise, sequence-based prediction software (e.g., DeepTMHMM49) and structure-based prediction software (e.g., OPM50) are agnostic to membrane orientation and can also generate erroneous results" - what does "erroenous results" mean in this context? Those tools are not supposed to predict orientation.

(4) What was the benchmark used to estimate the accuracy of orientation assignments?

(5) It is not clear why structural information is required to calculate the volume taken up by different proteins across the proteome. For each protein, the expression level (copy number) is expected to have a significant effect, but I'm unsure of why oligomerization is considered key here. It will modulate the volume exclusion associated with interface contact areas, but isn't this negligible compared to other factors, in particular expression?

(6) Models aiming at predicting deleterious effects of mutations typically use sequence conservation, but I do not see such information used in Figure 4. Assessing the added value of structural information should include such evolutionary information (residue-level sequence conservation) in the baseline.

(7) The "proteome allocation" analysis is presented as an important result, but I did not find details of equations used to conduct this analysis. I assume that "proteome allocation" is based solely on expression, and that "cell volume" uses structural information on top of it. There is a significant difference between "proteome allocation" and "cell volume" as reflected in the proteomaps shown in panels 4e & 4f, but there is no explanation for it. Are the proteins' identities the same in these two panels? Were only proteins counted or was RNA considered as well? Clarifications are needed for RNA, for example, how were volumes calculated in structures containing RNAs? Datasets used to derive these maps should also be provided to enable reproducing them.

(8) I did not see that the structures generated are available - they should be deposited on a permanent repository with a DOI.

  1. Howard Hughes Medical Institute
  2. Wellcome Trust
  3. Max-Planck-Gesellschaft
  4. Knut and Alice Wallenberg Foundation