Widespread Horizontal Gene Transfer Among Animal Viruses

  1. National Cancer Institute, Bethesda, MD, USA
  2. Arizona State University, Tempe, AZ, USA
  3. University of Cape Town, South Africa

Peer review process

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, and public reviews.

Read more about eLife’s peer review process.

Editors

  • Reviewing Editor
    Margaret Stanley
    University of Cambridge, Cambridge, United Kingdom
  • Senior Editor
    George Perry
    Pennsylvania State University, University Park, United States of America

Reviewer #1 (Public Review):

This paper discusses the identification of viral genes in publicly available DNA and RNA sequencing datasets. In many cases, these datasets have been assembled into contigs. Many viral genes were identified and contigs containing genes from more than one type of virus were more common than expected. The analysis appears to be sound and the results presented will be of great interest to the community.

The strengths of the paper are in the analysis itself, which is detailed, complex, and on a very large scale. To my knowledge, the identification of DNA viral proteins in sequencing datasets not deliberately infected with viruses has not previously been performed on this scale. Many proteins were identified which are at the limit of our current capacity to detect divergent proteins. I think the use of multiple methodologies strengthens the study, as it increases the depth of the results. The authors are also clear about the limitations of their study and give many caveats about their results, which is excellent.

I have two major concerns about the study. The first is the presentation, which in places makes it difficult to tell exactly how and why the analysis has been performed. I do not think it would be possible to reproduce this analysis based only on the information presented in the Materials and Methods section. This makes it difficult to assess the exact details of the method and whether they are appropriate. I would appreciate something like a flow chart to show, for each SRA dataset and each assembled contig, the exact steps taken for classification and the hierarchy of tools, plus the threshold values, applied to the results. An overview of the results at the beginning of the results section would also be helpful - how many proteins were identified, what were their host species, how many contigs were assembled and how many of these were chimeric, etc.

My second concern is that it is not clear how each protein was determined to be either viral or non-viral or how contigs were assigned as chimeric or non-chimeric. Positive and negative controls are not mentioned and false positive or negative rates are not calculated. Given that many of the identified proteins are highly divergent from known viral proteins, it would be good to see how likely it is that a random protein would be assigned as viral, or a viral protein as non-viral. Chimeric contigs could occur due to misassembly or endogenous viral elements, it seems like viruses in these categories may have been filtered using Cenote Taker but no checks are described to confirm that the filtering was successful.

Overall, I think that the study is useful and of interest, but I think more clarity in the presentation of the results would increase the value of the paper for many readers.

Reviewer #2 (Public Review):

Summary:

A large-scale computational analysis of published sequences of various animal species provides evidence for extensive gene transfer amongst DNA viruses.

Strengths:

The study provides evidence for a large number of previously uncharacterized DNA viruses and supports a model whereby DNA viruses have evolved by combining distinct shared replication modules and some of these evolutionary oddities likely remain in the biosphere. The work provides a useful repository and potential framework for additional virus discovery efforts.

Weaknesses:

This is an entirely computational story, with very limited experimental validation. A large number of often confusing new acronyms are introduced that may be "cute" (such as the reference to the delicious half-smoke sausage) but are not particularly useful. This is not helped by the somewhat "telegraphic" presentation of the data that is sometimes difficult to digest. Not all paragraphs deliver what they promise. For example under the title "Polyomaviruses and papillomaviruses" there is no discussion of papillomaviruses. Overall, however, these weaknesses do not diminish my enthusiasm for this paper, which will be an important resource for computational and non-computational virus hunters.

Reviewer #3 (Public Review):

Summary:

Buck et al., set out to characterize small DNA tumor viruses through the generation and analysis of ~100,000 public sequencing datasets from the SRA and other databases. Using a variety of powerful bioinformatic methods including alignment-based searches, statistical modelling, and structure-aware detection, the authors successfully classify novel protein sequences which support the occurrence of evolutionary gene transfer between DNA virus families. The authors propose a naming scheme to better capture viral diversity and uncover novel chimeric viruses, those containing genes from multiple established virus families. Additional analysis using the generated dataset was performed to search for DNA and RNA viruses of interest, demonstrating the utility of generated datasets for exploratory screens. The assembled sequencing datasets are publicly available, providing invaluable resources for current and future investigations within this subfield.

Strengths:

The scope of data analysis (100,000+ SRA records and additional libraries) is substantial, and the authors have contributed to further insight into the modularity of previously uncharacterized viral genomes, through computationally demanding advanced bioinformatics analyses in addition to extensive manual inspection.

The publicly available resources generated as a result of these analyses provide useful data for further experiments to inspect viral diversity and modularity. Other scanning experiments and further investigation of biologically relevant viruses using these contigs may uncover, for example, animal reservoirs or novel recombinant viruses of significance.

Novel instances of genomic modularity provide excellent starting points for understanding virus evolutionary pathways and gene transfer events.

Weaknesses:

Overall, the methods section of this paper requires more detail.

The inclusion criteria for which "SRA" datasets were or were not utilized within this study are poorly defined. This means the comprehensiveness of the study for a given search space of the SRA is not defined, and the results are ultimately not reproducible, or expandable. For example, are all vertebrate RNA-seq samples processed? Or just aquatic vertebrate RNA-seq? Were samples randomly sampled from a more comprehensive data set? What is the make-up of the search space and how much was DNA-seq or RNA-seq? This section should be expanded and explicit accounting provided for how dataset selection was performed. This would provide additional confidence in the results and conclusions, as well as allow for future analysis to be conducted.

Hallmark virus genes require further clarification, as it is unclear what genes are utilized as bait, or in the initial search process. The reported "Hallmark gene sets" are not described in a systematic way. What is the sensitivity and specificity of these gene sets? Was there a validation of the performance characteristics (ROC) for this gene set with different tools? How is this expected to be utilized? Which kinds of viruses are excluded/missed? Are viroids included?

For the Tailtomavirus, additional information is needed for sufficient confidence. Was this "chimeric" genomic arrangement detected in a single library? This raises a greater issue of how technical artifacts, which may appear as chimeric assemblies, are ruled out in the workflow. If two viral genomes share a k-mer of length greater than the assembly k, the graph may become merged. Are there read pairs that span all regions of the genome? Is there evidence for multiple homologous viruses with synteny between them that supports the combination of these genes as an evolving genome, or is this an anomalous observation? Read alignments should be included and Bandage graph visualization for all cases of chimeric assemblies and active steps to disprove the baseline hypotheses that these are technical artifacts of genome assembly.

Justification for exclusion of endogenized sequences is not included and must be described, as small DNA tumor viruses may endogenize into the host genome as part of their life cycle. How is such an integration resolved from an evolutionary "endogenization"? What's the biological justification for this step?

Additional supporting information, clear presentation, and context are needed to strengthen results and conclusions.

Basic reporting of global statistics, such as the total number of viruses found per family, should be included in the main text to better support the scope of the results. How many viruses (per family) were previously known, and therefore what is the magnitude of the expansion performed here?

Additional parameters and information should be included in bioinformatic tool outputs to provide greater clarity and interpretation of results. For example, reporting the "BLASTp E-val", as for the PolB homology (BLASTp 6E-12) is not informative, and does not tell the reader this is (we assume) an expectancy value. For each such case please report, the top database hit accession, percent identity, query coverage, and E-value. Otherwise, a judgment cannot be adequately made regarding the quality of evidence for homology. Similarly, for HHpred what does the number represent - confidence, identity, or coverage?

Some findings described in the Results section may require revision. Several of the Nidoviruses (Nidovirus takifugu, Nidovirus hypomesus, Nidovirus ambystoma, etc...) have been previously described by three groups, first by Edgar et al., (https://www.nature.com/articles/s41586-021-04332-2), then Miller et al., (https://academic.oup.com/ve/article/7/2/veab050/6290018) and then Lauber et al., (https://journals.plos.org/plospathogens/article?id=10.1371/journal.ppat.1012163). This is now the 4th description of the same set of viruses. These sequences are in GenBank (https://www.ncbi.nlm.nih.gov/nuccore/OV442424.1), although it is unclear why they're not returned as BLAST hits. Miller also described the Togavirus co-segment previously.

It is also uncertain what is being described with HelPol/maldviruses which was not previously described in distantly similar relatives. How many were described in the previous literature and how many are described by this work?

Co-phylogenies should be used to convey gene transfer and flow clearly to support the conclusions made in the text.

Statements such as, "The group encompasses a surprising degree of genomic diversity...", should be supported by additional information to strengthen conclusions (e.g., what the expected diversity is). What is the measurement for genomic diversity here, and why is this surprising? There is overall a lack of quantification to support the conclusions made throughout the paper.

  1. Howard Hughes Medical Institute
  2. Wellcome Trust
  3. Max-Planck-Gesellschaft
  4. Knut and Alice Wallenberg Foundation