Comment on 'Single nucleus sequencing reveals evidence of inter-nucleus recombination in arbuscular mycorrhizal fungi'
Abstract
Chen et al. recently reported evidence for inter-nucleus recombination in arbuscular mycorrhizal fungi (Chen et al., 2018a). Here, we report a reanalysis of their data. After filtering the data by excluding heterozygous sites in haploid nuclei, duplicated regions of the genome, and low-coverage depths base calls, we find the evidence for recombination to be very sparse.
https://doi.org/10.7554/eLife.47301.001Introduction
For many years arbuscular mycorrhizal fungi (AMF) were presumed to be asexual as no one had witnessed sexual structures in these fungi. This was puzzling because AMF retain core meiosis genes (Halary et al., 2011), indicating that a meiosis-like process most likely occurs in this lineage. Previous evidence for recombination later turned out to be based on duplicated gene copies (Croll and Sanders, 2009), or ribosomal RNA sequences that were paralogs (Pawlowska and Taylor, 2004; Maeda et al., 2018). Recently, based on work comparing single nuclei whole-genome sequences to bulk sequencing data, new evidence for recombination in these fungi was reported (Chen et al., 2018a). The isolates were dikaryotic, containing nuclei of two classes defined by their mating type (MAT) locus (Ropars et al., 2016). For each sequenced nucleus, PCR-amplification was attempted to assign a mating type class (MAT-1 up to MAT-5). Recombination was then inferred based on: (i) base-pair calls classed as one mating type found in the alternate mating type; (ii) nuclei of the same mating type showing variation in consecutive blocks of single nucleotide polymorphisms (SNPs); (iii) SNPs from nucleus 7 (SL1 strain) being more similar to SNPs of the alternate mating type, consistent with a recombination event spanning the MAT locus.
Here, we ask how strong the signal of within-strain recombination was in the data from Chen et al. (2018a) if we excluded heterozygous sites in haploid nuclei, duplicated regions of the genome, and low-coverage depths base calls. By removing data that cannot confidently be distinguished from sequencing errors and repeated regions, we find that the evidence for recombination is very sparse. We also report specific examples of these possible errors to justify our more stringent filtering of the data.
Results
The effect of filtering positions based on reads
Our first analysis was of the dataset reported in Supplementary file 6 of Chen et al. (2018a), used for both Figures 2 and 3 of that manuscript. We filtered out: (1) positions where any single nucleus was heterozygous (defined as sites with read depth >10, and alternate allele >10%), (2) any individual site with less than five reads coverage, and (3) positions with more than one high-confidence BLAST hit using the settings specified in Chen et al. (2018a). We applied the filters individually or in combination, to see how many of the variable sites inferred as signals of recombination would be removed. Applying each filter individually removed between 19–77% of recombined positions. Filtering of low coverage sites had the strongest effect, a ~ 75% reduction. Applying all three filters together removed 91% of recombined sites (Figure 1). Notably, these filters had much less effect on the total number of analyzed sites, with the combined application of all three filters reducing the total number of sites by only ~22%.
Recombination, when involving crossing over, exchanges physical blocks between homologous chromosomes resulting in consecutive allelic differences. We calculated the number of recombined sites, as well as the number of recombined blocks, consecutive SNPs of the alternate haplotype, shown in Table 1 (details of identified blocks can be found in Table 1—source data 1). Based on the analysis presented in Chen et al., all isolates show recombined blocks, in some cases spanning over a thousand base pairs. However, applying our three filters reduced these blocks. After filtering no consecutive SNPs remained for strain A4. Filtering applied to strains A5 and SL1 reduced the number of recombined blocks to 2 and 4, respectively, and also reduced the length of the remaining blocks.
-
Table 1—source data 1
- https://doi.org/10.7554/eLife.47301.005
A specific example of repeated regions associated with biallelic sites in haploid nuclei
Chen et al. (2018a) interpreted consecutive SNPs differing between nuclei of the same mating type as a sign of recombination, as highlighted in Figure 3 of their article. The filtering used in Chen et al. (2018a) removed multi-copy sites ‘[i]f BLAST results returned more than two good hits’, but retained regions with two BLAST hits. This could lead to the inclusion of SNPs that are heterozygous due to duplicated regions of the genome.
To show how repeated regions may lead to a false signal of recombination, we focused on an example highlighted in Figure 3 of Chen et al. (2018a), and discussed in the main text of that article. We found that several positions on scaffold 70 from isolate A4 were heterozygous in several nuclei (Figure 2), although they are treated as homozygous in Chen et al. (2018a). High sequencing depth (>30) eliminates rare sequencing errors as the cause. To test if duplicated regions could be the cause of the heterozygosity, we performed a BLAST search against the A4 reference genome with sequences from scaffold 70:100354–100657. This search resulted in two BLAST matches: the self match on scaffold 70, as well as an additional match on scaffold 3570 (Figure 2B). When the short reads of the dikaryon (bulk sequencing of all nuclei) were aligned to the reference genome, both these SNPs on scaffold 70 and their match on scaffold 3570 were heterozygous, and the BLAST hit result for both showed 100% identity match. Thus, this repeated sequence seems to have been assembled as a chimera of the two variants in both scaffolds, and the short reads from either copy are mapped equally to both.
We feel this example illustrates the need to exclude repetitive regions from analyses of recombination.
Confidence in low coverage sites to infer recombination
The data presented in Chen et al. (2018a) was filtered with a minimum of two reads. This is a very low threshold, and insufficient even for a consensus in the event of disagreeing reads. To look at the effect of low coverage on the signal of recombination, we compared the distribution of read depths between random and recombined sites. We first needed to identify recombined sites, as the method was lacking from the original manuscript, so we applied a parsimony criterion as detailed in Figure 1—figure supplement 1. While imperfect, this method certainly underestimates recombination, as it cannot identify recombined sites when equal numbers of nuclei within a mating type have alternate genotypes. Our method identified 733 positions, sufficient for analysis. Looking at the distribution of read depths, overall SL1 nuclei had ~95% fewer high coverage sites (average of 97 sites > 10 read depth for nuclei from SL1 versus 2290 for A4 and 2441 for A5) compared to A4 and A5 (Figure 3). We note here, as described in Table 1 of Chen et al. (2018a), that SL1 nuclei cover much less of the genome (14%) compared to A4 and A5 (53% and 42%, respectively). Another fact visible from Figure 3 is that, for A4 and A5, recombined sites are overrepresented by sites with low depth compared to sub-sampled non-recombined sites (Wilcoxon ranked sum test A4; p=3.2×10−09, A5 p=2.9×10−6). We note that for nuclei from isolate SL1, fewer overall recombined sites can be identified since the decreased breadth of coverage reduces overlap between nuclei, making it difficult to say whether this pattern of excess low-coverage sites is also present (p=0.11).
Genome-wide pairwise SNP differences are reduced after filtering
We then assessed the evidence for genome-wide recombination based on pair-wise SNP differences. This is the analysis presented in Figure 3 of Chen et al. (2018a), showing overall more recombination in SL1 than A4 and A5, indicated by a ‘mosaic pattern’. We again note here that sequence from SL1 nuclei covers very little of the assembly (average of 14% from Table 1 of Chen et al., 2018a) means that very few positions will be covered between any two nuclei between nuclei of SL1 (14% in one nucleus x 14% in the other nucleus = 2% in both). After applying our filters, we find that in A4 and A5 almost all differences between nuclei of one mating type disappear (Figure 4). For nuclei from SL1, the filtering reduced the differences within a mating type, but since these nuclei cover so little of the genome, the overall dataset is reduced such that on average nuclei only share 9 SNPs that can be compared. Many nuclei have no overlapping SNPs and no comparison can be made (black squares in Figure 4). A few nuclei of opposite mating types, such as nuclei 17 and 25, show high similarity, but for these pairs the similarity is based on only one or two shared SNP positions.
Confirming the placement of nucleus 07 (SL1) could be strong evidence for recombination
In Figure 2 of Chen et al. (2018a), nucleus 07 of SL1 shows a strong similarity with nuclei of the alternate mating type, seen by the clustering of nucleus 07 with MAT-5. However, this nucleus was PCR-genotyped to be of mating type MAT-1. The incongruence between PCR-genotyped mating type and the SNP clustering would be evidence of recombination spanning the MAT locus. To confirm this, we looked in the mapped reads of each nucleus to find reads mapping to either alternate mating type locus (Figure 4—figure supplement 1]). We found that there were no Illumina sequencing reads of nucleus 07 mapped to either mating type locus, indicating the whole genome amplification step may not have amplified the mating locus. As such, we have no available evidence for the mating type of this nucleus. Without corresponding Illumina evidence, we consider the PCR product the sole remaining evidence of this recombination event. This PCR experiment that is not confirmed with Illumina data represents the only remaining evidence for recombination after read filtering. We find cross-contamination of the PCR to be a more likely scenario in the face of many billions of sequenced bases from an Illumina run.
Conclusion
Finding a balance between filtering poor data and losing informative data is a critical component of any analysis. For this dataset, we provide evidence for the necessity of stringent filtering to avoid inferences based on erroneous or misleading data. We do not consider our filters to be particularly strict, as removing low coverage sites, repeated regions, and heterozygous sites from haploid data is commonplace in genomic analyses. In SL1, three blocks of consecutive SNPs remained after our filtering, and two regions in A5. Some of these regions likely remain because our heterozygosity filter requires a minimum of ten reads, thus low coverage heterozygous sites are not excluded. This analysis used the first 100 contigs, covering approximately 10 Mb. As such, finding only a handful of putative recombined SNPs certainly cannot be confidently separated from amplification/sequencing noise. While small blocks of genetic exchange may be compatible with gene conversion, the limited number of markers involved greatly increases the difficulty in identifying high-confidence gene conversion events (Wijnker et al., 2013; Qi et al., 2014).
Mapping recombination inside repetitive regions would require longer reads than available from standard short-read technologies. This is a formidable task due to the input requirements of PacBio technologies, but it has been accomplished using linked short reads from individual pollen cells (Sun et al., 2019). Notably, the use of a more contiguous reference genome will actually include additional repetitive regions, and the exclusion of repetitive regions will become even more important. As the apparent recombined blocks are much smaller than the contigs, a more contiguous genome assembly would not change our analysis. Single nucleus genome amplification with multiple displacement amplification produces extremely variable genome coverage. Normalization of the data was shown to improve the quality of AMF genomic data when coverage is variable (Montoliu-Nerin et al., 2019). In addition to normalization, low coverage sites could still be used with a model-based approach, which incorporates the associated uncertainty (Hinch et al., 2019; Bloom et al., 2013). Finally, removing heterozygous sites from haploid single-nuclei seems like a self-evident requirement.
The conserved meiosis genes found in the genomes of Glomeromycotan species strongly suggests a meiosis-like process allowing recombination and re-shuffling of genetic material among genomes. Uncovering the details of this process would be a major scientific breakthrough. Given this importance, claims made regarding recombination in Glomeromycota require rigorous examination. While we acknowledge that the models presented in Chen et al. (2018a) are valuable framework to test hypotheses for meiosis-like mechanisms found in these fungi, the data presented are not robust enough to support or reject them. As such, we believe that one of the greatest remaining mysteries in mycology remains unknown.
Materials and methods
Raw data
Request a detailed protocolWe obtained the SPAdes assemblies for the three dikaryotic R. irregularis isolates from NCBI, as well as the paired-end read libraries from the dikaryons and the paired-end reads of the single nuclei. Details of the accession numbers used are found in Chen et al. (2018a).
Read processing
Request a detailed protocolShort reads were cleaned with FASTP (Chen et al., 2018b), then aligned using BWA mem as in Chen et al. (2018a). Reads per nucleus were analyzed using the python modules pysam (Heger and Jacobs, 2019, and BLAST searches were scripted using biopython (Cock et al., 2009). Scripts used are available on GitHub repository (https://github.com/BenAuxier/Chen.2018.Response; Auxier, 2019; copy archived at https://github.com/elifesciences-publications/Chen.2018.Response). The parsimony criterion shown in Figure 1—figure supplement 1A. for identification of recombined sites, was performed manually. Filtering of excel files and calculations of recombined sites was performed with the R statistical language, which was also used to prepare plots with ggplot2 and distance matrices using ape (Wickham, 2016; Paradis and Schliep, 2019).
We compared the coverage representation between non-recombined and recombined sites. We subsampled from non-recombined sites to match the number of reads in recombined sites and performed Wilcoxon ranked sum test in R between the subsampled set of non-recombined reads and the recombined reads.
Mating-type loci identification and mapping
Request a detailed protocolAs the location of the mating type loci was not specified in the results of Chen et al. (2018a), we identified them based on data from Ropars et al. (2016). Specifically, we used the primers sequences kary001, kary002, and kary003 as query sequences for BLAST searches against the A4, A5, and SL1 genomes. Each primer sequence had strong matches against two different scaffolds, consistent with divergent ideomorphs as found in Ropars et al. (2016). As these primers only target a small region, we extended the locus using the annotations found on NCBI to identify the boundaries of the pair of genes. These locations are reported in Figure 4—figure supplement 1—source data 1.. As no annotations are available for SL1, we used the entire sequence of the ideomorph on scaffold511 of A5 as a BLAST query to identify the homologous regions.
With the mating locus identified, we then counted the number of reads that mapped to each sequence using samtools (Li et al., 2009).
Calculation of overlapping SNPs
Request a detailed protocolTo calculate the expected number of overlapping SNPs found in Figure 4D, we used the following formula:
Data availability
No data was generated for this study. Analyses and scripts can be found on https://github.com/BenAuxier/Chen.2018.Response (copy archived at https://github.com/elifesciences-publications/Chen.2018.Response).
References
-
Fastp: an ultra-fast all-in-one FASTQ preprocessorBioinformatics 34:i884–i890.https://doi.org/10.1093/bioinformatics/bty560
-
Conserved meiotic machinery in Glomus spp., a putatively ancient asexual fungal lineageGenome Biology and Evolution 3:950–958.https://doi.org/10.1093/gbe/evr089
-
The sequence alignment/Map format and SAMtoolsBioinformatics 25:2078–2079.https://doi.org/10.1093/bioinformatics/btp352
-
BookGgplot2: Elegant Graphics for Data AnalysisNew York: Springer-Verlag.https://doi.org/10.1007/978-0-387-98141-3
Article and author information
Author details
Funding
The authors declare that there was no funding for this work.
Acknowledgements
We thank Dr. Duur Aanen, Dr. Anna Rosling, Dr. Marisol Sanchez-Garcia, Dr. Erik Wijnker, and Mathijs Nieuwenhuis for critical feedback. We also thank the three anonymous reviewers.
Copyright
© 2019, Auxier and Bazzicalupo
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics
-
- 1,138
- views
-
- 76
- downloads
-
- 8
- citations
Views, downloads and citations are aggregated across all versions of this paper published by eLife.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading
-
- Evolutionary Biology
The majority of highly polymorphic genes are related to immune functions and with over 100 alleles within a population, genes of the major histocompatibility complex (MHC) are the most polymorphic loci in vertebrates. How such extraordinary polymorphism arose and is maintained is controversial. One possibility is heterozygote advantage (HA), which can in principle maintain any number of alleles, but biologically explicit models based on this mechanism have so far failed to reliably predict the coexistence of significantly more than ten alleles. We here present an eco-evolutionary model showing that evolution can result in the emergence and maintenance of more than 100 alleles under HA if the following two assumptions are fulfilled: first, pathogens are lethal in the absence of an appropriate immune defence; second, the effect of pathogens depends on host condition, with hosts in poorer condition being affected more strongly. Thus, our results show that HA can be a more potent force in explaining the extraordinary polymorphism found at MHC loci than currently recognized.
-
- Cancer Biology
- Evolutionary Biology
In growing cell populations such as tumours, mutations can serve as markers that allow tracking the past evolution from current samples. The genomic analyses of bulk samples and samples from multiple regions have shed light on the evolutionary forces acting on tumours. However, little is known empirically on the spatio-temporal dynamics of tumour evolution. Here, we leverage published data from resected hepatocellular carcinomas, each with several hundred samples taken in two and three dimensions. Using spatial metrics of evolution, we find that tumour cells grow predominantly uniformly within the tumour volume instead of at the surface. We determine how mutations and cells are dispersed throughout the tumour and how cell death contributes to the overall tumour growth. Our methods shed light on the early evolution of tumours in vivo and can be applied to high-resolution data in the emerging field of spatial biology.