Pervasive translation in Mycobacterium tuberculosis

Abstract
Editor's evaluation
eLife digest
Introduction
Results
Discussion
Materials and methods
Data availability
References
Article and author information
Metrics

Abstract

Most bacterial ORFs are identified by automated prediction algorithms. However, these algorithms often fail to identify ORFs lacking canonical features such as a length of >50 codons or the presence of an upstream Shine-Dalgarno sequence. Here, we use ribosome profiling approaches to identify actively translated ORFs in Mycobacterium tuberculosis. Most of the ORFs we identify have not been previously described, indicating that the M. tuberculosis transcriptome is pervasively translated. The newly described ORFs are predominantly short, with many encoding proteins of ≤50 amino acids. Codon usage of the newly discovered ORFs suggests that most have not been subject to purifying selection, and hence are unlikely to contribute to cell fitness. Nevertheless, we identify 90 new ORFs (median length of 52 codons) that bear the hallmarks of purifying selection. Thus, our data suggest that pervasive translation of short ORFs in Mycobacterium tuberculosis serves as a rich source for the evolution of new functional proteins.

Editor's evaluation

The use of ribosome profiling in this study allowed for the identification of translated regions of the Mycobacterium tuberculosis genome, identifying new genomic regions that undergo active translation. A select set of these appears to have been the subject of purifying evolutionary selection, suggesting that this pervasive translation of short genetic regions serves as the basis for the evolution of new proteins/protein functions.

https://doi.org/10.7554/eLife.73980.sa0

eLife digest

How can you predict which proteins an organism can make? To answer this question, scientists often use computer programs that can scan the genetic information of a species for open reading frames – a type of DNA sequence that codes for a protein. However, very short genes and overlapping genes are often missed through these searches.

Mycobacteria are a group of bacteria that includes the species Mycobacterium tuberculosis, which causes tuberculosis. Previous work has predicted several thousand open reading frames for M. tuberculosis, but Smith et al. decided to use a different approach to determine whether there could be more. They focused on ribosomes, the cellular structures that assemble a specific protein by reading the instructions provided by the corresponding gene.

Examining the sections of genetic code that ribosomes were processing in M. tuberculosis uncovered hundreds of new open reading frames, most of which carried the instructions to make very short proteins. A closer look suggested that only 90 of these proteins were likely to have a useful role in the life of the bacteria, which could open new doors in tuberculosis research. The rest of the sequences showed no evidence of having evolved a useful job, yet they were still manufactured by the mycobacteria. This pervasive production could play a role in helping the bacteria adapt to quickly changing environments by evolving new, functional proteins.

Introduction

The canonical mode of bacterial translation initiation begins with the association of a 30 S ribosomal subunit, initiator tRNA, and initiation factors, with the ribosome binding site of an mRNA (Laursen et al., 2005). Binding of the 30 S initiation complex to the mRNA involves base-pairing interactions between the mRNA Shine-Dalgarno (S-D) sequence, located a short distance upstream of the start codon, and the anti-S-D sequence in the 16 S ribosomal RNA (rRNA). Local mRNA secondary structure around the ribosome binding site can reduce interaction with the 30 S initiation complex. Translation initiates at a start codon, typically an AUG; less frequently, translation initiation occurs at GUG or UUG, and in rare instances at AUC, AUU, and AUA start codons (Gvozdjak and Samanta, 2020; Hecht et al., 2017). Hence, the likelihood of translation initiation at a given sequence will depend on the sequence upstream of the start codon, the degree of secondary structure in the region surrounding the start codon, and start codon identity.

Due to the requirement for a 5’ untranslated region that includes the S-D sequence, mRNAs translated using the canonical mechanism are referred to as ‘leadered’. By contrast, ‘leaderless’ translation initiation occurs on mRNAs that lack a 5’ UTR, such that the transcription start site (TSS) and translation start codon coincide. The mechanism of leaderless translation initiation is poorly understood. Until recently, there were few known examples of leaderless mRNAs; leaderless translation in the model bacterium Escherichia coli was shown to be rare and inefficient (Moll et al., 2002; Romero et al., 2014; Shell et al., 2015). However, recent studies indicate that leaderless translation initiation is a prevalent and robust mechanism in many bacterial and archaeal species (Beck and Moll, 2018). We and others showed that ~25% of all mRNAs in Mycobacterium smegmatis and Mycobacterium tuberculosis (Mtb) are leaderless (Cortes et al., 2013; Shell et al., 2015). Moreover, our data suggested that any RNA with a 5’ AUG or GUG will be efficiently translated using the leaderless mechanism in M. smegmatis (Shell et al., 2015).

Bacterial open reading frames (ORFs) are typically identified from genome sequences using automated prediction algorithms (Besemer and Borodovsky, 2005; Delcher et al., 2007; Hyatt et al., 2010). Among the criteria used by these algorithms are ORF length, and the presence of a S-D sequence. Hence, they often fail to identify non-canonical ORFs, including overlapping ORFs (Burge and Karlin, 1998), leaderless ORFs (Beck and Moll, 2018; Lomsadze et al., 2018), and short ORFs (sORFs; encoding small proteins of 50 or fewer amino acids; most algorithms have a lower size limit of 50 codons). Recent studies have revealed hundreds of sORFs in diverse bacterial species (Orr et al., 2020; Sberro et al., 2019; Storz et al., 2014; Stringer et al., 2021; VanOrsdel et al., 2018; Weaver et al., 2019). Some sORFs encode functional small proteins that contribute to cell fitness, whereas other sORFs function as cis-acting regulators. In eukaryotes, there have been reports of ‘pervasive translation’ of thousands of unannotated sORFs, likely due to the imperfect specificity of the translation machinery (Ingolia et al., 2014; Ruiz-Orera et al., 2018; Wacholder et al., 2021). The function, if any, of most of these sORFs and/or their encoded proteins is unclear, although they are rarely subject to purifying selection (Ruiz-Orera et al., 2018; Wacholder et al., 2021). Nonetheless, a high-throughput mutagenesis study of unannotated sORFs in human cells suggested that some contribute to cell fitness (Chen et al., 2020). Moreover, pervasively translated eukaryotic sORFs may function as ‘proto-genes’, that, over the course of evolution, can acquire a function promoting cell fitness, a process referred to as ‘de novo gene birth’ (Blevins et al., 2021; Carvunis et al., 2012; Ruiz-Orera et al., 2018; Vakirlis et al., 2018; Vakirlis et al., 2020).

Ribosome profiling (Ribo-seq) is a powerful experimental approach to identify the translated regions of mRNAs by mapping ribosome-protected RNA fragments (Ingolia et al., 2009). Ribo-RET is a modified form of Ribo-seq in which bacterial cells are treated with the antibiotic retapamulin before lysis; retapamulin traps bacterial ribosomes at sites of translation initiation, whereas elongating ribosomes are free to complete translation (Meydan et al., 2019). Thus, Ribo-RET facilitates the identification of overlapping ORFs by limiting the signal to the start codons (Meydan et al., 2018; Meydan et al., 2019). Ribo-RET was recently applied to E. coli, revealing start codons for many previously undescribed ORFs (Meydan et al., 2019; Stringer et al., 2021; Weaver et al., 2019), including sORFs, and ORFs positioned in frame with annotated ORFs, such that the translated protein is an isoform of the previously described protein. Here, we use a combination of Ribo-seq and Ribo-RET to map translated ORFs in Mtb. We detect thousands of robustly translated, previously undescribed sORFs from leaderless and leadered mRNAs. We also identify hundreds of ORFs that have start codons upstream or downstream of those for annotated genes, in the same reading frame. We conclude that the Mtb transcriptome is pervasively translated, with spurious translation initiation occurring at many sites. We also identify a subset of novel sORFs that appear to be under purifying selection, suggesting these ORFs, or the proteins they encode, contribute to cell fitness. Thus, our data suggest that pervasive translation of sORFs in Mtb serves as a rich source for the evolution of functional genes.

Results

Hundreds of actively translated sORFs from leaderless mRNAs

Mtb has a genome of 4,411,532bp, with 3989 annotated protein-coding genes (RefSeq annotation). Two previous studies of Mtb identified 1285 transcription start sites (TSSs) for which the associated transcript begins with the sequence ‘RUG’ (R = A or G; Supplementary file 1A; Cortes et al., 2013; Shell et al., 2015), suggesting that these transcripts correspond to leaderless mRNAs (Shell et al., 2015). Of the 1285 TSSs associated with a 5’ RUG, 577 match the start codons of protein-coding genes included in the current genome annotation, as previously noted (Cortes et al., 2013; Shell et al., 2015). A further 338 of the RUG-associated TSSs correspond to putative ORFs whose start codons are unannotated, but whose stop codons match those of annotated genes; we refer to this architecture as ‘isoform’, since translation of these putative ORFs would generate N-terminally extended or truncated isoforms of annotated proteins. We note that some isoform ORFs likely reflect mis-annotations, as has been suggested previously (Cortes et al., 2013; Shell et al., 2015). Lastly, 370 of the 1,285 RUG-associated TSSs correspond to putative ORFs whose start and stop codons do not match those of any annotated gene; we refer to these as putative ‘novel’ ORFs.

To determine whether the putative isoform and novel leaderless ORFs are actively translated, we performed Ribo-seq in Mtb. Note that all genome-scale data described in this manuscript can be viewed in our interactive genome browser (https://mtb.wadsworth.org/). We first assessed ribosome occupancy profiles for leadered ORFs that are present in the current genome annotation. Consistent with previous studies (Oh et al., 2011; Woolstenhulme et al., 2015), we observed enrichment of ribosome occupancy at start and stop codons of annotated, leadered ORFs; the 3’ ends of ribosome-protected RNA fragments are enriched 15 nt downstream of the start codons, and 12 nt downstream of stop codons (Figure 1A). We note that there are also smaller peaks and troughs of Ribo-seq signal precisely at start and stop codons, likely attributable to sequence biases associated with library preparation that are highlighted when groups of similar sequences (e.g. start/stop codons) are aligned (see Methods). We next assessed ribosome occupancy profiles for the 577 leaderless ORFs that are present in the current genome annotation. As expected, we observed an enrichment of ribosome-protected RNA fragments, with 3’ ends positioned 12 nt downstream of stop codons (Figure 1B), consistent with the profile observed for leadered ORFs. However, 3’ ends of ribosome-protected RNA fragments were not enriched 15 nt downstream of the start codons of the 577 annotated leaderless ORFs; rather, we observed enrichment spread across the region ~25–35 nt downstream of leaderless start codons (Figure 1B), suggesting either that ribosomes at leaderless ORF start codons behave differently to those at leadered ORF start codons, or that ribosome-protected fragments are too small to be represented in the RNA library; this observation is consistent with a previous study (Sawyer et al., 2021). Further confounding analysis of leaderless start codons, which are, by definition, aligned with TSSs, we consistently observed non-random Ribo-seq signals at TSSs of non-leaderless transcripts (Figure 1—figure supplement 1), albeit to a lesser extent than that observed for leaderless gene starts.

Figure 1 with 1 supplement see all

Download asset Open asset

Ribo-seq data support the translation of hundreds of isoform and novel ORFs from leaderless mRNAs.

(A) Metagene plot showing normalized Ribo-seq sequence read coverage for untreated cells in the regions around start (left graph) and stop codons (right graph) of previously annotated, leadered ORFs. Note that sequence read coverage is plotted only for the 3’ ends of reads, since these are consistently positioned relative to the ribosome P-site (Woolstenhulme et al., 2015). Data are shown for two biological replicate experiments. The schematics show the position of initiating/terminating ribosomes, highlighting the expected site of ribosome occupancy enrichment at the downstream edge of the ribosome. (B) Equivalent data to (A) but for putative annotated, leaderless ORFs. (C) Equivalent data to (A) but for putative novel, leaderless ORFs. (D) Equivalent data to (A) but for putative isoform, leaderless ORFs. Only data for start codons are shown because the same stop codon is used by both an annotated and isoform ORF.

We reasoned that if the putative leaderless isoform and novel ORFs are actively translated, they would exhibit similar ribosome occupancy profiles to the leaderless annotated ORFs. Indeed, this was the case, with similar relative occupancy of ribosomes undergoing translation initiation and termination at start/stop codons (Figure 1C–D; we did not analyze isoform ORF stop codons because they are shared with those of annotated ORFs). Thus, our data are consistent with active translation of the majority of the 370 putative novel ORFs as leaderless mRNAs. Strikingly, 268 of the leaderless novel ORFs are sORFs. We conclude that Mtb has hundreds of actively translated sORFs on leaderless mRNAs.

Ribo-RET identifies sites of translation initiation in Mtb

While there are likely >1000 leaderless mRNAs in Mtb, most mRNAs are leadered (Cortes et al., 2013; Sawyer et al., 2021; Shell et al., 2015). Given that our data support the existence of >300 novel ORFs translated from the 5’ ends of leaderless mRNAs, we speculated that there are many more unannotated ORFs translated from leadered initiation codons. While sites of leaderless translation initiation can be readily identified from TSS maps, identification of novel leadered ORFs is more challenging. Translated leadered ORFs generate signal in Ribo-seq datasets, but identification of novel ORFs from Ribo-seq data is confounded by (i) the potential for artifactual signal in 5’ UTRs due to the binding of RNA-binding proteins (Ji et al., 2016), and (ii) masking of signal by overlapping ORFs on the same strand. To circumvent these problems, we performed Ribo-RET with Mtb to specifically map sites of translation initiation. We aligned the ribosome-protected RNA fragment sequences to the Mtb genome to identify ‘Initiation-Enriched Ribosome Footprints’ (IERFs), sites of ribosome occupancy that exceed the local background (Supplementary file 1B). Specifically, IERFs correspond to genomic coordinates that have ribosome occupancy coverage that exceeds an arbitrarily defined threshold value (5.5 reads per million) and is at least 10-fold higher than the mean ribosome occupancy coverage in the region 50 nt upstream to 50 nt downstream. We hypothesized that most IERFs correspond to sites of translation initiation. In support of this idea, there is a strong enrichment of IERF 3’ ends 15 nt downstream of the start codons of annotated, leadered genes; this enrichment is substantially greater than that observed for Ribo-seq data from cells grown without retapamulin treatment (Figure 2A; Figure 2—figure supplement 1).

Figure 2 with 2 supplements see all

Download asset Open asset

Ribo-RET of M. *tuberculosis* identifies sites of translation initiation.

(A) Metagene plot showing normalized Ribo-seq and Ribo-RET sequence read coverage (single replicate for each; data indicate the position of ribosome footprint 3’ ends) in the region from –50 to +100 nt relative to the start codons of annotated, leadered ORFs. (B) Heatmap showing the enrichment of eight selected trinucleotide sequences, for regions upstream of IERFs, relative to control regions. Expected positions of start codons and S-D sequences are indicated below the heatmap.

We determined the abundance of all trinucleotide sequences in the 40 nt regions upstream of IERF 3’ ends; there is a > 2 fold enrichment of ATG, GTG and TTG (likely start codons), but not CTG, ATT or ATC, 15 nt upstream of IERF 3’ ends, and an enrichment of AGG and GGA (components of a consensus AGGAGGU Shine-Dalgarno sequence) 22–31 nt upstream of IERF 3’ ends (Figure 2B). We also observed >1.5 fold enrichment of ATG and GTG 14, 16, 17, and 18 nt upstream of IERF 3’ ends. The enrichment and position of start codon and Shine-Dalgarno-like sequence features upstream of IERFs are consistent with IERFs marking sites of translation initiation. We observed a strong enrichment of A/T immediately 3’ of the IERFs, i.e. on the other side of the site cleaved by micrococcal nuclease (MNase) during the Ribo-RET procedure; ‘A’ was found most frequently (53% of IERFs), and ‘G’ found the least frequently (2% of IERFs; Figure 2—figure supplement 2). This sequence bias is likely not due to a biological phenomenon, but rather to the sequence preference of MNase, which is known to display sequence bias when cutting DNA (Dingwall et al., 1981) and RNA (Woolstenhulme et al., 2015). The sequence bias is apparent in the complete Ribo-RET libraries, with 74% of sequenced ribosome-protected fragments having an ‘A’ or ‘U’ 3’ of the upstream MNase site. Given that the genomic A/T content in Mtb is only 34%, it is likely that inefficient RNA processing by MNase led to an underrepresentation of some G/C-rich translation initiation sites in the Ribo-RET data, and may explain the extended footprints ( > 15 nt) in G/C-rich contexts (see Discussion). This sequence bias also likely favors cleavage precisely at exposed start codons, which are strongly enriched for A/T bases, creating more RNA library fragments that end in these sequences (e.g. enriched Ribo-seq signal precisely at start codons in Figure 2A).

Identification of putative ORFs from Ribo-RET data

A total of 1994 IERFs were found in both replicate experiments (Supplementary file 1B). 71% (1406) of these IERFs were associated with a potential ATG or GTG start codon 14–18 nt upstream of their 3’ ends, or a potential TTG start codon 15 nt upstream of their 3’ ends (Supplementary file 1C), a far higher proportion than that expected by chance (17%). Thus, these 1,406 IERFs correspond to the start codons of putative ORFs, with an overall estimated false discovery rate (FDR) of 9% (see Materials and methods for details). 34% (478; FDR of 0.3%) of the putative ORFs precisely match previously annotated ORFs; 27% (373; FDR of 9%) overlap , and are in frame with previously annotated ORFs (i.e. isoform ORFs); 39% (555; FDR of 15%) are novel ORFs, with no match to a previously annotated stop codon. A total of 112 novel ORFs were found entirely in regions presently designated as intergenic; the remaining novel ORFs overlap partly or completely with annotated genes in sense and/or antisense orientations (Figure 3A; Supplementary file 1C). Strikingly, 77% (430) of the novel ORFs we identified are sORFs, with 48 novel ORFs consisting of only a start and stop codon (Supplementary file 1C), an architecture recently described in E. coli (Meydan et al., 2019).

Figure 3 with 1 supplement see all

Download asset Open asset

Features of higher-confidence ORFs identified by Ribo-RET.

(A) Distribution of different classes of ORFs identified by Ribo-RET. The pie-chart shows the proportion of identified ORFs in each class. Isoform ORFs are further classified based on whether they are longer (‘N-terminal extension’) or shorter (‘N-terminal truncation’) than the corresponding annotated ORF. Novel ORFs are further classified based on their overlap with annotated genes. ‘Sense’, ‘antisense’, and ‘mixed’ refer to whether the overlapping gene(s) is/are in the sense, antisense, or both (multiple overlapping genes) orientations with respect to the novel ORF. ‘Fully’ and ‘Partially’ indicate whether all or only some of the novel ORF overlaps annotated genes. (B) Strip plot showing the ΔG for the predicted minimum free energy structures for the regions from –40 to +20 nt relative to putative start codons for the different classes of ORF, and for a set of 500 random sequences. Median values are indicated by horizontal lines.

We reasoned that if the isoform ORFs and novel ORFs are genuine, they should have S-D sequences upstream, and their start codons should each be associated with a region of reduced RNA secondary structure, as has been described for ORFs in other bacterial species (Baez et al., 2019; Del Campo et al., 2015). As we had observed for the set of all IERFs, regions upstream of isoform ORFs and novel ORFs are associated with an enrichment of AGG and GGA sequences in the expected location of a S-D sequence (Figure 3—figure supplement 1). This enrichment is lower than for annotated genes, but it is important to note that a S-D sequence was likely a contributing criterion in computationally predicting the initiation codons of annotated genes. We also assessed the level of RNA secondary structure upstream of all the putative ORFs identified by Ribo-RET. The predicted secondary structure for a set of random genomic sequences was significantly higher than the predicted secondary structure around the start of the identified annotated, novel, or isoform ORFs (Mann-Whitney U Test P < 2.2e^–16 in all cases; Figure 3B). Moreover, the predicted secondary structure around the start of the annotated ORFs was only modestly, albeit significantly, higher than that of novel ORFs (Mann-Whitney U Test P = 1e^–3). Collectively, the ORFs identified from Ribo-RET data exhibit the expected features of genuine translation initiation sites.

ORFs identified by Ribo-RET are actively translated in untreated cells

To determine if isoform ORFs and novel ORFs are actively and fully translated in cells not treated with retapamulin, we analyzed Ribo-seq data generated from cells grown without drug treatment. We assessed ribosome occupancy for annotated, novel, and isoform ORFs identified by Ribo-RET. As for the predicted leaderless ORFs, we reasoned that expressed leadered ORFs would be associated with increased ribosome occupancy at start and stop codons, as exemplified by previously annotated, leadered ORFs (Figure 1A; Oh et al., 2011; Woolstenhulme et al., 2015). Accordingly, annotated ORFs identified by Ribo-RET were strongly enriched for Ribo-seq signal 15 nt downstream of their start codons and 12 nt downstream of their stop codons (Figure 4A–B). We observed similar Ribo-seq enrichment profiles at the start and stop codons of novel ORFs, and downstream of the start codons of isoform ORFs (Figure 4A and C–D), but we did not observe these enrichment profiles for a set of mock ORFs (Figure 4—figure supplement 1A). Moreover, we did not observe enrichment of RNA-seq signal at start/stop codons, ruling out biases associated with library construction (Figure 4—figure supplement 1B-D). Overall, our data are consistent with most Ribo-RET-predicted isoform and novel ORFs being actively translated from start to stop codon, independent of retapamulin treatment.

Figure 4 with 2 supplements see all

Download asset Open asset

Ribo-seq data support the translation of hundreds of isoform and novel ORFs identified by Ribo-RET.

(A) Ribo-seq and Ribo-RET sequence read coverage (read 3’ ends) across two genomic regions, showing examples of putative ORFs in the annotated (blue arrow), novel (orange arrow), and isoform (green arrow) categories. ORFs identified by Ribo-RET shown with a black outline. (B) Metagene plot showing normalized Ribo-seq sequence read coverage (data indicate the position of ribosome footprint 3’ ends) for untreated cells in the regions around start (left graph) and stop codons (right graph) of ORFs predicted from Ribo-RET profiles, that correspond to previously annotated genes. (C) Equivalent data to (B) but for putative novel ORFs identified from Ribo-RET data. (D) Equivalent data to (B) but for putative isoform ORFs identified from Ribo-RET data. Only data for start codons are shown because the same stop codon is used by both an annotated and isoform ORF.

Identification of lower-confidence ORFs from Ribo-RET data

In addition to the 1994 IERFs present in both replicates of Ribo-RET data, 4216 IERFs were found in only the first replicate dataset, which was associated with a stronger enrichment of ribosome occupancy at start codons (compare Figure 2A and Figure 2—figure supplement 1). Strikingly, 2791 (66%) of IERFs found in only the first Ribo-RET dataset were associated with a potential start codon 14–18 nt upstream of their 3’ ends (Supplementary file 1C; see Materials and methods for details), a far higher proportion than that expected by chance (17%), and a similar proportion to that observed for IERFs found in both replicate Ribo-RET datasets (70%). We refer to ORFs identified from only the first Ribo-RET dataset as ‘lower-confidence’ ORFs, reflecting the marginally higher FDRs; we refer to ORFs identified from both Ribo-RET datasets as ‘higher-confidence’ ORFs. 22% (614; FDR of 0.6%) of the lower-confidence ORFs are annotated, 29% (801; FDR of 10%) are isoform, and 49% (1372; FDR of 16%) are novel. 77% (1061) of the novel lower-confidence ORFs are sORFs, with 120 consisting of only a start and stop codon (Figure 4—figure supplement 2A), mirroring the proportions observed in the higher-confidence dataset.

Regions upstream of lower-confidence annotated, novel, and isoform ORFs are associated with an enrichment of AGG and GGA sequences in the expected location of a Shine-Dalgarno sequence (Figure 4—figure supplement 2B). The predicted secondary structure for a set of random genomic sequences was significantly higher than the predicted secondary structure around the start of the lower-confidence annotated ORFs, novel ORFs, and isoform ORFs (Mann-Whitney U Test P < 2.2e^–16 in all cases; Figure 4—figure supplement 2C). Moreover, the predicted secondary structure around the start of the lower-confidence annotated ORFs was not significantly higher than that of the lower-confidence novel ORFs (Mann-Whitney U Test P = 0.22). Lastly, we examined ribosome occupancy at the start and stop codons of the lower-confidence ORFs from our Ribo-seq data generated from cells grown without drug treatment. Lower-confidence annotated, novel, and isoform ORFs were strongly enriched for Ribo-seq signal 15 nt downstream of their start codons and 12 nt downstream of their stop codons (Figure 4—figure supplement 2D-F). Collectively, the lower-confidence ORFs exhibit the characteristics of actively translated regions.

Novel ORFs tend to be weakly transcribed but efficiently translated

To investigate how efficiently novel ORFs are expressed, we determined RNA levels from RNA-seq data, and ribosome occupancy levels from Ribo-seq data, for all annotated and novel ORFs detected in this study (leaderless and leadered ORFs). We also determined RNA and ribosome occupancy levels for putatively untranslated regions of 1854 control transcripts (see Materials and methods for details). For novel ORFs, we analyzed only the 871 ORFs for which ≥ 50 nt of the ORF is ≥30 nt from an annotated gene on the same strand, to avoid overlapping signal from other ORFs. As a group, novel ORFs have lower RNA levels and lower ribosome occupancy levels than the 1670 annotated ORFs (Figure 5A top panel; Figure 5—figure supplement 1A top panel; Figure 5—figure supplement 1B-C). By contrast, the non-coding control transcripts as a group have similar RNA levels to novel ORFs, but lower ribosome occupancy levels (Figure 5A, lower panels; Figure 5—figure supplement 1A lower panels; Figure 5—figure supplement 1B-C). To estimate the ribosome occupancy per transcript, we determined the ratio of Ribo-seq reads to RNA-seq reads for each region analyzed (Figure 5B; Supplementary file 1, tabs A + C). As a group, novel ORFs have only slightly lower ribosome occupancy per transcript than annotated ORFs, while both novel and annotated ORFs have markedly higher ribosome occupancy per transcript than the control non-coding transcripts. We conclude that the RNA level for novel ORFs tends to be lower than that for annotated ORFs, but novel ORFs are translated with similar efficiency to annotated ORFs, and are thus clearly distinct from non-coding transcripts. The overall lower expression of novel ORFs relative to annotated ORFs is also reflected by lower Ribo-RET occupancy at their start codons (Figure 5—figure supplement 2).

Figure 5 with 2 supplements see all

Download asset Open asset

Novel ORFs are efficiently translated.

(A) Pairwise comparison of normalized RNA-seq and Ribo-seq coverage for annotated, novel and non-coding control transcripts. Reads are plotted as RPM per nucleotide using a single replicate of each dataset for reads aligned to the reference genome at their 3’ ends. The categories compared are: (i) annotated ORFs (higher-confidence and lower-confidence ORFs detected by Ribo-RET, and leaderless ORFs; blue datapoints), (ii) novel ORFs (higher-confidence and lower-confidence ORFs detected by Ribo-RET and leaderless ORFs, for regions at least 30 nt from an annotated gene; orange datapoints), and (iii) a set of 1854 control transcript regions that are expected to be non-coding (see Materials and methods; purple datapoints). ORF/transcript sets are plotted in pairs to aid visualization. (B) Normalized ribosome density per transcript (ratio of Ribo-seq coverage to RNA-seq coverage) for the same sets of ORFs/transcripts. The graph shows the frequency (%) of ORFs/transcripts within each group for bins of 0.05 density units.

Validation of novel ORFs using mass spectrometry

Mass spectrometry (MS) provides a rigorous methodology to define the Mtb proteome. However, we predict that many of the small proteins we describe here are likely to be missed by MS because (i) there are biases against retaining small proteins in standard sample preparation methods and, (ii) small proteins generate few tryptic peptides. We hypothesized that we could enrich for small proteins by processing the normally discarded fractions from each of two standard preparations (Wisniewski et al., 2009). In total, we analyzed five samples prepared in different ways designed to enrich for small proteins (see Materials and methods). We also analyzed a sample made by in-solution digestion, which does not discard small proteins during final preparative stages (see Materials and methods). Nano-UHPLC-MS/MS on these samples identified proteins encoded by 44 of the putative leaderless and leadered novel ORFs identified in this study, at an estimated overall FDR of 1% (Tang et al., 2008). Novel proteins detected by MS are indicated in Supplementary file 1A, C. Eight proteins were detected in more than one preparation, or with independent peptide matches. Direct analysis from the mixed-organic extraction (with and without demethylation), and analysis of a minimally treated in-solution digestion, yielded the majority of the protein identifications. Ten of the proteins we detected are <50 amino acids in length, with the shortest being 23 amino acids long. The methods aimed at enriching for small proteins detected proteins of a smaller average size: the mean predicted length of novel proteins identified with small protein enrichment strategies was 60 amino acids, versus 86 amino acids for proteins identified from in-solution digestion. We anticipate that additional modifications in the enrichment protocols for small proteins will further improve the sensitivity of detection of small proteins.

Since many small proteins were only identified as single peptides by MS, we sought a direct approach to validate their detection. Three MS-detected novel small proteins were commercially synthesized, and their MS/MS spectra determined for empirical comparison to the native small protein. The three proteins were selected from high- (local FDR < 1%), and medium- (local FDR < 5%) search scores. Two of these proteins are translated from leaderless ORFs and one from a leadered ORF. For all three proteins, the numerical ions from the synthetic peptide matched those from the proteomic datasets, with conservation of the mass intensity (Figure 6). We conclude that all three proteins are translated as stable products that match the sequence expected based on Ribo-RET data.

Figure 6 with 2 supplements see all

Download asset Open asset

Mass spectrometry validation of selected ORFs.

MS/MS spectra from novel ORFs measured with a synthetic peptide compared to spectra measured from the *Mtb* proteome. The genome coordinate and strand of each selected novel ORF start codon is indicated. (A) Leaderless ORF 1272167 (-) was identified from amino-acids 2–24. The y₄ and parent m/z ions are off-scale. (B) Leaderless ORF 1242703 (+) was observed from amino acids 46–61. (C) Leadered ORF 4071711 (+) was observed from amino acids 4–26. The b₃ ion is off-scale. Measured b-ions are in blue, and y-ions are in red. The nearly complete spectrum obtained for each peptide and the fragment-mass balance clearly indicate that these sORFs are identical to their synthetic cognates.

Validation of novel and isoform start codons using reporter gene fusions

We sought to validate selected novel and isoform ORFs. We hypothesized that the start codons identified by Ribo-RET would direct translation initiation in a reporter system that controls for extraneous contextual variables. We selected 18 novel predicted start codons that scored in the top quartile for ribosome occupancy in Ribo-RET datasets; in Ribo-seq profiles, the associated ribosome densities per transcript cover a broad range of values (median percentile rank of 37 for the eight ORFs that could be assessed). We tested these start codons by fusing them to a luciferase reporter, including 25 bp of upstream sequence for each ORF tested. We constructed equivalent reporter fusions with a single base substitution in the predicted start codon (RTG to RCG). For comparison, we included wild-type and start codon mutant luciferase reporter fusions for three annotated ORFs (icl1, sucC, and mmsA). The reporter plasmids were integrated into the chromosome of M. smegmatis. Luciferase expression from each of the 20 luciferase fusions, including those for five novel ORFs from our lower-confidence list, was significantly reduced by mutation of the start codon (Figure 6—figure supplement 1; p < 0.05 or 0.01, as indicated, one-way Student’s T-test). Mutation of the start codons reduced, but did not abolish, luciferase expression; this was true even for the three annotated ORFs. We speculate that translation can initiate at low levels from non-canonical start codons, as has been described for E. coli (Hecht et al., 2017). We note that our plasmid reporter system was designed to minimize extraneous variables between constructs that could confound initiation codon evaluation, which necessarily removed the candidate start codons from their larger native context. Overall, the luciferase reporter fusion data are consistent with active translation from the start codons identified by Ribo-RET.

Validation of novel and isoform ORFs using western blotting

To directly assess translation of selected putative ORFs, we generated constructs for two complete novel ORFs with 3 x FLAG tags fused at the encoded C-terminus. We generated equivalent constructs with a single base substitution in the putative start codon. The tagged constructs were integrated into the chromosome of M. smegmatis. The two proteins were detected by western blot, and they were not detected from cells with mutant start codons (Figure 6—figure supplement 2A). We generated equivalent 3 x FLAG-tagged strains for two isoform ORFs. We detected the overlapping, full-length protein by western blot, and expression of these full-length proteins was unaffected by mutation of the isoform ORF start codon (Figure 6—figure supplement 2B). We also detected a protein of smaller size, corresponding to the expected size of the isoform protein; expression of these small isoform proteins was not detected in the start codon mutant constructs (Figure 6—figure supplement 2B). Notably, for the pairs of novel and isoform proteins we detected by western blot, the two more highly expressed proteins were from the lower-confidence set of ORFs. Overall, these data support the ORF predictions from the Ribo-RET data, and the existence of novel and isoform ORFs identified from only a single replicate of Ribo-RET data.

Limited G/C-Skew in the codons of non-overlapping novel ORFs

The Mtb genome has a high G/C content (65.6%). There is a G/C bias within codons of annotated genes: the second position of codons is particularly constrained to encode specific amino acids, which supersedes the G/C bias of the genome, whereas the third (wobble) position has few such constraints. Hence, functional ORFs under purifying selection exhibit G/C content below the genome average at the second codon position and above the genome average at the third codon position (Bibb et al., 1984). We refer to the difference in G/C content at third positions and second positions of codons as ‘G/C-skew’, with positive G/C-skew expected for ORFs subject to purifying selection. We reasoned that we could exploit G/C-skew to assess the likelihood that novel ORFs identified by Ribo-RET have experienced purifying selection at the codon level. We assessed G/C skew for all 2299 novel ORFs identified in this study (leadered and leaderless). We limited the analysis to regions that do not overlap previously annotated genes, since G/C-skew could be impacted by selective pressure on an overlapping gene; 62% of ORFs were discarded because they completely overlap an annotated gene, and 17% of ORFs had some portion excluded. The set of all tested novel ORFs has modest, but significant, positive G/C-skew (Fisher’s exact test P < 2.2e^–16; n = 19,750 codons; Figure 7; Supplementary file 1A, C ), consistent with a subset of codons in this class having been under purifying selection. However, the degree of positive G/C-skew for the novel ORFs is much smaller than that for the annotated ORFs we identified in our datasets (Figure 7), suggesting that the proportion of novel ORFs experiencing purifying selection, and/or the intensity of that selection, is much lower than that for the annotated ORF group. To identify specific novel ORFs that have likely experienced purifying selection of their codons, and hence are likely to contribute to cell fitness, we determined G/C-skew for the non-overlapping regions of each novel ORF individually. We then ranked the ORFs by the significance of their G/C-skew (Fisher’s exact test; see Materials and methods for details). Of the 103 ORFs with the most significant G/C-skew, there is a strong enrichment for positive G/C-skew: 90 of the ORFs have positive G/C skew and 13 have negative G/C skew. This suggests that ~80 of the 90 ORFs with positive G/C skew have been subject to purifying selection on their codons. It is important to note that the size of the ORF is a major consideration when determining the significance of G/C-skew; the small size of novel ORFs therefore limits this analysis. Moreover, the G/C-skew analysis provides no information on regions of novel ORFs that overlap annotated genes. Hence, the number of novel ORFs that we predict to be functional based on their G/C-skew is almost certainly a substantial underestimate. Nonetheless, the overall G/C-skew of novel ORFs relative to that of annotated ORFs provides strong evidence that the majority of novel ORFs are not functional.

Figure 7

Download asset Open asset

G/C skew within codons of novel and annotated ORFs.

Histogram showing the frequency of G/C nucleotides at each of the three codon positions for annotated ORFs or novel ORFs. Note that only regions of novel ORFs that do not overlap a previously annotated ORF were analyzed.

Discussion

Ribo-Seq identifies thousands of isoform and novel ORFs

We have identified thousands of actively translated novel and isoform ORFs with high confidence. This conclusion is strongly supported by the clear association of initiating and terminating ribosomes with the start and stop codons, respectively, in untreated cells. We note that the enrichment of terminating ribosomes at the stop codons of novel ORFs in Ribo-seq data (i.e. no retapamulin treatment) is independent of the methods used to identify the novel ORF start codons. The novel and isoform ORFs are also supported by validation of selected ORFs using multiple independent genetic and biochemical approaches. Overall, our data reveal a far greater number of ORFs than previously appreciated, with annotated ORFs outnumbered by isoform and novel ORFs. Many genomic regions encode overlapping ORFs on opposite strands or on the same strand in different frames, contrary to the textbook view of genome organization.

There are 3898 annotated Mtb ORFs, but the ORF discovery approaches applied here under-sampled these, identifying 1669. Failure to identify more annotated ORFs is likely due to the following biological and technical reasons: (i) Many genes are likely to be expressed at levels too low to be detected. In support of this idea, the median Ribo-seq read coverage for leadered, annotated ORFs identified by Ribo-RET was significantly higher than that for equivalent ORFs not identified by Ribo-RET (3.8-fold; Mann-Whitney U test P < 2.2e^–16); (ii) Many ORF start codons are likely to be misannotated, so they would be classified as isoforms. (iii) The A/T sequence preference of MNase (Figure 2—figure supplement 2) likely led to exclusion of some ORFs from the Ribo-RET libraries. In support of this idea, the base at position +17 relative to the start codon (i.e. immediately downstream of the preferred MNase cleavage site) is 1.7-fold more likely to be ‘A’, and 1.6-fold less likely to be ‘G’, for annotated ORFs we identified, than for those we did not. Given the clear underrepresentation of annotated ORFs in our datasets, we conclude that there are many more isoform and novel ORFs to be discovered.

The abundance of novel start codons likely reflects pervasive translation

Evidence from other bacterial species suggests that the primary determinants of leadered translation initiation in Mtb are likely to be (i) a suitable start codon, (ii) an upstream sequence that can act as a S-D, and (iii) low local secondary structure around the ribosome-binding site. We detected enrichment of three different start codons in Ribo-RET data (Figure 2B), while S-D sequences can be located at a range of distances upstream of the start codon (Vellanoweth and Rabinowitz, 1992). Hence, there is limited sequence specificity associated with translation initiation. Moreover, a recent report showed that in E. coli, an S-D sequence is not an essential requirement for translation initiation (Saito et al., 2020). Leaderless translation initiation has even fewer sequence requirements; our data suggest that a 5’ AUG or GUG is sufficient for robust leaderless translation (Shell et al., 2015). While AUG and GUG represent only ~3% of all possible trinucleotide sequences, there is likely to be a strong bias towards 5’ AUG or GUG from the process of transcription initiation; the majority of TSSs in Mtb are purines, and the majority of +2 nucleotides are pyrimidines (Cortes et al., 2013; Shell et al., 2015). We propose that many Mtb transcripts are subject to spurious translation either by the leaderless or leadered mechanism, simply because the nominal sequence requirements for these processes commonly occur by chance. Thus, there is pervasive translation of the Mtb transcriptome, similar to the pervasive translation described in eukaryotes (Ingolia et al., 2014; Ruiz-Orera et al., 2018; Wacholder et al., 2021). Pervasive translation has been proposed as an explanation for some of the novel ORFs detected in E. coli by Ribo-RET (Meydan et al., 2019).

The process of pervasive translation is analogous to pervasive transcription, whereby many DNA sequences function as promoters, often from within genes, to drive transcription of spurious RNAs (Lybecker et al., 2014; Wade and Grainger, 2014). Indeed, there are many intragenic promoters in Mtb (Cortes et al., 2013; Shell et al., 2015), providing an additional source of potential spurious translation. We speculate that like spurious transcripts, which are rapidly degraded by RNases, the protein products of pervasive translation are rapidly degraded, as has been proposed for pervasively translated ORFs in E. coli (Stringer et al., 2021). Since Ribo-seq and Ribo-RET detect translation, not the protein product, the stability of the encoded proteins would not impact our ability to detect the corresponding ORFs.

Pervasive translation, by definition, means that ribosomes will spend some fraction of the time translating spurious ORFs. Although we detected many more novel ORFs than annotated ORFs, the total number of codons in all detected novel ORFs is ~20% that of annotated ORFs because of the smaller size of novel proteins. Moreover, novel ORFs as a group are expressed at substantially lower levels than annotated ORFs (Figure 5; Figure 5—figure supplements 1 and 2). Thus, it is likely that <10% of translation in Mtb at any given time is of spurious ORFs, so pervasive translation is unlikely to be overly detrimental to the cell.

Proto-genes and the evolution of new functional genes

Studies of eukaryotes indicate the existence of proto-genes, targets of pervasive translation of either intergenic sequences or sequences antisense to annotated genes (Ingolia et al., 2014; Ruiz-Orera et al., 2018; Wacholder et al., 2021). Proto-genes have the potential to evolve into functional ORFs that contribute to cell fitness (Blevins et al., 2021; Carvunis et al., 2012; Lu et al., 2017; Ruiz-Orera et al., 2018; Vakirlis et al., 2018; Vakirlis et al., 2020; Van Oss and Carvunis, 2019). There is also evidence that some bacterial protein-coding genes evolved from intergenic sequence (Yomtovian et al., 2010). Our data suggest that Mtb has a rich source of proto-genes. As described for proto-genes in yeast, the novel ORFs we identified in Mtb tend to be less well expressed, have less adapted codon usage, and are shorter than annotated genes (Blevins et al., 2021; Carvunis et al., 2012). Pervasive translation in Mtb likely facilitates the evolution of new gene function in Mtb. Since pervasive translation represents a low proportion of all translation, the fitness cost of pervasive translation may be balanced by the benefits of having a large pool of proto-genes.

New functional ORFs/proteins in Mtb

The question of whether an ORF is functional first requires a definition of function (Keeling et al., 2019). Here, we define function as the ability to improve cell fitness. While functional ORFs need not be under purifying selection, ORFs undergoing purifying selection are presumably functional. One metric of purifying selection available in the G/C-rich genomes of mycobacteria is G/C-skew. Analysis of G/C-skew in the codons of novel ORFs identified 90 ORFs that are likely to be functional (positive G/C, p < 0.1 in Supplementary file 1A, C). 54 of these 90 novel ORFs are leadered, and the Ribo-RET signal associated with these 54 ORFs was significantly higher than that for the set of all other novel ORFs (Mann-Whitney U test P = 1.8e^–5), consistent with the idea that functional ORFs are likely to be more highly expressed than non-functional ORFs (Carvunis et al., 2012; Vakirlis et al., 2020). Of the 90 ORFs that are likely functional based on their G/C-skew, 44 are ≤51 codons long. Thus, this single indicator of purifying selection has greatly expanded the set of likely functional small ORFs/proteins described for Mtb. There may be other constraints that additionally limit codon selection, especially for regulatory sORFs, such that functional sORFs lack positive G/C skew. Indeed, this is the case for a phylogenetically conserved set of cysteine-rich regulatory sORFs; cysteine codons that are likely to be essential for sORF regulatory function (Canestrari et al., 2020) also reduce the G/C-skew (Supplementary file 1D).

Analysis of codon usage for isoform ORFs is not informative due to their overlap with annotated ORFs. Some isoform ORFs are likely to represent mis-annotations of annotated ORFs. Multiple lines of evidence support this idea: (i) 19% (288) of isoform start codons are ≤10 codons from the corresponding annotated start codon (Supplementary file 1E); this was 3.4-fold more likely for leaderless isoform ORFs, presumably because they lack a S-D, which likely reduces the accuracy of start codon prediction by annotation pipelines. (ii) Leadered isoform ORFs that initiate within 10 codons of an annotated ORF have significantly higher Ribo-RET occupancy than other leadered isoform ORFs (Mann Whitney U Test P = 6.3e^–13; Ribo-RET occupancy from a single replicate), and are significantly less likely to overlap an annotated gene whose start codon was identified by Ribo-RET (Fisher’s Exact Test P = 3e^–4). Nonetheless, since most isoform ORFs start far from an annotated ORF start, we presume that most do not represent mis-annotations; indeed, for 43% (644) of the isoform ORFs, we also detected the start codon of the overlapping annotated ORF by Ribo-RET. While we expect many isoform ORFs to be a manifestation of pervasive translation, we speculate that some encode proteins with functions related to the function of protein encoded by the overlapping, annotated gene, as has been proposed for isoform ORFs in E. coli (Meydan et al., 2019).

Conclusions

Our data suggest that the Mtb transcriptome is pervasively translated. The unprecedented extent of translation we observe suggests that much of the translation is biological ‘noise’, and that most of the translated ORFs are unlikely to be functional. As ribosome-profiling studies are extended to more diverse species, we anticipate a massive increase in the discovery of bacterial sORFs/small proteins. Future studies aimed at functional characterization of sORFs/small proteins will require prioritizing with clear supporting evidence for function from codon usage patterns, phylogenetic conservation (Sberro et al., 2019), or genetic data.

Materials and methods

Key resources table

Reagent type (species) or resource	Designation	Source or reference	Identifiers	Additional information
Strain, strain background (Mycobacterium tuberculosis)	mc²7000	DOI: 10.1016/j.vaccine.2006.05.097	ΔpanCD ΔRD1
Strain, strain background (Mycobacterium smegmatis)	mc²155	DOI: 10.1111/j.1365–2958.1990.tb02040.x
Antibody	Monoclonal anti-FLAG M2 antibody (Mouse monoclonal)	SIGMA	Catalog # F1804	Used at (1:1,000) dilution for western blot
Recombinant DNA reagent	pRV1133C (plasmid)	This study	pRV1133C	Integrates at attP site; includes the metE promoter region
Recombinant DNA reagent	pGE450 (plasmid)	This study	pGE450	Derivative of pRV1133C containing 3 x FLAG
Recombinant DNA reagent	pGE190 (plasmid)	This study	pGE190	Derivative of pRV1133C containing the nLuc gene from pNL1.1 (Promega, cat no 1001)
Other	Micrococcal nuclease (S7)	SIGMA	Catalog # 10107921001
Other	Nano-Glo Luciferase Assay Reagent	Promega	Catalog # N1110
Chemical compound, drug	Retapamulin	SIGMA	Catalog # CDS023386
Software, algorithm	CLC Genomics Workbench	Qiagen	v8.5.1	Alignment of sequence reads from.fastq files
Software, algorithm	RNAfold	DOI:10.1186/1748-7188-6-26	v2.4.14	ViennaRNA Package https://www.tbi.univie.ac.at/RNA/

Share this article

Cite this article

Ribo-seq data support the translation of hundreds of isoform and novel ORFs from leaderless mRNAs.

Ribo-RET of M. tuberculosis identifies sites of translation initiation.

Features of higher-confidence ORFs identified by Ribo-RET.

Ribo-seq data support the translation of hundreds of isoform and novel ORFs identified by Ribo-RET.

Novel ORFs are efficiently translated.

Mass spectrometry validation of selected ORFs.

G/C skew within codons of novel and annotated ORFs.

Author details

Carol Smith

Contribution

Contributed equally with

Competing interests

Jill G Canestrari

Contribution

Contributed equally with

Competing interests

Archer J Wang

Contribution

Contributed equally with

Competing interests

Matthew M Champion

Contribution

Competing interests

Keith M Derbyshire

Contribution

For correspondence

Competing interests

Todd A Gray

Contribution

For correspondence

Competing interests

Joseph T Wade

Contribution

For correspondence

Competing interests

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Categories and tags

Further reading