Long-read sequencing for the identification of insertion sites in large libraries of transposon mutants

DNA was prepared for nucleotide sequencing using both the new LoRTIS method and the previously described TraDIS-.xpress duplicate protocols, from DNA extraction to generating nucleotide sequence reads. This allows a comparison of reproducibility within each method and also to compare LoRTIS data to those generated using TraDIS-Xpress. For the first LoRTIS replicate, 8.7 million nucleotide sequence reads were generated, of which 4.2 million (48%) included transposon-specific sequences, while for the second replicate, 7.6 million of the 14.2 million reads (54%) had transposon-specific sequences. These data demonstrate that this new method has been successfully enriched for transposon-genome junctions. Read lengths ranged from 300 bp to over 13,000 bp, with an average of over 1,200 bp (Supplementary Table 1).

Reproducibility of LoRTIS and comparison with TraDIS-xpress

The number of nucleotide sequence reads mapped to each gene was determined, and comparison between these values ​​for replicates 1 and 2 in a scatterplot demonstrated the reproducibility of the LoRTIS method (Fig. 2). Comparison of reads per gene generated from LoRTIS with data from TraDIS-xpress highlights the similarity using the two different methods (Fig. 2). Spearman’s correlation coefficient between LoRTIS and TraDIS-xpress data sets was 0.93. The distribution of mapped sequence reads also showed similarity in their positions and numbers between the two methods, indicating accurate calling of transposon insertion sites by LoRTIS (Fig. 3).

Figure 2

Scatter plot of number of sequence reads per gene using TraDIS-xpress and LoRTIS. Each point on the graph represents a gene. For each gene, the number of sequence reads mapped in replicate 1 is plotted against the number of reads mapped to the gene in replicate 2. Red dots represent the scatterplot of LoRTIS results and blue dots that of TraDIS –xpress results. The strong correlation found indicates the reproducibility of each method.

picture 3
picture 3

Comparison of transposon insertion sites identified using TraDIS-xpress and LoRTIS. A genetic map of the relative positions of the genes is displayed at the bottom of the panel. The white arrow boxes represent the position of the genes and the blue arrow boxes represent the encoded proteins of the genetic code. Above this, each row of red or blue vertical lines indicates the position of the mapped reads on the forward or reverse strand respectively. The height of the bar represents the relative number of reads mapped to each position. The top row shows TraDIS-xpress data, and the bottom row shows the LoRTIS data, demonstrating the similarity between the data generated by the two different methods.

Identification of candidate essential genes

During transposon mutagenesis for TIS experiments, mutants with transposon insertions in essential genes do not develop. Therefore, assuming that enough transposon mutants and nucleotide sequence reads are generated to avoid stochastic regions of low coverage, the TIS data should include relatively few sequence reads that match essential genes. However, if the data includes insufficient sequence reads, resulting in the loss of certain genes, then these will appear essential even when they are not, and therefore the precise calling of essential genes requires sufficient data to overcome that. Thus, an ideal quality control of TIS data is a clear demonstration that the reads mapped are distributed across the genome, and enough data is generated to distinguish where very few or no reads are mapped in known essential genes. .

The LoRTIS data presented here not only resulted in sequence reads that mapped across the genome, but also demonstrated an absence of mapped reads in many putative essential genes identified using TraDIS-xpress. As an example, the similarity in the distribution of LoRTIS and TraDIS mappedxpress reads generated over a short section of the genome are shown in Fig. 3. No sequence reads are mapped to candidate essential genes raw and groLwhile there was an abundance of readings that matched the dcuA, fxsA, yjeH and yjeJ genes using both LoRTIS and TraDIS-xpressconfirming that LoRTIS was at least equal to TraDIS-xpress in this regard.

A list of putative essential genes generated from our LoRTIS data was also compared to lists derived from TraDIS-xpress data and conventional TraDIS data from another group6.7. These reference datasets were selected for comparison purposes because they were generated from the same strain of E.coli (BW25113). TraDIS-xpress and TraDIS data was produced using the Illumina platform for sequence generation, while LoRTIS used nanopore sequencing. Comparisons of putative essential genes showed that 311 identified essential genes were common to all three methods (Fig. 4; Supplementary Table 2). Figure 4 illustrates the putative essential genes identified using each method and their relative distribution. Of 398 putative essential genes that have been identified by our TraDIS-xpress data, 340 (85%) were also identified by LoRTIS.

Figure 4
number 4

Putative essential genes identified using three different transposon insertion sequencing methods. Venn diagram showing the number of putative essential genes identified from three different sources of transposon insertion sequencing: LoRTIS, TraDIS-xpress and TraDIS7. Of these, 311 putative essential genes were common to all three datasets.

Advantages of long sequence reads in mapping transposon insertion sites in regions of repeated nucleotide sequences

Long reads are particularly useful for mapping unique sites in the genome when the organism’s genome size is large or there are repeating elements. LoRTIS can produce long reads that map repeated elements and into unique regions of the genome, allowing us to identify transposon insertions. In E.coli BW25113, there are seven ribosomal RNA operons; each is over 5 kb in size and contains two highly conserved ribosomal RNA genes. Readings generated by TraDIS-xpress could not be uniquely mapped to these operons while the reads generated by LoRTIS could. Although most of the reads generated in this study were between 0.3 and 2 kB in length, they were uniquely mapped. Indeed, either the reads spanned regions of polymorphisms in the repeat elements, or the reads spanned unique flanking nucleotide sequences (Fig. 5).

Figure 5
number 5

Long reads are mapped to only a single copy of a repeated ribosomal RNA gene cluster larger than 5 kb. In the E.coli genome, the longest repetitive nucleotide sequences are those of ribosomal RNA operons, of which there are seven copies, each over 5 kb in length. This is therefore a good test of the ability of LoRTIS to uniquely identify insertion sites in long repetitive sequences. A genetic map of the relative positions of the rRNA genes is displayed at the bottom of the panel as blue arrowhead boxes. Above that, each panel represents the long plays, each represented by a thin horizontal line, which corresponds only to that repeat element.

Another set of repeated elements in E.coli are the ins places (insA, insB, insH) that have more than one copy of genes spread across the genome16. Insertions of transposons have been reported in these ins loci, but again it was not possible to map any given copy with certainty using short reads. In our LoRTIS data, there were over 47,000 sequence reads that matched ins loci, of which ~22,000 uniquely mapped (47%) while in the TraDIS-xpress data generated by the Illumina short-read platform, across 28,000 reads mapped to ins loci, only ~6500 uniquely mapped (17%) (Supplementary Table 3). These data demonstrate that LoRTIS long reads can uniquely map reads to repeating elements more efficiently than TraDIS-xpress.

Multiplexing of LoRTIS experiments

A unique sequence identifier (barcode) can be added to DNA fragments of a sample during the sequencing library preparation step, allowing different samples to be combined and sequenced on a single flow cell (multiplexed) and after sequencing, reads from each sample can be separated from the pool based on the barcode (demultiplexed). Oxford Nanopore uses 24 bp sequences to assign a unique identifier to each sample; these are called Native Barcodes (NBD), and 96 NBDs are available. We used four of these NBDs to multiplex our LoRTIS DNA fragment preparations. Of the sequence reads from our LoRTIS experiment, 94% and 84% were demultiplexed into these unique NBDs in replicates 1 and 2, respectively. Although each NBD produced a different number of reads, no bias was observed when using a particular NBD (Fig. 6). This confirms that LoRTIS can successfully integrate the multiplexing of different experimental samples onto a single MinION flow cell.

Figure 6
number 6

Native barcode reads demuxed from LoRTIS data. The circular plots represent the total number of transposon-specific sequences containing reads that were demultiplexed from the LoRTIS data. Replica 1 had reads from all four native barcodes (NBD): 32% of the reads were from NBD1 (blue), 30% from NBD4 (brown), 20% from NBD5 (grey) and 18% from NBD8 (yellow) . Replica 2 had readings from all four NBDs: 20% NBD1 (blue), 32% NBD4 (brown), 26% NBD5 (grey), and 22% NBD8 (yellow).

Comments are closed.