Illumina and Nanopore methods for whole genome sequencing of hepatitis B virus (HBV)

McNaughton, Anna L.; Roberts, Hannah E.; Bonsall, David; de Cesare, Mariateresa; Mokaya, Jolynne; Lumley, Sheila F.; Golubchik, Tanya; Piazza, Paolo; Martin, Jacqueline B.; de Lara, Catherine; Brown, Anthony; Ansari, M. Azim; Bowden, Rory; Barnes, Eleanor; Matthews, Philippa C.

doi:10.1038/s41598-019-43524-9

Illumina and Nanopore methods for whole genome sequencing of hepatitis B virus (HBV)

Article
Open access
Published: 08 May 2019

Volume 9, article number 7081, (2019)
Cite this article

Download PDF

You have full access to this open access article

Scientific Reports

Illumina and Nanopore methods for whole genome sequencing of hepatitis B virus (HBV)

Download PDF

39k Accesses
63 Citations
15 Altmetric
Explore all metrics

Abstract

Advancing interventions to tackle the huge global burden of hepatitis B virus (HBV) infection depends on improved insights into virus epidemiology, transmission, within-host diversity, drug resistance and pathogenesis, all of which can be advanced through the large-scale generation of full-length virus genome data. Here we describe advances to a protocol that exploits the circular HBV genome structure, using isothermal rolling-circle amplification to enrich HBV DNA, generating concatemeric amplicons containing multiple successive copies of the same genome. We show that this product is suitable for Nanopore sequencing as single reads, as well as for generating short-read Illumina sequences. Nanopore reads can be used to implement a straightforward method for error correction that reduces the per-read error rate, by comparing multiple genome copies combined into a single concatemer and by analysing reads generated from plus and minus strands. With this approach, we can achieve an improved consensus sequencing accuracy of 99.7% and resolve intra-sample sequence variants to form whole-genome haplotypes. Thus while Illumina sequencing may still be the most accurate way to capture within-sample diversity, Nanopore data can contribute to an understanding of linkage between polymorphisms within individual virions. The combination of isothermal amplification and Nanopore sequencing also offers appealing potential to develop point-of-care tests for HBV, and for other viruses.

Adaptation of Oxford Nanopore technology for hepatitis C whole genome sequencing and identification of within-host viral variants

Article Open access 02 March 2021

Multiplex PCR method for MinION and Illumina sequencing of Zika and other virus genomes directly from clinical samples

Article 24 May 2017

A method for near full-length amplification and sequencing for six hepatitis C virus genotypes

Article Open access 17 March 2016

Introduction

Chronic hepatitis B virus (HBV) infection affects an estimated 250–290 million individuals worldwide, resulting in around 800,000 deaths from chronic liver disease and hepatocellular carcinoma each year^1,2. The status of HBV infection as a globally important public health problem is highlighted by United Nations Sustainable Development Goals, which set a target for HBV elimination by the year 2030³. An improved understanding of the molecular biology, epidemiology, infection dynamics and pathophysiology of HBV is a crucial step towards reducing the global burden of HBV disease. Despite the availability of a robust prophylactic vaccine and safe suppressive antiviral therapy, HBV has remained endemic - and neglected - in many populations⁴. Large-scale virus genome sequencing to provide more complete genetic information at the population and individual level can shed light on the limitations of current interventions⁵, and inform new strategies for elimination. New sequencing initiatives are required with improved methodologies that are efficient, accurate, sensitive and cost-effective⁶.

In the context of clinical and public health settings, HBV sequencing can provide information that is useful in characterizing virus genotype, potential transmission networks, drug and vaccine resistance, and aspects of the dynamics of infection^5,7,8. Traditional Sanger sequencing can derive consensus sequences (usually of sub-genomic fragments), and next-generation technologies such as Illumina can interrogate within-sample diversity at the whole-genome level. Sequencing complete virus genomes at depth, while also preserving mutation-linkage information (ie. complete haplotypes), remains an important goal. Such data will inform more accurate phylogenetic characterisation of viral quasispecies within infected hosts, which can in turn be interpreted to study virus transmission and the evolutionary dynamics of drug and immune escape⁶.

‘Third generation’ (i.e. single-molecule) sequencing approaches including those based on nanopores (Oxford Nanopore Technologies, ONT)^9,10, have the potential to revolutionise virus genome sequencing by producing genome-length reads that encompass all of the mutations within a single virus particle. In addition, Nanopore technology is portable and provides sequence data in real time, potentially enabling sequencing as a point-of-care test. However, Nanopore sequencing has been adopted with caution because of its high raw error rates¹¹. While error-corrected Nanopore consensus sequences may be sufficiently accurate for many uses, raw-read accuracy remains a concern if it is to be used for the assessment of within-sample (between-molecule) diversity. One strategy to reduce error rates from single source molecules is to create concatemeric (chain-like) successive copies of each template, so that a single concatemer contains several reads of each base from the original molecule. This approach has been demonstrated in the circularization of 16 S bacterial DNA sequences followed by ‘rolling circle amplification’ (RCA) using a high-fidelity DNA polymerase¹².

HBV has an unusual, circular, partially double-stranded (ds) DNA genome of approximately 3.2 kB (Fig. 1A(i))⁶. The combination of double- and single-stranded DNA in a single molecule can cause technical problems for sequencing, since library preparation methods are usually specific for either double- or single-stranded DNA templates. HBV isolates have previously been sequenced with Nanopore technology using full-length and sub-genomic PCR approaches to enrich for HBV sequences^13,14. Whilst these approaches worked well in the studies when applied to high viral load samples, in both publications correction was only possible at the consensus level, with one study having a raw read error rate of ~12%¹³, and the other unable to definitively confirm putative minority variants detected in the minION reads¹⁴. In this study we build on a published method for HBV enrichment and amplification from plasma^15,16, which generates intermediates that are suitable for sequencing by Nanopore or Illumina. We implement novel analytical methods to exploit concatemeric reads in improving the accuracy of Nanopore sequencing of HBV for use in research and clinical applications.

Results

Completion ligation and rolling circle amplification prior to illumina sequencing of full-length HBV genomes

We applied sequencing methods (as shown in Fig. 1) to plasma from three different adults with chronic HBV infection (Table 1). We first set out to convert the partially dsDNA viral genome (Fig. 1A(i)) to a complete dsDNA HBV molecule using a completion-ligation (CL) method (Fig. 1A(ii))¹⁶, so that sequencing libraries could be generated using kits that require dsDNA as input. Following CL, genomes were amplified by the use of primers (Fig. 1A(iii) and rolling circle amplification (RCA; Fig. 1A(iv))^15,16. We confirmed an increase in HBV DNA after RCA by comparing extracted DNA to RCA products using qPCR (Suppl Methods 1). Using DNA products derived from from CL followed by RCA (Fig. 1B(ii)) and from CL alone without an RCA step (Fig. 1B(iii)), we prepared sequencing libraries and sequenced them using an Illumina MiSeq instrument.

Table 1 Details of samples used for HBV sequencing.

Full size table

Both the CL and CL + RCA methods generated Illumina sequencing data that covered the whole HBV genome for all three samples (Fig. 2A). The relative drop in coverage across the single-stranded region of the HBV genome disappeared after RCA, suggesting a preferential amplification of intact whole HBV genomes.

We observed a region of reduced coverage, corresponding approximately to nt 2500–2700, in all samples (Fig. 2A). Further examination of the sample with the sharpest drop in coverage across this region (sample 1348) revealed a drop in the density of insert ends in the region (Suppl Fig. 1) and resulting disruption to insert size (Fig. 2B), consistent with inefficient digestion by the Nextera transposase. Reasons for the reduced coverage are unclear; no nicks in the HBV genome have been described in this region, but there may be some secondary structure present. GC content may also be a contributing factor: GC bases in the region nt 2500–2700 account for 35–37.5% in the Illumina consensus sequences, in contrast to the rest of the genome, where GC content is 48–49.5%.

To investigate the possible effects of RCA on the representation of within-sample diversity, we compared variant frequencies between CL and CL + RCA. Only 2% of sites had variants at a frequency >0.01 and there appeared to be a consistent reduction in estimated frequency in RCA compared with CL alone (Fig. 2C), but overall this effect appears to be very minor for the samples we have studied.

Completion ligation and rolling circle amplification facilitates nanopore sequencing of full-length HBV genomes

We used the material generated by RCA for Nanopore sequencing on the MinION (ONT) (Fig. 1B(i)). Reads mapping to HBV accounted for 0.6–1.3% of all sequences derived from individual patient samples (Table 1). The majority of the remainder of reads mapped to the human genome (Suppl Fig. 2). The reads included concatemers of the full-length HBV genome (as illustrated in Fig. 1C) reaching up to 16 HBV genomes per concatemer sequence, with a median of 1–2 HBV genomes (Fig. 3A,B). The number of reads passing quality criteria required for downstream analysis (described in the methods section) are shown in Table 1.

RCA sequencing followed by nanopore does not produce chimeric sequences

In order to ascertain whether recombination occurred between different viral genomes during RCA or Nanopore sequencing¹², we sequenced a mixture of two plasma samples (1331 and 1332, genotypes C and E respectively), producing 3,795 HBV reads (of any length) with a primary mapping to genotype C and 9,358 HBV reads with a primary mapping to genotype E. Of these, 148 genotype C and 532 genotype E reads were in the form of complete concatemer sequences (defined as containing ≥3 full HBV genomes) and between them they contained 4,805 HBV full or partial genome reads (for definitions, see Fig. 1C). We scored the similarity of each HBV genome read to the 1331 and 1332 Illumina consensus sequences at each of 335 sites that differed between the two consensus sequences, classifying genome segments as genotype C or genotype E if they matched the respective consensus at ≥80% of sites (Suppl Fig. 3). No complete concatemer sequences contained a mixture of geno-C and geno-E HBV genome reads. Only 6/4,805 HBV genome reads (either full or partial length) could not be classified in this way, each of which constituted either a partial genome covering <8 marker sites, or a low-quality sequence matching variants from both genotypes (Suppl Fig. 3). Thus, we found no evidence that the RCA process generates recombined sequences.

Error correction in nanopore data

Among all Nanopore complete concatemer sequences with ≥3 full genome reads (as defined in Fig. 1C), 11.5% of positions differed from the Illumina consensus sequence for that sample. Given Nanopore raw error rates and the observation that the Illumina data contained very few within-host variants, we considered that the majority of such differences were likely to be Nanopore sequencing errors. Correcting such errors would allow us to phase true variants into within-sample haplotypes, improving on the information available from Illumina sequencing alone.

As a first step in correcting Nanopore sequencing errors at the level of the complete concatemer sequence, we took the consensus of all HBV genome reads (both full and partial reads) in each concatemer. Such an approach involves a trade-off between increasing the minimum number of HBV genome reads per concatemer for inclusion to optimise error correction, versus increasing the number of complete concatemer sequences under consideration to maximise sensitivity for assessment of within-sample diversity.

To assess error rates, we compared corrected Nanopore sequences with the Illumina consensus, considering only those sites with <1% variation in the Illumina data. For sample 1331, analysis of all sequences containing ≥3 HBV full genome reads maximised the total number of distinct complete concatemer sequences available for analysis (n = 208), and resulted in 0.88% of positions with a consensus call different from Illumina. Changing the criteria to be more stringent, we analysed only concatemers containing ≥8 HBV full genome reads, giving us a smaller pool of concatemer sequences (n = 41) but reducing the mean proportion of sites that varied from the Illumina consensus to 0.51% (Suppl Table 1).

In order to reduce the error rate, while maximising the number of complete concatemer sequences, we adopted a refined error correction method based on two assumptions:

(i)
Basecaller errors are randomly distributed across all complete concatemer sequences, whereas true genetic variants are consistently seen in HBV genome reads within a subset of concatemers;
(ii)
Systematic sequencing errors tend to be associated with a particular sequence context, or k-mer (Suppl Fig. 4A). In many cases, the error rate associated with a particular k-mer differs from that associated with its reverse complement (with the exception of longer homopolymers). Thus, basecaller errors often appear to be strand-specific, whereas true genetic variants can be seen with equal probability in forward and reverse strand reads (Suppl Figs 4B and 5). Note that the RCA process is such that forward reads may have had either strand of the original circular HBV genomes as their original template, and similarly for reverse reads (Fig. 1A).

To identify sites of true genetic polymorphism, for the data generated from each sample we tested for an association between base and concatemer at each site, to determine whether some bases were consistently found in particular concatemers at any one site, as described in assumption (i) above. For this we analysed forward and reverse strand reads separately, requiring that an association was found in both read sets (forward and reverse) for the site to be considered truly polymorphic (Fig. 4(ii–iv)).

We additionally tested each site for an association between variant (presence/absence within a concatemer) and strand (forward/reverse), thus sites where the potential variant showed significant strand bias were not considered truly polymorphic (Fig. 4(v)). We corrected polymorphic sites using the within-concatemer consensus base, whereas sites that failed this test were corrected using the whole-sample consensus base for all concatemers (Fig. 4(vi)). The result was a single, corrected, HBV genome haplotype for each original complete concatemer sequence. Further details on this error correction procedure are provided in the methods.

The final corrected Nanopore sequences differed from the Illumina-derived consensus at an average of <0.4% of sites for the three samples studied (Table 1). We noted that many of these differences were called as gaps (‘−’) or ambiguous sites (‘N’) in the Nanopore data, so the proportion of sites which had been called as an incorrect base was even lower (Fig. 5).

Detection of true genetic variants in nanopore data

We then switched our attention to the sites which our Nanopore correction method had highlighted as genuine variants. All variants with >10% frequency in the Illumina RCA data were also detected by the Nanopore method, and frequencies from the two methods showed good concordance (Fig. 5A,B). When considering those variants that appeared at >10% frequency in corrected Nanopore concatemers, all were confirmed as genuine by their presence in the Illumina data (Suppl Table 3). Hence, the Nanopore approach shows good sensitivity and specificity for calling mid-low frequency variants.

We also used the set of complete concatemer sequences to derive a within-patient consensus sequence from the Nanopore data. For two out of three samples (1331 and 1348) we found this to be identical to the final consensus sequences for Illumina using CL +/− RCA (excluding 5 sites in each sample which were called as ‘N’s in the Nanopore consensus) (Fig. 5C). In the third case (1332), the Nanopore consensus differed at just two sites, located next to a homopolymer (GGGGG).

A primary advantage that Nanopore (long-read data) offers over Illumina (short-read data) is the ability to generate full-length haplotypes, providing insights into the epistatic interactions between polymorphisms at different loci. This is illustrated by quantifying the proportion of genomes derived from Nanopore data that represent a specific haplotype, characterised by combinations of multiple polymorphisms (Fig. 6). For example, we were able to identify linkage between two mutations in sample 1348, spaced 1,789 bp apart in 4/32 whole genome haplotypes (at sites nt 400 and nt 2189, Suppl Table 3). Comparing this to Illumina data, the same polymorphisms are detected at similar frequencies but cannot be assigned to a single haplotype in combination. Thus, accurate haplotyping with Nanopore facilitates improved insight into within-host population structure.

Sequence data generated from a plasmid by nanopore sequencing

To further evaluate our methods, we applied our RCA amplification, library preparation, Nanopore sequencing and variant detection pipeline to an HBV plasmid¹⁷. No genetic variants were detected within this sample, as anticipated for clonal genetic material. The corrected consensus sequence differed from the published plasmid sequence¹⁷ at only 1/6820 positions (excluding 26 sites which were called as ‘N’s). This difference was the result of a homopolymer miscall, similar to the case in 1332. These results confirm the high fidelity of the RCA enrichment step and the accuracy of our bioinformatic approach for sequence data generated by Nanopore.

Sequence availability

Consensus sequences for our Illumina completion-ligation (MK720628, MK720629, MK720632), Illumina RCA (MK720627, MK720630, MK720631) and Nanopore sequences (MK321264, MK321265, MK321266) have been deposited into Genbank. HBV reads generated from the sequencing platforms have been made available via the European Nucleotide Archive with the study accession number PRJEB31886.

Discussion

Robust generation of full-length HBV sequence data is an important aspiration for improving approaches to clinical diagnosis (including point-of-care diagnostics and detection of co-infections), patient-stratified management, molecular epidemiology, and long-term development of cure strategies, following precedents set by work in HIV¹⁸. However, the unusual biology of the HBV genome has represented a significant challenge for whole-genome sequencing to date⁶.

We here demonstrate and compare the use of two different sequencing platforms to generate full length HBV sequences from clinical samples. Illumina deep sequencing approaches allow determination of diversity and detection of minor variants, but have the disadvantage of short reads that do not permit the reconstruction of complete viral haplotypes. In contrast, our new Nanopore protocol may under-estimate the total diversity present within a sample, but allows us to gain confidence in the generation of whole HBV genome haplotypes. Existing approaches can already determine mixed or highly-diverse infections^18,19 however, additional insight into the linkage between polymorphisms, and developing methods to track divergent quasispecies, may yield important benefits in understanding the evolutionary biology and clinical outcomes of HBV infection. A comparison of the pros and cons of different sequencing approaches is summarised in Table 2.

Table 2 Comparison of three methods of deriving HBV sequence data.

Full size table

Many users of Nanopore technology are primarily interested in obtaining an accurate full-length consensus sequence for diagnostic purposes. Error correction tools such as Nanopolish²⁰ are sufficient for such applications, but methodological adjustments are required for the analysis of intra-host diversity. Our analysis highlights that, aside from homopolymer errors, many errors in raw Nanopore sequence data are k-mer-specific. The approach used in this study, using both genome-length concatemers and strand specificity to distinguish k-mer-specific errors from genuine diversity, facilitates error correction at the per-read level. The approach did not introduce any unexpected diversity when applied to a ‘clonal’ population of plasmid HBV genomes, adding to our confidence that the polymorphisms we detect in the final corrected dataset reflect genuine genetic variants rather than Nanopore sequencing errors.

For a given number of genomes in a concatemer, there is a trade-off between the amount of data available for analysis, relative to the potential for accurate error correction (Suppl Table 1). Thus, using three genomes in a concatemer produces the largest data-set but a relatively higher error rate, while increasing the threshold to six genomes per concatemer reduces the available data-set for analysis, but also lowers the error rate. The approach taken by any individual study might therefore alter the threshold for the minimum number of concatenated genomes, according to the question being asked (a study seeking to quantify maximum possible diversity would benefit from analysing a smaller number of genomes per concatemer, while a study requiring highly robust error correction might raise the threshold for genome copy numbers in each concatemer). Future optimisation focused on increasing the number of long concatemers will improve the specificity and sensitivity of variant identification and thereby the resolution of low-frequency variants on haplotypes. Long concatemers also improve the confidence with which low frequency haplotypes can be called and linkage established (Suppl Methods 3 and Suppl Fig. 9).

As a new technology, Nanopore sequencing is currently still evolving rapidly, with updates to basecalling algorithms, kits and the flowcell chemistry being frequently released. Our bioinformatic methods are based on general principles of the technology, and hence have shown applicability across samples sequenced using different flowcell and basecaller versions (Table 1). At present, this assay is not quantitative, and in this study we observed considerable variability in total yields and proportion of mapped HBV reads between Nanopore sequencing runs. However, it is reasonable to expect that the generation of high quality HBV data will increase as further updates improve total yields and raw accuracy rates.

In chronic HBV infection, the hepatitis B e-antigen (HBeAg)-positive phase of infection is frequently characterised by high viral loads and low viral diversity, as in the samples described here. It has been hypothesised that reduced immune-mediated selection during the HBeAg phase of infection is allowing the unconstrained replication of conserved viral populations^21,22, explaining the low diversity we observed in our samples. Marked increases in viral diversity have been described prior to and immediately after HBeAg seroconversion, coinciding with reductions in viral load²². Samples from the seroconversion phase are relatively unusual in clinical practice, and focused studies undertaken within large, diverse clinical cohorts will be needed to identify and study individuals in this stage of chronic infection. Further work with larger numbers of samples, including different disease context and phenotypes (e.g. acute infection, transmission networks, patients with a wide range of viral loads, HBeAg-negative status, chronic disease including cancer and cirrhosis), will be of interest in characterising the utility of these different methods for diversity analyses, including identification of specific sequence polymorphisms and determination of within and between host diversity. Optimisation for lower viral loads is particularly important for the approach to become widely applicable. Broadly speaking, sensitivity can be optimised through viral enrichment (for example using probe-based selection^19,23 and/or by using laboratory approaches that deplete human reads^24.

Our results demonstrate that our approach is successful for HBV genotypes C and E (from clinical samples) and D (plasmid sequence). Although we have not yet applied the method to other genotypes, we believe our methods are likely to be agnostic to genotype, as the primers were designed to be complementary to highly conserved regions of the HBV genome¹⁵. Sequencing of a mixed genotype-C/E sample demonstrates that the RCA approach is capable of identifying >1 genotype within a single sample without suggesting or introducing recombination events, illustrating the reliability of Nanopore long-read data for complete haplotype reconstruction. Further optimisation in sensitivity will be required before we can use the method to detect mixed infections in which one genotype is introduced as a minor variant. The methods developed in this study could potentially be applied to study other viruses with small, circular DNA genomes.

Methods

Patients and ethics

We used plasma samples from adults (aged ≥18 years) with chronic HBV infection attending outpatient clinics at Oxford University Hospitals NHS Foundation Trust, a large tertiary referral teaching hospital in the South-East of England. All participants provided signed informed consent for participation. Ethics permission was given by NHS Health Research Authority (Ref. 09/H0604/20). All methods and analysis were performed in accordance with the guidelines and regulations stipulated as part of the ethics approval. HBV DNA viral loads were obtained from the clinical microbiology laboratory (COBAS AmpliPrep/COBAS TaqMan, Roche²⁵; a standard automated platform for quantification of viral loads). We chose samples for sequencing based on their high viral load; all were HBeAg-positive. Blood samples were collected in EDTA. To separate plasma, we centrifuged whole blood at 1800 rpm for 10 minutes. We removed the supernatant and stored in aliquots of 0.5–2 ml at −80 °C. We selected samples of minimum volume 0.5 ml and with a minimum HBV DNA viral load of 10⁷ IU/ml to optimize successful amplification and sequencing (Table 1).

HBV plasmid

In addition to sequencing autologous HBV from clinical samples, we also applied our sequencing methods to a plasmid, in order to investigate the performance of our approach using a template for which the full molecular sequence is already known, and in which diversity is anticipated to be minimal or absent. We used the HBV 1.3-mer P-null replicon plasmid, a 6820 bp fully dsDNA construct, with a replication-deficient 1.3 × HBV length clone encoded along with ampicillin resistance genes and promoter sequences¹⁷. The plasmid was supplied as purified DNA in nuclease-free water.

Nucleic acid extraction

For patient samples, we extracted total nucleic acid from 500 µl plasma using the NucliSENS magnetic extraction system (bioMérieux) and eluted into 35 µl of kit buffer as per the manufacturer’s instructions.

Completion/ligation and Phi 29 rolling circle amplification

For patient samples, we prepared CL reactions in triplicate using previously described methods¹⁶. We modified this protocol to maximise the amount of DNA added, by using 6.4 μl extracted DNA plus 3.6 μl reaction mix to obtain a total reaction volume of 10 μl. We retained one reaction for sequencing after undergoing only the CL step, and the other two underwent RCA, using the previously described Phi 29 protocol¹⁶. The completion-ligation step was not required for the plasmid, so it directly underwent RCA using the same primers and laboratory protocol that were used for patient samples¹⁶. Primer sites are shown in Suppl Fig. 6.

Library preparation and sequencing

For each sample, we used both the product of the CL reaction and the RCA reaction for library preparation using the Nextera DNA Library Preparation Kit (Illumina) with a modified protocol to account for lower input, based on a previously published method²⁶. We sequenced indexed libraries, consisting of short fragments of PCR-amplified template, on a MiSeq (Illumina) instrument with v3 chemistry for a read length up to 300 bp paired-end.

We used the remaining RCA reaction products, consisting of concatemers of the unfragmented template DNA, for Nanopore sequencing. First, we resolved potential branching generated by RCA by digesting with a T7 endonuclease I (New England Biolabs). We carried out library preparation with a 1D Genomic DNA ligation protocol (SQK-LSK108, Oxford Nanopore Technologies, ONT), and sequenced the samples using R9.4 or R9.5.1 flowcells on a MinION Mk 1B sequencer (ONT).

Analysis of Illumina data

We demultiplexed paired-end Illumina reads and trimmed low quality bases and adapter sequences (QUASR²⁷ and Cutadapt²⁸ software), before removing human reads by mapping to the human reference genome, hg19 using bowtie2²⁹. We then used BWA-MEM³⁰ to map non-human reads to HBV genotype A-H majority consensus sequences, derived from 4,500 whole genomes stored on HBVdb³¹. We used conventional numbering systems for the HBV genome, starting at the EcoR1 restriction site (G/AATTC, where the first T is nucleotide 1). We re-mapped the same reads using BWA-MEM to each within-sample majority consensus. In a test of accuracy, consensus genomes were locally aligned to contiguous elements (contigs) assembled ‘de novo’ from the trimmed reads (VICUNA software) and found to match perfectly.

Analysis of nanopore sequence data: initial processing

We basecalled raw Nanopore reads of the RCA concatemers using ONT’s Albacore versions 2.0.2 (samples 1331 and 1332) and 2.1.10 (sample 1348 and 1331/1332 mix). We trimmed ‘pass’ reads (those with qscore >7) using Porechop v.0.2.3 (https://github.com/rrwick/Porechop) to remove adapter sequences. We used Kraken to classify reads³² against a custom database comprised of the human genome and all complete microbial genomes from RefSeq. We additionally mapped reads to a panel of reference sequences representing genotypes A-H (sequences available at https://github.com/hr283), in order to identify the genotype of the sample. These reference sequences had a repeat of the first 120 bp appended on the end, to ease the alignment of reads from circular genomes.

Analysis of plasmid sequence

For the plasmid, raw Nanopore data was basecalled with guppy 1.8.10 and then trimmed with Porechop as previously. We constructed a custom reference sequence for use in the following alignment steps (sequence available at https://github.com/hr283). This had the same structure as the plasmid construct but used the sequence of the genotype D reference in the HBV sections. We removed a site from the reference which was known to be deleted in the plasmid, since our methods are not designed to call insertions and deletions with respect to the genotype reference (see further details below).

Analysis of nanopore sequence data: error correction

Our initial consensus error correction procedure was adapted from the method previously described by Li et al.¹². We started with complete concatemer sequences and chopped these into full or partial HBV genome reads (as illustrated in Fig. 1C). For this step, we identified repeat HBV genome reads in concatemeric sequences with the use of an anchor sequence comprising the first 100 bp of the relevant genotype reference. Reads were chopped every time the anchor sequence was found. Where individual anchor sequences were missed because of poor-quality data, we used the distance to the nearest anchor sequence as a guide to form individual genomes. Each HBV genome read was remapped with BWA-MEM³⁰ to the HBV genotype reference. Note that since minimap2³³ has recently replaced BWA-MEM for alignment of Nanopore data, future work would benefit from using minimap2 at the relevant steps in the pipeline.

Reads were assigned to either forward or reverse read sets, based on whether they mapped to the plus or minus strand of the genotype reference (Fig. 4(ii)). Concatemers containing reads in both sets were removed (representing a total of 13/1048 concatemers across all three patient samples). To select concatemers with n full genome reads for further analysis, we filtered for those containing ≥(n + 2) read-sections, since the first and last section of each concatemer are not guaranteed to be full length.

We applied our refined error correction method to complete concatemer sequences with ≥3 full genome reads (Fig. 4(i)). To speed up the search for true genetic variants, we only considered sites where a non-consensus base appeared at >60% frequency within one or more concatemers. We scored and filtered each of these potential variant sites using the following approach:

(1)
We conducted a Fisher’s exact test (https://pypi.org/project/FisherExact) to determine significance of the association between base and concatemer on forward and then reverse read sets (Fig. 4(iv)). If either of the resulting p-values were >0.01, we removed the site from the list of variants. We used the two p-values, p1 and p2, to generate a phred-based QUAL score by setting QUAL = −10 * log10(p1*p2), as reported in Suppl Table 3.
(2)
We calculated a strand bias p-value, by applying a chi squared contingency test to the numbers of forward vs reverse strand concatemers with vs. without observations of the variant base (defined as the most common non-consensus base). If this p-value was <0.01 then the potential variant was filtered out (Fig. 4(v)).

Sites failing either the concatemer-association or strand bias criteria were considered Nanopore errors, and were corrected using the consensus base across all concatemers. Note that to avoid false correction, if the most common base in the forward read set did not match the most common base in the reverse read set, then we defined the whole sample consensus base as ‘N’ (undetermined). Variant sites were corrected using the consensus base within each concatemer (Fig. 4(vi)). We additionally recorded the allele frequency, calculated as the proportion of base calls across all corrected concatemers that are equal to the most common non-consensus base. Further filtering based on allele frequency >10% was applied for consistency when comparing Nanopore variant calls with variants at >10% frequency in Illumina. These variants are shown in Suppl Table 3.

Whole-sample consensus Nanopore sequences were derived by taking the most common base at each site, if it was at >40% frequency and was the most common base in both the forward and reverse read sets, or calling the site as an ‘N’ otherwise. Note that the method is not designed to call insertions or deletions relative to the genotype reference; sites are only called as a gap (-) if there are no bases covering the site in either the forward or reverse read sets. The code used for data processing, error correction and variant calling is available on github: https://github.com/hr283/RCAcorrect.

Sanger sequencing

Sanger sequencing was performed on the patient samples, using a pan-genotypic approach to generate multiple overlapping amplicons spanning the HBV genome (Suppl methods 2). The amplicons generated were examined for evidence of polymorphisms identified in both the Nanopore and Illumina sequencing data (Suppl Table 3, Suppl Figs 7 and 8).

Phylogenetic trees

We generated maximum likelihood phylogenetic trees using RaxML³⁴ with a gamma model of rate heterogeneity and a general time-reversible (GTR) nucleotide substitution model, followed by visualisation in FigTree.

References

P Observatory, Collaborators. Global prevalence, treatment, and prevention of hepatitis B virus infection in 2016: a modelling study. Lancet Gastroenterol Hepatol https://doi.org/10.1016/S2468-1253(18)30056-6 (2018).
Article Google Scholar
WHO. Hepatitis B Fact Sheet. Available at: http://www.who.int/mediacentre/factsheets/fs204/en/ (Accessed: May 2017) (2017).
Griggs, D. et al. Policy: Sustainable development goals for people and planet. Nature 495, 305–307 (2013).
Article ADS CAS PubMed Google Scholar
O’Hara, G. A. et al. Hepatitis B virus infection as a neglected tropical disease. PLoS Negl. Trop. Dis. 11, e0005842 (2017).
Article PubMed PubMed Central Google Scholar
McNaughton, A. L. et al. HBV vaccination and PMTCT as elimination tools in the presence of HIV: insights from a clinical cohort and dynamic model. BMC Med. 17, 43 (2019).
Article PubMed PubMed Central Google Scholar
McNaughton, A. L. et al. Insights From Deep Sequencing of the HBV Genome-Unique, Tiny, and Misunderstood. Gastroenterology 156, 384–399 (2019).
Article PubMed Google Scholar
Gonzalez, C. et al. Barcoding analysis of HIV drug resistance mutations using Oxford Nanopore MinION (ONT) sequencing. BioRxiv https://doi.org/10.1101/240077 (2017).
Quick, J. et al. Real-time, portable genome sequencing for Ebola surveillance. Nature 530, 228–232 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
Pennisi, E. Genome sequencing. Search for pore-fection. Science 336, 534–537 (2012).
Article ADS CAS PubMed Google Scholar
Reiner, J. E. et al. Disease detection and management via single nanopore-based sensors. Chem. Rev. 112, 6431–6451 (2012).
Article CAS PubMed Google Scholar
Lu, H., Giordano, F. & Ning, Z. Oxford Nanopore MinION Sequencing and Genome Assembly. Genomics Proteomics Bioinformatics 14, 265–279 (2016).
Article PubMed PubMed Central Google Scholar
Li, C. et al. INC-Seq: accurate single molecule reads using nanopore sequencing. Gigascience 5, 34 (2016).
Article CAS PubMed PubMed Central Google Scholar
Sauvage, V. et al. Early MinION nanopore single-molecule sequencing technology enables the characterization of hepatitis B virus genetic complexity in clinical samples. PLoS One 13, e0194366 (2018).
Article PubMed PubMed Central Google Scholar
Astbury, S. et al. Extraction-free direct PCR from dried serum spots permits HBV genotyping and RAS identification by Sanger and minION sequencing. bioRxiv 552539, https://doi.org/10.1101/552539 (2019).
Margeridon, S. et al. Rolling circle amplification, a powerful tool for genetic and functional studies of complete hepatitis B virus genomes from low-level infections and for directly probing covalently closed circular DNA. Antimicrob. Agents Chemother. 52, 3068–3073 (2008).
Article CAS PubMed PubMed Central Google Scholar
Martel, N., Gomes, S. A., Chemin, I., Trepo, C. & Kay, A. Improved rolling circle amplification (RCA) of hepatitis B virus (HBV) relaxed-circular serum DNA (RC-DNA). J. Virol. Methods 193, 653–659 (2013).
Article CAS PubMed Google Scholar
Addgene: HBV 1.3-mer P-null replicon. Available at: https://www.addgene.org/65462/ (Accessed: 25th March 2019)
Wymant, C. et al. PHYLOSCANNER: Inferring Transmission from Within- and Between-Host Pathogen Genetic Diversity. Mol. Biol. Evol. https://doi.org/10.1093/molbev/msx304 (2017).
Article PubMed Central Google Scholar
Thomson, E. et al. Comparison of Next-Generation Sequencing Technologies for Comprehensive Assessment of Full-Length Hepatitis C Viral Genomes. J. Clin. Microbiol. 54, 2470–2484 (2016).
Article CAS PubMed PubMed Central Google Scholar
Loman, N. J., Quick, J. & Simpson, J. T. A complete bacterial genome assembled de novo using only nanopore sequencing data. Nat. Methods 12, 733–735 (2015).
Article CAS PubMed Google Scholar
Cheng, Y., Guindon, S., Rodrigo, A. & Lim, S. G. Increased viral quasispecies evolution in HBeAg seroconverter patients treated with oral nucleoside therapy. J. Hepatol. 58, 217–224 (2013).
Article CAS PubMed Google Scholar
Lim, S. G. et al. Viral quasi-species evolution during hepatitis Be antigen seroconversion. Gastroenterology 133, 951–958 (2007).
Article CAS PubMed Google Scholar
Greninger, A. L. et al. Rapid metagenomic identification of viral pathogens in clinical samples by real-time nanopore sequencing analysis. Genome Med. 7, 99 (2015).
Article PubMed PubMed Central Google Scholar
Loose, M., Malla, S. & Stout, M. Real-time selective sequencing using nanopore technology. Nat. Methods 13, 751–754 (2016).
Article CAS PubMed PubMed Central Google Scholar
Allice, T. et al. COBAS AmpliPrep-COBAS TaqMan hepatitis B virus (HBV) test: a novel automated real-time PCR assay for quantification of HBV DNA in plasma. J. Clin. Microbiol. 45, 828–834 (2007).
Article CAS PubMed PubMed Central Google Scholar
Lamble, S. et al. Improved workflows for high throughput library preparation using the transposome-based Nextera system. BMC Biotechnol. 13, 104 (2013).
Article CAS PubMed PubMed Central Google Scholar
Watson, S. J. et al. Viral population analysis and minority-variant detection using short read next-generation sequencing. Philos. Trans. R. Soc. Lond. B Biol. Sci. 368, 20120205 (2013).
Article PubMed PubMed Central Google Scholar
Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal 17, 10 (2011).
Article Google Scholar
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).
Article CAS PubMed PubMed Central Google Scholar
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint (2013).
Hayer, J. et al. HBVdb: a knowledge database for Hepatitis B Virus. Nucleic Acids Res. 41, D566–70 (2013).
Article CAS PubMed Google Scholar
Wood, D. E. & Salzberg, S. L. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15, R46 (2014).
Article PubMed PubMed Central Google Scholar
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Article PubMed PubMed Central Google Scholar
Stamatakis, A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30, 1312–1313 (2014).
Article CAS PubMed PubMed Central Google Scholar
Pfeiffer, F. et al. Systematic evaluation of error rates and causes in short samples in next-generation sequencing. Sci. Rep. 8, 10950 (2018).
Article ADS PubMed PubMed Central Google Scholar
Schirmer, M., D’Amore, R., Ijaz, U. Z., Hall, N. & Quince, C. Illumina error profiles: resolving fine-scale variation in metagenomic sequencing data. BMC Bioinformatics 17, 125 (2016).
Article PubMed PubMed Central Google Scholar
Wick, R. R., Judd, L. M. & Holt, K. E. Comparison of Oxford Nanopore basecalling tools. Available at: https://github.com/rrwick/Basecalling-comparison. (Accessed: 5th February 2019).
Slatko, B. E., Gardner, A. F. & Ausubel, F. M. Overview of Next-Generation Sequencing Technologies. Curr. Protoc. Mol. Biol. 122, e59 (2018).
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

The work described here was funded by the Wellcome Trust (Intermediate Fellowship to PM, grant ref 110110). PP is funded by NIHR funding allocated to the Imperial Biomedical Research Centre. EB is funded by the Medical Research Council UK, the Oxford NIHR Biomedical Research Centre and is an NIHR Senior Investigator. Core funding to the Wellcome Centre for Human Genetics was provided by the Wellcome Trust (award 203141/Z/16/Z). A synopsis of the work presented here was represented in poster format at the European Association of the Society for the Liver (EASL) International Liver Conference, Paris 2018, and at the Nanopore ‘London Calling’ Meeting, London 2018. The views expressed in this article are those of the author and not necessarily those of the NHS, the NIHR, or the Department of Health. We would like to acknowledge the support of the Hepatology clinic at Oxford University Hospitals NHS Foundation Trust for their support in recruitment of patients into research cohorts, and we are grateful to Senthil Chinnakannan for sharing the HBV plasmid which we sequenced.

Author information

Anna L. McNaughton, Hannah E. Roberts and David Bonsall contributed equally.

Authors and Affiliations

Nuffield Department of Medicine, Medawar Building, University of Oxford, South Parks Road, Oxford, OX1 3SY, UK
Anna L. McNaughton, David Bonsall, Jolynne Mokaya, Sheila F. Lumley, Catherine de Lara, Anthony Brown, M. Azim Ansari, Eleanor Barnes & Philippa C. Matthews
Wellcome Centre for Human Genetics, Roosevelt Drive, Oxford, OX3 7BN, UK
Hannah E. Roberts, Mariateresa de Cesare, Tanya Golubchik & Rory Bowden
Department of Infectious Diseases and Microbiology, Oxford University Hospitals NHS Foundation Trust, John Radcliffe Hospital, Headley Way, Oxford, OX3 9DU, UK
David Bonsall, Sheila F. Lumley & Philippa C. Matthews
Big Data Institute, Old Road, Oxford, OX3 7FZ, UK
David Bonsall & Tanya Golubchik
Imperial BRC Genomics Facility, Imperial College, London, UK
Paolo Piazza
Gastroenterology and Hepatology Clinical Trials Facility, Oxford University Hospitals NHS Foundation Trust, John Radcliffe Hospital, Oxford, OX3 9DU, UK
Jacqueline B. Martin
Department of Hepatology, Oxford University Hospitals NHS Foundation Trust, John Radcliffe Hospital, Oxford, OX3 9DU, UK
Eleanor Barnes
NIHR Oxford Biomedical Research Centre, Oxford University Hospitals NHS Foundation Trust, John Radcliffe Hospital, Oxford, OX3 9DU, UK
Eleanor Barnes & Philippa C. Matthews

Authors

Anna L. McNaughton
View author publications
You can also search for this author in PubMed Google Scholar
Hannah E. Roberts
View author publications
You can also search for this author in PubMed Google Scholar
David Bonsall
View author publications
You can also search for this author in PubMed Google Scholar
Mariateresa de Cesare
View author publications
You can also search for this author in PubMed Google Scholar
Jolynne Mokaya
View author publications
You can also search for this author in PubMed Google Scholar
Sheila F. Lumley
View author publications
You can also search for this author in PubMed Google Scholar
Tanya Golubchik
View author publications
You can also search for this author in PubMed Google Scholar
Paolo Piazza
View author publications
You can also search for this author in PubMed Google Scholar
Jacqueline B. Martin
View author publications
You can also search for this author in PubMed Google Scholar
Catherine de Lara
View author publications
You can also search for this author in PubMed Google Scholar
Anthony Brown
View author publications
You can also search for this author in PubMed Google Scholar
M. Azim Ansari
View author publications
You can also search for this author in PubMed Google Scholar
Rory Bowden
View author publications
You can also search for this author in PubMed Google Scholar
Eleanor Barnes
View author publications
You can also search for this author in PubMed Google Scholar
Philippa C. Matthews
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

A.L.M., D.B., M.d.C. and P.C.M. conceived and designed the project. P.C.M. and E.B. applied for ethical approval. J.B.M. recruited patients and obtained informed consent; clinical blood samples were processed by A.B. and C.d.L., A.L.M., D.B. and M.d.C. undertook the R.C.A., Nanopore and Illumina sequencing work with expert input from P.P. and R.B. J.M. and A.L.M. generated Sanger sequences. S.F.L. contributed to development of sequencing methods. H.E.R., D.B., M.A.A. and A.L.M. analysed the data with oversight from P.C.M. and R.B. A.L.M., H.E.R. and P.C.M. wrote the manuscript with input from D.B., R.B. and E.B. All authors provided editorial comments, and reviewed and approved the final manuscript.

Corresponding author

Correspondence to Philippa C. Matthews.

Ethics declarations

Competing Interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Data File

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

McNaughton, A.L., Roberts, H.E., Bonsall, D. et al. Illumina and Nanopore methods for whole genome sequencing of hepatitis B virus (HBV). Sci Rep 9, 7081 (2019). https://doi.org/10.1038/s41598-019-43524-9

Download citation

Received: 24 November 2018
Accepted: 26 April 2019
Published: 08 May 2019
DOI: https://doi.org/10.1038/s41598-019-43524-9
Springer Nature Limited

This article is cited by

Technical comparison of MinIon and Illumina technologies for genotyping Chikungunya virus in clinical samples
- Leandro Menezes de Souza
- Isabelle Dias de Oliveira
- Leonardo José Tadeu de Araújo
Journal of Genetic Engineering and Biotechnology (2023)
Long-read sequencing of the zebrafish genome reorganizes genomic architecture
- Yelena Chernyavskaya
- Xiaofei Zhang
- Jessica Blackburn
BMC Genomics (2022)
VirStrain: a strain identification tool for RNA viruses
- Herui Liao
- Dehan Cai
- Yanni Sun
Genome Biology (2022)
Comparison of SARS-CoV-2 sequencing using the ONT GridION and the Illumina MiSeq
- Derek Tshiabuila
- Jennifer Giandhari
- Tulio de Oliveira
BMC Genomics (2022)
Long-term hepatitis B virus infection of rhesus macaques requires suppression of host immunity
- Sreya Biswas
- Lauren N. Rust
- Benjamin J. Burwitz
Nature Communications (2022)

Illumina and Nanopore methods for whole genome sequencing of hepatitis B virus (HBV)

Abstract

Similar content being viewed by others

Introduction

Results

Completion ligation and rolling circle amplification prior to illumina sequencing of full-length HBV genomes

Completion ligation and rolling circle amplification facilitates nanopore sequencing of full-length HBV genomes

RCA sequencing followed by nanopore does not produce chimeric sequences

Error correction in nanopore data

Detection of true genetic variants in nanopore data

Sequence data generated from a plasmid by nanopore sequencing

Sequence availability

Discussion

Methods

Patients and ethics

HBV plasmid

Nucleic acid extraction

Completion/ligation and Phi 29 rolling circle amplification

Library preparation and sequencing

Analysis of Illumina data

Analysis of nanopore sequence data: initial processing

Analysis of plasmid sequence

Analysis of nanopore sequence data: error correction

Sanger sequencing

Phylogenetic trees

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing Interests

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Navigation