Main

The transformative potential of bacterial whole-genome sequencing (WGS) for clinical diagnostics has been widely recognized in the scientific literature1,2,3,4,5,6,7,8,9. Molecular diagnostics (MDx) adopts tools from molecular biology for use in clinical diagnostics; it includes both the identification and the characterization of microorganisms on the basis of the detection and the characterization of nucleic acids. Beyond the research sector, the MDx market is a fast growing segment in the in vitro diagnostic space and is expected to grow at a rate of >9% in the next five years10. Infectious disease testing in clinical microbiology is estimated to represent 70% of the global MDx market11. Bacterial WGS applications have been widely discussed theoretically as a future application of clinical MDx, but such applications have only been shown in practice in few proof-of-concept studies, most of which were from the same established genome sequencing centres. Box 1 provides an overview of the range of MDx methods that are currently in use in clinical bacterial diagnostics.

The objective of this Perspective article is to provide an overview of the specific bioinformatic challenges that need to be addressed in the transition from proof of concept to widespread clinical implementation of bacterial WGS-based MDx tests in microbial diagnostics. We briefly describe the changes in the sequencing market that are defining the landscape for clinical adoption of WGS. We focus on bioinformatic problems that are associated with creating a technically and economically sound MDx product for clinical implementation — specifically, development of bioinformatic workflows and definition of standard operating procedures, management of computational resources and selection of computational support models, and data integration and storage. Here, we focus on clinical bacteriology, as several comprehensive recent reviews have highlighted the potential for immediate application of WGS-based MDx tools in this field of research1,8,9. Bacterial diagnostics are an attractive application for early adoption of WGS owing to their modest sequencing and bioinformatic analysis requirements, the existing clinical microbiology infrastructure and a well-established set of validated, clinically useful diagnostic parameters.

We describe how recent developments in the markets for both sequence data generation and bioinformatic support and, most importantly, how the introduction of both benchtop sequencing and remote cloud computing have laid the basis for widespread, decentralized adoption of WGS-based MDx tools in the clinic.

Genome sequencing in diagnostics

Although the potential of WGS to complement existing clinical microbiology practice for the typing and the characterization of bacterial isolates has long been emphasized, only recently have the first studies been published to clearly demonstrate examples of clinical use, most of which are for high-resolution epidemiological investigations of bacterial pathogens (reviewed in Refs 1,8,9). Table 1 provides an overview of seminal publications that highlight cases of the potential use of WGS in bacterial diagnostics. In the near future, applications of WGS in the clinic are predicted to include the drawing of more accurate epidemiological outbreak maps2,6,7,12, the deciphering of both the evolutionary history and the genetic make-up of particular outbreak isolates2,3, and forensic assignments of biological samples in the context of biodefense or criminal investigation4. Recently, the use of metagenomic sequencing has been proposed for infectious disease detection13,14.

Table 1 Recent seminal publications on bacterial diagnostic applications of WGS

In clinical practice, an exemplary diagnostic application of bacterial WGS would be the culture-independent in silico antimicrobial susceptibility testing. This technique is based on bioinformatic analyses of the genomes of either bacterial pathogens — such as Salmonella enterica or other enteric pathogens under routine surveillance by the US National Antimicrobial Resistance Monitoring System15 — or bacterial isolates that are responsible for drug-resistant tuberculosis infections, such as extensively drug-resistant (XDR) Mycobacterium tuberculosis strains16. In one study, methicillin-resistant Staphylococcus aureus (MRSA)12 isolates from patients and staff members at a hospital in the United Kingdom were sequenced, and single-nucleotide polymorphisms (SNPs) that were indistinguishable by traditional sequence typing between isolates were compared. This study reported a suspected MRSA transmission within the hospital, which confirmed predictions that were previously made on the basis of conventional epidemiological analyses.

A recent contract was awarded by the US Food and Drug Administration to Illumina to place and use the MiSeq sequencing platform — a benchtop sequencer introduced by Illumina to make genome sequencing affordable even for smaller laboratories — in US state and federal laboratories to source-track foodborne enteric pathogens (that is, S. enterica and Shiga toxin-producing Escherichia coli (STEC)), which attests to the receptiveness of both national health services and public health laboratories to adopt WGS-based MDx tools17. This might prove as a starter for their clinical adoption. However, similar cases have not yet been seen in other countries. Julian Parkhill, a co-author of several recent benchmark studies on the use of WGS-based epidemiological tracking of hospital pathogens2,12,16,18,19,20,21, pointed out that “the average clinician in a hospital is not going to be able to do this [WGS analysis], so it has to be automated” (Ref. 22).

The economic feasibility of bacterial WGS-based MDx tests in the clinic will mostly depend on the cost and the ease of both sequence generation and bioinformatic sequence analyses. Extensive benchmarking will be required to determine expenses and personnel efforts that are required for sample processing, data analysis and data storage. To this end, recent studies have attempted to compare sequencing platforms23 and to calculate costs of bioinformatic analyses using 'canned' analysis pipelines in combination with commercial cloud computing services24,25. Bioinformatic challenges are possible reasons that WGS has not yet penetrated deeply into the clinical bacterial diagnostic market. These challenges result from the lack of both bioinformatic standards and infrastructure to adequately meet demands on data storage, as well as on the analyses of rapidly changing and increasing amounts of sequence data from multiple next-generation sequencing platforms26. Box 2 gives an overview of the specific bioinformatic challenges and the solutions that characterize the transition of bacterial WGS-based MDx from proof of concept to clinical implementation.

The changing sequencing market

Next-generation sequencing technologies continue to change the genomic field, which results in several options for sequence data generation. These have been reviewed in the context of bacterial WGS-based clinical application27,28. Available sequencing platforms vary with respect to required financial investments for initial installation (that is, infrastructure for sample processing, sequencing and sequence analyses) and subsequent usage (that is, the cost per sequencing run or per generated base); the time and effort that are associated with sample preparation and sequencing itself; the degree of automation and personnel training that is required for sample processing; and the specific characteristics of the generated data, such as sequence read length and accuracy, error profile, number of generated reads per run and multiplexing capacity. As a consequence, available sequencing platforms can be more or less suitable for specific genomic applications or for desired production scales, and thus for clinical application in MDx27,28. Most recent studies on the use of WGS for bacterial diagnostics applied high-throughput short-read sequencing (Table 1), as the accuracy that is afforded by high genome coverage proved to be more important for the applied tests, including the detection of SNPs or of gene presence or absence, than read length. With the clinical market in mind, both Illumina and Life Technologies — the two leading developers of sequencing platforms in this field — have made great advances to simplify the workflow of sequence library preparation and to reduce sequencing runtimes.

Generally, the sequencing market has been following two major trends over the past few years. First, sequencing platforms continue to be optimized for ever-increasing output and lower cost, which are measured in total numbers of base-pair output per sequencing run and in the associated cost per base pair, respectively. As a downside, installation of these increased output platforms often requires substantial financial investments and offers limited flexibility to scale sequencing outputs down to the smaller read numbers that are required for bacterial genomes compared with that of human genomes. High-throughput platforms can thus require more samples to be sequenced in parallel in order to provide economic advantage over smaller sequencing platforms. For example, a full run of the Illumina HiSeq platform can generate more than 200-fold sequencing coverage for more than 250 E. coli genomes. This sets a sample number threshold that may be difficult to amass on a daily or weekly basis by a single clinical laboratory that depends on the timely delivery of diagnostic results, thus decreasing the use of these platforms for local installation.

Second, benchtop sequencers have recently been introduced to the market. Compared with larger sequencing platforms, these sequencers require smaller capital investments for both installation and infrastructure, and they promise fast and simple sequence generation in a standard laboratory environment. These platforms could be economically feasible for the small- to mid-size health care setting, in spite of higher operating costs per individual sequenced genome. Correspondingly, benchtop sequencing has been proposed for in situ implementation in the clinical setting, and the use of these platforms for WGS-based bacterial MDx applications has been successfully examined and validated7,27.

It is important to note that the next-generation-sequencing market could change markedly, as the field is expecting the introduction of nanopore technologies for fast and affordable long-read sequencing from low-input samples29. WGS-based diagnostic protocols will most probably have to be adapted, as the bioinformatic field is evolving to accommodate changing sequence data types. However, the required modifications to bioinformatic protocols for bacterial WGS-based MDx might be less extreme than expected, considering that several of the first bioinformatic sequence analysis tools, such as the basic local alignment search tool (BLAST)30, continue to be widely used today, more than 20 years after their publication. Moreover, in some aspects, the field is moving back to its roots, as bacterial genome sequencing started with the generation of fairly long (>900 bp) Sanger sequence reads.

In the academic setting, the bacterial genomics field is already experiencing increased decentralization owing to the introduction of benchtop sequencers, with even smaller laboratories successfully deploying next-generation platforms31. These decentralized sequencing operations provide a paradigm for future widespread application of bacterial WGS in the clinic. Today, small to mid-size hospitals often use central commercial services for routine bacterial diagnostic tests, whereas larger hospitals maintain microbiology laboratories on site. With this infrastructure in place, both the logistics and the financial requirements would be modest for commercial diagnostic services or for hospital-integrated microbiology laboratories to venture into the bacterial WGS-based MDx service space (Fig. 1). For the integration of bacterial WGS efforts into existing clinical microbiology practice, different models are conceivable, depending on the requirements for sample throughput, the control over both data and analysis parameters, and the integration with additional research activities (Box 3). Decentralized diagnostic networks that are in close association with health care providers offer general advantages, as they can stimulate research collaborations, shorten reaction times to outbreak scenarios and might ultimately lead to better quality of care. The specific bioinformatic challenges to developing and supporting market-ready bacterial WGS-based MDx products are outlined below.

Figure 1: Workflow for bacterial WGS-based MDx.
figure 1

A summary of the proposed workflow is shown for bacterial whole-genome sequencing (WGS)-based molecular diagnostics (MDx), which uses benchtop sequencing for decentralized sequence generation, as well as the cloud for both central data storage and remote data processing. Double-headed arrows indicate that the central data repository and reference database are constantly updated on the basis of new analysis results.

PowerPoint slide

Bioinformatic challenges

Standard operating procedures. In the academic research setting, the analysis of sequence data often involves an iterative process of testing, evaluating and optimizing specific steps in the analysis, which can include the application of multiple bioinformatic methods, tools and parameters. This optimization is less driven by economic factors — such as simplicity, reproducibility and efficiency of the analysis — all of which are key concerns for clinical implementation, but it is instead driven by the completeness and the accuracy of results, as well as by the conformity of the analysis with community-accepted standards. However, clinical MDx applications require definition of a robust bioinformatic sequence analysis workflow in order to ensure standardization, validation and automation, which helps to reduce both costs and times of such analyses.

Standardization of entire analysis protocols using a defined set of bioinformatic tools and analysis parameters guarantees reproducible diagnostic results. This reproducibility allows validation of diagnostic results as part of the developmental process of the clinical MDx product. Such validation should include large blinded clinical cohort studies. Parameters will need to be defined and validated for bacterial WGS-based MDx applications to associate specific genetic features with phenotypes. For example, standardization requires the definition of set thresholds to identify the presence of a genetic feature in the WGS data set; that is, how many individual sequence reads need to match a reference gene or locus, what should be the minimal required sequence identity between these reads and the reference, and how much of the reference locus needs to be covered by the matching sequence reads. The US Institute of Medicine formulated guidelines for sound scientific practice for the validation of so-called 'omics-based' tests, which included independent validation of the robustness of the test using a 'locked down', or frozen, computational model that cannot be changed during the validation process32.

The automation of defined complex bioinformatic workflows that involve multiple individual steps facilitates analysis optimization, affords high-throughput data processing and reduces user training requirements. Most bioinformatic support systems that are available for bacterial sequence analyses rely on 'workbench' models that allow users to choose from a variety of analysis procedures and tools, which provides maximum support for customized analyses. However, such flexibility is unnecessary in the clinical setting, in which the implementation of a standardized MDx test that runs with minimal configurable options will be more desirable to support their widespread use by clinical personnel without bioinformatic training. Ideally, such an MDx product would directly link to the raw bacterial WGS data that come from the sequencer and, in a fully automated way, generate a clear, concise human-readable diagnostic report, as well as an electronic output in a format that can be integrated with existing hospital informatics systems.

Data processing and computational resource management. Reasonably fast and computationally inexpensive bioinformatic workflow protocols have been effectively used in recent bacterial WGS-based MDx studies to identify specific genetic elements of interest (such as antibiotic resistance or virulence genes) or to determine the evolutionary position of a genome in relation to a set of references (that is, to draw phylogenetic trees). Short-read mapping tools that use the Burrows–Wheeler transform can align an entire bacterial WGS data set to a reference genome in less than an hour even with modest hardware support; for example, ~2.5 million E. coli reads can be processed on the credit card-sized single-board Raspberry Pi computer33. Mapping results can be parsed to infer information about the presence or absence of genes and to identify SNPs. The bioinformatic challenge for this type of bacterial WGS-based analysis includes the ability to provide scalable and elastic computational support to simultaneously process hundreds or more genomes at any time.

In the academic field, computational support for bioinformatic sequence analysis is typically provided using either local hardware or online computational resources. Local support can be provided through individual desktop computers or through local server networks. For example, two commercial providers of workbench programs for bioinformatic sequence analyses — CLC bio and Geneious — offer software for local installation either on a local desktop or on a server for additional computational support. Apart from the advantages that automated analysis pipelines provide over workbench models that force users to select and configure analysis parameters, the use of locally installed systems for bacterial genome sequence analyses requires substantial hardware support to allow parallel processing of typical clinical sample loads.

Online resources for academic use have been provided either as fixed workbenches that are pre-installed on a remote server or as Infrastructure as a Service that requires users to install software on a cloud server before using its resources. Compared with locally installed computational support systems, central resource management provided by cloud services tends to afford economies of scale that result in greater data storage and processing capacities at lower prices. Non-commercial sequence analysis software that is provided as an online service includes the versatile Galaxy platform34 and the Rapid Annotation using Subsystem Technology (RAST) server for bacterial genome annotation35. Users typically access these platforms with a standard web browser and take advantage of both the hardware resources that are available to them and the software that is pre-installed on the server. These systems do not require users to install software locally and provide a lot of flexibility to customize analyses; they are therefore popular in the academic community. However, clinical MDx applications are likely to be better served by combining online computational resources with streamlined and automated 'canned' analysis software systems.

The flexibility of the pay-as-you-go support model of the cloud could be an example for the decentralized implementation of highly scalable bioinformatic support for bacterial MDx. Computational resources on the cloud can accommodate even substantial analysis loads, including multiple tasks that are carried out in parallel. The elasticity of the cloud to theoretically provide these resources on-demand at any given time makes it possible that even bursts of extreme computational demand can be met on short notice.

For cloud-based bioinformatic applications, such as Galaxy Cloudman36 and the Cloud Virtual Resource (CloVR)37, to carry out analyses on the cloud, users upload and install software together with the input data. In CloVR, the software is assembled into automated bacterial sequence analysis pipelines that are pre-installed on a virtual computer (that is, a virtual machine)24,37, which greatly facilitates installation and affords portability across different computer operating systems, including local desktop computers and online cloud servers. As the cloud provides on-demand computational resources and storage space, both of which are paid for by the hour, data outputs should be downloaded after completion of the analysis and all remaining data and software removed from the cloud, which CloVR supports in a fully automated, seamless manner.

Data storage and integration. Bacterial WGS-based MDx relies heavily on the use of reference databases, for example, the use of reference strains to carry out isolate typing and phylogenetic analysis or to predict antimicrobial resistance phenotypes on the basis of comparison with known marker genes. In addition, if such MDx system is widely implemented, substantial amount of new bacterial WGS data will be generated on a continuous basis. These data need to be integrated back into the reference databases for subsequent iterations of the MDx application. For example, if a hospital identifies an MRSA isolate as part of its routine surveillance programmes, then the corresponding genotypic information will be important for the inclusion of this isolate in subsequent analyses in order to discover and track potential MRSA transmission events in the same hospital, as well as for long-term national and international surveillance. Beyond their direct use in situ, the generated WGS data will also be a valuable academic and/or commercial resource, for example, if genetic features can be associated with a specific outbreak after multiple bacterial isolates have been sequenced and analysed. These genetic loci can then be either diagnostic markers for commercial use, or targets for functional research or vaccine development38. Although commercial interests can foster and accelerate both the development and the implementation of new WGS-based MDx products, the immense value of the generated output for both the public health community and the scientific research community should be a strong argument to politically mandate open data sharing, as commercial interests might otherwise result in company-owned, proprietary genome sequence databases.

Central data storage provides economic benefits that are associated with unified data management. New data repositories could become integrated with established non-private databases, such as those maintained at the US National Center for Biotechnology Information (NCBI), which should guarantee open data sharing between commercial MDx test providers and the academic research community. Consistent data types and formats will have to be adapted to handle the enormous amount of data that are generated by routine clinical use of WGS. The current paradigm is to store raw sequence reads, which represents a great data burden and might not provide the best use for integration into future MDx applications. For example, the raw data from a single E. coli genome sequenced at ~250-fold coverage that are deposited at the NCBI Short Read Archive amounts to ~500 megabytes, whereas the corresponding GenBank file (accession number: AIFA00000000), which contains all the information of the assembled and annotated genome that is required for phylogenetic or genotypic characterization, has a size of less than two megabytes. This 250-fold reduction in the data 'footprint' substantially decreases both the costs and the efforts that are associated with long-term bacterial WGS data management.

Regulations

As MDx enters the clinical paradigm, the control of information is becoming an important issue at the laboratory, provider and patient levels. Online data transfer, processing and storage will require safety precautions to protect sensitive patient information. Commercial online resource providers, such as the Amazon web services, have recognized this problem and responded by obtaining appropriate security certifications. Additional regulatory requirements need to be defined by the legislative body. Although providers of clinical laboratory services in the United States are obligated to obtain the Clinical Laboratory Improvement Amendment (CLIA) certification, similar standards are lacking for bioinformatic sequence analyses. A California bill sponsored by the consumer genomics firm 23andMe introduced the term 'post-CLIA bioinformatics services' to distinguish laboratory services that are regulated by the CLIA from post-laboratory bioinformatic analysis services39. According to this bill, which has been discussed controversially40, post-CLIA bioinformatics services would not require approval or review of the algorithm by a government regulatory body but instead by a designated individual who is vaguely defined to possess a background in either bioinformatics or biostatistics. Although the main focus of current discussions is on human MDx, the legislative regulation of both the definitions and the validations that are applied to bioinformatic analysis parameters in bacterial MDx will probably also become more relevant in the near future.

Future directions

Clinical application of bacterial WGS-based MDx will be crucial for global efforts to identify, prevent and treat infectious diseases. In order to be truly successful, applications need to become widely available to clinical laboratories, including remote field settings and resource-poor hospitals in the developing world. For this scenario we envision a model that will require little more than a power supply, an Internet connection and an individual who has minimal laboratory skills to integrate bacterial genome sequencing as a diagnostic tool with essentially limitless applications in any health care setting. The first step towards integration of bacterial WGS-based MDx into the clinic could be its adoption by national health services and public health laboratories. These services operate on a defined set of clinical pathogens and diagnostic parameters, and at a scale that is large enough to allow timely validation and optimization of future MDx tests.