How important are those NGS metrics?

Of the many metrics used in evaluating target capture data for NGS applications, read about which ones our researchers consider important for evaluating performance of target enrichment panels.

Talking to scientists who regularly use target capture for their next generation sequencing about the performance of their current or desired enrichment panel, we realize the diversity in the number and interpretation of metrics that are believed to be important. To help reduce confusion from the vast number of metrics circulating in the NGS field, we present metrics that are important for our analysis of short read sequencers. To evaluate sequencing performance, it is important to understand what the following metrics indicate and how to use them together:

  • On-target percentage
  • Complexity and duplicate rate
  • Coverage depth
  • Uniformity of coverage
  • Consistency
  • Flexibility of workflow

On-target percentage

After reads (sequenced DNA fragments) are generated, they are filtered based on quality. The percentage that passes the filter (% PF) are aligned to a reference genome. For targeted sequencing, the reads that align to the region of interest are on-target (Method #1, Figure 1). The measurement of on-target bases or % on-target* are typically represented as the ratio of the number of bases within a target region to the total number of pass-filter bases, expressed as a percentage (Method #2). The default range used for near bases is within 200 bp of the target region using Picard HsMetrics. Usually, we calculate these values after duplicate reads are removed. On-target percentage can be improved using blockers. Blockers are added during hybridization capture and anneal to adapters to prevent cross-hybridization, also called daisy chaining, between library fragments (Figure 2).

Method #1: % On-target bases = On-target bases/Total aligned bases

Method #2: % On-target* = (On-target bases + near-target bases)/Total aligned bases

Defining on-target rate

Figure 1. Defining on-target. A base within a read is considered on target if it is aligned with a targeted region. Bases on-target can fluctuate when the fragmentation size of a library is different from one library kit to another. Larger fragments have lower bases on-target, smaller fragments have higher bases on-target. For example, if one of your targets is a 150 bp read and the insert is 150 bp, then your on-target bases would be 100%. Meanwhile, a larger 300 bp read, aligned to the same 150 bp target, would mean that only 50% of the bases are on-target. To simplify the analysis, and compare across diverse samples, the target region is usually padded (150 bp in either direction) to calculate bases flanked on-target. IDT typically uses flanked on-target because it is stable across different library fragment sizes.


Improved on-target performance using blockers

Figure 2. Improved on-target performance using blockers. (A) Blocking oligos are synthetic oligonucleotide sequences that hybridize to NGS adapters in a sequencing library to prevent cross-hybridization between library fragments. Cross-hybridization is also known as daisy chaining. (B) DNA libraries were prepared from cell line NA12878 (Coriell) using a TruSeq® Exome Library Prep Kit (Illumina), and enriched in single or multiplex reactions using the xGen AML Cancer Panel 1.0 with xGen Universal Blockers—TS Mix. Sequencing was performed on a MiSeq® System (Illumina) to generate 2 x 150 bp, paired-end reads. On-target values (with 150 bp flank) were averaged across experiments.



Complexity and duplicate rate

For sequencing that uses hybridization capture, duplicate reads, especially in paired-end sequencing, are assumed to be the result of reading 2 or more PCR copies of the same original DNA fragment. When sequencing randomly fragmented, PCR-amplified DNA, some amount of duplication is unavoidable. The goal is to have sufficient diversity and complexity persisting within the library even after enrichment so that random sampling of reads will rarely detect the same fragment multiple times. Optimizing your library preparation and hybridization capture protocols can help lower duplicate reads to improve library complexity (Figure 3). Most sequencing analysis pipelines remove PCR duplicates; therefore, using protocols that maintain a low frequency of duplicate DNA fragments results in a greater amount of usable sequencing data at the end (Figure 4). However, for applications requiring higher sensitivity—such as rare allele detection—the unique versus duplicate reads metric is important for monitoring diversity and for more accurate measurement of copy number variation.

High conversion rates, complexity, and coverage.

Figure 3. High conversion rates, complexity, and coverage. Libraries were generated according to the manufacturer’s instructions with 10 ng of cell-free DNA (cfDNA) reference standards, then captured with a custom 61 kb xGen Lockdown Panel. Libraries were pooled and sequenced on an Illumina NextSeq® 500 instrument. Reads were mapped using BWA (0.7.15). (A) Coverage and complexity (estimated unique molecules; HS library size) were calculated using Picard (2.18.9). Conversion rates were calculated from mean target coverage at very high duplication rates. By optimizing sample conversion during library prep, overall library conversion is increased. Increased library conversion leads to higher complexity, more coverage and sensitivity. To illustrate this, (B) libraries were subsampled to fewer reads and maintained high coverage.


Protocol optimization led to a 3-fold reduction in “duplicate” reads.

Figure 4. Protocol optimization led to a 3-fold reduction in “duplicate” reads. This optimization was primarily accomplished by controlling PCR parameters before and after targeted enrichment. The xGen AML Panel was used for duplicate optimization. Duplicate reads are used to calculate library complexity.

 

Coverage depth

Coverage represents the number of times a read maps to a genomic target. The deeper the coverage of a target region (i.e., the more times the region is sequenced), the greater the reliability and sensitivity of the sequencing assay. Achieving robust sequencing results requires that a certain percentage of the targeted regions reach a certain coverage depth (Figure 5).

Typically, the minimum depth of coverage required for genomic resequencing of diploid organisms, such as human, mouse, or rat, is 20–30X. However, different applications, labs, or bioinformatics groups may require lower or higher minimum coverage depth. We have met researchers who find coverage as low as 1–2X sufficient, while at the other end of the scale, some researcher requires 500—1000X coverage of target regions; higher coverage depth allows for higher detection sensitivity of genomic sequence variations.

A good method for estimating the required depth of coverage for a particular application is to begin with 20X and divide by the expected allele frequency; e.g., for detecting mutations with 5% (0.05) allele frequency, you would need 400X coverage depth.

Complete, deep exon coverage at varying read depths.

Figure 5. Complete, deep exon coverage at varying read depths.

Uniformity of coverage

Coverage uniformity is a metric that should be considered in combination with other measures. One measure of uniformity is the fold-80, the amount of extra coverage required for 80% of the target sequences to reach the mean coverage depth. It is calculated by dividing the mean coverage by the 20th percentile coverage. Fold-80 can be a misleading measure of sequencing efficiency, since regions with low coverage are not included in the calculation. Sequencing data with a significant lack of coverage could still achieve a fold-80 score close to 1.0. This caveat illustrates the importance of evaluating targeted NGS panel performance in light of multiple measurements.

Consistency

Consistency between panels is especially important if you will be doing multiple experiments at different times. To achieve consistency, variation between panels must be minimized. All xGen panels are generated using PCR-free synthesis, which removes the possibility of variation due to PCR bias. Large-scale synthesis and formulation of the xGen Exome Research Panel v2 generates many aliquots from the same single lot over time. This service is an option for all IDT panels. IDT labels these aliquots as “lots” to track shipping, but they originate from a single synthesis run. Lots from other vendors are typically generated from multiple synthesis runs and use enzymatic methods. To test the lot-to-lot consistency, two different users performed the IDT captures using different aliquots on different days at different sites (Figure 6). The coverage-to-coverage scatterplot for the xGen Exome Research Panel v2 shows a linear regression line that mimics the predicted “perfect” correlation line (Figure 6A). Lot-to-lot variation in target coverage can only be overcome by performing expensive re-validations for applications like copy number variant (CNV) calling. The xGen Exome Research Panel v2 allows researchers to skip expensive re-validations, saving time and precious resources.

Coverage compared between lots/aliquots for the IDT xGen Exome Research Panel v2 and xGen Exome Research Panel v1.0

Figure 6. Coverage compared between lots/aliquots for the IDT xGen Exome Research Panel v2 (A) and xGen Exome Research Panel v1.0 (B). 

Flexibility of workflow

Data reproducibility is paramount when conducting experiments. Ideally, experiments produce similar data when variables like hybridization time or multiplex level are changed for the hybridization-capture workflow (Figure 7). Reducing hybridization time to only 4 hours yields the same results as hybridization overnight, saving time overall. In addition, flexibility in timing allows users to plan experiments around shifts. The ability to multiplex provides similar flexibility, allowing users to adjust the workflow to the number of samples, as well as saving cost per sample. Consistency of data while maximizing cost-saving and time-saving ensure reproducibility and optimal efficiency.

xGen Exome Research Panel v2 maintains performance with a flexible workflow.

Figure 7. xGen Exome Research Panel v2 maintains performance with a flexible workflow. (A–C) Higher multiplexing and (D–F) shorter hybridization times maintain high on-target %, low duplicate %, and uniform coverage.

 

Summary and additional resources

Optimizing your NGS experiments using these metrics can save time and costs. Workflows that result in high on-target rates and adequate coverage can avoid the burden of re-sequencing. Deep coverage, low duplicate rates, and experimental consistency give you confidence that your data is accurate and reproducible. The ability to modify the workflow by changing multiplexing levels and hybridization times allow you to plan your experiments to fit your schedule. All these metrics must be taken into consideration when choosing the best NGS products for your study. For more information about NGS workflows, methods, and applications, download our Targeted sequencing guide. To learn how the xGen Exome Research Panel provides a superior on-target rate and the most complete coverage of the human exome, download our white paper, Consistent, comprehensive efficient: An improved human exome sequencing solution.

Published Jun 12, 2014
Revised/updated Sep 17, 2020

TruSeq is a registered trademark of Illumina, Inc., used with permission. All rights reserved.