Talking to scientists who regularly use target capture for their next generation sequencing about the performance of their current or desired enrichment panel, we realize the diversity in the number and interpretation of metrics that are believed to be important. To help reduce confusion from the vast number of metrics circulating in the NGS field, we present metrics that are important for our analysis of short read sequencers. To evaluate sequencing performance, it is important to understand what the following metrics indicate and how to use them together:
- On-target percentage
- Complexity and duplicate rate
- Coverage depth
- Uniformity of coverage
- Flexibility of workflow
After reads (sequenced DNA fragments) are generated, they are filtered based on quality. The percentage that passes the filter (% PF) are aligned to a reference genome. For targeted sequencing, the reads that align to the region of interest are on-target (Method #1, Figure 1). The measurement of on-target bases or % on-target* are typically represented as the ratio of the number of bases within a target region to the total number of pass-filter bases, expressed as a percentage (Method #2). The default range used for near bases is within 200 bp of the target region using Picard HsMetrics. Usually, we calculate these values after duplicate reads are removed. On-target percentage can be improved using blockers. Blockers are added during hybridization capture and anneal to adapters to prevent cross-hybridization, also called daisy chaining, between library fragments (Figure 2).
Method #1: % On-target bases = On-target bases/Total aligned bases
Method #2: % On-target* = (On-target bases + near-target bases)/Total aligned bases
Complexity and duplicate rate
For sequencing that uses hybridization capture, duplicate reads, especially in paired-end sequencing, are assumed to be the result of reading 2 or more PCR copies of the same original DNA fragment. When sequencing randomly fragmented, PCR-amplified DNA, some amount of duplication is unavoidable. The goal is to have sufficient diversity and complexity persisting within the library even after enrichment so that random sampling of reads will rarely detect the same fragment multiple times. Optimizing your library preparation and hybridization capture protocols can help lower duplicate reads to improve library complexity (Figure 3). Most sequencing analysis pipelines remove PCR duplicates; therefore, using protocols that maintain a low frequency of duplicate DNA fragments results in a greater amount of usable sequencing data at the end (Figure 4). However, for applications requiring higher sensitivity—such as rare allele detection—the unique versus duplicate reads metric is important for monitoring diversity and for more accurate measurement of copy number variation.
Coverage represents the number of times a read maps to a genomic target. The deeper the coverage of a target region (i.e., the more times the region is sequenced), the greater the reliability and sensitivity of the sequencing assay. Achieving robust sequencing results requires that a certain percentage of the targeted regions reach a certain coverage depth (Figure 5).
Typically, the minimum depth of coverage required for genomic resequencing of diploid organisms, such as human, mouse, or rat, is 20–30X. However, different applications, labs, or bioinformatics groups may require lower or higher minimum coverage depth. We have met researchers who find coverage as low as 1–2X sufficient, while at the other end of the scale, some researcher requires 500—1000X coverage of target regions; higher coverage depth allows for higher detection sensitivity of genomic sequence variations.
A good method for estimating the required depth of coverage for a particular application is to begin with 20X and divide by the expected allele frequency; e.g., for detecting mutations with 5% (0.05) allele frequency, you would need 400X coverage depth.
Uniformity of coverage
Coverage uniformity is a metric that should be considered in combination with other measures. One measure of uniformity is the fold-80, the amount of extra coverage required for 80% of the target sequences to reach the mean coverage depth. It is calculated by dividing the mean coverage by the 20th percentile coverage. Fold-80 can be a misleading measure of sequencing efficiency, since regions with low coverage are not included in the calculation. Sequencing data with a significant lack of coverage could still achieve a fold-80 score close to 1.0. This caveat illustrates the importance of evaluating targeted NGS panel performance in light of multiple measurements.
Consistency between panels is especially important if you will be doing multiple experiments at different times. To achieve consistency, variation between panels must be minimized. All xGen panels are generated using PCR-free synthesis, which removes the possibility of variation due to PCR bias. Large-scale synthesis and formulation of the xGen Exome Research Panel v2 generates many aliquots from the same single lot over time. This service is an option for all IDT panels. IDT labels these aliquots as “lots” to track shipping, but they originate from a single synthesis run. Lots from other vendors are typically generated from multiple synthesis runs and use enzymatic methods. To test the lot-to-lot consistency, two different users performed the IDT captures using different aliquots on different days at different sites (Figure 6). The coverage-to-coverage scatterplot for the xGen Exome Research Panel v2 shows a linear regression line that mimics the predicted “perfect” correlation line (Figure 6A). Lot-to-lot variation in target coverage can only be overcome by performing expensive re-validations for applications like copy number variant (CNV) calling. The xGen Exome Research Panel v2 allows researchers to skip expensive re-validations, saving time and precious resources.
Flexibility of workflow
Data reproducibility is paramount when conducting experiments. Ideally, experiments produce similar data when variables like hybridization time or multiplex level are changed for the hybridization-capture workflow (Figure 7). Reducing hybridization time to only 4 hours yields the same results as hybridization overnight, saving time overall. In addition, flexibility in timing allows users to plan experiments around shifts. The ability to multiplex provides similar flexibility, allowing users to adjust the workflow to the number of samples, as well as saving cost per sample. Consistency of data while maximizing cost-saving and time-saving ensure reproducibility and optimal efficiency.
Summary and additional resources
Optimizing your NGS experiments using these metrics can save time and costs. Workflows that result in high on-target rates and adequate coverage can avoid the burden of re-sequencing. Deep coverage, low duplicate rates, and experimental consistency give you confidence that your data is accurate and reproducible. The ability to modify the workflow by changing multiplexing levels and hybridization times allow you to plan your experiments to fit your schedule. All these metrics must be taken into consideration when choosing the best NGS products for your study. For more information about NGS workflows, methods, and applications, download our Targeted sequencing guide. To learn how the xGen Exome Research Panel provides a superior on-target rate and the most complete coverage of the human exome, download our white paper, Consistent, comprehensive efficient: An improved human exome sequencing solution.