Talking to scientists who regularly use target capture for their next generation sequencing about the performance of their current or desired enrichment panel, I realize the diversity in the number and interpretation of metrics that are believed to be important. To help reduce confusion from the vast number of metrics circulating in the NGS field, I have decided to discuss those that scientists at IDT consider to be important for our analysis of short read sequencers. IDT uses the following measurements to evaluate the performance of an enrichment panel:
- Unique vs. duplicate reads
- Percentage of reads mapping on-target
- Coverage depth
- Uniformity of coverage
Figure 1. Protocol optimization led to a 3-fold reduction in “duplicate” reads. This optimization was primarily accomplished by controlling PCR parameters before and after targeted enrichment. The xGen® AML Panel was used for duplicate optimization.
Unique vs. duplicate reads
For sequencing that uses hybridization capture, duplicate reads (sequenced DNA fragments), especially in paired-end sequencing, are assumed to be the result of reading 2 or more PCR copies of the same original DNA fragment. When sequencing randomly fragmented, PCR-amplified DNA, some amount of duplication is unavoidable. The goal is to have sufficient diversity persisting within the library even after enrichment so that random sampling of reads will rarely detect the same fragment multiple times. Most sequencing analysis pipelines remove PCR duplicates; therefore, using protocols that maintain a low frequency of duplicate DNA fragments results in a greater amount of usable sequencing data at the end. However, for applications requiring higher sensitivity—such as rare allele detection—the unique vs. duplicate reads metric is important for monitoring diversity and for more accurate measurement of copy number variation.
Figure 2. Defining on-target. A base within a read is considered on target if it is aligned with a targeted region. A read is considered on target if a single base within a read aligns to a targeted region. In the example above, we would say that this particular result was 75% on-target if we calculate by reads (reads 1, 2, and 4 are on target; read 3 is not), but approximately 50% on-target if we calculate by bases (only half of the bases within the reads are aligned with a targeted region). IDT measures reads on-target because that more accurately depicts reliable pull down of target fragments regardless of variables such as shear size.
Percentage of reads mapping on-target
The measurement of on-target bases or reads is typically represented as the ratio of number of bases within a target region to total number of bases output by the sequencer, expressed as a percentage. Usually we calculate these values after duplicate reads are removed from the read pool.
Method #1: % On-target = On-target reads / Total aligned reads
Method #2: % On-target = On-target bases / Total aligned bases
Figure 3. ~50% increase in on-target reads achieved through protocol optimization. This optimization was primarily accomplished by adjusting the temperatures used during hybridization and wash steps. On-target refers to “On-target reads” as defined in the earlier section. The flank includes the target region +100 bases. Note that the amount of a given read mapping in the region flanking a target will change with shear size. Larger shear sizes correspond to more read mapping in the flank region.
Coverage represents the number of times a sequenced DNA fragment (i.e., a read) maps to a genomic target. The deeper the coverage of a target region (i.e., the more times the region is sequenced), the greater the reliability and sensitivity of the sequencing assay.
Typically, the minimum depth of coverage required for genomic resequencing of diploid organisms, such as human, mouse, or rat, is 20–30X. However, different applications, labs, or bioinformatics groups may require lower or higher minimum coverage depth. We have met researchers who find coverage as low as 1–2X sufficient, while at the other end of the scale, some researcher require 500—1000X coverage of target regions; higher coverage depth allows for higher detection sensitivity of genomic sequence variations.
A good method for estimating the required depth of coverage for a particular application is to begin with 20X and divide by the expected allele frequency; e.g., for detecting mutations with 5% (0.05) allele frequency, you would need 400X coverage depth.
Figure 4. Assessing target coverage. To assess how well targets are covered, we plot % of Targets > X Coverage on the Y axis against coverage on the X axis. This data has been normalized to 1 million mapped reads, making it easier to calculate and compare the depth of coverage achieved for different platforms and levels of multiplexing. The Illumina MiSeq platform can support up to 30M reads. Protocol v2 clearly demonstrates deeper coverage across a larger range of targets, with ~93% of targets covered at 20X compared to ~86% with Protocol v1.
Uniformity of coverage
Uniformity can be expressed in various ways. IDT uses different methods to calculate coverage uniformity. The primary method, which is applicable to the widest range of applications, is to calculate the proportions of sequences that have greater than 0.2, 0.5, and 1.0 times the mean coverage. We find this method useful for helping researchers to understand the lower coverage limits—certainly, the drawbacks of under-sequencing are greater than those of over-sequencing.
The other methods used at IDT for calculating uniformity of coverage are more useful for assessment of copy number variation (CNV). One method is to calculate the coefficient of variation (CV), which is the standard deviation divided by the mean. Lower numbers indicate better uniformity. This can be made more granular by calculating CV for targets grouped by GC content. We typically observe wider distributions at the extreme ends of the GC spectrum.
Figure 5. Uniformity statistics between first and second versions of the xGen® Rapid Capture Protocol. Although protocol v1 provides slightly higher uniformity (which may be more important for CNV applications), protocol v2 compensates by providing deeper overall coverage.