Talking to scientists who regularly use target capture for their next generation sequencing about the performance of their current or desired enrichment panel, I realize the diversity in the number and interpretation of metrics that are believed to be important. To help reduce confusion from the vast number of metrics circulating in the NGS field, I have decided to discuss those that scientists at IDT consider to be important. IDT uses the following measurements to evaluate the performance of an enrichment panel:
- Unique vs. Duplicate
- On Target
- Coverage Depth
- Uniformity of Coverage
Unique vs. Duplicate
Duplicate reads, especially in paired-end sequencing, are assumed to be the result of reading 2 or more PCR copies of the same original DNA fragment. When sequencing randomly fragmented DNA, some amount of duplication is unavoidable. The goal is to have sufficient diversity persisting within the library even after enrichment so that random sampling of reads will rarely pull out the same fragment multiple times. Most sequencing analysis pipelines remove PCR duplicates; therefore, limiting the number of duplicates results in a greater amount of usable sequencing data at the end. However, for more sensitive applications—such as rare allele detection—unique vs. duplicate reads is important for monitoring diversity and for more accurate measurement of copy number variation.
The figure above demonstrates protocol optimization which led to a 3-fold reduction in “duplicate” reads. This optimization was primarily accomplished by controlling PCR parameters before and after targeted enrichment. The xGen® AML Panel was used for duplicate optimization.
The measurement of on-target bases or reads is typically represented as the ratio of number of bases within a target region to total number of bases output by the sequencer, expressed as a percentage.
Method #1: % On Target
On Target Reads / Total Aligned Reads
Method #2: % On Target
On Target Bases / Total Aligned Bases
A base is considered on target if it is aligned with a targeted region. A read is considered on target if a single base within a read aligns to a targeted region. In the example above, we would say that this particular result was 75% on target if we calculate by reads (reads 1, 2, and 4 are on target; read 3 is not), but approximately 50% on target if we calculate by bases (only half of the bases within the reads are aligned with a targeted region). IDT measures reads on target because that more accurately depicts reliable pull down of target fragments regardless of variables such as shear size.
This figure demonstrates ~50% increase in on-target reads achieved through protocol optimization. This optimization was primarily accomplished by adjusting the temperatures used during hybridization and wash steps. On Target refers to “On-Target Reads” as defined in the earlier section. The flank includes the target region +100 bases. The ratio of target vs. flank can be modified by shear size and read length, among other things.
Coverage is the number of times a region is sequenced per sequencing run. The deeper the coverage of a target region (i.e., the more times the region is sequenced), the greater the reliability and sensitivity of the sequencing assay. Typically, the minimum depth of coverage required for genomic resequencing of diploid organisms, such as human, mouse, or rat, is 20–30X. However, different applications, labs, or bioinformatics groups may require lower or higher minimum coverage depth. I have met researchers who find coverage as low as 1–2X sufficient, while at the other end of the scale, some researcher require 500—1000X coverage of target regions because they need greater sensitivity. A good method for estimating the required depth of coverage for a particular application is to begin with 20X and divide by the expected allele frequency; e.g., for detecting mutations with 5% (0.05) allele frequency, you would need 400X coverage depth.
To assess how well targets are covered, we plot % of Targets > X Coverage on the Y axis against coverage on the X axis. This data has been normalized to 1 million mapped reads, making it easier to calculate and compare the depth of coverage achieved for different platforms and levels of multiplexing. The Illumina MiSeq platform can support up to 30M reads. Protocol v2 clearly demonstrates deeper coverage across a larger range of targets, with ~93% of targets covered at 20X compared to ~86% with Protocol v1.
Uniformity of Coverage
Uniformity can be expressed in various ways. IDT uses different methods to calculate coverage uniformity. The primary method, which is applicable to the widest range of applications, is to calculate the proportions of sequences that have greater than 0.2, 0.5, and 1.0 times the mean coverage. We find this method useful for helping researchers to understand the lower coverage limits—certainly, the drawbacks of under-sequencing are greater than those of over-sequencing.
The other methods used at IDT for calculating uniformity of coverage are more useful for assessment of CNV. One method is to calculate the coefficient of variation (CV), which is the standard deviation divided by the mean. Lower numbers indicate better uniformity. This can be made more granular by calculating CV for targets grouped by GC content. We typically observe wider distributions at the extreme ends of the GC spectrum.
This figure shows uniformity statistics between the first and second versions of the xGen® Rapid Capture Protocol. Although protocol v1 provides slightly higher uniformity (which may be more important for CNV applications), protocol v2 compensates by providing deeper overall coverage.
Author: Ibrahim Jivanjee is the Product Manager for NGS at IDT.