Talking to scientists who regularly use target capture for their next generation sequencing about the performance of their current or desired enrichment panel, I realize the diversity in the number and interpretation of metrics that are believed to be important. To help reduce confusion from the vast number of metrics circulating in the NGS field, I have decided to discuss those that scientists at IDT consider to be important for our analysis of short read sequencers. IDT uses the following measurements to evaluate the performance of an enrichment panel:
- Unique vs. duplicate reads
- Percentage of reads mapping on-target
- Coverage depth
- Uniformity of coverage
Unique vs. duplicate reads
For sequencing that uses hybridization capture, duplicate reads (sequenced DNA fragments), especially in paired-end sequencing, are assumed to be the result of reading 2 or more PCR copies of the same original DNA fragment. When sequencing randomly fragmented, PCR-amplified DNA, some amount of duplication is unavoidable. The goal is to have sufficient diversity persisting within the library even after enrichment so that random sampling of reads will rarely detect the same fragment multiple times. Most sequencing analysis pipelines remove PCR duplicates; therefore, using protocols that maintain a low frequency of duplicate DNA fragments results in a greater amount of usable sequencing data at the end. However, for applications requiring higher sensitivity—such as rare allele detection—the unique vs. duplicate reads metric is important for monitoring diversity and for more accurate measurement of copy number variation.
The figure above demonstrates protocol optimization which led to a 3-fold reduction in “duplicate” reads. This optimization was primarily accomplished by controlling PCR parameters before and after targeted enrichment. The xGen® AML Panel was used for duplicate optimization.
Percentage of reads mapping on-target
The measurement of on-target bases or reads is typically represented as the ratio of number of bases within a target region to total number of bases output by the sequencer, expressed as a percentage. Usually we calculate these values after duplicate reads are removed from the read pool.
Method #1: % On-target
On-target reads / Total aligned reads
Method #2: % On-target
On-target bases / Total aligned bases
A base within a read is considered on target if it is aligned with a targeted region. A read is considered on target if a single base within a read aligns to a targeted region. In the example above, we would say that this particular result was 75% on-target if we calculate by reads (reads 1, 2, and 4 are on target; read 3 is not), but approximately 50% on-target if we calculate by bases (only half of the bases within the reads are aligned with a targeted region). IDT measures reads on-target because that more accurately depicts reliable pull down of target fragments regardless of variables such as shear size.
This figure demonstrates ~50% increase in on-target reads achieved through protocol optimization. This optimization was primarily accomplished by adjusting the temperatures used during hybridization and wash steps. On-target refers to “On-target reads” as defined in the earlier section. The flank includes the target region +100 bases. Note that the amount of a given read mapping in the region flanking a target will change with shear size. Larger shear sizes correspond to more read mapping in the flank region.
Coverage represents the number of times a sequenced DNA fragment (i.e., a read) maps to a genomic target. The deeper the coverage of a target region (i.e., the more times the region is sequenced), the greater the reliability and sensitivity of the sequencing assay. Typically, the minimum depth of coverage required for genomic resequencing of diploid organisms, such as human, mouse, or rat, is 20–30X. However, different applications, labs, or bioinformatics groups may require lower or higher minimum coverage depth. I have met researchers who find coverage as low as 1–2X sufficient, while at the other end of the scale, some researcher require 500—1000X coverage of target regions; higher coverage depth allows for higher detection sensitivity of genomic sequence variations. A good method for estimating the required depth of coverage for a particular application is to begin with 20X and divide by the expected allele frequency; e.g., for detecting mutations with 5% (0.05) allele frequency, you would need 400X coverage depth.
To assess how well targets are covered, we plot % of Targets > X Coverage on the Y axis against coverage on the X axis. This data has been normalized to 1 million mapped reads, making it easier to calculate and compare the depth of coverage achieved for different platforms and levels of multiplexing. The Illumina MiSeq platform can support up to 30M reads. Protocol v2 clearly demonstrates deeper coverage across a larger range of targets, with ~93% of targets covered at 20X compared to ~86% with Protocol v1.
Uniformity of coverage
Uniformity can be expressed in various ways. IDT uses different methods to calculate coverage uniformity. The primary method, which is applicable to the widest range of applications, is to calculate the proportions of sequences that have greater than 0.2, 0.5, and 1.0 times the mean coverage. We find this method useful for helping researchers to understand the lower coverage limits—certainly, the drawbacks of under-sequencing are greater than those of over-sequencing.
The other methods used at IDT for calculating uniformity of coverage are more useful for assessment of copy number variation (CNV). One method is to calculate the coefficient of variation (CV), which is the standard deviation divided by the mean. Lower numbers indicate better uniformity. This can be made more granular by calculating CV for targets grouped by GC content. We typically observe wider distributions at the extreme ends of the GC spectrum.
This figure shows uniformity statistics between the first and second versions of the xGen® Rapid Capture Protocol. Although protocol v1 provides slightly higher uniformity (which may be more important for CNV applications), protocol v2 compensates by providing deeper overall coverage.
Target capture reagents from IDT
xGen® Lockdown® Probes
xGen Lockdown Probes are individually synthesized, quality controlled, and normalized hybridization probes that offer:
- Sensitive detection of SNPs, indels, CNV, LOH, and translocations
- Available for clinical and diagnostics research
- Use to augment existing panels or create completely custom panels
- Quick delivery
Discover more about xGen Lockdown Probes.
xGen® Lockdown® Panels
xGen Lockdown Panels are preconfigured, validated, and stocked pools of xGen Lockdown Probes for targeted next generation sequencing of defined gene families:
- xGen Exome Research Panel
- xGen Acute Myeloid Leukemia Panel
- xGen Pan-Cancer Panel
- xGen Inherited Diseases Panel
- xGen Human ID Research Panel
- xGen Human mtDNA Research Panel
Discover more about xGen Lockdown Panels.
xGen® Blocking Oligos
xGen Universal Blocking Oligos for single- or dual-index adapters used with common sequencing platforms improve on-target performance for multiplexed samples by reducing adapter participation in hybridization enrichment. Custom adapters can be manufactured for other barcodes or to meet the needs of customers who require specific modifications or services to improve performance in unique applications.
Discover more about xGen Blocking Oligos.
Author: Ibrahim Jivanjee is the Product Manager for NGS at IDT.
© 2014, 2015 Integrated DNA Technologies. All rights reserved. Trademarks contained herein are the property of Integrated DNA Technologies, Inc. or their respective owners. For specific trademark and licensing information, see www.idtdna.com/trademarks.