Minimizing duplicates and obtaining uniform coverage in multiplexed target enrichment sequencing

Recommendations for pooling NGS libraries for hybridization capture to increase sample throughput and reduce cost and time

Did you know that using the right amount of starting material of pooled samples can decrease cost and increase data quality of your multiplexed NGS experiment? Follow these recommendations from IDT scientists to minimize duplicates and obtain uniform coverage in your multiplexed target enrichment sequencing experiments.

Advantages of multiplexed sequencing

The capacity of next generation sequencing (NGS) platforms has increased at an astonishing rate. As a result, libraries are commonly pooled together and sequenced simultaneously via a process known as multiplexing. To distinguish individual libraries throughout this process, sample-specific sequences, called sample indexes or sample barcodes, are added to each fragment in the library during construction. The pooled libraries are sequenced simultaneously in a single sequencing run. The barcode information is then used to computationally assign the sequence reads back to the individual libraries. Multiplexing reduces the cost of sequencing substantially and facilitates experimental scalability. Libraries may also be multiplexed during hybridization and target enrichment, further driving down sample processing costs.

Despite these advantages, multiplexed NGS also poses some challenges to end users. Sequencing experiments require multiple steps from sample preparation to final data acquisition, and each of these steps can impact final data quality. Here, we discuss 2 key metrics that are important indicators of successful multiplexed target enrichment: duplication rate and uniformity. Using what we have learned from our research, we also provide recommendations for successful execution of your multiplexed NGS experiments.

Minimizing duplicates

The duplication rate is the fraction of mapped reads where any 2 reads share the same 5′ and 3′ coordinates. Duplicates mostly arise during PCR-based library construction. Duplicates may also result as artifacts on the sequencing instrument, when the same template binds to multiple clusters on a flow cell and is thus amplified independently, multiple times. Both types of duplications are an important source of error, because the resulting reads may contain mutations introduced during PCR. Duplicates can also lead to false allele frequency representation by increasing the proportion of the allele present in the duplicates compared to the alternate allele. Many analysis pipelines remove PCR duplicate reads before downstream analysis to mitigate these undesired consequences and minimize potential variant calling biases. Picard (MarkDuplicates; [1]) and SAMTools (rmdup; [2]) are the 2 main software programs used for this purpose. Removal of duplicates, however, leads to exclusion of some of the generated sequence data, which can impact cost and data quality. Thus, minimizing duplicates in NGS experiments is critically important.

The amount of starting material that is pooled plays an important role in determining the rate of duplication in multiplexed NGS experiments. To determine the amount of barcoded library needed to minimize duplicates in multiplexed capture, we prepared 16 different libraries from Coriell DNA (NA12878) using custom, dual-matched sample index adapters (IDT) and the KAPA Hyper Prep Kit (Kapa Biosystems). We then captured 1-, 4-, 8-, and 16-plex pools with either 500 ng of total input or 500 ng of each library (Figure 1A). For example, the 16-plex captures contained either 31.25 ng of each library, totaling 500 ng per capture, or 500 ng of each library, totaling 8 µg per capture. Importantly, we did not make any other modifications to the experiments, and used the same amount of hybrid capture probes, blockers, and buffers for multiplexed captures.

As shown in Figure 1B, the duplication rate was consistent when the libraries were sequenced individually (2.4%). However, there was an increase in the duplication rate in the "500 ng total input" groups (orange circles) when the libraries were sequenced in 4-plex (4.5%) instead of 1-plex (2.0%). The rate of duplication increased substantially in the same groups when the libraries were sequenced in 8-plex (7.1% vs. 2.4%). We observed the biggest increase in duplication rate when sequencing was performed in 16-plex with the 500 ng total groups (13.5% vs. 2.5%). Importantly, through the experiments, the duplication rate remained almost constant in "500 ng each library" groups (blue circles), whether they were sequenced individually or in multiplex (4-plex, 8-plex, or 16-plex). Based on these data, we recommend using 500 ng of each barcoded library in your multiplexing experiments to reduce PCR duplicates. 

Stable duplication rate with 500 ng of each library in pool

Figure 1. Duplication rates are stable when 500 ng of each library is used for target enrichment. (A) Libraries were prepared, and 1-, 4-, 8-, or 16-plex captures were performed with the IDT xGen® AML Cancer Panel using either 500 ng of total input or 500 ng of each library and sequenced on the NextSeq® System (Illumina). (B) The duplication rate of each library was determined for each multiplexing scenario. Libraries were sequenced in separate NextSeq runs and analyzed with Picard’s HsMetrics [1].

High coverage uniformity with multiplexed captures

Sequencing coverage or coverage depth represents the number of times sequencing reads “map to” or “cover” a genomic target region. Coverage level determines the sensitivity of the assay. A higher level of coverage increases the possibility of and the confidence in variant discovery. Coverage level is usually based on such factors as application type (SNPs, mutations, genomic rearrangements) and expression level of target genes (low or high expression genes for RNA-Seq). For example, low-frequency variant detection in cancer research may require from 80X to thousands of fold coverage.

Successful targeted sequencing also requires uniform coverage across the areas of interest within the genome. In a perfect scenario, every target site would be covered at the same level, which would keep the required number of sequencing reads for every target site at the minimum. However, sequencing reads are often not distributed evenly over the target areas, necessitating extra reads to “rescue” the poorly covered regions. Thus, multiplexed capture is most effective when a high level of uniformity is obtained for variant calling.

To determine whether using 500 ng of each library in multiplexing experiments provides uniform coverage, we examined the per-base target coverage in the previously described experiment. As seen in Figure 2, target coverage was highly uniform, regardless of the number of samples multiplexed. Base coverage was 98.2% for 20X for all 4 experimental groups. An average of 94.8% of the bases were covered at least 100X. Coverage was nearly 200X for 61.8% and 300X for 23.6% of bases.

High coverage uniformity with 500 ng of each library in pool

Figure 2. Multiplexed libraries yield high coverage uniformity when 500 ng of each library is pooled for target enrichment. Libraries were prepared as described in Figure 1, and per-base target coverage [bases covered at >X(%)] was calculated for the "500 ng of each library" group using Picard’s HsMetrics [1]. High coverage uniformity from multiplexed libraries provides high target coverage for variant calling with minimal sequencing, when using an input of 500 ng per library.

These results suggest that pooling of 500 ng per library, captured using IDT xGen Lockdown® Probes, provides high coverage uniformity and high target coverage for variant calling with minimal sequencing in multiplexed NGS experiments.

Other considerations

It is noteworthy to mention that several other factors, including sample quality, PCR conditions, panel size, and the number of samples multiplexed should be studied carefully in your experiments.

Our scientific application specialists are available to answer further questions or provide guidance on sample multiplexing for your NGS experiments. Contact them at applicationsupport@idtdna.com.

References

  1. Picard’s HsMetrics, https://broadinstitute.github.io/picard/picard-metric-definitions.html [Accessed 8 Mar, 2018].
  2. SAMTools http://samtools.sourceforge.net/ [Accessed 8 Mar, 2018].

Published Mar 20, 2018

Your Advocate for the Genomics Age