Use of Targeted Solution Hybrid Capture, Hamming Barcoding, and NGS
Genomic Components of Fetal Growth Restriction
The laboratory of Dr Jorge Piedrahita (Genomics & Molecular Biomedical Sciences, North Carolina State University) studies several dangerous pregnancy-related conditions, including fetal growth restriction (FGR). FGR is defined as the limitation to 10% or less of normal fetal weight for fetal gestational age and is the second leading cause of perinatal death . To better understand the biological pathway(s) causing this condition and the genomic elements responsible, the group used gene expression profiling studies to identify a gene associated with FGR, as well as several single nucleotide polymorphisms (SNPs) within that gene.
Exhaustive Sequencing of a 500 kb FGR-Associated Genomic Region
Dr Shengdar Tsai, a graduate student in the Piedrahita laboratory at the time, addressed the genetic architecture of this region and the polymorphisms that lie within through exhaustive sequencing of a 500 kb area surrounding this gene using high throughput (HT), next generation sequencing (NGS) technology. This study, and his addition of Hamming error-correcting barcodes, previously applied to the 454 sequencing platform, are described here.
Targeted Solution Hybrid Capture, Barcoding, and NGS
Tsai used 55 human placenta samples selected from a cohort of 89 live births that had gone through gene expression analysis for FGR based on previously identified biomarkers for this condition . The 55 patient samples used in this preliminary study represented a mixture of controls and women with FGR. The experimental workflow is outlined in Figure 1.
Figure 1. Preparation and Sequencing of Genomic Region Associated with Fetal Growth Restriction. Workflow for processing genomic DNA samples for high throughput massively parallel sequencing. Genomic DNA from each of 55 patients was randomly sheared using a Covaris S instrument, PCR adaptors ligated onto the ends, and the fragments amplified. The subset of fragments containing a 500 kb region of a gene associated with fetal growth restriction were isolated from the pool of sheared genomic DNA by solution hybridization using biotinylated RNA oligonucleotides (Agilent). These ultra-long 200mers consisted of a target-specific 170mer sequence flanked by 15 bases of universal primer sequence for subsequent PCR amplification. After the initial PCR, a T7 promoter was added in a second round of PCR, and in vitro transcription performed in the presence of biotin-UTP to generate the single-stranded RNA hybridization probes. Excess single-stranded non-probe complementary RNA drives the hybridization. The genomic fragment–bound biotinylated RNA complexes were pulled down with streptavidin-coated magnetic beads, PCR amplified with universal primers incorporating the Hamming error-correcting barcodes (IDT), and analyzed on a next generation sequencing instrument. The sequencing targets are shown in red, with the asymmetrically added universal adaptor sequences in white. The Hamming barcode (index) is shown as a black bar over the adaptor.
Genomic DNA was sheared using a Covaris S instrument. Long biotinylated oligonucleotides (Agilent) complementary to the target region  were then used to capture and enrich for sequences in the 500 kb target region of the FGR-associated gene. The resulting trace amounts of fragments from each sample were amplified and each set tagged at the end with primer/adapter oligonucleotides containing barcode sequences (IDT).
Barcodes—Unique short DNA identification sequences added to the cDNA library clones of a particular sample so that multiple samples, each tagged with different barcodes, can be mixed (in a multiplex reaction), sequenced simultaneously, and then separated out for analysis. These sequences are usually 6−12 bases long.
Hamming Error-Correcting Barcodes—Barcodes are usually added to the beginning of DNA fragments to be sequenced. These barcodes are chosen so that they are difficult to confuse with one another, allowing correct sample assignments to be made even when there are a few errors. Hamming error-correcting barcodes are able to correct sample assignments with single errors. More complex types of error-correcting barcodes, e.g., Levenshtein distances, can correct up to 3 errors.
Next Generation Sequencing (NGS)—The simultaneous sequencing of all DNA molecules within one or multiple samples (“massively parallel” or “parallelized” sequencing). NGS can have a throughput of >1 billion bases/day (compared to the ~1,000 bases/day of Sanger sequencing). Among its many applications, this new sequencing paradigm makes possible the analysis of transcriptomes as they change with development, metabolism, and disease.
Nonsynonymous coding mutations—A SNP or other mutation that changes the amino acid sequence of the encoded polypeptide.
SNPs (Single Nucleotide Polymorphisms)—a single nucleotide variation in the coding or noncoding region of a genome when comparing paired chromosomes of an individual or sequences from members of the same species.
Hamming Barcodes Used for Multiplex Sequencing and Error Correction
Including the barcodes was critical since the samples were to be mixed for multiplex sequencing and the resulting data would be deconvoluted for reference back to the specific starting sample at a later time. The IDT primer/adapter tags incorporated unique Hamming error-correcting barcodes . These sequences are designed to correct for sequencing errors that occur with higher frequency in high throughput next generation sequencing. While this technology has been previously applied to 454 sequencing platforms, this was one of the first adaptations of Hamming barcode tags to create libraries for sequencing on an Illumina platform.
Dr Tsai notes, “We were really happy with the performance of our barcoded selections. Some barcodes used previously for multiplex NGS have performed fairly unevenly with variability between 5 and 24X , whereas the barcodes that we picked performed better with no more than a 2X difference between samples. By constraining our barcode choices to have at least one of each base and restricting the number of GC repeats, our set of 55 barcodes performed much more evenly. This means that we didn’t need to oversample as much to overcome errors introduced by the barcodes.” Click here for a list of the 55 Hamming barcodes used in this study (xls).
“Barcodes are fundamental to multiplex NGS. However, they also introduce error, both at the primer synthesis level and during the sequence reads. That’s where the Hamming error correction barcodes come in. If 4% of the barcodes have some sort of error, we can correct for 2%, leaving just a 2% barcoding error. Thus, we can read 98% of the barcodes, and that’s very reasonable…” — Dr Tsai
The Sequence Data
The primer/adapter tagged libraries were then pooled together in equimolar amounts and sequenced on an Illumina GAIIX Sequencer. After obtaining sequence data, the group was able to deconvolute the sequences, assign them to their original samples, and correct for any errors, of which there were very few—only 1−2%. For each sample, the researchers obtained 500- to 1500-fold coverage of the entire targeted 500 kb region. This was much higher than needed to identify with assurance any SNPs within the sequenced area.
During sequence data analysis, Dr Tsai noted a trend of multiple, novel, nonsynonymous coding mutations in this gene putatively associated with FGR. Some could potentially have deleterious effects as they would change the amino acid sequence of encoded protein. Given the small sample number for this preliminary experiment, Dr Tsai hopes to extend this study with more patient data in the future to see if this trend can be validated. However, this preliminary work has been critical for development of the technical workflow—to show successful solution hybrid capture and application of Hamming barcodes to specific sequencing of this target region comprising 0.017% of the human genome.
As part of his doctoral research at North Carolina State University, Shengdar Tsai focused on FGR and preeclampsia as phenotypes, using genomic approaches to try to unravel the etiology of those pregnancy-associated disorders. Techniques included gene expression profiling, genetic association, and, as discussed here, NGS approaches to take a deeper look into the genetic architecture of this FGR associated gene region. Having completed his PhD, Dr Tsai has since moved on to Massachusetts General Hospital/Harvard University where he works on genome engineering with engineered zinc finger and TAL nucleases as a postdoctoral research fellow.
- Bernstein I, Gabbe SG. Intrauterine growth restriction. In: Gabbe SG, Niebyl JR, Simpson JL, Annas GJ, et al., eds. Obstetrics: normal and problem pregnancies. 3d ed. New York: Churchill Livingstone, 1996:863−886.
- Tsai S, Hardison NE, James AH, Motsinger-Reif AA, Bischoff SR, Thames BH, Piedrahita JA.Gnirke A (2011). Transcriptional profiling of human placentas from pregnancies complicated by preeclampsia reveals disregulation of sialic acid acetylesterase and immune signalling pathways. Placenta, 32(2):175-182.
- Melnikov A, et al. (2008). Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing Nature Biotech, 27(2):182−189.
- Hamady M, Walker JJ, et al. (2008). Error-correcting barcoded primers for pyrosequencing hundreds of samples in multiplex. Nature Meth, 5(3): 235−237.