A Case Study: Barcode Crosstalk can Severely Affect Next Generation Sequencing Results
The GenePool is a next generation sequencing (NGS) and bioinformatics facility within the University of Edinburgh. As such, its scientists both produce and analyze data. The Facility works mostly with academic labs in Scotland and the rest of the UK, but also has a number of collaborations overseas. We interviewed Dr Karim Gharbi and Dr Pablo Fuentes-Utrilla to gain insight into research at The GenePool and discuss how IDT oligonucleotides with TruGrade processing improve their sequencing data.
When researchers contact The GenePool with a research proposal, scientists at the Facility suggest experimental designs to address the research question, determine which sequencing platform is best suited to the project, and recommend the most appropriate data analysis plan. In most cases, The GenePool helps researchers from conception to publication; i.e., Facility members may help the researcher set up their project, secure funding, produce and analyze the data, and write up the results for publication.
Barcode Cross Contamination
With the increasing capacity of NGS instruments, the Facility is performing multiplexed sequencing runs with an ever-increasing number of samples. A technique they are employing to facilitate this is restriction site–associated DNA (RAD) sequencing (Figure 1), which subsamples a portion of the genome for analysis. Dr Gharbi describes this as reduced representation genome sequencing, which allows them to screen or sample many individuals at relatively low cost. However, it is important to realize that the technique is only effective if the scientists can accurately barcode the libraries that they create so that sample-specific information can be retrieved after the samples have been pooled for sequencing. When performing multiplexed RAD sequencing, scientists at The GenePool currently use 8 nt barcodes that differ by a minimum of 4 bases.
Figure 1. RAD Sequencing Library Preparation Workflow. (1) DNA from different samples (S1 and S2) is digested with restriction enzymes that leave a 4-base overhang. (2) P1 adapters with a molecular identifier (MID) unique to each sample (i.e., barcode) and complementarity to the overhang are ligated to the digested DNA. (3) The P1-barcoded DNA samples are pooled. (4) The pooled library is sheered to generate target sizes of 300–700 bp. (5) Library DNA is size-selected for fragments 300–700 bp and to eliminate adapter dimers. (6) Y-shaped P2 adapters are ligated onto the library DNA. (7) The library is enriched by PCR; the combination of P2 adapter design and correct PCR primers ensures that only fragments that have both P1 and P2 adapters will be amplified. (8) The enriched RAD library is sequenced on the Illumina platform. Each 100 bp read will begin with the sample MID, followed by the restriction site overhang (r.s.), and then the associated DNA sequence. Bioinformatics tools are used to separate the DNA fragments based on the sample MID. It is, therefore, extremely important that there is no cross-contamination in the P1 adapters.
During recent data analysis, bioinformatics scientists at The GenePool Facility realized that some of the data produced by this method exhibited alarming levels of cross-contamination (crosstalk) between the barcodes; e.g., for Sample X labeled with Barcode #1 and Sample Y labeled with Barcode #2, they observed substantial levels of Barcode #2 in Sample X and Barcode #1 in Sample Y. This was completely unsatisfactory as it is imperative that these data can be robustly separated to infer accurate genotyping calls. Dr Gharbi explains that when they are examining differences between populations for example, poor separation of genomes will bias allele frequencies and make populations appear more similar than they really are. If they were investigating segregation of genotypes from parents to offspring, such as identifying the genetic basis for a particular trait, they would obtain incorrect genotypes, which can lead to unreliable results. Crosstalk in such experiments can have multiple origins, but Drs Gharbi and Fuentes-Utrilla quickly suspected cross-contamination of the oligonucleotides used for barcoding.
Drs Gharbi and Fuentes-Utrilla discovered their barcoding contamination rather serendipitously. It so happened that in placing an order with their previous supplier for oligonucleotide barcodes for their NGS experiment, the plate setup the researchers used made the cross-contamination obvious. With this discovery, The GenePool scientists determined that neither HPLC- nor PAGE-purified oligos from that supplier were of sufficient purity for the sensitive multiplex NGS that the Facility performs.
Drs Gharbi and Fuentes-Utrilla are concerned that the problem of barcode crosstalk may be pervasive in NGS. However, in their opinion, this issue has not received sufficient attention. In RAD sequencing, barcode crosstalk is a particular problem due to the method of library preparation, as wrongly allocated sequencing reads can significantly mislead subsequent analyses. RAD sequencing libraries are amplified from a subset of the genome (Figure 2); therefore, cross-contamination in the adapters poses more of a problem.
Figure 2. Amplified RAD Libraries. The gel picture shows RAD libraries before (lanes 1 and 3) and after (lanes 2 and 4) PCR enrichment. Note that the average size of the sequences in the enriched libraries is approximately 100 bp greater than those in the pre- PCR libraries. These additional bases are clustering sequences for the Illumina flowcell that are incorporated during PCR amplification. Lane L, ladder.
Introduction to TruGrade™ Processing
Although IDT is relatively new in Europe, Dr Gharbi and colleagues knew that large biotechnology companies such as Roche recommend IDT oligonucleotides for their next generation sequencing applications. As a result, researchers at The GenePool decided that IDT oligos were worth a try. The scientists placed an order for IDT oligos processed using the TruGrade™ service (see sidebar, TruGrade Processing Service), using exactly the same crite¬ria that they had used for their order with their previous supplier. With the IDT reagents in hand, they then carried out an identical experiment. The resulting data were very clean, with no apparent crosstalk between barcodes.
Concern about Oligonucleotide Purity for NGS
Dr Fuentes-Utrilla is keen to emphasize the importance of oligonucleotide purity for NGS. He thinks that whereas the issue is often irrelevant for Sanger sequencing, multiplex NGS requires pure, high quality oligonucleotides. Most researchers take into account n+/-1 nucleotides when they consider oligonucleotide purity, and assume that there will be no cross-contamination. However, due to the high throughput methods used to manufacture oligos and the sensitivity of NGS data analysis, small amounts of crosstalk that are innocuous in other applications can be detected here. For example, a small amount of crosstalk is not a problem in most polymerase chain reaction (PCR) applications because the correct, more abundant oligonucleotide is favored during amplification. But in NGS the incorrect amplification prod¬ucts are likely to be sequenced as well.
According to Dr Fuentes-Utrilla, “When we order a specific oligonucleotide, we expect this sequence, not a mixture of sequences. You assume that when you order one sequence you will get that sequence and nothing else.
“It may look like a technical detail where we source our oligos from, but it makes a huge difference to the quality of the data at the other end. It basically makes the difference between high quality data and a dataset that we can’t publish because we don’t trust it.”
—Dr Karim Gharbi
It is important to emphasize that for NGS this can be a huge issue, and a mixture is what we observed with our barcodes from our previous supplier. The presence of crosstalk is very worrying and I don’t think many scientists using barcoded adapters are aware of or testing for this, which would be very expensive. You have to run lanes on a sequencer at several thousands of pounds (GB£) per run. And that is just for checking the quality of an oligo, which you would expect the oligo company to have done for you because that is what you pay for.”
Drs Gharbi and Fuentes-Utrilla have now switched to theTruGrade Processing Service for their NGS barcoding oligos. Author: Nicola Brookman-Amissah, PhD is a Scientific Writer at IDT.
TruGrade™ Processing Service
The TruGrade™ Processing Service is a proprietary production process that reduces the risk of oligonucleotide crosstalk during multiplex next generation sequencing applications. Oligo crosstalk is a potential cause of barcode misalignment, which can lead to inaccurate conclusions by associating sequencing data with the wrong sample.
Oligonucleotides manufactured using the TruGrade service are suitable for:
- Sample preparation using barcoded adapters for multiplex sequencing
- PCR using barcoded fusion primers for multiplex amplicon sequencing
About The GenePool
The GenePool is a leading next generation genomics facility based in the Institute of Evolutionary Biology in the School of Biological Sciences of the University of Edinburgh. Using high throughput sequencing instrumentation, and high-end computing facilities, they deliver collaborative access to cutting edge genomics tools to the academic community. The GenePool is a recognised high throughput sequencing facility for two UK research councils, the Medical Research Council (MRC) and the Natural Environment Research Council (NERC), and forms part of a collaboration between three genomics facilities in Edinburgh. The GenePool supports a wide range of projects and applications, from medical resequencing to de novo sequencing of ecologically important organisms, such as worms and butterflies. See http://genepool.bio.ed.ac. uk for more information on the Facility.
Dr Karim Gharbi (top) is the Scientific Manager of The GenePool. His research interests include the evolution and transmission of genomes, big and small.
Dr Pablo Fuentes-Utrilla (bottom) is a postdoc in Prof Mark Blaxter’s lab at the University of Edinburgh and a collaborator of The GenePool. He is currently working on population genetics and the implementation of genotyping-by-sequencing technologies in plant breeding programs.