Insertion Site Detection and Targeted RNA Capture Using Next Generation Sequencing

Cofactor Genomics uses xGen® Lockdown® Probes for Targeted Sequencing

The exponentially decreasing cost of next generation sequencing (NGS) has made this technology readily available to research labs around the world. However, experimental design and data analysis continue to be a challenge for many researchers, especially those with little or no NGS expertise. The need for more robust analysis pipelines has become evident, resulting in the emergence of more NGS services companies. Cofactor Genomics (St Louis, MO, USA), an NGS-focused contract research organization (CRO), provides comparative SNP analysis, RNA-seq, and de novo genome assembly services, among others. Several of these services require the use of target enrichment techniques to direct sequencing capacity to smaller sections of the genome. In this article, Cofactor Genomics Chief Operating Officer, Jon Armstrong, discusses two such applications: a) insertion site detection and b) targeted RNA capture and subassembly of transcripts.

Insertion Site Detection

Insertion site detection allows researchers to resolve assumptions and answer questions like:

  • “Is there really one insert per genome?” and “If not, what is the distribution?”
  • “Where in the host genome did the insert go?”
  • “Is the insertion site randomly selected?”
  • “Did my insert disrupt any functions within my host genome?”
  • “Is the insert in an intergenic region or within an intron or exon?”

The scientists use xGen® Lockdown® Probes (see Product Focus box) almost exclusively for insertion site detection projects because they run a lot of samples and the cost is amortized across the probe set.

The Process: Insertion site detection is a powerful tool for identifying where exogenous DNA that is inserted into a host genome has been incorporated. To detect insertion sites, scientists at Cofactor first design Lockdown Probes targeting the exogenous DNA. They then fragment the host DNA containing the insert, create a library of the DNA fragments by ligating adapters, and hybridize the xGen Lockdown Probes to the library. The insert will be contained in that DNA mixture; however the sequence of the insert is unimportant in most cases; it is just used as a hook to design the capture probes. What the scientists are interested in are the regions of host sequence abutting the insert sequence. Among the randomly fragmented genomic DNA are fragments that are an insert:genomic DNA chimera, and these are the sequences that the scientists look for to determine the insertion sites (Figure 1).

Figure 1. Workflow for Insertion Site Detection.

Researchers often assume that this method can be directly applied to translocation detection. However, Mr Armstrong distinguishes insertion site detection from identifying translocations by noting, “Insertion site detection is easy to work on because you know the sequence of the insert. With translocation detection you might have to capture and sequence large regions of DNA before you perform the alignment to determine where the translocation has occurred.”

Data Analysis: Cofactor has developed a very specific and unique insertion site analysis pipeline with which all of the sequencing data is analyzed to determine the most likely areas where an insert is incorporated into the host genome (Figure 2). When samples are received from a client, Cofactor is able to perform the complete service in 8–10 weeks and deliver a ranked list of the most likely areas in the genome for the insertion as well as FASTA sequences for those regions. Confirmation of the site of insertion can be performed by Cofactor or the client by simply using PCR to amplify across the insert:genomic DNA junction.

Figure 2. Insertion Site Detection Using xGen® Lockdown® Probes. This example shows an alignment output produced by Cofactor Genomics’ insertion site detection analysis pipeline, visualized in the UCSC Genome Browser. Paired-end, 100 nt sequences were generated from genomic and insert fragments enriched using xGen Lockdown Probes. Reads were aligned to the host genome. "Orphan" paired reads are represented by overlapping horizontal red and blue lines. Forward orphan sequencing reads are shown in red and reverse orphan sequencing reads are shown in blue. Orphaned sequencing reads are characterized when one read from a pair aligns to the host genome and the other read aligns to the insert sequence. Areas of the genome with coverage greater than the calculated mean depth of coverage and orphan reads in the correct orientation are considered for insertion site confirmation by PCR. The black line represents the proposed insertion site location for this sample.

Advantage of Using Target Capture: It is important to note that because the insert is introduced randomly into the host genome, it is impossible to design primers targeted to the genome upstream and downstream of the insert; therefore, standard PCR cannot be used to detect the insertion site. Use of target capture and NGS provides an easier workflow that also allows higher throughput. Additionally, the hybridization procedure enables capture of regions that are unique to the insert, while allowing the scientists to separate random noise from non-random signal by sequencing to coverage above the noise threshold. Using xGen Lockdown Probes helps Cofactor to easily anticipate the cost of the project as they know the sequence coverage required and exactly how many probes will be needed.

Targeted RNA Capture

NGS technologies are able to determine the nucleic acid sequence of both DNA and RNA. However, unlike DNA, which is a somewhat static molecule, RNA transcription is dynamic, tissue specific, and often contains a temporal component. Thus, Cofactor considers RNA-seq applications akin to snapshots taken at a given moment in time as opposed to whole exome or genome sequencing, which exhibits more stability. Recently, considerable attention has been focused on the role of transcript isoform function. Isoforms from the same gene can have significantly different structures from one another and, therefore, significantly different functions. Cofactor has developed a molecular approach, using xGen Lockdown Probes, and a specific analysis pipeline, to help clients elucidate specific isoform structures of interest.

The Process: Scientists at Cofactor perform RNA isoform assembly of specific genes using xGen Lockdown Probes and the workflow shown in Figure 3. The sequencing reads containing the targeted transcripts are assembled using Cofactor’s transcript assembly pipeline. After the structures of the isoforms are determined from the assembled transcripts, they can be added to the gene reference for comparative expression and further interrogation. Aligning RNA-seq data to a gene (transcriptome) reference is similar to a microarray experiment, but provides much greater dynamic range and is hypothesis neutral. Data can also be aligned to a genome reference to identify novel or unannotated transcripts. Cofactor aligns RNA-seq data to both a genome reference and a gene reference, which is unusual in the field; most researchers align to either a genome reference or a gene reference. Aligning to both types of reference allows Cofactor’s clients to interrogate their data with 2 lenses—a discovery lens and a comparative expression lens.

Figure 3. Workflow for Targeted RNA Capture.

Data Analysis: Mr Armstrong notes that many clients approach them to request RNA-seq for comparative expression, however, during further discussion Cofactor scientists learn that >50% of clients are also very interested in splice isoform discovery. Mr Armstrong stresses that comparative expression from RNA-seq data and splice isoform discovery are not the same experiment. Very good comparative expression information can be obtained from a single-end 1 x 50 read, with minimal reduction in alignment specificity, at 1X cost. However, splice isoform detection should be approached differently [1]—at a minimum, paired-end 2 x 100 reads are required; i.e., the longest length with the highest number of reads, and cost becomes 2–3X. Researchers at Cofactor state that while identifying exon–exon junctions from 50 bp reads is possible, it is not the most effective strategy. Ultimately, if isoform detection is of equal importance as comparative expression, Cofactor usually defines a better means of allocating funds to obtain both types of information as efficiently and cost effectively as possible.

Project Tracking and Data Delivery

ActiveSite™ is a web interface that serves as a dashboard for an entire project. Using ActiveSite, clients can interrogate 200 million lines of data from an RNA-seq experiment across 24 samples, for example, and within a minute, reduce that data down to the most interesting candidates based on statistical measures and input thresholds. The data analysis is performed by Cofactor bioinformaticians, according to clients’ needs and budget, and characterized in ActiveSite for client access. Their clients can then access the interface and manipulate the data as required.

Other Services

Other specialties at Cofactor Genomics include development and testing of extraction protocols for different tissue types (e.g., biofluids, plants) to improve NGS performance. They also perform molecular protocol development, such as devising the means for porting samples that have been assayed by another method onto an NGS sequencer. They can provide qPCR validation downstream of NGS that identifies SNP and RNA-seq candidates and, through partnerships with other companies, are able to tie together proteomics information with NGS data. Although the company started out performing library generation and sequencing analysis, as they continue to grow they are adding bolt-on applications to the front- and back-ends of their processes.

Glossary of Terms Commonly Used in Next Generation Sequencing

Paired- and Single-End Reads

2 x 100: Paired-end 100 base reads where 100 nucleotides from each end of each genomic fragment are sequenced; i.e., a pair of reads generated from 2 ends of the same fragment and each read is 100 nt.

1 x 50: Single-end 50 base reads where 50 nucleotides from one end of each genomic fragment are sequenced; i.e., a single 50 nt read generated from one end of a fragment.

Depth of Coverage

The number of times a nucleotide is read during the sequencing process. Coverage is the number of reads representing a given nucleotide in the reconstructed sequence. Average depth of coverage (D) can be calculated from the length of the original genome (G), the number of reads (N), and the average read length (L) using the equation: D = N x L/G. For example, a genome of 2000 bp reconstructed from 8 reads with an average length of 500 nucleotides will have a 2X redundancy (depth of coverage).

Quality Score

The probability of a base being called incorrectly. Quality scores are logarithmically linked to error probabilities; e.g., from the equation Q = –10 log10P, where Q is quality score and P is the probability of a base-calling error, a quality score of Q30 for a given base indicates that the chances of that base being incorrectly called are 1 in 1000. In general, a base call is considered high quality if the score is greater than Q20 (probability of an incorrect base call is 1 in 100). Quality scores are used for:

  • Assessment of sequence quality
  • Recognition and removal of low quality sequence; i.e., end clipping
  • Determination of accurate consensus sequences

SIPE (Short-Insert Paired-End) Reads

Reads for which the genomic fragments have a short distance between the ends; typically 300 and 500 nucleotides.

LIMP (Long-Insert Mate-Pair) Reads

Reads for which the genomic fragments have a longer distance between ends than for simple fragments; typically 1000, 3000, 5000, and 9000 nucleotides on Illumina instruments.

Note: Some researchers use the terms “paired-end” and “mate-pair” interchangeably; however, the scientists at Cofactor do not. At Cofactor, paired-end reads are defined as those generated from the ends of a single fragment, while mate-pairs are derived from the ends of a longer fragment that has been circularized to bring the ends into close proximity of one another.

Contig

A consensus region of genomic sequence that is formed by overlapping sequence reads.

Scaffold

A consensus region of genomic sequence that is formed by using paired-end information to combine, order, and orient shorter contigs.


About Cofactor Genomics

Cofactor Genomics is a privately held biotechnology company that employs experimental design, next generation sequencing, and proprietary analysis technology and pipelines to drive the discovery and design of new products and processes for the life sciences. Cofactor’s D&A (Design and Analysis) Solution is constructed to specifically assist those researchers who require more than just sequencing at a reasonable cost. It also provides expert design and analysis capabilities, customized to specific requirements, and bioinformatics to make the most of the data generated. Cofactor’s expertise in molecular biology and bioinformatics accelerates its partners’ biological research, discovery, and product development in a number of scientific areas worldwide. For more information, visit www.cofactorgenomics.com.

Jon Armstrong, Chief Operating Officer (pictured), was a research scientist with the Technology Development Group at The Genome Institute (Washington University in St. Louis) from 2001 to 2009. During that time, he was instrumental in the design of molecular tools (e.g., multi-locus sequence typing) for the characterization of single nucleotide polymorphisms in organisms such as S. aureus and uropathogenic E. coli, and diseases such as acute myeloid leukemia.


Product Focus: xGen® Lockdown® Probes

xGen® Lockdown® Probes are individually synthesized probes for target enrichment by hybrid capture. They have been specifically developed for next generation sequencing. xGen Lockdown Probes can be used alone to create custom panels that can be optimized, expanded, and combined with other panels as necessary. They can also be used to supplement existing capture panels to rescue poorly represented regions, such as areas of high GC content.


References

  1. http://www.cofactorgenomics.com/blog/2013/its-all-about-isoforms (accessed 10/30/2013).

Author: Nicola Brookman-Amissah, PhD, is a Scientific Writer at IDT.

Related Articles

Read an overview article about target capture:  
Read other examples of how target capture is helping scientists achieve increased depth of coverage in NGS: