The exponentially decreasing cost of next generation sequencing (NGS) has made this technology readily available to research labs around the world. However, experimental design and data analysis continue to be a challenge for many researchers, especially those with little or no NGS expertise. The need for more robust analysis pipelines has become evident, resulting in the emergence of more NGS services companies.
Cofactor Genomics (St Louis, MO, USA), an NGS-focused contract research organization (CRO), provides comparative SNP analysis, RNA-seq, and de novo genome assembly services, among others. Several of these services require the use of target enrichment techniques to direct sequencing capacity to smaller sections of the genome. In this article, Cofactor Genomics Chief Operating Officer, Jon Armstrong, discusses two such applications: a) insertion site detection and b) targeted RNA capture and subassembly of transcripts.
Insertion site detection
Insertion site detection allows researchers to resolve assumptions and answer questions like:
- “Is there really one insert per genome?” and “If not, what is the distribution?”
- “Where in the host genome did the insert go?”
- “Is the insertion site randomly selected?”
- “Did my insert disrupt any functions within my host genome?”
- “Is the insert in an intergenic region or within an intron or exon?”
The scientists use IDT xGen Lockdown Probes (see Product Focus box, right) almost exclusively for insertion site detection projects because they run a lot of samples and the cost is amortized across the probe set.
Insertion site detection is a powerful tool for identifying where exogenous DNA that is inserted into a host genome has been incorporated. To detect insertion sites, scientists at Cofactor first design xGen Lockdown Probes targeting the exogenous DNA. They then fragment the host DNA containing the insert, create a library of the DNA fragments by ligating adapters, and hybridize the xGen Lockdown Probes to the library. The insert will be contained in that DNA mixture; however the sequence of the insert is unimportant in most cases; it is just used as a hook to design the capture probes. What the scientists are interested in are the regions of host sequence abutting the insert sequence. Among the randomly fragmented genomic DNA are fragments that are an insert:genomic DNA chimera, and these are the sequences that the scientists look for to determine the insertion sites (Figure 1).
Figure 1. Workflow for insertion site detection.
Researchers often assume that this method can be directly applied to translocation detection. However, Mr Armstrong distinguishes insertion site detection from identifying translocations by noting, “Insertion site detection is easy to work on because you know the sequence of the insert. With translocation detection you might have to capture and sequence large regions of DNA before you perform the alignment to determine where the translocation has occurred.”
Cofactor has developed a very specific and unique insertion site analysis pipeline with which all of the sequencing data is analyzed to determine the most likely areas where an insert is incorporated into the host genome (Figure 2). When samples are received from a client, Cofactor is able to perform the complete service in 8–10 weeks and deliver a ranked list of the most likely areas in the genome for the insertion as well as FASTA sequences for those regions. Confirmation of the site of insertion can be performed by Cofactor or the client by simply using PCR to amplify across the insert:genomic DNA junction.
Figure 2. Insertion site detection using xGen Lockdown Probes. This example shows an alignment output produced by Cofactor Genomics’ insertion site detection analysis pipeline, visualized in the UCSC Genome Browser. Paired-end, 100 nt sequences were generated from genomic and insert fragments enriched using xGen Lockdown Probes. Reads were aligned to the host genome. "Orphan" paired reads are represented by overlapping horizontal red and blue lines. Forward orphan sequencing reads are shown in red and reverse orphan sequencing reads are shown in blue. Orphaned sequencing reads are characterized when one read from a pair aligns to the host genome and the other read aligns to the insert sequence. Areas of the genome with coverage greater than the calculated mean depth of coverage and orphan reads in the correct orientation are considered for insertion site confirmation by PCR. The black line represents the proposed insertion site location for this sample.
Advantage of using target capture
It is important to note that because the insert is introduced randomly into the host genome, it is impossible to design primers targeted to the genome upstream and downstream of the insert; therefore, standard PCR cannot be used to detect the insertion site. Use of target capture and NGS provides an easier workflow that also allows higher throughput. Additionally, the hybridization procedure enables capture of regions that are unique to the insert, while allowing the scientists to separate random noise from non-random signal by sequencing to coverage above the noise threshold. Using xGen Lockdown Probes helps Cofactor to easily anticipate the cost of the project as they know the sequence coverage required and exactly how many probes will be needed.
Targeted RNA capture
NGS technologies are able to determine the nucleic acid sequence of both DNA and RNA. However, unlike DNA, which is a somewhat static molecule, RNA transcription is dynamic, tissue specific, and often contains a temporal component. Thus, Cofactor considers RNA-seq applications akin to snapshots taken at a given moment in time as opposed to whole exome or genome sequencing, which exhibits more stability.
Recently, considerable attention has been focused on the role of transcript isoform function. Isoforms from the same gene can have significantly different structures from one another and, therefore, significantly different functions. Cofactor has developed a molecular approach, using xGen Lockdown Probes, and a specific analysis pipeline, to help clients elucidate specific isoform structures of interest.
Scientists at Cofactor perform RNA isoform assembly of specific genes using xGen Lockdown Probes and the workflow shown in Figure 3. The sequencing reads containing the targeted transcripts are assembled using Cofactor’s transcript assembly pipeline. After the structures of the isoforms are determined from the assembled transcripts, they can be added to the gene reference for comparative expression and further interrogation. Aligning RNA-seq data to a gene (transcriptome) reference is similar to a microarray experiment, but provides much greater dynamic range and is hypothesis neutral. Data can also be aligned to a genome reference to identify novel or unannotated transcripts.
Cofactor aligns RNA-seq data to both a genome reference and a gene reference, which is unusual in the field; most researchers align to either a genome reference or a gene reference. Aligning to both types of reference allows Cofactor’s clients to interrogate their data with 2 lenses—a discovery lens and a comparative expression lens.
Figure 3. Workflow for targeted RNA capture.
Mr Armstrong notes that many clients approach them to request RNA-seq for comparative expression, however, during further discussion Cofactor scientists learn that >50% of clients are also very interested in splice isoform discovery. Mr Armstrong stresses that comparative expression from RNA-seq data and splice isoform discovery are not the same experiment. Very good comparative expression information can be obtained from a single-end 1 x 50 read, with minimal reduction in alignment specificity, at 1X cost. However, splice isoform detection should be approached differently —at a minimum, paired-end 2 x 100 reads are required; i.e., the longest length with the highest number of reads, and cost becomes 2–3X.
Researchers at Cofactor state that while identifying exon–exon junctions from 50 bp reads is possible, it is not the most effective strategy. Ultimately, if isoform detection is of equal importance as comparative expression, Cofactor usually defines a better means of allocating funds to obtain both types of information as efficiently and cost effectively as possible.
Project tracking and data delivery
ActiveSite™ is a web interface that serves as a dashboard for an entire project. Using ActiveSite, clients can interrogate 200 million lines of data from an RNA-seq experiment across 24 samples, for example, and within a minute, reduce that data down to the most interesting candidates based on statistical measures and input thresholds. The data analysis is performed by Cofactor bioinformaticians, according to clients’ needs and budget, and characterized in ActiveSite for client access. Their clients can then access the interface and manipulate the data as required.
Other specialties at Cofactor Genomics include development and testing of extraction protocols for different tissue types (e.g., biofluids, plants) to improve NGS performance. They also perform molecular protocol development, such as devising the means for porting samples that have been assayed by another method onto an NGS sequencer. They can provide qPCR validation downstream of NGS that identifies SNP and RNA-seq candidates and, through partnerships with other companies, are able to tie together proteomics information with NGS data. Although the company started out performing library generation and sequencing analysis, as they continue to grow they are adding bolt-on applications to the front- and back-ends of their processes.