Advancing species discovery and identification
Imagine a unified database that provides rapid access to the names and biological attributes of every species on Earth, linking genetic, morphological, and ecological data, including reference to the exact location of the specimens, and made publically available. Not only would it provide a resource to assess species diversity and detangle links within the ecosystem food webs, but it could impact human health, safety, and economic well-being (see the below sidebar, Barcoding News Stories).
Creating a digital identification system for life
The International Barcode of Life Project (iBOL; https://ibol.org/) attempts to provide just such a resource through the construction of a comprehensive library of DNA sequence tags, or “barcodes”, for eukaryotic life (Figure 1). It is based on the simple fact that for a given DNA fragment, the genetic variation across species exceeds sequence variation among members within the species. Thus, a very short, standard DNA segment, in this case a fragment of the CO1 gene for animals, and rbcL and MatK for plants, allows researchers to infer species identity faster and with more accuracy and precision than traditional taxonomy or morphological approaches alone.
The five-year Phase I project, running from 2010 to 2015, and jointly administered by a consortium of funders and research groups, aims to acquire DNA barcode records for 5 million individual specimens, representing over 500,000 species (Figure 1). The Canadian Centre for DNA Barcoding (CCDB), part of the University of Guelph in Ontario, Canada, is the main DNA barcoding facility for iBOL. Run by Dr Paul Hebert, who is also the Scientific Director of iBOL and often referred to as the father of DNA barcoding, CCDB has already generated over 1.2 million barcodes, and its high throughput operation is currently generating barcodes at a rate of 350,000 per year and accelerating.
Life at a DNA barcoding core facility
CCDB employs 60 staff divided between bioinformatics, specimen and data submission management, and the sequencing laboratories. We spoke with Dr Evgeny V Zakharov, the Director of Laboratory Operations for the facility, about the high throughput workflow they use in the labs. While the steps are simple—DNA extraction, PCR amplification, and Sanger cycle sequencing—the focus is on use of highly standardized protocols that will handle minute amounts of the broadest diversity of sample types that can be performed with high throughput, and that can be automated where possible (Figure 2).
DNA extractionSeveral hundred collaborators (from universities, museums, etc.) have active projects with the CCDB at any given time, and the CCDB laboratories are typically processing over 2000 samples per day. Samples are usually fragments of the original specimen, “...the smaller, the better,” notes Dr Zakharov. “Our workflow is tailored to maximize DNA recovery often from minute samples and these smaller samples tend to be more homogeneous, that is, less likely to contain contaminating organisms”, (e.g., parasites or species from a meal). The lab uses membrane-based DNA extraction protocols that involve binding DNA to a glass fiber membrane in the presence of chaotropic salts. Distinct chemistry is used for different sample categories; for example, to break plant cell walls or the tough cell membranes of echinoderms. DNA extraction is fully automated in 96-well plate format using a Biomek® FXP Liquid Handing System.
PCR amplificationThe group uses standard PCR protocols with cocktails of IDT primers that work efficiently across a variety of taxa to amplify the target DNA fragment. Usually a 658 bp fragment of the CO1 gene is amplified in a 5–10 μL reaction on one of the facility’s thirty-four 96-well and four 384-well Eppendorf thermal cyclers. However, shorter amplicons are sometimes required for processing degraded DNA either from recently collected but poorly preserved samples or older museum specimens. Amplification reactions are optimized so that no sample purification is required prior to Sanger sequencing, though the CCDB researchers do perform visual validation of the amplified product by agarose gel electrophoresis.
“We use IDT primers for our PCR work. Since our protocols are mostly focused on high throughput batch processing, quality and consistency of all components are critical. Even though we validate the primers prior to any large scale analysis, if a primer starts degrading, it will compromise a lot of data—that’s a lot of rework, resulting in additional time and money wasted. We’ve been very happy so far, and while we always keep a list of back up suppliers as standard practice, the IDT primers have shown great consistency in meeting our expectations.”—Dr Evgeny Zakharov
Sanger sequencing is the method of choice to ensure an indisputable link between the sample source and the derived barcode sequence. An aliquot of diluted non-purified PCR product is used as template in a cycle sequencing reaction in 96- or 384-well plates, followed by sequencing product cleanup and DNA sequence detection on 3730xl DNA sequencers (Applied Biosystems), the “workhorse” of the first whole genome sequencing projects. A team of sequence data finishers assembles collected trace files with raw sequence information into contigs to produce alignment of consensus sequences, validates the results, and uploads both unedited data and final results to the database for further analysis by researchers in BIO (Biodiversity Institute of Ontario, home for CCDB) and project collaborators at other institutions. Typically several individual samples per species are sequenced to capture sequence variation within a species and to cross-validate sequencing results.
The sequence information is integrated with metadata on The Barcode of Life Data (BOLD) Systems—the online workbench that aids collection, management, analysis, and use of DNA barcodes. Each sample includes information regarding where the sample was collected, when, by whom, and in which museum or institution collection the voucher specimen has been deposited for long-term storage and curation. This creates a unified resource that serves to engage the scientific community to drive collaborative projects that describe new species, uncover hidden cryptic diversity not seen with prior standard methods, and even identify errors in curated collections.
Dr Zakharov notes, “This project comes at a very critical time, when we are in a race against massive anthropogenic extinction of species on the planet. It has taken us biologists over 250 years to describe the known 1.7 million species. However, the true number of species on Earth is believed to be closer to 10–100 million. At the current pace of species discovery, it is certain that over the next few decades many will become extinct before even having been described. Like the birth of DNA forensics in the early 1990s, the rise of DNA barcoding represents another milestone that will have far-reaching implications on the world in which we live.”