Open access whole exome sequencing data lowers the barrier for research linking genes to phenotypes

Van Hout CV, Tachmazidou I, et al. (2019) Whole exome sequencing and characterization of coding variation in 49,960 individuals in the UK Biobank. bioRxiv 572347, DOI 10.1101/572347

Citation summary: The UK Biobank has performed exome sequencing on almost 50,000 human samples and made the data open access. Van Hout, et al., found many known loss-of-function variants associated with disease, as well as some novel disease-associated variants. Read about the utility of exome sequencing and the xGen Exome Research Panel, which was used to reveal these disease risk variants.


The Human Genome Project was the first large sequencing endeavor to successfully characterize the human genome. This project revealed much about the human genome, including the vast extent of variation, even in closely defined populations. This revealed a need for many genomes to be sequenced. As next generation sequencing technology and the ability to store and analyze large amounts of data developed, the other “-omics” projects started to emerge. Whole exomes could be sequenced with less time and cost than whole genomes, lowering the barriers to attaining more data. More data gives us the ability to link exome sequences to diseases, paving the way for personalized medicine.


The UK Biobank recruited 500,000 people to launch a prospective health study. These people provided detailed information about themselves and agreed to continue to provide updated health data. Of those 500,000 people, 49,960 had their exomes sequenced. The sequencing data is available to approved researchers. This study outlines the sequence variation found, and evaluates loss-of-function (LOF) mutations and their association with 1741 common and rare phenotypes.

Whole exome sequencing (WES) was performed using a customized version of xGen Exome Research Panel v1.0 (IDT). The standard version normally captures 19,396 human genes (39 megabases). Van Hout, et al., supplemented the standard panel with additional probes to boost coverage, achieving a sequencing coverage of over 20X for 94.6% of targeted sites on a NovaSeq™ 6000 S2 flow cell (Illumina). The researchers summarized the sequencing variations, reviewed predicted disease risk variants, and completed comprehensive LOF burden testing with 1741 phenotypes.

Results and discussion

The researchers primarily evaluated LOF variants due to the information added to human genetics and medical sequencing studies. LOF variants disrupt gene function and play a causal role in many Mendelian disorders. These types of variants have been used extensively to identify novel drug targets [1–3].

WES results revealed an increased number of all variants, including LOF variants, compared to imputed sequencing data (statistical inference of unobserved genotypes). Not surprisingly, WES had better concordance, or agreement, with array genotyped variants than it did with imputed sequences, since WES and arrays evaluate experimental data rather than predictions.

Thousands of LOF variants have been associated with disease. Treatments and preventions that can be applied based on variant sequence can significantly improve survival rates of patients with genetic risk. When LOF variants impart disease risk, they are categorized as pathogenic or likely pathogenic. Variants that contribute unknown risk are called variants of unknown significance (VUS). Assessing genetic data for disease risk gives physicians the opportunity to apply precision medicine and improve patient outcomes. The researchers analyzed variant data using the current American College of Medical Genetics (ACMG) 59 gene set [4]. Over 2% of study participants carried “a potentially actionable, rare pathogenic variant.” Cancer-associated variants were the most prevalent, followed closely by variants associated with cardiac dysfunction. Taking a closer look at pathogenic cancer risk genes, Van Hout, et al., found that carriers of LOF variants in BRCA1/2, a breast cancer risk gene, had increased risk for 5 other cancers.

A burden test combines risk of multiple variants within a gene to evaluate the association of that gene with a trait. Van Hout, et al, used burden testing to find 25 unique gene-burden trait associations, that achieved higher significance than any single LOF variant alone. Some of the gene-trait associations were well known, such as PKD1 with polycystic kidney disease. Others were less common but supported by the literature, such as HBB with red blood cell phenotypes. Several gene-trait associations were novel, such as PIEZO1 LOF variants associated with increased risk for varicose veins. To ensure that no single variant accounted for the association, Van Hout, et al., performed “leave-one-out” (LOO) analyses. Step-wise regression analysis indicated 11 variants contributed to the PIEZO1 gene risk.

The main purpose of this study is to provide open access to exome sequencing data. Large-scale studies of this kind are invaluable to human health research. The secondary goal of the study is to demonstrate the utility of this type of data. The methodology increased sensitivity toward identification of LOF variants and novel gene-trait associations. The number of samples added to the UK Biobank greatly increases the power of the study and will be a great resource for future studies that seek to evaluate pathogenic and likely pathogenic variants and their carriers. Exome sequencing, combined with other relevant health data from the prospective study, holds promise for leading to many discoveries and improvements in human health and well-being.

Published Nov 20, 2019