Genotyping: Terms to know

SNPs, indels, alleles, haplotypes, hemizygous, nonsynonymous variants

Genotyping determines differences in a specific DNA sequence across a population. These sequence variations can be used as markers in linkage and association studies to determine genes relevant to specific traits. Review the vocabulary commonly encountered in genotyping experiments.

Researchers often look at genetic variations between individuals in a population to better understand phenotypic traits, such as fruit production, or human disease origin and incidence rate. One individual's genome can differ from that of the general population in numerous ways, including single base changes (single nucleotide polymorphisms, or SNPs), insertions, deletions, or even the number of gene copies. These unique differences can be used as markers in linkage and association studies that attempt to determine genes responsible for disease, plant drought tolerance, etc.

Genotyping is the process of determining the DNA sequence—the genotype—at specific positions within a gene of an individual. Genotyping can be performed by end-point or real-time PCR, sequencing, bead or microarray analysis, or even mass spectrometry.

And like many technical processes, this application includes its own vocabulary. Here we provide an introduction to some of the most commonly used terms, and use them in context to draw distinctions when encountered in genotyping experiments.

Allele, locus, and haplotype

Allele The DNA sequence at a specific chromosomal location, which presents as a variant, or alternative form, of a gene. Any given gene can have multiple different alleles. Humans have 2 sets of each chromosome so they possess the potential for only 2 alleles at any given locus, one inherited from each parent. Some genes have only one allele, such as those on the human male's Y chromosome, and any deviation from that allele can be harmful, or even fatal, to the organism.
Polyallelic The existence of multiple alleles at a specific genetic locus.
Biallelic/Triallelic/Quatra-allelic The number of distinct nucleotides (2/3/4) known to exist at a particular base position of an allele. For example, the occurrence of only A or G is a biallelic position, A or C or T is a triallelic position, and A or C or G or T is a quatra-allelic position (Figure 1).
Locus A specific chromosomal location. Can refer to a gene location on a chromosome or to a specific sequence element.
Haplotype A set of DNA variations (polymorphisms such as SNPs and indels) adjacent to one another at the same locus that tend to be inherited together (Figure 1). This set of alleles is often referred to as linked polymorphisms.

 
Multiple adjacent SNPs in an individual define a haplotype.

Figure 1. Haplotypes made up of 3 biallelic positions. The 3 distinct haplotypes (AGT, GTA, AGA) contain biallelic SNPs (A or G, G or T, and A or T) at the 3 variant positions in this locus.

Zygosity

Zygosity Describes the similarity or differences between an individual's alleles. Since most eukaryotes have 2 matching sets of chromosomes, zygosity terminology describes whether both copies of an allele, or allele-encoding trait, are the same or not.
Dominant allele (B) A dominant allele, designated by an uppercase letter (such as "B"), always displays the phenotype it encodes. It does this either through its presence in both gene copies (BB) or by masking the expression of a second, distinct recessive allele at the same locus (Bb) (Figures 2, 3; Note: There are occasions when a recessive allele can contribute to a phenotype through co-dominance or incomplete dominance.)
Recessive allele (b) A recessive allele, designated by a lowercase letter (such as "b"), expresses its associated phenotype only when paired with another recessive allele (Figures 2,3; see note under Dominant allele).
Homozygous (BB, bb) An individual with 2 copies of the same allele, whether dominant (designated by 2 uppercase letters, such as "BB") or recessive (designated by 2 lowercase letters, such as "bb").
Heterozygous (Bb) An individual who has 2 different alleles for the same trait, with one dominant over the other recessive allele.
Hemizygous (B, b) An individual possessing only a single copy of a gene instead of the customary 2 copies, therefore having only 1 allele. For example, all the genes on the single X and Y chromosomes in human males are hemizygous.
Zygosity is defined by whether a genotype is homozygous (dominant or recessive) or heterozygous.

Figure 2. Zygosity. The genotypes at 3 different loci show examples of homozygosity for both a dominant and recessive allele, as well as heterozygosity.

 

Genotype vs phenotype

Genotype Refers broadly to the genetic makeup of an organism—its complete set of genes. Sometimes used in a narrower definition, (as in this article), genotype refers to the specific alleles found on each chromosome.
Phenotype The physical/observed traits determined or "expressed" by a given genotype; for example, the purple or white petals of a pea flower seen in Figure 3.
Different genotypes can result in distinct phenotypes.

Figure 3. Different genotypes give rise to distinct phenotypes.

 

SNPs, polymorphisms, mutations, and CNVs

In human beings, 99.9% of all bases in the genome—from individual to individual—are the same. The remaining 0.1% make a person unique. Each of us differs by about 10,000 non-synonymous variants from the human genome reference sequence. Of these, each of us carry around 340–400 variations that result in loss of function of certain genes [1].

An individual's genome may differ from others in numerous ways, including base differences known as single nucleotide polymorphisms (SNPs), insertions or deletions (INDELs), or differences in the number of copies of a sequence or gene [copy number variations (CNV)] (Table 1)..

These variants can be:

  • Harmless—Variations that cause no change in phenotype; this is true of most SNPs.
  • Harmful—Variations that cause diseases, such as diabetes, cancer, heart disease, or hemophilia.
  • Latent—Variations, found in coding and regulatory regions of the genome that are not harmful on their own. Their change in sequence only becomes apparent under certain conditions, such as susceptibility to cancers or response to drugs.
Reference sequence (RefSeq) The standard sequence for a given organism's genome, cataloged in the RefSeq database curated by the NCBI.
Polymorphism Variation at a genomic locus carried by a percentage of individuals within a population (generally >1%), thus creating different genotypes across that population.
Single-nucleotide polymorphism (SNP) Variation in a single nucleotide that occurs at a specific position in the genome. To be considered a SNP, the variation must be present in >1% of the population. Less than this, and it would be considered a rare mutation (abnormal change).
Single nucleotide variation (SNV) A base variation, distinct from the reference sequence, without information regarding how often this variation occurs.
Multiple nucleotide polymorphism (MNP) When 2 or more SNPs occur right next to each other.
INDEL
(INsertion/DELetion)
Sequence that has been inserted or deleted in one genome relative to another. A deletion in one genome corresponds to an insertion in the other.
Mutation Changes in DNA sequence from an individual's inherited genetic sequence (as conferred in the reference sequence for that individual). Each of the above types of polymorphisms—SNPS, SNVs, MNPs, INDELs—are considered mutations. However, while polymorphisms are defined as being present within an appreciable subset of the general population, mutations also include alterations in DNA sequence that are rare or have been identified in just a single individual.
Copy number variation (CNV) When the number of copies of a particular genetic sequence differs between individuals. It is caused by repeats in the genome, the number of which can vary dramatically across a population.
Type Reference sequence Alternate sequence
SNP (single) T G
MNP (multiple) TA GC
Insertion AGT ACGT
ATCGGG ATCTGAGGG
Deletion ACGT AGT
ATCTGAGGG ATCGGG

Table 1. Distinction between SNPs, MNPs, and InDELs.

Types of mutations

Germline mutation A mutation present in one's gametes (egg or sperm), and thus, can be inherited. Germline mutations are responsible for familial inherited diseases, such as retinoblastoma, Huntington's disease, and cystic fibrosis. They can be either dominant or recessive mutations, requiring only 1 or both alleles, respectively, to be mutated for expression of the inherited trait.
Somatic mutation A mutation that occurs in non-germline tissues and cannot be inherited. Thus, such mutations are only present in some of the cells of the body (e.g., in a tumor), giving rise to the presence of multiple genotypes within a single individual.
Silent mutations A mutation that does not have a visible/detectable effect on the phenotype of an organism.
Non-synonymous variant A SNP that changes the codon it resides in, resulting in an altered amino acid sequence for the encoded protein (missense mutation) or a truncated protein (nonsense mutation).
Missense mutation A mutation at a single base that results in the encoding of a distinct amino acid in the resulting protein. The amino acid substitution may render the protein fully functional, partially functional, or nonfunctional.
Nonsense mutation A mutation that results in a codon change to a chain-terminating codon, thus generating a truncated protein. Such proteins are often nonfunctional.

Allele frequency

Minor allele frequency (MAF) The frequency (percent or fraction) of the second most common allele for a given locus in a population.
MAF/MinorAlleleCount (Figure 4) An equation that provides an estimate of the number of times a particular SNP has been observed in the population used in a specific study. For example:
C=0.1506/754(1000 Genomes)

where,
C = the minor allele for that particular locus
0.1506 = the frequency of the C allele (MAF); in this case meaning 15% within the 1000 Genomes database
754 = the number of times this SNP has been observed in the population of the study

The SNP database, dbSNP, is the NCBI-curated database of verified SNPs, each with its own RefSNP entry. Figure 4 shows a RefSNP entry for a pathogenic SNV, noting the alleles detected and the MAF/MinorAlleleCount of these alleles, based on reference sequences from 4 different sources.

The RefSNP cluster report for a specific SNP.

Figure 4. RefSNP entry for a pathogenic SNV. Accessed from www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=429358 on 20 Feb 2019.

 

Your genotyping resource

Reagents for genotyping. IDT offers a complete PCR-based SNP genotyping solution, the rhAmp SNP Genotyping System, as well as PrimeTime LNA qPCR Probes and MGB Eclipse® Probes that allow detection of small sequence alterations.

Learn more about these IDT genotyping solutions.

A complete PCR-based SNP genotyping solution. The rhAmp SNP Genotyping System includes a predesigned assay collection addressing >10 million human SNPs, including a broad selection of functionally validated absorption, distribution, metabolism, and excretion (ADME) SNP assays. A custom assay design pipeline is also available for newly discovered human SNPs or assay designs of other species. Over 90% of assays tested have returned greater than 99.5% call accuracy. And the design of rhAmp SNP assays makes it possible to detect SNPs in difficult sequence regions with amplicon lengths as short as 40 bp.

Learn more about the rhAmp SNP Genotyping System.

Technical support. In addition to the comprehensive set of tools, reagents, and educational resources for PCR-based SNP genotyping, IDT also provides world-class technical support. These scientists are available to answer all types of genotyping questions ranging from experimental design to interpreting data. Contact us with your questions about genotyping assay design at applicationsupport@idtdna.com.

References

  1. Durbin RM, Altshuler D, et al. (2010) A map of human genome variation from population scale sequencing. Nature 467(7319):1061–1073.

Published May 16, 2019

Your Advocate for the Genomics Age