FamLink home page


Home       Download       Getting started       Manual       Contact       Help       Databases       Genotype likelihoods


Genotype likelihoods

This page is entirely dedicated to explain and discuss our model for genotype likelihoods. The model was introduced in Mostad et al. (2023) and is summarized below. An implementation in R is also available at https://familias.name/lcNGS/. The typical input in our model is low coverage sequence data for one or more samples and high quality genotype data (from whatever platform) for zero or more samples. One scenario can be the comparison of low quality sequence data generated from low quality material (e.g. old bones) with high quality genotype data from one (or more) alleged living relatives. Comparisons are obviously also possible if samples are only high quality as well. Our R-script implementation accepts simple allele read counts for a SNP marker as input. In addition the R-script accepts a common vcf-file with PL parameters as input. The PL data will be used in the genotype likelihood model and therefore allows the usage of an external software, e.g. ANSGD or GATK to generate these likelihoods directly from BAM files.

An introduction is also provided at slides 50-60 in https://familias.name/ISFG2023/ISFG2023_day2_part1.pdf.

Model

The model describes at each locus independently, a probability distribution for low coverage NGS (lcNGS) data given a specific genotype  of a tested person, where each  is one of A, C, G, or T. We describe this model by describing how to simulate from it using several steps: First, for each of  DNA templates that end up founding PCR amplicons, where  is a model parameter, it is randomly chosen whether the template is based on  or . The identity of each recorded read in the lcNGS data is then determined as follows: With a probability , where  is a model parameter, it is chosen randomly among A, C, G, T. Otherwise it is chosen equal to  with a probability  and  with probability , where  is the number of DNA templates based on .

Note how the parameter  is broadly related to the drop-in rate, so that, if , no drop-ins will originate as noise in the lcNGS data. However, if , data could still contain alleles not observed in the frequency database used, but such data would then indicate a true, but until now unobserved, genotype. We address such unobserved alleles with a parameter gamma (denoted unobserved allele freq. In FamLink2). Similarly, the parameter  is related to the drop-out rate.  When  is large (say ), , and a drop-out can only occur, in our observation model, when the total number of reads is so low that they all can happen to be based on the same of the two alleles  or . When  is smaller (say ),  can be further away from , increasing the imbalance in the sampling and thus the chance of drop-outs. If   is very small (say , there is a considerable chance that  or , in which cases there will be a drop-out (for a heterozygote genotype) no matter how many reads are recorded for the single allele. Finally, we note that our model for genotype likelihoods is defined using two parameters (e and m) whereas there in reality are other parameters that could be relevant.

We note that our model bears resemblance to other published model, see for instance GATK.

 

Examples

To be added

Sensitivity study

We are

 

You may send comments to daniel.kling@rmv.se