This page is entirely dedicated to explain and discuss
our model for genotype likelihoods. The model was introduced in Mostad et
al. (2023) and is summarized below. An implementation in R is also
available at https://familias.name/lcNGS/.
The typical input in our model is low coverage sequence
data for one or more samples and high quality genotype data (from whatever
platform) for zero or more samples. One scenario can be the comparison of low
quality sequence data generated from low quality material (e.g. old bones) with
high quality genotype data from one (or more) alleged living relatives.
Comparisons are obviously also possible if samples are only high quality as
well. Our R-script implementation accepts simple allele read counts for a SNP marker as input. In addition the R-script accepts a common vcf-file with PL parameters as input. The PL data will be used in the genotype likelihood model and therefore allows the usage of an external software, e.g. ANSGD or GATK to generate these likelihoods directly from BAM files.
An introduction is also provided at slides 50-60 in https://familias.name/ISFG2023/ISFG2023_day2_part1.pdf.
describes at each locus independently, a probability distribution for low coverage
NGS (lcNGS) data given a specific genotype of a tested person, where each is one of A, C, G, or T. We
describe this model by describing how to simulate from it using several steps:
First, for each of DNA templates that end up founding PCR amplicons, where is a model parameter, it is
randomly chosen whether the template is based on or . The identity of each recorded read in the lcNGS data is then
determined as follows: With a probability , where is a model parameter, it is
chosen randomly among A, C, G, T. Otherwise it is chosen equal to with a probability and with probability , where is the number of DNA templates
based on .
the parameter is broadly related to the drop-in
rate, so that, if , no drop-ins will originate as noise in the lcNGS data. However, if , data could still contain alleles not observed in the frequency
database used, but such data would then indicate a true, but until now
unobserved, genotype. We address such unobserved alleles with a parameter gamma
(denoted unobserved allele freq. In FamLink2).
Similarly, the parameter is related to the drop-out
rate. When is large (say ), , and a drop-out can only occur, in our observation model, when the
total number of reads is so low that they all can happen to be based on the
same of the two alleles or . When is smaller (say
), can be further away from , increasing the imbalance in the sampling and thus the chance of drop-outs.
If is very small (say , there is a considerable chance that or , in which cases there will be a drop-out (for a heterozygote genotype)
no matter how many reads are recorded for the single allele. Finally, we note
that our model for genotype likelihoods is defined
using two parameters (e and m) whereas there in reality are other
parameters that could be relevant.
that our model bears resemblance to other published model, see for instance
To be added
You may send comments to firstname.lastname@example.org