Home
Download
Getting started
Manual
Contact
Help
Databases
Genotype likelihoods
This
page is entirely dedicated to explain and discuss our model for genotype
likelihoods. The model was introduced in Mostad et
al. (2023) and is summarized below. An implementation in R is also
available at https://familias.name/lcNGS/.
The typical input in our model is low coverage
sequence data for one or more samples and high quality genotype data (from
whatever platform) for zero or more samples. One scenario can be the comparison
of low quality sequence data generated from low quality material (e.g. old
bones) with high quality genotype data from one (or more) alleged living
relatives. Comparisons are obviously also possible if samples are only high
quality as well. Our R-script implementation accepts simple allele read counts for a SNP marker as input.
In addition the R-script accepts
a common vcf-file
with PL parameters as input. The PL data will be used in the genotype likelihood model and therefore allows the usage of an external software, e.g. ANSGD or GATK to generate these likelihoods directly from BAM files.
An
introduction is also provided at slides 50-60 in https://familias.name/ISFG2023/ISFG2023_day2_part1.pdf.
The model
describes at each locus independently, a probability distribution for low
coverage NGS (lcNGS) data given a specific genotype of a tested person, where each
is one of A, C, G, or T. We
describe this model by describing how to simulate from it using several steps:
First, for each of
DNA templates that end up
founding PCR amplicons, where
is a model parameter, it is
randomly chosen whether the template is based on
or
. The identity of each recorded read in the lcNGS data is then determined
as follows: With a probability
, where
is a model parameter, it is
chosen randomly among A, C, G, T. Otherwise it is chosen equal to
with a probability
and
with probability
, where
is the number of DNA templates
based on
.
Note how
the parameter is broadly related to the drop-in
rate, so that, if
, no drop-ins will originate as noise in the lcNGS data. However, if
, data could still contain alleles not observed in the frequency
database used, but such data would then indicate a true, but until now
unobserved, genotype. We address such unobserved alleles with a parameter gamma
(denoted unobserved allele freq. In FamLink2). Similarly, the parameter
is related to the drop-out
rate. When
is large (say
),
, and a drop-out can only occur, in our observation model, when the
total number of reads is so low that they all can happen to be based on the
same of the two alleles
or
. When
is smaller (say
),
can be further away from
, increasing the imbalance in the sampling and thus the chance of
drop-outs. If
is very small (say
, there is a considerable chance that
or
, in which cases there will be a drop-out (for a heterozygote genotype)
no matter how many reads are recorded for the single allele. Finally, we note
that our model for genotype likelihoods is defined using two parameters (e and m) whereas there in reality are other parameters that could be
relevant.
We note
that our model bears resemblance to other published model, see for instance
GATK.
To be added
We are
You may send comments to daniel.kling@rmv.se