libsequence  1.9.5

Molecular population genetic analysis. More...

#include <Sequence/PolySNP.hpp>

Public Member Functions

 PolySNP (const Sequence::PolyTable *data, bool haveOutgroup=false, unsigned outgroup=0, bool totMuts=true)
 
 PolySNP (const PolySNP &)=delete
 
PolySNPoperator= (const PolySNP &)=delete
 
virtual double ThetaPi (void) const
 
virtual double ThetaW (void) const
 
virtual double ThetaH (void) const
 
virtual double ThetaL (void) const
 
double VarPi (void) const
 
double StochasticVarPi (void) const
 
double SamplingVarPi (void) const
 
double VarThetaW (void) const
 
unsigned NumPoly (void) const
 
virtual unsigned NumMutations (void) const
 
virtual unsigned NumSingletons (void) const
 
virtual unsigned NumExternalMutations (void) const
 
virtual double TajimasD (void) const
 
virtual double Hprime (const bool &likeThorntonAndolfatto=false) const
 
virtual double Dnominator (void) const
 
virtual double FuLiD (void) const
 
virtual double FuLiF (void) const
 
virtual double FuLiDStar (void) const
 
virtual double FuLiFStar (void) const
 
double DandVH (void) const
 
unsigned DandVK (void) const
 
virtual double WallsB (void) const
 
virtual unsigned WallsBprime (void) const
 
virtual double WallsQ (void) const
 
double HudsonsC (void) const
 
virtual unsigned Minrec (void) const
 
std::vector< PairwiseLDstatsDisequilibrium (const unsigned &mincount=1, const double &max_marker_distance=std::numeric_limits< double >::max()) const
 

Protected Member Functions

void DepaulisVeuilleStatistics (void) const
 
virtual void WallStats (void) const
 
double a_sub_n (void) const
 
double a_sub_n_plus1 (void) const
 
double b_sub_n (void) const
 
double b_sub_n_plus1 (void) const
 
double c_sub_n (void) const
 
double d_sub_n (void) const
 

Protected Attributes

std::unique_ptr< _PolySNPImplrep
 

Detailed Description

Molecular population genetic analysis.

Example Use:

#include <iostream>
#include <vector>
using namespace std;
using namespace Sequence;
int main
{
vector<Fasta> data;
Alignment::GetData(data,"popdata.fasta");
assert(Alignment::IsAlignment(data))
if (Alignment::Gapped(data))
{
Alignment::RemoveTerminalGaps(data);
}
PolySites *polytable = new PolySites(data);
PolySNP *analyze = new PolySNP(data,false,0);
cout << "Tajima's D is " << analyze->TajimasD() << endl;
delete polytable;
delete analyze;
exit(1);
}
Warning
The routines to calculate nucleotide diversity, numbers of singletons, etc. all handle missing data (the N character). However, all summary statistics involved in "tests of neutrality" are, strictly speaking, undefined if missing data are present. The reason for this is because the denominators of the statistics are functions of the sample sizes, and no explicit formulae exist when the sample size varies from site to site (which is the case when there are missing data). In short, if you want to be rigorous, you can only really count up nucleotide diversity and a few other statistics if your data contain untyped SNPs. However, the routines present in libsequence will happily go and calculate the summary statistics for you, and it is up to you to be aware that you are writing a program that may give biased results. To date, the magnitude and direction of the bias remains unknown. Functions (and hence the statistics) that are affected have warnings in their documentation.
Note
As of libsequence 1.4.1, the routines in this class explicity check for gaps in the polymorphism table. This provides an additional safeguard for cases where Sequence::PolyTable objects are created and sites with gaps are left in.
Deprecated:
Will be removed in libsequence 2.0
Examples:
slidingWindow.cc, and slidingWindow2.cc.

Definition at line 86 of file PolySNP.hpp.

Constructor & Destructor Documentation

◆ PolySNP()

Sequence::PolySNP::PolySNP ( const Sequence::PolyTable data,
bool  haveOutgroup = false,
unsigned  outgroup = 0,
bool  totMuts = true 
)
explicit
Parameters
dataa valid object of type Sequence::PolyTable
haveOutgrouptrue if an outgroup is present, false otherwise
outgroupif haveOutgroup is true, outgroup is the index of that sequence in data
totMutsif true (the default) use the total number of inferred mutations, otherwise use the total number of polymorphic sites in calculations
Note
this constructor allocates a pointer to an implementation class, which automatically pre-processes the SNP data

Definition at line 171 of file PolySNP.cc.

Member Function Documentation

◆ a_sub_n()

double Sequence::PolySNP::a_sub_n ( void  ) const
protected

\[a_n=\sum_{i=1}^{i=n-1}\frac{1}{i}.\ \]

This is the denominator of Watterson's Theta (see PolySNP::ThetaW)

Warning
statistic undefined if there are untyped SNPs

Definition at line 1504 of file PolySNP.cc.

◆ a_sub_n_plus1()

double Sequence::PolySNP::a_sub_n_plus1 ( void  ) const
protected

\[a_{n+1}=\sum_{i=1}^{i=n}\frac{1}{i}\ \]

Warning
statistic undefined if there are untyped SNPs

Definition at line 1520 of file PolySNP.cc.

◆ b_sub_n()

double Sequence::PolySNP::b_sub_n ( void  ) const
protected

\[b_n=\sum_{i=1}^{i=n-1}\frac{1}{i^2}\ \]

Warning
statistic undefined if there are untyped SNPs

Definition at line 1537 of file PolySNP.cc.

◆ b_sub_n_plus1()

double Sequence::PolySNP::b_sub_n_plus1 ( void  ) const
protected

\[b_n=\sum_{i=1}^{i=n}\frac{1}{i^2}\ \]

Warning
statistic undefined if there are untyped SNPs
Author
Joshua Shapiro

Definition at line 1552 of file PolySNP.cc.

◆ c_sub_n()

double Sequence::PolySNP::c_sub_n ( void  ) const
protected

\[ c_n=\left\{\begin{array}{cl} 1 , & when\ n = 2 \\ \frac{2 \times (n \times a_n - 2 \times (n-1))}{(n-1) \times (n-2)}, & when\ n > 2 \\ \end{array}\right.\ \]

Warning
statistic undefined if there are untyped SNPs

Definition at line 1568 of file PolySNP.cc.

◆ d_sub_n()

double Sequence::PolySNP::d_sub_n ( void  ) const
protected

\[\ d_n=\frac{2}{n-1} \times (1.5 - \frac{2 \times a_{n+1}}{n-2} - \frac{1}{n})\ \]

Warning
statistic undefined if there are untyped SNPs

Definition at line 1595 of file PolySNP.cc.

◆ DandVH()

double Sequence::PolySNP::DandVH ( void  ) const

To check if two sequences are unique, Sequence::Comparisons::Different is used, which does not allow missing data to result in 2 sequences being considered different (as they would be if you simply used thestd::string comparison operators == or !=)

Returns
the haplotype diversity of the data.

Definition at line 1614 of file PolySNP.cc.

◆ DandVK()

unsigned Sequence::PolySNP::DandVK ( void  ) const

To check if two sequences are unique, Sequence::Comparisons::Different is used, which does not allow missing data to result in 2 sequences being considered different (as they would be if you simply used the std::string comparison operators == or !=)

Returns
number of haplotypes in the sample

Definition at line 1631 of file PolySNP.cc.

◆ DepaulisVeuilleStatistics()

void Sequence::PolySNP::DepaulisVeuilleStatistics ( void  ) const
protected

Calculate the number of haplotypes in the sample, and haplotype diversity. Unlike Depaulis and Veuille's original paper, this routine uses an unbiased calculation of haplotype diversity (i.e. divide by n choose 2).
To check if two sequences are unique, Sequence::Comparisons::Different is used, which does not allow missing data to result in 2 sequences being considered different (as they would be if you simply used the std::string comparison operators == or !=)

Definition at line 1050 of file PolySNP.cc.

◆ Disequilibrium()

std::vector< PairwiseLDstats > Sequence::PolySNP::Disequilibrium ( const unsigned &  mincount = 1,
const double &  max_marker_distance = std::numeric_limits<double>::max() 
) const
Returns
A vector of statistics related to LD and distance in the sample. An empty vector is returned if there are < 2 polymorphic sites in the sample. See the documentation for Recombination::Disequilibrium for a description of the return vector.
Parameters
mincounta frequency filter. A polymorphism must be present at least mincount times in the data
Note
For D and D', the 11 gamete is defined as follows: If no outgroup is present, it refers to the genotype of minor alleles at both sites. If there is an outgroup, it is based on the genotype of derived alleles at both sites.

Definition at line 1837 of file PolySNP.cc.

◆ Dnominator()

double Sequence::PolySNP::Dnominator ( void  ) const
virtual
Warning
statistic undefined if there are untyped SNPs
Returns
Denominator of Tajima's D, or nan if there are no polymorphic sites

Definition at line 1014 of file PolySNP.cc.

◆ FuLiD()

double Sequence::PolySNP::FuLiD ( void  ) const
virtual
Returns
The Fu and Li (1993) D statistic, or nan if there are no polymorphic sites.
Note
For sequence data, an outgroup is required. This requirement is checked by assert()
Warning
statistic undefined if there are untyped SNPs

Definition at line 1359 of file PolySNP.cc.

◆ FuLiDStar()

double Sequence::PolySNP::FuLiDStar ( void  ) const
virtual
Warning
statistic undefined if there are untyped SNPs
Returns
Fu and Li (1993) D*, or nan if there are no polymorphic sites

Definition at line 1428 of file PolySNP.cc.

◆ FuLiF()

double Sequence::PolySNP::FuLiF ( void  ) const
virtual
Returns
Fu and Li (1993) F statistic, or nan if there are no polymorphic sites
Note
For sequence data, an outgroup is required, else undefined
Warning
statistic undefined if there are untyped SNPs

Definition at line 1388 of file PolySNP.cc.

◆ FuLiFStar()

double Sequence::PolySNP::FuLiFStar ( void  ) const
virtual

Fu and Li (1993) F* statistic. Incorporates correction from Simonsen et al. (1995) Genetics 141: 413, eqn A5.

Warning
statistic undefined if there are untyped SNPs
Returns
Fu and Li (1993) F* statistic, or nan if there are no polymorphic sites

Definition at line 1462 of file PolySNP.cc.

◆ Hprime()

double Sequence::PolySNP::Hprime ( const bool &  likeThorntonAndolfatto = false) const
virtual
Returns
ThetaPi-ThetaH/(~Var(ThetaPi-ThetaH)). This corresponds to Equation 5 in Thornton and Andolfatto (Genetics) "Approximate Bayesian Inference reveals evidence for a recent, severe, bottleneck in a Netherlands population of Drosophila melanogaster" and Equation 13 of Zeng et al. (2006) Genetics 1431-1439
Parameters
likeThorntonAndolfattoThe calculation of H' requires calculation of $\theta^2$. In Thornton and Andolfatto, we simply used $\widehat\theta_W^2$, which is slightly biased. By default, this function calculates $\theta^2=\frac{S(S-1)}{a_n^2+b_n}$, unless this bool is set to false, in which case $\widehat\theta_W^2$ is used.
Note
returns nan if there are 0 polymorphic sites
Author
Joshua Shapiro

Definition at line 952 of file PolySNP.cc.

◆ HudsonsC()

double Sequence::PolySNP::HudsonsC ( void  ) const
Returns
Hudson's (1987) estimator of $\rho=4Nc$, an estimator of the population recombination rate that depends on the variance of the site frequencies. The calculation is made by a call to Recombination::HudsonsC
Note
Will return nan if there are no polymorphic sites

Definition at line 1648 of file PolySNP.cc.

◆ Minrec()

unsigned Sequence::PolySNP::Minrec ( void  ) const
virtual
Returns
The minimum number of recombination events observed in the sample (Hudson and Kaplan 1985). Will return SEQMAXUNSIGNED if there are < 2 segregating sites.
Note
Code is a modification of that provided by Jeff Wall

Definition at line 1665 of file PolySNP.cc.

◆ NumExternalMutations()

unsigned Sequence::PolySNP::NumExternalMutations ( void  ) const
virtual
Returns
the number of derived singletons.
Note
For sequence data, an outgroup is required. Will return SEQMAXUNSIGNED if that is not the case.

Definition at line 885 of file PolySNP.cc.

◆ NumMutations()

unsigned Sequence::PolySNP::NumMutations ( void  ) const
virtual
Returns
the total number of mutations in the data. The number of mutations per site = number of states per site - 1

Definition at line 823 of file PolySNP.cc.

◆ NumPoly()

unsigned Sequence::PolySNP::NumPoly ( void  ) const
Returns
the number of polymorphic (segregating) sites in data

Definition at line 807 of file PolySNP.cc.

◆ NumSingletons()

unsigned Sequence::PolySNP::NumSingletons ( void  ) const
virtual
Returns
number of polymorphisms that appear once in the data, without respect to ancestral/derived

Definition at line 844 of file PolySNP.cc.

◆ SamplingVarPi()

double Sequence::PolySNP::SamplingVarPi ( void  ) const

Component of variance of mean pairwise differences from sampling. Tajima in Takahata/Clark book, (15)

Warning
statistic undefined if there are untyped SNPs

Definition at line 1324 of file PolySNP.cc.

◆ StochasticVarPi()

double Sequence::PolySNP::StochasticVarPi ( void  ) const

Stochastic variance of mean pairwise differences. Tajima in Takahata/Clark book, (14).

Warning
statistic undefined if there are untyped SNPs

Definition at line 1306 of file PolySNP.cc.

◆ TajimasD()

double Sequence::PolySNP::TajimasD ( void  ) const
virtual

A common summary of the site frequency spectrum. Proportional to $\widehat\theta_\pi-\widehat\theta_W$. This routine does calculate the denominator of the test statistic.

Warning
statistic undefined if there are untyped SNPs
Returns
Tajima's D, or nan if there are no polymorphic sites
Examples:
slidingWindow.cc, and slidingWindow2.cc.

Definition at line 929 of file PolySNP.cc.

◆ ThetaH()

double Sequence::PolySNP::ThetaH ( void  ) const
virtual

Calculate Theta ( = 4Nu) from site homozygosity, a la Fay and Wu (2000). This statistic is problematic in general to calculate when there are multiple hits. The test requires that the ancestral state (inferred from the outgroup) still be segregating in the ingroup. If that is not true, the site is skipped.
If there are >= 2 derived states inferred, a "missing data" approach is taken.
For example:
Outgroup :
A
Ingroup :
A
A
A
G
G
T
Gets treated as two sites:
A A
A A
A A
G N
G N
N T

This keeps the expectation of the statistic equal to $\theta$, and uses the correct number of derived mutations ovserved in the data.

Note
When using PolySNP, an outgroup is required. When using PolySIM, which is constructed with a SimData *, an outgroup is not required (as the 0,1 coding of the data refers to ancestral and derived, respectively).

Definition at line 366 of file PolySNP.cc.

◆ ThetaL()

double Sequence::PolySNP::ThetaL ( void  ) const
virtual

Calculate Theta ( = 4Nu) from site homozygosity, corresponding to equation 1 in Thornton and Andolfatto (Genetics) "Approximate Bayesian Inference reveals evidence for a recent, severe, bottleneck in a Netherlands population of Drosophila melanogaster," (although we labelled in $theta_\eta$ in that paper) The test requires that the ancestral state (inferred from the outgroup) still be segregating in the ingroup. If that is not true, the site is skipped.
If there are >= 2 derived states inferred, a "missing data" approach is taken.
For example:
Outgroup :
A
Ingroup :
A
A
A
G
G
T
Gets treated as two sites:
A A
A A
A A
G N
G N
N T

This keeps the expectation of the statistic equal to $\theta$, and uses the correct number of derived mutations ovserved in the data.

Note
For sequence data, an outgroup is required. This requirement is checked by assert()
Author
Joshua Shapiro

Definition at line 594 of file PolySNP.cc.

◆ ThetaPi()

double Sequence::PolySNP::ThetaPi ( void  ) const
virtual

Calculated here as the sum of 1.0 - sum of site homozygosity accross sites.

\[ \widehat\theta_\pi=\sum_{i=1}^{i=S}(1-\sum_{j=1}^{j=4}\frac{k_{j,i} \times (k_{j,i}-1)}{n_i \times (n_i - 1)});k_{j,i}>0 \]

Where $S$ is the number of segregating sites, $k_{j,i}$ is the number of occurences of the $j_{th}$ character state at site $i$, and $n_i$ is the sample size at site $i$. Calculating the statistic in this manner makes it easy to generalize to an arbitrary number of character states per polymorphic site
Also equivalent to sum of site heterozygosities:

\[ \widehat\theta_\pi=\sum_{i=1}^{i=S}2{p_i}{q_i} \]


Also equivalent to mean pairwise differences, but that's slow to calculate.

If there is missing data (indicated by 'N' characters), the sample size is reduced for that site. For example, if the data for the $i_{th}$ site is:
A
A
A
N
N
G
Then ThetaPi is calculated for that site as if the sample size were 4 (not 6), and the polymorphic site frequencies are 3/4 for A and 1/4 for G

Definition at line 193 of file PolySNP.cc.

◆ ThetaW()

double Sequence::PolySNP::ThetaW ( void  ) const
virtual

The classic "Watterson's Theta" statistic, generalized to missing data and multiple mutations per site:

\[ \widehat\theta_w=\sum_{i=1}^{i=S}\frac{S}{\sum_{j=1}^{j=n_i-1}\frac{1}{j}} \]


For this statistic, $S$ is either the number of segregating sites, or the number of mutations on the genealogy and $n_i$ is the sample size at site i. If totMuts == 1, the number of mutations is used, else the number ofsegregating sites is used.

Warning
Statistic undefined if there are untyped SNPs. In the presence of missing data, ThetaW is calculated as the sum (over all segregating sites) of 1/a_sub_n, where a_sub_n is the denominator of ThetaW, using the number of alleles without missing data as the sample size. More formally, the routine returns a calculation base on an unweighted sample size adjustment accross sites.

Definition at line 314 of file PolySNP.cc.

◆ VarPi()

double Sequence::PolySNP::VarPi ( void  ) const

Total variance of mean pairwise differences. Tajima in Takahata/Clark book, (13).

Warning
statistic undefined if there are untyped SNPs

Definition at line 1288 of file PolySNP.cc.

◆ VarThetaW()

double Sequence::PolySNP::VarThetaW ( void  ) const
Returns
Variance of Watterson's Theta (ThetaW()).
Warning
statistic undefined if there are untyped SNPs

Definition at line 1341 of file PolySNP.cc.

◆ WallsB()

double Sequence::PolySNP::WallsB ( void  ) const
virtual
Returns
Wall's B Statistic. Wall, J. (1999) Genetical Research 74, pp 65-79
Author
Kevin Thornton

Definition at line 1144 of file PolySNP.cc.

◆ WallsBprime()

unsigned Sequence::PolySNP::WallsBprime ( void  ) const
virtual
Returns
Wall's B' Statistic. Wall, J. (1999) Genetical Research 74, pp 65-79
Author
Kevin Thornton

Definition at line 1262 of file PolySNP.cc.

◆ WallsQ()

double Sequence::PolySNP::WallsQ ( void  ) const
virtual
Returns
Wall's Q Statistic. Wall, J. (1999) Genetical Research 74, pp 65-79
Author
Kevin Thornton

Definition at line 1275 of file PolySNP.cc.


The documentation for this class was generated from the following files: