Molecular population genetic analysis. More...

#include <Sequence/PolySNP.hpp>

Public Member Functions
	PolySNP (const Sequence::PolyTable *data, bool haveOutgroup=false, unsigned outgroup=0, bool totMuts=true)

	PolySNP (const PolySNP &)=delete

PolySNP &	operator= (const PolySNP &)=delete

virtual double	ThetaPi (void) const

virtual double	ThetaW (void) const

virtual double	ThetaH (void) const

virtual double	ThetaL (void) const

double	VarPi (void) const

double	StochasticVarPi (void) const

double	SamplingVarPi (void) const

double	VarThetaW (void) const

unsigned	NumPoly (void) const

virtual unsigned	NumMutations (void) const

virtual unsigned	NumSingletons (void) const

virtual unsigned	NumExternalMutations (void) const

virtual double	TajimasD (void) const

virtual double	Hprime (const bool &likeThorntonAndolfatto=false) const

virtual double	Dnominator (void) const

virtual double	FuLiD (void) const

virtual double	FuLiF (void) const

virtual double	FuLiDStar (void) const

virtual double	FuLiFStar (void) const

double	DandVH (void) const

unsigned	DandVK (void) const

virtual double	WallsB (void) const

virtual unsigned	WallsBprime (void) const

virtual double	WallsQ (void) const

double	HudsonsC (void) const

virtual unsigned	Minrec (void) const

std::vector< PairwiseLDstats >	Disequilibrium (const unsigned &mincount=1, const double &max_marker_distance=std::numeric_limits< double >::max()) const

Protected Member Functions
void	DepaulisVeuilleStatistics (void) const

virtual void	WallStats (void) const

double	a_sub_n (void) const

double	a_sub_n_plus1 (void) const

double	b_sub_n (void) const

double	b_sub_n_plus1 (void) const

double	c_sub_n (void) const

double	d_sub_n (void) const

Protected Attributes
std::unique_ptr< _PolySNPImpl >	rep

Detailed Description

Molecular population genetic analysis.

Example Use:

#include <iostream>
#include <vector>
#include <Sequence/PolySNP.hpp>
#include <Sequence/Fasta.hpp>
#include <Sequence/PolySites.hpp>
using namespace std;
using namespace Sequence;
int main
{
  vector<Fasta> data;
  Alignment::GetData(data,"popdata.fasta");
  assert(Alignment::IsAlignment(data))
  if (Alignment::Gapped(data))
  {
    Alignment::RemoveTerminalGaps(data);
  }
  PolySites *polytable = new PolySites(data);
  PolySNP *analyze = new PolySNP(data,false,0);
  cout << "Tajima's D is " << analyze->TajimasD() << endl;
  delete polytable;
  delete analyze;
  exit(1);
}

Warning: The routines to calculate nucleotide diversity, numbers of singletons, etc. all handle missing data (the N character). However, all summary statistics involved in "tests of neutrality" are, strictly speaking, undefined if missing data are present. The reason for this is because the denominators of the statistics are functions of the sample sizes, and no explicit formulae exist when the sample size varies from site to site (which is the case when there are missing data). In short, if you want to be rigorous, you can only really count up nucleotide diversity and a few other statistics if your data contain untyped SNPs. However, the routines present in libsequence will happily go and calculate the summary statistics for you, and it is up to you to be aware that you are writing a program that may give biased results. To date, the magnitude and direction of the bias remains unknown. Functions (and hence the statistics) that are affected have warnings in their documentation.

Note: As of libsequence 1.4.1, the routines in this class explicity check for gaps in the polymorphism table. This provides an additional safeguard for cases where Sequence::PolyTable objects are created and sites with gaps are left in.

Deprecated:: Will be removed in libsequence 2.0

Examples:: slidingWindow.cc, and slidingWindow2.cc.

Definition at line 86 of file PolySNP.hpp.

Constructor & Destructor Documentation

◆ PolySNP()

Sequence::PolySNP::PolySNP	(	const Sequence::PolyTable *	data,
		bool	haveOutgroup = `false`,
		unsigned	outgroup = `0`,
		bool	totMuts = `true`
	)

explicit

Parameters

data	a valid object of type Sequence::PolyTable
haveOutgroup	`true` if an outgroup is present, `false` otherwise
outgroup	if haveOutgroup is `true`, outgroup is the index of that sequence in data
totMuts	if true (the default) use the total number of inferred mutations, otherwise use the total number of polymorphic sites in calculations

Note: this constructor allocates a pointer to an implementation class, which automatically pre-processes the SNP data

Definition at line 171 of file PolySNP.cc.

Member Function Documentation

◆ a_sub_n()

double Sequence::PolySNP::a_sub_n ( void ) const

protected

$a_n=\sum_{i=1}^{i=n-1}\frac{1}{i}.\$

This is the denominator of Watterson's Theta (see PolySNP::ThetaW)

Warning: statistic undefined if there are untyped SNPs

Definition at line 1504 of file PolySNP.cc.

◆ a_sub_n_plus1()

double Sequence::PolySNP::a_sub_n_plus1 ( void ) const

protected

$a_{n+1}=\sum_{i=1}^{i=n}\frac{1}{i}\$

Warning: statistic undefined if there are untyped SNPs

Definition at line 1520 of file PolySNP.cc.

◆ b_sub_n()

double Sequence::PolySNP::b_sub_n ( void ) const

protected

$b_n=\sum_{i=1}^{i=n-1}\frac{1}{i^2}\$

Warning: statistic undefined if there are untyped SNPs

Definition at line 1537 of file PolySNP.cc.

◆ b_sub_n_plus1()

double Sequence::PolySNP::b_sub_n_plus1 ( void ) const

protected

$b_n=\sum_{i=1}^{i=n}\frac{1}{i^2}\$

Warning: statistic undefined if there are untyped SNPs

Author: Joshua Shapiro

Definition at line 1552 of file PolySNP.cc.

◆ c_sub_n()

double Sequence::PolySNP::c_sub_n ( void ) const

protected

$c_n=\left\{\begin{array}{cl} 1 , & when\ n = 2 \\ \frac{2 \times (n \times a_n - 2 \times (n-1))}{(n-1) \times (n-2)}, & when\ n > 2 \\ \end{array}\right.\$

Warning: statistic undefined if there are untyped SNPs

Definition at line 1568 of file PolySNP.cc.

◆ d_sub_n()

double Sequence::PolySNP::d_sub_n ( void ) const

protected

$\ d_n=\frac{2}{n-1} \times (1.5 - \frac{2 \times a_{n+1}}{n-2} - \frac{1}{n})\$

Warning: statistic undefined if there are untyped SNPs

Definition at line 1595 of file PolySNP.cc.

◆ DandVH()

double Sequence::PolySNP::DandVH ( void ) const

To check if two sequences are unique, Sequence::Comparisons::Different is used, which does not allow missing data to result in 2 sequences being considered different (as they would be if you simply used thestd::string comparison operators == or !=)

Returns: the haplotype diversity of the data.

Definition at line 1614 of file PolySNP.cc.

◆ DandVK()

unsigned Sequence::PolySNP::DandVK ( void ) const

To check if two sequences are unique, Sequence::Comparisons::Different is used, which does not allow missing data to result in 2 sequences being considered different (as they would be if you simply used the std::string comparison operators == or !=)

Returns: number of haplotypes in the sample

Definition at line 1631 of file PolySNP.cc.

◆ DepaulisVeuilleStatistics()

void Sequence::PolySNP::DepaulisVeuilleStatistics ( void ) const

protected

Calculate the number of haplotypes in the sample, and haplotype diversity. Unlike Depaulis and Veuille's original paper, this routine uses an unbiased calculation of haplotype diversity (i.e. divide by n choose 2).
To check if two sequences are unique, Sequence::Comparisons::Different is used, which does not allow missing data to result in 2 sequences being considered different (as they would be if you simply used the std::string comparison operators == or !=)

Definition at line 1050 of file PolySNP.cc.

◆ Disequilibrium()

std::vector< PairwiseLDstats > Sequence::PolySNP::Disequilibrium	(	const unsigned &	mincount = `1`,
		const double &	max_marker_distance = `std::numeric_limits<double>::max()`
	)		const

Returns: A vector of statistics related to LD and distance in the sample. An empty vector is returned if there are < 2 polymorphic sites in the sample. See the documentation for Recombination::Disequilibrium for a description of the return vector.

Parameters

mincount a frequency filter. A polymorphism must be present at least mincount times in the data

Note: For D and D', the 11 gamete is defined as follows: If no outgroup is present, it refers to the genotype of minor alleles at both sites. If there is an outgroup, it is based on the genotype of derived alleles at both sites.

Definition at line 1837 of file PolySNP.cc.

◆ Dnominator()

double Sequence::PolySNP::Dnominator ( void ) const

virtual

Warning: statistic undefined if there are untyped SNPs

Returns: Denominator of Tajima's D, or nan if there are no polymorphic sites

Definition at line 1014 of file PolySNP.cc.

◆ FuLiD()

double Sequence::PolySNP::FuLiD ( void ) const

virtual

Returns: The Fu and Li (1993) D statistic, or nan if there are no polymorphic sites.

Note: For sequence data, an outgroup is required. This requirement is checked by assert()

Warning: statistic undefined if there are untyped SNPs

Definition at line 1359 of file PolySNP.cc.

◆ FuLiDStar()

double Sequence::PolySNP::FuLiDStar ( void ) const

virtual

Warning: statistic undefined if there are untyped SNPs

Returns: Fu and Li (1993) D*, or nan if there are no polymorphic sites

Definition at line 1428 of file PolySNP.cc.

◆ FuLiF()

double Sequence::PolySNP::FuLiF ( void ) const

virtual

Returns: Fu and Li (1993) F statistic, or nan if there are no polymorphic sites

Note: For sequence data, an outgroup is required, else undefined

Warning: statistic undefined if there are untyped SNPs

Definition at line 1388 of file PolySNP.cc.

◆ FuLiFStar()

double Sequence::PolySNP::FuLiFStar ( void ) const

virtual

Fu and Li (1993) F* statistic. Incorporates correction from Simonsen et al. (1995) Genetics 141: 413, eqn A5.

Warning: statistic undefined if there are untyped SNPs

Returns: Fu and Li (1993) F* statistic, or nan if there are no polymorphic sites

Definition at line 1462 of file PolySNP.cc.

◆ Hprime()

double Sequence::PolySNP::Hprime ( const bool & likeThorntonAndolfatto = false ) const

virtual

Returns: ThetaPi-ThetaH/(~Var(ThetaPi-ThetaH)). This corresponds to Equation 5 in Thornton and Andolfatto (Genetics) "Approximate Bayesian Inference reveals evidence for a recent, severe, bottleneck in a Netherlands population of Drosophila melanogaster" and Equation 13 of Zeng et al. (2006) Genetics 1431-1439

Parameters

likeThorntonAndolfatto The calculation of H' requires calculation of $\theta^2$ . In Thornton and Andolfatto, we simply used $\widehat\theta_W^2$ , which is slightly biased. By default, this function calculates $\theta^2=\frac{S(S-1)}{a_n^2+b_n}$ , unless this bool is set to false, in which case $\widehat\theta_W^2$ is used.

Note: returns nan if there are 0 polymorphic sites

Author: Joshua Shapiro

Definition at line 952 of file PolySNP.cc.

◆ HudsonsC()

double Sequence::PolySNP::HudsonsC ( void ) const

Returns: Hudson's (1987) estimator of $\rho=4Nc$ , an estimator of the population recombination rate that depends on the variance of the site frequencies. The calculation is made by a call to Recombination::HudsonsC

Note: Will return nan if there are no polymorphic sites

Definition at line 1648 of file PolySNP.cc.

◆ Minrec()

unsigned Sequence::PolySNP::Minrec ( void ) const

virtual

Returns: The minimum number of recombination events observed in the sample (Hudson and Kaplan 1985). Will return SEQMAXUNSIGNED if there are < 2 segregating sites.

Note: Code is a modification of that provided by Jeff Wall

Definition at line 1665 of file PolySNP.cc.

◆ NumExternalMutations()

unsigned Sequence::PolySNP::NumExternalMutations ( void ) const

virtual

Returns: the number of derived singletons.

Note: For sequence data, an outgroup is required. Will return SEQMAXUNSIGNED if that is not the case.

Definition at line 885 of file PolySNP.cc.

◆ NumMutations()

unsigned Sequence::PolySNP::NumMutations ( void ) const

virtual

Returns: the total number of mutations in the data. The number of mutations per site = number of states per site - 1

Definition at line 823 of file PolySNP.cc.

◆ NumPoly()

unsigned Sequence::PolySNP::NumPoly ( void ) const

Returns: the number of polymorphic (segregating) sites in data

Definition at line 807 of file PolySNP.cc.

◆ NumSingletons()

unsigned Sequence::PolySNP::NumSingletons ( void ) const

virtual

Returns: number of polymorphisms that appear once in the data, without respect to ancestral/derived

Definition at line 844 of file PolySNP.cc.

◆ SamplingVarPi()

double Sequence::PolySNP::SamplingVarPi ( void ) const

Component of variance of mean pairwise differences from sampling. Tajima in Takahata/Clark book, (15)

Warning: statistic undefined if there are untyped SNPs

Definition at line 1324 of file PolySNP.cc.

◆ StochasticVarPi()

double Sequence::PolySNP::StochasticVarPi ( void ) const

Stochastic variance of mean pairwise differences. Tajima in Takahata/Clark book, (14).

Warning: statistic undefined if there are untyped SNPs

Definition at line 1306 of file PolySNP.cc.

◆ TajimasD()

double Sequence::PolySNP::TajimasD ( void ) const

virtual

A common summary of the site frequency spectrum. Proportional to $\widehat\theta_\pi-\widehat\theta_W$ . This routine does calculate the denominator of the test statistic.

Warning: statistic undefined if there are untyped SNPs

Returns: Tajima's D, or nan if there are no polymorphic sites

Examples:: slidingWindow.cc, and slidingWindow2.cc.

Definition at line 929 of file PolySNP.cc.

◆ ThetaH()

double Sequence::PolySNP::ThetaH ( void ) const

virtual

Calculate Theta ( = 4Nu) from site homozygosity, a la Fay and Wu (2000). This statistic is problematic in general to calculate when there are multiple hits. The test requires that the ancestral state (inferred from the outgroup) still be segregating in the ingroup. If that is not true, the site is skipped.
If there are >= 2 derived states inferred, a "missing data" approach is taken.
For example:
Outgroup :
A
Ingroup :
A
A
A
G
G
T
Gets treated as two sites:
A A
A A
A A
G N
G N
N T

This keeps the expectation of the statistic equal to $\theta$ , and uses the correct number of derived mutations ovserved in the data.

Note: When using PolySNP, an outgroup is required. When using PolySIM, which is constructed with a SimData *, an outgroup is not required (as the 0,1 coding of the data refers to ancestral and derived, respectively).

Definition at line 366 of file PolySNP.cc.

◆ ThetaL()

double Sequence::PolySNP::ThetaL ( void ) const

virtual

Calculate Theta ( = 4Nu) from site homozygosity, corresponding to equation 1 in Thornton and Andolfatto (Genetics) "Approximate Bayesian Inference reveals evidence for a recent, severe, bottleneck in a Netherlands population of Drosophila melanogaster," (although we labelled in $theta_\eta$ in that paper) The test requires that the ancestral state (inferred from the outgroup) still be segregating in the ingroup. If that is not true, the site is skipped.
If there are >= 2 derived states inferred, a "missing data" approach is taken.
For example:
Outgroup :
A
Ingroup :
A
A
A
G
G
T
Gets treated as two sites:
A A
A A
A A
G N
G N
N T

This keeps the expectation of the statistic equal to $\theta$ , and uses the correct number of derived mutations ovserved in the data.

Note: For sequence data, an outgroup is required. This requirement is checked by assert()

Author: Joshua Shapiro

Definition at line 594 of file PolySNP.cc.

◆ ThetaPi()

double Sequence::PolySNP::ThetaPi ( void ) const

virtual

Calculated here as the sum of 1.0 - sum of site homozygosity accross sites.

$\widehat\theta_\pi=\sum_{i=1}^{i=S}(1-\sum_{j=1}^{j=4}\frac{k_{j,i} \times (k_{j,i}-1)}{n_i \times (n_i - 1)});k_{j,i}>0$

Where $S$ is the number of segregating sites, $k_{j,i}$ is the number of occurences of the $j_{th}$ character state at site $i$ , and $n_i$ is the sample size at site $i$ . Calculating the statistic in this manner makes it easy to generalize to an arbitrary number of character states per polymorphic site
Also equivalent to sum of site heterozygosities:

$\widehat\theta_\pi=\sum_{i=1}^{i=S}2{p_i}{q_i}$

Also equivalent to mean pairwise differences, but that's slow to calculate.

If there is missing data (indicated by 'N' characters), the sample size is reduced for that site. For example, if the data for the $i_{th}$ site is:
A
A
A
N
N
G
Then ThetaPi is calculated for that site as if the sample size were 4 (not 6), and the polymorphic site frequencies are 3/4 for A and 1/4 for G

Definition at line 193 of file PolySNP.cc.

◆ ThetaW()

double Sequence::PolySNP::ThetaW ( void ) const

virtual

The classic "Watterson's Theta" statistic, generalized to missing data and multiple mutations per site:

$\widehat\theta_w=\sum_{i=1}^{i=S}\frac{S}{\sum_{j=1}^{j=n_i-1}\frac{1}{j}}$

For this statistic, $S$ is either the number of segregating sites, or the number of mutations on the genealogy and $n_i$ is the sample size at site i. If totMuts == 1, the number of mutations is used, else the number ofsegregating sites is used.

Warning: Statistic undefined if there are untyped SNPs. In the presence of missing data, ThetaW is calculated as the sum (over all segregating sites) of 1/a_sub_n, where a_sub_n is the denominator of ThetaW, using the number of alleles without missing data as the sample size. More formally, the routine returns a calculation base on an unweighted sample size adjustment accross sites.

Definition at line 314 of file PolySNP.cc.

◆ VarPi()

double Sequence::PolySNP::VarPi ( void ) const

Total variance of mean pairwise differences. Tajima in Takahata/Clark book, (13).

Warning: statistic undefined if there are untyped SNPs

Definition at line 1288 of file PolySNP.cc.

◆ VarThetaW()

double Sequence::PolySNP::VarThetaW ( void ) const

Returns: Variance of Watterson's Theta (ThetaW()).

Warning: statistic undefined if there are untyped SNPs

Definition at line 1341 of file PolySNP.cc.

◆ WallsB()

double Sequence::PolySNP::WallsB ( void ) const

virtual

Returns: Wall's B Statistic. Wall, J. (1999) Genetical Research 74, pp 65-79

Author: Kevin Thornton

Definition at line 1144 of file PolySNP.cc.

◆ WallsBprime()

unsigned Sequence::PolySNP::WallsBprime ( void ) const

virtual

Returns: Wall's B' Statistic. Wall, J. (1999) Genetical Research 74, pp 65-79

Author: Kevin Thornton

Definition at line 1262 of file PolySNP.cc.

◆ WallsQ()

double Sequence::PolySNP::WallsQ ( void ) const

virtual

Returns: Wall's Q Statistic. Wall, J. (1999) Genetical Research 74, pp 65-79

Author: Kevin Thornton

Definition at line 1275 of file PolySNP.cc.

The documentation for this class was generated from the following files:

Public Member Functions

Protected Member Functions

Protected Attributes

Detailed Description

Constructor & Destructor Documentation

◆ PolySNP()

Member Function Documentation

◆ a_sub_n()

◆ a_sub_n_plus1()

◆ b_sub_n()

◆ b_sub_n_plus1()

◆ c_sub_n()

◆ d_sub_n()

◆ DandVH()

◆ DandVK()

◆ DepaulisVeuilleStatistics()

◆ Disequilibrium()

◆ Dnominator()

◆ FuLiD()

◆ FuLiDStar()

◆ FuLiF()

◆ FuLiFStar()

◆ Hprime()

◆ HudsonsC()

◆ Minrec()

◆ NumExternalMutations()

◆ NumMutations()

◆ NumPoly()

◆ NumSingletons()

◆ SamplingVarPi()

◆ StochasticVarPi()

◆ TajimasD()

◆ ThetaH()

◆ ThetaL()

◆ ThetaPi()

◆ ThetaW()

◆ VarPi()

◆ VarThetaW()

◆ WallsB()

◆ WallsBprime()

◆ WallsQ()