Calculating summary statistics¶
This page is a tutorial on calculating summary statistics from a libsequence.VariantMatrix
. We
also show how to interchange data between msprime [5] and pylibseq.
Simple overview of concepts¶
In [1]: import msprime
In [2]: import libsequence
In [3]: import numpy as np
# Get a TreeSequence from msprime
In [4]: ts = msprime.simulate(100, mutation_rate=250., recombination_rate=250., random_seed=42)
# TreeSequence -> VariantMatrix
In [5]: vm = libsequence.VariantMatrix.from_TreeSequence(ts)
# VariantMatrix -> AlleleCountMatrix
In [6]: ac = vm.count_alleles()
Certain summary statistics can be calculated simply from allele counts via
libsequence.AlleleCountMatrix
. Calculations invovling linkage disequilibrium
will typically require the entire libsequence.VariantMatrix
object.
The following examples rely on msprime [5] to generate input data, but the concepts hold more generally.
List of “classic”/standard summary statistics¶
Nucleotide diversity, or \(\pi\) [7]:
-
libsequence.
thetapi
(ac: libsequence._libsequence.AlleleCountMatrix) → float¶ Mean number of pairwise differences.
Parameters: ac – A libsequence.AlleleCountMatrix
Note
Implemented as sum of site heterozygosity.
In [7]: print(libsequence.thetapi(ac))
980.3789898989783
Watterson’s estimator of \(\theta\), [10]:
-
libsequence.
thetaw
(ac: libsequence._libsequence.AlleleCountMatrix) → float¶ Watterson’s theta.
Parameters: m – A libsequence.AlleleCountMatrix
Note
Calculated from the total number of mutations.
In [8]: print(libsequence.thetaw(ac))
982.5437651915445
Tajima’s \(D\), [8]:
-
libsequence.
tajd
(ac: libsequence._libsequence.AlleleCountMatrix) → float¶ Tajima’s D.
Parameters: m – A libsequence.AlleleCountMatrix
In [9]: print(libsequence.tajd(ac))
-0.007527225314052745
Fay and Wu’s test statistic, \(\pi - \hat\theta_H\) [2] has two overloads. The first takes a single ancestral state as an argument. The latter takes a list of ancestral states (one per site in the variant matrix):
-
libsequence.
faywuh
(*args, **kwargs)¶ Overloaded function.
- faywuh(ac: libsequence._libsequence.AlleleCountMatrix, ancestral_state: int) -> float
- faywuh(ac: libsequence._libsequence.AlleleCountMatrix, ancestral_states: List[int]) -> float
In [10]: print(libsequence.faywuh(ac, 0))
-74.02989898988403
The \(H'\) statistic [11] should be preferred to the Fay and Wu test statistic:
-
libsequence.
hprime
(*args, **kwargs)¶ Overloaded function.
- hprime(ac: libsequence._libsequence.AlleleCountMatrix, ancestral_state: int) -> float
- hprime(ac: libsequence._libsequence.AlleleCountMatrix, ancestral_state: List[int]) -> float
In [11]: print(libsequence.hprime(ac, 0))
-0.12962968409194914
Summaries of haplotype variation [1]:
-
libsequence.
number_of_haplotypes
(arg0: libsequence._libsequence.VariantMatrix) → int¶
-
libsequence.
haplotype_diversity
(arg0: libsequence._libsequence.VariantMatrix) → float¶
In [12]: print(libsequence.number_of_haplotypes(vm))
99
In [13]: print(libsequence.haplotype_diversity(vm))