Example workflow

We’ll work with pylibseq’s wrapper to libsequence’s SimData, which is used to process bi-allele data encoded as 0/1 = ancestral/derived, respectively

In [1]: from __future__ import print_function

In [2]: import libsequence

Assigning to an object

You may assign from a list of tuples.

Each tuple is a site: (pos:genotypes)

Here, there are 2 sites and a sample size of \(n=4\)

In [3]: rawData1 = [(0.1,'0101'),(0.2,'1010')]

#We can construct objects straight from these tuples
In [4]: sd = libsequence.SimData(rawData1)

In [5]: sd.size()
Out[5]: 4

In [6]: sd.pos()
Out[6]: [0.1, 0.2]

In [7]: sd.data()
Out[7]: ['01', '10', '01', '10']

, you can assign from separate list of positions and haplotypes

In [8]: rawDataPos = [0.1,0.2]

In [9]: rawDataGenos = ['01','10','01','10']

In [10]: sd.assign(rawDataPos,rawDataGenos)

In [11]: sd.numsites()
Out[11]: 2

In [12]: sd.size()
Out[12]: 4

In [13]: sd.pos()
Out[13]: [0.1, 0.2]

In [14]: sd.data()
Out[14]: ['01', '10', '01', '10']

Summary statistics

Let’s calculate some basic summary statistics

See libsequence.summstats.PolySIM for more documentation

#ms 10 1 -s 10 -I 2 5 5 0.05
In [15]: rawDataPos=[0.0997, 0.2551, 0.3600, 0.4831, 0.5205, 0.5668, 0.5824, 0.6213, 0.7499, 0.9669]

In [16]: rawDataGenos=['0000001010',
   ....:               '1000000011',
   ....:               '1000001010',
   ....:               '1000000010',
   ....:               '1000001010',
   ....:               '1111010100',
   ....:               '1111010100',
   ....:               '1111110100',
   ....:               '1111010100',
   ....:               '1111010100']
   ....: 

In [17]: sd.assign(rawDataPos,rawDataGenos)

In [18]: ps = libsequence.PolySIM(sd)

In [19]: ps.thetapi()
Out[19]: 4.4

In [20]: ps.thetaw()
Out[20]: 3.5348576237901534

In [21]: ps.tajimasd()
Out[21]: 1.084815820054003

Filtering sites

You may also filter sites from your variation tables using libsequence.removeColumns(), which operates on objects of type libsequence.StateCounter.

Our data look like this right now:

In [22]: print(sd)
//
segsites: 10
positions: 0.0997 0.2551 0.36 0.4831 0.5205 0.5668 0.5824 0.6213 0.7499 0.9669
0000001010
1000000011
1000001010
1000000010
1000001010
1111010100
1111010100
1111110100
1111010100
1111010100

Let’s remove derived singletons:

In [23]: sd2 = libsequence.removeColumns(sd,lambda x : x.one != 1)

In [24]: print(sd2)
//
segsites: 8
positions: 0.0997 0.2551 0.36 0.4831 0.5668 0.5824 0.6213 0.7499
00000101
10000001
10000101
10000001
10000101
11111010
11111010
11111010
11111010
11111010

Let’s remove all singletons:

In [25]: sd3 = libsequence.removeColumns(sd,lambda x: x.one !=1 and x.zero != 1)

In [26]: print(sd3)
//
segsites: 7
positions: 0.2551 0.36 0.4831 0.5668 0.5824 0.6213 0.7499
0000101
0000001
0000101
0000001
0000101
1111010
1111010
1111010
1111010
1111010

Note

Yes, it is odd that the column removal function removes sites for which the lambda returns False. I’ll fix that in a future release, which requires an upstream change to libsequence.

Sliding windows

In [27]: w = libsequence.Windows(sd,window_size=0.1,step_len=0.05,starting_pos=0.,ending_pos=1.0)

In [28]: len(w)
Out[28]: 20

In [29]: for i in range(len(w)):
   ....:     wi = w[i]
   ....:     pswi = libsequence.PolySIM(wi)
   ....:     print(pswi.thetaw())
   ....: 
0.3534857623790153
0.3534857623790153
0.0
0.0
0.3534857623790153
0.3534857623790153
0.3534857623790153
0.3534857623790153
0.3534857623790153
0.7069715247580306
1.060457287137046
1.060457287137046
0.3534857623790153
0.3534857623790153
0.3534857623790153
0.0
0.0
0.0
0.3534857623790153
0.3534857623790153

Linkage disequilibrium

The function libsequence.ld returns pairwise LD stats as a list of dicts. The return value is easily coerced into a pandas.DataFrame.

\(F_{ST}\)

Let’s pretend that our data are from two demes of sizes n/2 each.

Note that most flavors of \(F_{ST}\) are very similar to one another. See Charlesworth, B. (1998) Mol. Biol. Evol. 15(5): 538-543 for a great overview.

In [30]: import libsequence

In [31]: sd.size()
Out[31]: 10

In [32]: f = libsequence.Fst(sd,[5,5])

#Hudson, Slatkin, and Maddison's FST:
In [33]: f.hsm()
Out[33]: 0.8750000000000001

#Slatkin's
In [34]: f.slatkin()
Out[34]: 0.7777777777777778

#Hudson, Boos, and Kaplan, which is also Nei's Gst:
In [35]: f.hbk()
Out[35]: 0.7777777777777778

#Positions of snps shared b/w demes 0 and 1
In [36]: f.shared(0,1)
Out[36]: set()

#Positions of private mutations in deme 0 and 1:
In [37]: f.priv(0,1)
Out[37]: ({0.0997, 0.5824, 0.9669}, {0.5205})

#Positions of fixed differences between demes 0 and 1:
In [38]: f.fixed(0,1)
Out[38]: {0.2551, 0.36, 0.4831, 0.5668, 0.6213, 0.7499}