The VariantMatrix

Variation data are stored in instances of libsequence.VariantMatrix. This class stores state data as signed 8-bit integers and position data as floats (C/C++ doubles). Missing data are encoded as negative numbers. You can use any negative number for missing data except the minimum possible value of an 8-bit integer, which is reserved for internal use as a “mask” value when filtering data out of variant matrices.

In [1]: import libsequence

# You can construct variant matrices from lists
In [2]: states = [0, 1, 1, 0, 0, 0, 0, 1]

In [3]: pos = [0.1, 0.2]

In [4]: m = libsequence.VariantMatrix(states, pos)

In [5]: print(m.nsites)
2

In [6]: print(m.nsam)
4

libsequence.VariantMatrix.data allows numpy-based views of the genotype data:

In [7]: import numpy as np

In [8]: ma = m.data

In [9]: print(ma)
[[0 1 1 0]
 [0 0 0 1]]

Our ma object is a thin read-only wrapper to the underlying memory allocated by C++, so we cannot change the data:

In [10]: try:
   ....:    ma[0,2] = -1
   ....: except ValueError:
   ....:    print("exception raised")
   ....: 
exception raised

You can construct from numpy arrays without making a copy! Internally, the VariantMatrix will hold a reference to the original data:

In [11]: import sys

In [12]: states_array = np.array(states, dtype=np.int8).reshape((2,4))

In [13]: pos_array = np.array(pos)

In [14]: print(sys.getrefcount(states_array), sys.getrefcount(pos_array))
2 2

In [15]: m = libsequence.VariantMatrix(states_array, pos_array)

In [16]: print(sys.getrefcount(states_array), sys.getrefcount(pos_array))
3 3

In [17]: print(m.nsites)
2

In [18]: print(m.nsam)
4

The ability to get data from numpy arrays copy-free means that we can get data from tools like msprime and fwdpy11 into pylibseq for “free”.

Note

When msprime generates genotype data as a numpy array, the dtype is np.uint8, but the VariantMatrix dtype is np.int8. Thus, a cast must take place, which slows down the conversion somewhat.

You can access the site and sample data via loops:

In [19]: for i in range(m.nsites):
   ....:     s = m.site(i)
   ....:     print([i for i in s])
   ....: 
[0, 1, 1, 0]
[0, 0, 0, 1]

In [20]: for i in range(m.nsam):
   ....:     s = m.sample(i)
   ....:     print([i for i in s])
   ....: 
[0, 0]
[1, 0]
[1, 0]
[0, 1]

The types involved in the last example are:

In [21]: print(type(m.site(0)))
<class 'libsequence._libsequence.ConstRowView'>

In [22]: print(type(m.sample(0)))
<class 'libsequence._libsequence.ConstColView'>

Counting the states at a site

The most straightforward way to get the allele counts at all sites is via libsequence.AlleleCountMatrix:

In [23]: import msprime

In [24]: ts = msprime.simulate(10, mutation_rate=25, random_seed=666)

In [25]: m = libsequence.VariantMatrix.from_TreeSequence(ts)

In [26]: ac = m.count_alleles()

In [27]: print(np.array(ac)[:5])
[[9 1]
 [9 1]
 [9 1]
 [9 1]
 [4 6]]

# Confirm that the counts are the same as
# what msprime thinks:
In [28]: vi = ts.variants()

In [29]: for i in range(5):
   ....:     v = next(vi)
   ....:     ones = np.count_nonzero(v.genotypes)
   ....:     print(len(v.genotypes)-ones, ones)
   ....: 
9 1
9 1
9 1
9 1
4 6

# These count objects are sliceable...
In [30]: print(np.array(ac[1:ac.nrow:25]))
[[9 1]
 [4 6]
 [7 3]
 [9 1]
 [4 6]
 [9 1]
 [9 1]
 [9 1]]

# ...and indexable via lists
In [31]: print(np.array(ac[[0,1,2,3,4]]))
[[9 1]
 [9 1]
 [9 1]
 [9 1]
 [4 6]]

The allele count data are stored in order of allele label, starting with zero. The sum of allele counts at a site is the sample size at that site.

libsequence.StateCounts provides a means to generate allele counts on-demand for a site:

In [32]: c = libsequence.StateCounts()

# These objects are callable classes:
In [33]: c(m.site(0))

In [34]: print(c.counts[:3])
[9, 1, 0]

# The sample size at this site
In [35]: print(c.n)
10

# c is iterable...
In [36]: for i in c:
   ....:     if i > 0:
   ....:         print(i)
   ....: 
9
1

#...and indexable...
In [37]: for i in range(len(c)):
   ....:     if c[i] > 0:
   ....:         print(i,c[i])
   ....: 
0 9
1 1

#...and supports the buffer protocol
In [38]: ca = np.array(c)

In [39]: nonzero_states = np.where(ca > 0)

In [40]: print(nonzero_states[0])
[0 1]

In [41]: ca[nonzero_states[0]]
Out[41]: array([9, 1], dtype=int32)

By convention, missing data affects the sample size at a site:

In [42]: ts = msprime.simulate(10, mutation_rate=10.)

# msprime's genotype matrix have dtype np.uint8,
# so we must cast to signed int in order to
# assign missing data:
In [43]: g = ts.genotype_matrix().astype(np.int8)

In [44]: g[0,0] = -1

In [45]: m = libsequence.VariantMatrix(g, ts.tables.sites.position)

In [46]: print(vm.data[0,0])
0

In [47]: c(m.site(0))

# Sample size reduced by 1 due to missing data
In [48]: print(c.n)
9

You may specify a reference state when counting. Depending on the analysis, that may mean a literal reference genome state, an ancestral state, a minor allele state, etc.

# Above, no reference state was specified,
# so it is considered missing:
In [49]: print(c.refstate)
-1

# Let's let 0 be the reference state:
In [50]: c = libsequence.StateCounts(refstate = 0)

In [51]: c(m.site(0))

In [52]: print(c.counts[:3])
[8, 1, 0]

In [53]: print(c.refstate)
0

You may get all of the counts at all sites in three different ways:

# Without respect to reference state
In [54]: lc = libsequence.process_variable_sites(m)

In [55]: for i in lc[:5]:
   ....:     print(i.counts[:2], i.refstate)
   ....: 
[8, 1] -1
[9, 1] -1
[4, 6] -1
[6, 4] -1
[8, 2] -1

# With a single reference state for all sites
In [56]: lc = libsequence.process_variable_sites(m, 0)

In [57]: for i in lc[:5]:
   ....:     print(i.counts[:2], i.refstate)
   ....: 
[8, 1] 0
[9, 1] 0
[4, 6] 0
[6, 4] 0
[8, 2] 0

# With a reference specified state for each site
In [58]: rstats = [0 for i in range(m.nsites)]

In [59]: rstats[0:len(rstats):2] = [1 for i in range(0,len(rstats),2)]

In [60]: lc = libsequence.process_variable_sites(m, rstats)

In [61]: for i in lc[:5]:
   ....:     print(i.counts[:2], i.refstate)
   ....: 
[8, 1] 1
[9, 1] 0
[4, 6] 1
[6, 4] 0
[8, 2] 1

Encoding missing data

# This is the value of the reserved state:
In [62]: print(libsequence.VariantMatrix.mask)
-128

# Attempting to construct an object with this
# value is allowed, but is an error.
# Downstream analyses will see this and raise exceptions.
In [63]: x = libsequence.VariantMatrix([0, 1, libsequence.VariantMatrix.mask, 2], [0.2, 0.5])

In [64]: print(x.data)
[[   0    1]
 [-128    2]]

# For example:
In [65]: c(x.site(1))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-65-5f73ed7c191c> in <module>()
----> 1 c(x.site(1))

ValueError: reserved value encountered

Filtering VariantMatrix data

You may remove sites and/or samples via the application of functions written in Python. To filter sites, a function must take the return value of libsequence.VariantMatrix.site() as an argument:

In [66]: class RemoveNonRefSingletons(object):
   ....:     def __init__(self):
   ....:         self.__c = libsequence.StateCounts(0)
   ....:     def __call__(self, x):
   ....:         self.__c(x)
   ....:         n=np.array(self.__c, copy=False)
   ....:         singletons = np.where(n == 1)
   ....:         if len(singletons[0])>0:
   ....:             return True
   ....:         return False
   ....: 

# Copy our data
In [67]: m2 = libsequence.VariantMatrix(m.data, m.positions)

In [68]: rv = libsequence.filter_sites(m2, RemoveNonRefSingletons())

In [69]: print(m.data.shape)
(110, 10)

In [70]: print(m2.data.shape)
(74, 10)

# This is the number of sites removed:
In [71]: print(rv)
29

Performance tip: I wrote the callable as a class so that a StateCounts could be stored as member data. The reason is that libsequence.StateCounts.counts is a buffer whose memory is re-used for each call. Thus, storing an instance saves repeated memory allocation/deallocation events for each site.

Similarly, we can remove samples:

# Treat 0 as the reference state
In [72]: def remove_all_ref_samples(x):
   ....:     if all([i==0 for i in x]):
   ....:         return True
   ....:     return False
   ....: 

In [73]: m2 = libsequence.VariantMatrix(m.data, m.positions)

In [74]: rv = libsequence.filter_haplotypes(m2, remove_all_ref_samples)

In [75]: print(rv)
0

In [76]: print(m.data.shape)
(110, 10)

In [77]: print(m2.data.shape)
(110, 10)