Tree Sequences¶
pylibseq is able to generate instances of libsequence.VariantMatrix
from a Python object
representing a “tree sequence” [6]. Currently, the primary method of interacting with these data
structures is via msprime.
There are two methods to get data into a VariantMatrix. The first is via numpy arrays as described in The VariantMatrix:
In [1]: import msprime
In [2]: import libsequence
In [3]: import numpy as np
# Get a TreeSequence from msprime:
In [4]: ts = msprime.simulate(100, mutation_rate = 100)
# Use numpy arrays to make VariantMatrix
In [5]: m = libsequence.VariantMatrix(ts.genotype_matrix(), ts.tables.sites.position)
The second method is to call a class method of VariantMatrix:
In [6]: m2 = libsequence.VariantMatrix.from_TreeSequence(ts)
In [7]: assert np.array_equal(m.data, m2.data)
In [8]: assert np.array_equal(m.positions, m2.positions)
The second method is a touch slower than the first, but it uses about half as much memory. The reason is that getting the genotype matrix from the TreeSequence requires allocating the entire matrix. The second method only asks msprime to generate a 1d numpy array of length ts.num_samples, but does so once for each of ts.num_variants.