Tree Sequences

pylibseq is able to generate instances of libsequence.VariantMatrix from a Python object representing a “tree sequence” [6]. Currently, the primary method of interacting with these data structures is via msprime.

There are two methods to get data into a VariantMatrix. The first is via numpy arrays as described in The VariantMatrix:

In [1]: import msprime

In [2]: import libsequence

In [3]: import numpy as np

# Get a TreeSequence from msprime:
In [4]: ts = msprime.simulate(100, mutation_rate = 100)

# Use numpy arrays to make VariantMatrix
In [5]: m = libsequence.VariantMatrix(ts.genotype_matrix(), ts.tables.sites.position)

The second method is to call a class method of VariantMatrix:

In [6]: m2 = libsequence.VariantMatrix.from_TreeSequence(ts)

In [7]: assert np.array_equal(m.data, m2.data)

In [8]: assert np.array_equal(m.positions, m2.positions)

The second method is a touch slower than the first, but it uses about half as much memory. The reason is that getting the genotype matrix from the TreeSequence requires allocating the entire matrix. The second method only asks msprime to generate a 1d numpy array of length ts.num_samples, but does so once for each of ts.num_variants.