Data structures related to tree sequences#
Low-level data types#
fwdpy11.Node
defines nodesfwdpy11.Edge
defines edgesfwdpy11.Site
defines a site (genomic location) where a mutation is present.fwdpy11.MutationRecord
defines mutations associated withfwdpy11.Site
objects on trees and infwdpy11.DiploidPopulation
objects.fwdpy11.NodeTable
represents a node tablefwdpy11.EdgeTable
represents an edge tablefwdpy11.MutationTable
represents a mutation tablefwdpy11.SiteTable
represents a site table
These types are all analogous to the corresponding tskit types.
Here, fwdpy11.MutationRecord
plays the role of tskit.Mutation
,
but with additional data fields that are useful for the forward-time simulations.
Table collections#
The above data types are encapsulated into the Python class fwdpy11.TableCollection
. Instances of this
class are data fields of populations, via fwdpy11.DiploidPopulation.tables
.
Representing trees#
An edge table and a node table contain all the data required to generate what we may call “marginal trees”, which are the trees
corresponding to genomic intervals \([left, right)\). In fwdpy11, fwdpy11.TreeIterator
allows
left-to-tright traversal across marginal trees as well as access to their data.
The intent of this section is to describe the structure of the marginal trees and get some terminology out of the way. I will use Jerome Kelleher’s tskit package to illustrate the concepts. fwdpy11 uses the same basic data model, and it is more convenient to use trees generated via a coalescent simulation that to run forward simulations. Plus, we can use the great “draw” methods from tskit!
Note
Do not mistake the tskit operations used below for the operations needed in fwdpy11. They differ in some important ways. Namely, the tskit API is currently richer and the fwdpy11 API is currently more low-level. The main goal here is to define terms and visualize them.
We are going to look at a single tree, but the concepts apply to any of the trees generated by a simulation with nonzero recombination rates. First, let’s get a tree sequence using the linear-time algorithm of Hudson (1990), which involves repeated swapping of the IDs of potential ancestors until the MRCA is reached.
import tskit
import numpy as np
# The following makes use of the Kirk Lohmueller seed.
np.random.seed(101 * 405 * 10 * 110)
tc = tskit.TableCollection(1)
nsam = 10
for i in range(nsam):
tc.nodes.add_row(time=0, flags=1)
nodes = np.arange(2 * nsam - 1, dtype=np.int32)
time = 0.0
# The citation for this algorithm is
# Hudson, Richard R. 1990.
# “Gene Genealogies and the Coalescent Process.”
# Oxford Surveys in Evolutionary Biology 7 (1): 44.
n = nsam
while n > 1:
rcoal = (n * (n - 1)) / 2.0
tcoal = np.random.exponential(1.0 / rcoal)
time += tcoal
tc.nodes.add_row(time=time)
ancestor = 2 * nsam - n
p = np.random.choice(n, 1)[0]
tc.edges.add_row(left=0.0, right=1.0, parent=ancestor, child=nodes[p])
nodes[p] = nodes[n - 1]
p = np.random.choice(n - 1, 1)[0]
tc.edges.add_row(left=0.0, right=1.0, parent=ancestor, child=nodes[p])
nodes[p] = nodes[2 * nsam - n]
n -= 1
tc.sort()
ts = tc.tree_sequence()
print(ts.first().draw(format="unicode"))
18
┏━━━━━┻━━━━━┓
┃ 17
┃ ┏━━━┻━━┓
16 ┃ ┃
┏┻━┓ ┃ ┃
┃ ┃ 15 ┃
┃ ┃ ┏━┻━━┓ ┃
┃ ┃ ┃ 14 ┃
┃ ┃ ┃ ┏┻┓ ┃
┃ ┃ ┃ ┃ ┃ 13
┃ ┃ ┃ ┃ ┃ ┏┻┓
12 ┃ ┃ ┃ ┃ ┃ ┃
┏┻┓ ┃ ┃ ┃ ┃ ┃ ┃
┃ ┃ ┃ 11 ┃ ┃ ┃ ┃
┃ ┃ ┃ ┏━┻┓ ┃ ┃ ┃ ┃
┃ ┃ ┃ ┃ 10 ┃ ┃ ┃ ┃
┃ ┃ ┃ ┃ ┏┻┓ ┃ ┃ ┃ ┃
0 4 1 2 6 7 3 5 8 9
The above tree corresponds to a sample size of 10 haplotypes. The tree shows the node labels. The nodes labelled 0 through 9 correspond to the present time point–these are “alive nodes” or the “current generation” if we are thinking about a forward simulation. Further, we can describe the branches leading to these sample nodes as the “tips” or “leaves” of a tree.
“Leaves” of a tree#
One thing that we often want to know is “how many samples descend from node i
?” To do that, we
may look at the leaf counts attribute of a marginal tree:
# Let's store our
# tree in a variable now
t = ts.first()
def get_leaf_counts(tree, i):
return len([j for j in tree.leaves(i)])
# Map node ids to their leaf counts using a dict
lcmap = {i: "".format(get_leaf_counts(t, i)) for i in range(len(ts.tables.nodes))}
print(t.draw(format="unicode", node_labels=lcmap))
┏━━┻━┓
┃ ┃
┃ ┏┻━┓
┃ ┃ ┃
┏┻┓ ┃ ┃
┃ ┃ ┃ ┃
┃ ┃ ┏┻┓ ┃
┃ ┃ ┃ ┃ ┃
┃ ┃ ┃ ┏┓┃
┃ ┃ ┃ ┃┃┃
┃ ┃ ┃ ┃┃┏
┃ ┃ ┃ ┃┃┃
┏┓┃ ┃ ┃┃┃
┃┃┃ ┃ ┃┃┃
┃┃┃┏┓ ┃┃┃
┃┃┃┃┃ ┃┃┃
┃┃┃┃┏┓┃┃┃
┃┃┃┃┃┃┃┃┃
Children and siblings#
Let’s take another look at our tree, labelled with node ids:
print(t.draw(format="unicode"))
18
┏━━━━━┻━━━━━┓
┃ 17
┃ ┏━━━┻━━┓
16 ┃ ┃
┏┻━┓ ┃ ┃
┃ ┃ 15 ┃
┃ ┃ ┏━┻━━┓ ┃
┃ ┃ ┃ 14 ┃
┃ ┃ ┃ ┏┻┓ ┃
┃ ┃ ┃ ┃ ┃ 13
┃ ┃ ┃ ┃ ┃ ┏┻┓
12 ┃ ┃ ┃ ┃ ┃ ┃
┏┻┓ ┃ ┃ ┃ ┃ ┃ ┃
┃ ┃ ┃ 11 ┃ ┃ ┃ ┃
┃ ┃ ┃ ┏━┻┓ ┃ ┃ ┃ ┃
┃ ┃ ┃ ┃ 10 ┃ ┃ ┃ ┃
┃ ┃ ┃ ┃ ┏┻┓ ┃ ┃ ┃ ┃
0 4 1 2 6 7 3 5 8 9
Let’s ponder node 15 for a moment. It has two immediate descendants, nodes 11 and 14. We may consider these the left and right children, respectively, of node 15.
def get_children(tree, i):
lc = tree.left_child(i)
rc = tree.right_child(i)
if lc == tskit.NULL and rc == tskit.NULL:
return "->NULL".format(i)
return "->".format(i) + str((lc, rc))
cmap = {i: get_children(t, i) for i in range(len(ts.tables.nodes))}
print(t.draw(format="unicode", node_labels=cmap))
->(16, 17)
┏━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━┓
┃ ->(13, 15)
┃ ┏━━━━━━━━━━━┻━━━━━━━━━━━┓
->(1, 12) ┃ ┃
┏━━━━━┻━━━━┓ ┃ ┃
┃ ┃ ->(11, 14) ┃
┃ ┃ ┏━━━━━━━━┻━━━━━━━━━┓ ┃
┃ ┃ ┃ ->(3, 5) ┃
┃ ┃ ┃ ┏━━┻━━━┓ ┃
┃ ┃ ┃ ┃ ┃ ->(8, 9)
┃ ┃ ┃ ┃ ┃ ┏━━┻━━━┓
->(0, 4) ┃ ┃ ┃ ┃ ┃ ┃
┏━━┻━━━┓ ┃ ┃ ┃ ┃ ┃ ┃
┃ ┃ ┃ ->(2, 10) ┃ ┃ ┃ ┃
┃ ┃ ┃ ┏━━━━┻━━━━┓ ┃ ┃ ┃ ┃
┃ ┃ ┃ ┃ ->(6, 7) ┃ ┃ ┃ ┃
┃ ┃ ┃ ┃ ┏━━┻━━━┓ ┃ ┃ ┃ ┃
->NULL ->NULL ->NULL ->NULL ->NULL ->NULL ->NULL ->NULL ->NULL ->NULL
Likewise, we may look at the sibling relationships amongst nodes:
def get_sibs(tree, i):
ls = tree.left_sib(i)
rs = tree.right_sib(i)
if ls == tskit.NULL and rs == tskit.NULL:
return "->NULL".format(i)
return "->".format(i) + str((ls, rs))
smap = {i: get_sibs(t, i) for i in range(len(ts.tables.nodes))}
print(t.draw(format="unicode", node_labels=smap))
->NULL
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ ->(16, -1)
┃ ┏━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━┓
->(-1, 17) ┃ ┃
┏━━━━━━━┻━━━━━━━┓ ┃ ┃
┃ ┃ ->(13, -1) ┃
┃ ┃ ┏━━━━━━━━━━━━━┻━━━━━━━━━━━━┓ ┃
┃ ┃ ┃ ->(11, -1) ┃
┃ ┃ ┃ ┏━━━━┻━━━━┓ ┃
┃ ┃ ┃ ┃ ┃ ->(-1, 15)
┃ ┃ ┃ ┃ ┃ ┏━━━━┻━━━━┓
->(1, -1) ┃ ┃ ┃ ┃ ┃ ┃
┏━━━━┻━━━━┓ ┃ ┃ ┃ ┃ ┃ ┃
┃ ┃ ┃ ->(-1, 14) ┃ ┃ ┃ ┃
┃ ┃ ┃ ┏━━━━━━━┻━━━━━━┓ ┃ ┃ ┃ ┃
┃ ┃ ┃ ┃ ->(2, -1) ┃ ┃ ┃ ┃
┃ ┃ ┃ ┃ ┏━━━━┻━━━━┓ ┃ ┃ ┃ ┃
->(-1, 4) ->(0, -1) ->(-1, 12) ->(-1, 10) ->(-1, 7) ->(6, -1) ->(-1, 5) ->(3, -1) ->(-1, 9) ->(8, -1)
Multiply-linked lists#
Under the hood, the data structures representing marginal trees consist of several arrays represening a multiply-linked list allowing traversal up/down/left/right along a marginal tree. By convention, the value -1 is taken as a NULL value, signifying that there are no more nodes in “that” direction along the tree.
We can look directly at what theses arrays look like in our tree:
nnodes = len(ts.tables.nodes)
# First, let's write down our node ids:
print([i for i in range(nnodes)])
# Now, get the parents of each node, moving "up" the tree
print([t.parent(i) for i in range(nnodes)])
# The left child list allows moving "down left" along a tree
print([t.left_child(i) for i in range(nnodes)])
# The right child list allows moving "down right" along a tree
print([t.right_child(i) for i in range(nnodes)])
# The left sib list allows moving "left" along a tree
print([t.left_sib(i) for i in range(nnodes)])
# The right sib list allows moving "right" along a tree
print([t.right_sib(i) for i in range(nnodes)])
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]
[12, 16, 11, 14, 12, 14, 10, 10, 13, 13, 11, 15, 16, 17, 15, 17, 18, 18, -1]
[-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 6, 2, 0, 8, 3, 11, 1, 13, 16]
[-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 7, 10, 4, 9, 5, 14, 12, 15, 17]
[-1, -1, -1, -1, 0, 3, -1, 6, -1, 8, 2, -1, 1, -1, 11, 13, -1, 16, -1]
[4, 12, 10, 5, -1, -1, 7, -1, 9, -1, -1, 14, -1, 15, -1, -1, 17, -1, -1]
These lists are interpreted a lists of nodes referring to other nodes. For example, the value 10 in position 0 of the parents list means, “The parent of the node with index 0 has index 10”, where the indexes refer to the node table.
Given the above lists, you may start at any valid node id (e.g. a value not equal to -1 in the first array), and then “walk” in any direction you choose along the tree until you hit a value of -1, meaning that you cannot proceed any further.
Note
The trees generated by msprime.sim_ancestry
are bifurcating, which is a consequence of simulating from the Kingman
coalescent. In forward-time simulations, it is not uncommon to have more than two descendants of a node. When that
happens, left_child and right_child refer to the left-most and right-most children, respectively. Thus, to “walk”
along the descendants of a node, you proceed to left_child, and then march along right_sib until a value of -1 is
seen. This “walking” method is the same as what you would do for a bifurcating tree, but I want to point out that
counting the number of immediate descendants of a node requires counting the number of steps that the walk requires,
and that it may be longer than two steps.
The last five arrays show above correspond to the following attributes of fwdpy11.TreeIterator
:
The key to efficiency is how these linked lists are updated as you move from tree \(i\) to tree \(i+1\). Given an approprate set of indexes, these lists are only updated at the positions that differ between the two trees. In practice, adjacent trees are highly-correlated, meaning that very few values need updating. The construction of these indexes is described immediatlely above the description of “Algorithm T” in the Kelleher et al. (2016) paper describing msprime.
The building of those indexes is a key determinant of performance for algorithms on tree sequences. Fortunately, you don’t need to worry about that, as that machinery is hidden in the internals of fwdpp (and tskit).