Skip to main content
Ctrl+K

fwdpy11 manual

  • Introduction

Installation and setup

  • Setting up user environments for installation
  • Installation
  • Deployment tools
  • Invoking Python
  • Citation

Short vignettes - operations on populations

  • Initializing a population
  • Properties of populations

Short vignettes - setting up the genetics

  • Introduction
  • Genetic maps
  • Distributions of effect sizes
  • Mutations with variable dominance
  • Neutral mutations during a simulation
  • Genetic values - mutations with direct effects on fitness
  • Genetic values - simulating phenotypes/traits.
  • Working example - parameters for simulating direct effects on fitness
  • Working example - parameters for the simulation of a trait

Short vignettes - simulation with tree sequence recording

  • Evolving with tree sequence recording
  • Recording data during a simulation
  • Finishing a simulation with msprime
  • Adding neutral mutations to a population with existing ancestry
  • Importing mutations from tskit

Short vignettes - demographic modeling

  • Introduction
  • Using demes to specify demographic models.

Short vignettes - varying fitness conditions across demes

  • Introduction
  • The multivariate Gaussian distribution
  • The multivariate lognormal distribution
  • “Custom” multivariate distributions
  • Special considerations for demes models
  • Advanced topics

Short vignettes - conditional models

  • Tracking user-specified new mutations
  • Selective sweeps
  • Incomplete selective sweeps

Longer vignettes - demographic models

  • Timing of events in demographic models

Longer vignettes - exporting data to tskit

  • Converting data to tskit format
  • Working with data in tskit format

Longer vignettes - complete examples

  • Background selection
  • Gaussian stabilizing selection with an optimum shift
  • Adaptation of a trait to different optima in different demes
  • Tracking mutations over time
  • Tracking mutations over time using sqlite3

Concepts

  • Definitions

Tutorials

  • Advanced topics

Objects and concepts

  • Different effect sizes of mutations in different demes
  • Conceptual overview of tree sequence recording
  • Data structures related to tree sequences
  • Time series analysis

Technical details

  • Details of demes integration
  • Technical overview of genetic value calculations
  • Writing “plugins” with C++

Data types and fuctions

  • Genetic values
  • DataMatrix and StateMatrix
  • Function documentation
  • Random number distributions
  • Adding random noise to genetic values
  • Mapping genetic values to fitness
  • Model parameters
  • Setting up genomic intervals
  • Module fwdpy11.tskit_tools
  • Module fwdpy11.conditional_models
  • Data types related to simulated populations

Examples

  • Adaptation to different optima with multivariate distribution of effect sizes
  • Snowdrift model in python
  • Example of time series analysis implemented in C++

Miscellany

  • Changelog
  • Developer’s guide
  • Publications using fwdpy11
  • Upgrade path
  • Literature cited
  • Repository
  • Open issue
  • .md

Data structures related to tree sequences

Contents

  • Low-level data types
  • Table collections
  • Representing trees
    • “Leaves” of a tree
    • Children and siblings
    • Multiply-linked lists

Data structures related to tree sequences#

Low-level data types#

  • fwdpy11.Node defines nodes

  • fwdpy11.Edge defines edges

  • fwdpy11.Site defines a site (genomic location) where a mutation is present.

  • fwdpy11.MutationRecord defines mutations associated with fwdpy11.Site objects on trees and in fwdpy11.DiploidPopulation objects.

  • fwdpy11.NodeTable represents a node table

  • fwdpy11.EdgeTable represents an edge table

  • fwdpy11.MutationTable represents a mutation table

  • fwdpy11.SiteTable represents a site table

These types are all analogous to the corresponding tskit types. Here, fwdpy11.MutationRecord plays the role of tskit.Mutation, but with additional data fields that are useful for the forward-time simulations.

Table collections#

The above data types are encapsulated into the Python class fwdpy11.TableCollection. Instances of this class are data fields of populations, via fwdpy11.DiploidPopulation.tables.

Representing trees#

An edge table and a node table contain all the data required to generate what we may call “marginal trees”, which are the trees corresponding to genomic intervals \([left, right)\). In fwdpy11, fwdpy11.TreeIterator allows left-to-tright traversal across marginal trees as well as access to their data.

The intent of this section is to describe the structure of the marginal trees and get some terminology out of the way. I will use Jerome Kelleher’s tskit package to illustrate the concepts. fwdpy11 uses the same basic data model, and it is more convenient to use trees generated via a coalescent simulation that to run forward simulations. Plus, we can use the great “draw” methods from tskit!

Note

Do not mistake the tskit operations used below for the operations needed in fwdpy11. They differ in some important ways. Namely, the tskit API is currently richer and the fwdpy11 API is currently more low-level. The main goal here is to define terms and visualize them.

We are going to look at a single tree, but the concepts apply to any of the trees generated by a simulation with nonzero recombination rates. First, let’s get a tree sequence using the linear-time algorithm of Hudson (1990), which involves repeated swapping of the IDs of potential ancestors until the MRCA is reached.

import tskit
import numpy as np

# The following makes use of the Kirk Lohmueller seed.
np.random.seed(101 * 405 * 10 * 110)

tc = tskit.TableCollection(1)

nsam = 10

for i in range(nsam):
    tc.nodes.add_row(time=0, flags=1)

    nodes = np.arange(2 * nsam - 1, dtype=np.int32)
    time = 0.0
# The citation for this algorithm is
# Hudson, Richard R. 1990.
# “Gene Genealogies and the Coalescent Process.”
# Oxford Surveys in Evolutionary Biology 7 (1): 44.
n = nsam
while n > 1:
    rcoal = (n * (n - 1)) / 2.0
    tcoal = np.random.exponential(1.0 / rcoal)
    time += tcoal
    tc.nodes.add_row(time=time)
    ancestor = 2 * nsam - n
    p = np.random.choice(n, 1)[0]
    tc.edges.add_row(left=0.0, right=1.0, parent=ancestor, child=nodes[p])
    nodes[p] = nodes[n - 1]
    p = np.random.choice(n - 1, 1)[0]
    tc.edges.add_row(left=0.0, right=1.0, parent=ancestor, child=nodes[p])
    nodes[p] = nodes[2 * nsam - n]
    n -= 1
tc.sort()
ts = tc.tree_sequence()
print(ts.first().draw(format="unicode"))
       18          
  ┏━━━━━┻━━━━━┓    
  ┃          17    
  ┃       ┏━━━┻━━┓ 
 16       ┃      ┃ 
 ┏┻━┓     ┃      ┃ 
 ┃  ┃    15      ┃ 
 ┃  ┃   ┏━┻━━┓   ┃ 
 ┃  ┃   ┃   14   ┃ 
 ┃  ┃   ┃   ┏┻┓  ┃ 
 ┃  ┃   ┃   ┃ ┃ 13 
 ┃  ┃   ┃   ┃ ┃ ┏┻┓
12  ┃   ┃   ┃ ┃ ┃ ┃
┏┻┓ ┃   ┃   ┃ ┃ ┃ ┃
┃ ┃ ┃  11   ┃ ┃ ┃ ┃
┃ ┃ ┃ ┏━┻┓  ┃ ┃ ┃ ┃
┃ ┃ ┃ ┃ 10  ┃ ┃ ┃ ┃
┃ ┃ ┃ ┃ ┏┻┓ ┃ ┃ ┃ ┃
0 4 1 2 6 7 3 5 8 9

The above tree corresponds to a sample size of 10 haplotypes. The tree shows the node labels. The nodes labelled 0 through 9 correspond to the present time point–these are “alive nodes” or the “current generation” if we are thinking about a forward simulation. Further, we can describe the branches leading to these sample nodes as the “tips” or “leaves” of a tree.

“Leaves” of a tree#

One thing that we often want to know is “how many samples descend from node i?” To do that, we may look at the leaf counts attribute of a marginal tree:

# Let's store our
# tree in a variable now
t = ts.first()
def get_leaf_counts(tree, i):
    return len([j for j in tree.leaves(i)])
# Map node ids to their leaf counts using a dict
lcmap = {i: "".format(get_leaf_counts(t, i)) for i in range(len(ts.tables.nodes))}
print(t.draw(format="unicode", node_labels=lcmap))
         
 ┏━━┻━┓  
 ┃    ┃  
 ┃   ┏┻━┓
 ┃   ┃  ┃
┏┻┓  ┃  ┃
┃ ┃  ┃  ┃
┃ ┃ ┏┻┓ ┃
┃ ┃ ┃ ┃ ┃
┃ ┃ ┃ ┏┓┃
┃ ┃ ┃ ┃┃┃
┃ ┃ ┃ ┃┃┏
┃ ┃ ┃ ┃┃┃
┏┓┃ ┃ ┃┃┃
┃┃┃ ┃ ┃┃┃
┃┃┃┏┓ ┃┃┃
┃┃┃┃┃ ┃┃┃
┃┃┃┃┏┓┃┃┃
┃┃┃┃┃┃┃┃┃

Children and siblings#

Let’s take another look at our tree, labelled with node ids:

print(t.draw(format="unicode"))
       18          
  ┏━━━━━┻━━━━━┓    
  ┃          17    
  ┃       ┏━━━┻━━┓ 
 16       ┃      ┃ 
 ┏┻━┓     ┃      ┃ 
 ┃  ┃    15      ┃ 
 ┃  ┃   ┏━┻━━┓   ┃ 
 ┃  ┃   ┃   14   ┃ 
 ┃  ┃   ┃   ┏┻┓  ┃ 
 ┃  ┃   ┃   ┃ ┃ 13 
 ┃  ┃   ┃   ┃ ┃ ┏┻┓
12  ┃   ┃   ┃ ┃ ┃ ┃
┏┻┓ ┃   ┃   ┃ ┃ ┃ ┃
┃ ┃ ┃  11   ┃ ┃ ┃ ┃
┃ ┃ ┃ ┏━┻┓  ┃ ┃ ┃ ┃
┃ ┃ ┃ ┃ 10  ┃ ┃ ┃ ┃
┃ ┃ ┃ ┃ ┏┻┓ ┃ ┃ ┃ ┃
0 4 1 2 6 7 3 5 8 9

Let’s ponder node 15 for a moment. It has two immediate descendants, nodes 11 and 14. We may consider these the left and right children, respectively, of node 15.

def get_children(tree, i):
    lc = tree.left_child(i)
    rc = tree.right_child(i)
    if lc == tskit.NULL and rc == tskit.NULL:
        return "->NULL".format(i)
    return "->".format(i) + str((lc, rc))


cmap = {i: get_children(t, i) for i in range(len(ts.tables.nodes))}
print(t.draw(format="unicode", node_labels=cmap))
                          ->(16, 17)                                 
            ┏━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━┓                  
            ┃                                ->(13, 15)              
            ┃                         ┏━━━━━━━━━━━┻━━━━━━━━━━━┓      
        ->(1, 12)                     ┃                       ┃      
      ┏━━━━━┻━━━━┓                    ┃                       ┃      
      ┃          ┃               ->(11, 14)                   ┃      
      ┃          ┃           ┏━━━━━━━━┻━━━━━━━━━┓             ┃      
      ┃          ┃           ┃              ->(3, 5)          ┃      
      ┃          ┃           ┃               ┏━━┻━━━┓         ┃      
      ┃          ┃           ┃               ┃      ┃     ->(8, 9)   
      ┃          ┃           ┃               ┃      ┃      ┏━━┻━━━┓  
  ->(0, 4)       ┃           ┃               ┃      ┃      ┃      ┃  
   ┏━━┻━━━┓      ┃           ┃               ┃      ┃      ┃      ┃  
   ┃      ┃      ┃       ->(2, 10)           ┃      ┃      ┃      ┃  
   ┃      ┃      ┃      ┏━━━━┻━━━━┓          ┃      ┃      ┃      ┃  
   ┃      ┃      ┃      ┃     ->(6, 7)       ┃      ┃      ┃      ┃  
   ┃      ┃      ┃      ┃      ┏━━┻━━━┓      ┃      ┃      ┃      ┃  
->NULL ->NULL ->NULL ->NULL ->NULL ->NULL ->NULL ->NULL ->NULL ->NULL

Likewise, we may look at the sibling relationships amongst nodes:

def get_sibs(tree, i):
    ls = tree.left_sib(i)
    rs = tree.right_sib(i)
    if ls == tskit.NULL and rs == tskit.NULL:
        return "->NULL".format(i)
    return "->".format(i) + str((ls, rs))
smap = {i: get_sibs(t, i) for i in range(len(ts.tables.nodes))}
print(t.draw(format="unicode", node_labels=smap))
                                           ->NULL                                                    
                 ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━━━━━━┓                          
                 ┃                                                   ->(16, -1)                      
                 ┃                                        ┏━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━┓         
            ->(-1, 17)                                    ┃                                ┃         
         ┏━━━━━━━┻━━━━━━━┓                                ┃                                ┃         
         ┃               ┃                           ->(13, -1)                            ┃         
         ┃               ┃                  ┏━━━━━━━━━━━━━┻━━━━━━━━━━━━┓                   ┃         
         ┃               ┃                  ┃                     ->(11, -1)               ┃         
         ┃               ┃                  ┃                     ┏━━━━┻━━━━┓              ┃         
         ┃               ┃                  ┃                     ┃         ┃         ->(-1, 15)     
         ┃               ┃                  ┃                     ┃         ┃         ┏━━━━┻━━━━┓    
     ->(1, -1)           ┃                  ┃                     ┃         ┃         ┃         ┃    
    ┏━━━━┻━━━━┓          ┃                  ┃                     ┃         ┃         ┃         ┃    
    ┃         ┃          ┃             ->(-1, 14)                 ┃         ┃         ┃         ┃    
    ┃         ┃          ┃          ┏━━━━━━━┻━━━━━━┓              ┃         ┃         ┃         ┃    
    ┃         ┃          ┃          ┃          ->(2, -1)          ┃         ┃         ┃         ┃    
    ┃         ┃          ┃          ┃         ┏━━━━┻━━━━┓         ┃         ┃         ┃         ┃    
->(-1, 4) ->(0, -1) ->(-1, 12) ->(-1, 10) ->(-1, 7) ->(6, -1) ->(-1, 5) ->(3, -1) ->(-1, 9) ->(8, -1)

Multiply-linked lists#

Under the hood, the data structures representing marginal trees consist of several arrays represening a multiply-linked list allowing traversal up/down/left/right along a marginal tree. By convention, the value -1 is taken as a NULL value, signifying that there are no more nodes in “that” direction along the tree.

We can look directly at what theses arrays look like in our tree:

nnodes = len(ts.tables.nodes)
# First, let's write down our node ids:
print([i for i in range(nnodes)])
# Now, get the parents of each node, moving "up" the tree
print([t.parent(i) for i in range(nnodes)])
# The left child list allows moving "down left" along a tree
print([t.left_child(i) for i in range(nnodes)])
# The right child list allows moving "down right" along a tree
print([t.right_child(i) for i in range(nnodes)])
# The left sib list allows moving "left" along a tree
print([t.left_sib(i) for i in range(nnodes)])
# The right sib list allows moving "right" along a tree
print([t.right_sib(i) for i in range(nnodes)])
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]
[12, 16, 11, 14, 12, 14, 10, 10, 13, 13, 11, 15, 16, 17, 15, 17, 18, 18, -1]
[-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 6, 2, 0, 8, 3, 11, 1, 13, 16]
[-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 7, 10, 4, 9, 5, 14, 12, 15, 17]
[-1, -1, -1, -1, 0, 3, -1, 6, -1, 8, 2, -1, 1, -1, 11, 13, -1, 16, -1]
[4, 12, 10, 5, -1, -1, 7, -1, 9, -1, -1, 14, -1, 15, -1, -1, 17, -1, -1]

These lists are interpreted a lists of nodes referring to other nodes. For example, the value 10 in position 0 of the parents list means, “The parent of the node with index 0 has index 10”, where the indexes refer to the node table.

Given the above lists, you may start at any valid node id (e.g. a value not equal to -1 in the first array), and then “walk” in any direction you choose along the tree until you hit a value of -1, meaning that you cannot proceed any further.

Note

The trees generated by msprime.sim_ancestry are bifurcating, which is a consequence of simulating from the Kingman coalescent. In forward-time simulations, it is not uncommon to have more than two descendants of a node. When that happens, left_child and right_child refer to the left-most and right-most children, respectively. Thus, to “walk” along the descendants of a node, you proceed to left_child, and then march along right_sib until a value of -1 is seen. This “walking” method is the same as what you would do for a bifurcating tree, but I want to point out that counting the number of immediate descendants of a node requires counting the number of steps that the walk requires, and that it may be longer than two steps.

The last five arrays show above correspond to the following attributes of fwdpy11.TreeIterator:

  • fwdpy11.TreeIterator.parent

  • fwdpy11.TreeIterator.left_child

  • fwdpy11.TreeIterator.right_child

  • fwdpy11.TreeIterator.left_sib

  • fwdpy11.TreeIterator.right_sib

The key to efficiency is how these linked lists are updated as you move from tree \(i\) to tree \(i+1\). Given an approprate set of indexes, these lists are only updated at the positions that differ between the two trees. In practice, adjacent trees are highly-correlated, meaning that very few values need updating. The construction of these indexes is described immediatlely above the description of “Algorithm T” in the Kelleher et al. (2016) paper describing msprime.

The building of those indexes is a key determinant of performance for algorithms on tree sequences. Fortunately, you don’t need to worry about that, as that machinery is hidden in the internals of fwdpp (and tskit).

previous

Conceptual overview of tree sequence recording

next

Time series analysis

Contents
  • Low-level data types
  • Table collections
  • Representing trees
    • “Leaves” of a tree
    • Children and siblings
    • Multiply-linked lists

By Kevin Thornton

© Copyright 2023.