Working with data in tskit format

Working with data in tskit format#

This vignette discusses how to process simulations stored in a tskit.TreeSequence. A main goal of this vignette is to describe how to access data from the forward simulation stored as metadata in the tree sequence.

This example will be fairly rich in terms of features:

We simulate a model of one deme splitting into two.
We will record ancient samples from both demes after the split.
We initialize our population with a simulation from msprime.
We will have selected mutations from fwdpy11 and neutral mutations from msprime in our tree sequence.

Setting up a model#

In order to have data to analyze, we need to simulate some. We will simulate the following demographic model:

yaml="""
description: Two deme model with migration and size changes.
time_units: generations
demes:
- name: ancestral
  description: ancestral deme, two epochs
  epochs:
  - {end_time: 50, start_size: 100}
- name: deme1
  description: child 1
  epochs:
  - {start_size: 25, end_size: 25, end_time: 0}
  ancestors: [ancestral]
- name: deme2
  description: child 2
  epochs:
  - {start_size: 75, end_size: 75, end_time: 0}
  ancestors: [ancestral]
migrations:
- {demes: [deme1, deme2], rate: 1e-3}
"""

import demes
import demesdraw

graph = demes.loads(yaml)
demesdraw.tubes(graph);

../_images/8ddeeff7584c54f9c393da82fb1c11d664caea2d173ae7eb7f9e368b79c70883.png

(graph is a demes.Graph.)

The forward simulation parameters#

import fwdpy11
import msprime
import numpy as np

We will record all individuals in the entire (meta) population after a specific time:

class Recorder(object):
    def __init__(self, when):
        self.when = when

    def __call__(self, pop, sampler):
        if pop.generation >= self.when:
            sampler.assign(np.arange(pop.N))

Set up some general parameters about the “genome”:

RHO = 1000. # 4 * ancestral N * r
L = 1000 # Genome length
Nref = graph.demes[0].epochs[0].start_size # Ancestral population size
RECRATE = RHO / 4 / Nref

Will will use a multivariate lognormal distribution of effect sizes. We have to consider the effect sizes in three different demes:

The ancestor (deme ID 0)
Derived deme 1 (deme ID 1)
Derived deme 2 (deme ID 2)

The goal here is to model a population split such that derived deme 1 represent the continuation of the ancestral population and derived deme 2 is a new deme. Thus, we want the effect sizes of mutations to be the same in ancestral deme and derived deme 1. To do this, we need to set our covariance matrix to have a perfect correlation between IDs (indexes) 0 and 1. However, we cannot use an off-diagonal value exactly equal to 1, else the Cholesky decomposition that is needed to generate deviates will fail. Our solution is to use off-diagonal values very very close to 1:

vcov = np.identity(3)
vcov[0,1] = 1-np.finfo(np.float64).eps
vcov[1,0] = 1-np.finfo(np.float64).eps
mvdes = fwdpy11.mvDES(
        fwdpy11.LogNormalS.mv(0, L, 1, scaling=-2*Nref),
        np.zeros(3),
        vcov,
    )

The rest is standard. We will generate the model from a demes.Graph:

demog = fwdpy11.ForwardDemesGraph.from_demes(graph, burnin=1)

Set up the parameters dictionary:

pdict = {
    'nregions': [],
    'sregions': [mvdes],
    'recregions': [fwdpy11.PoissonInterval(0, L, RECRATE, discrete=True)],
    'rates': (0., 1e-3, None),
    'gvalue': fwdpy11.Multiplicative(ndemes=3, scaling=2.0),
    'demography': demog,
    'simlen': demog.model_duration,
    }

params = fwdpy11.ModelParams(**pdict)

Finally, we initialize a population using output from msprime, evolve it, and convert the data to tskit format:

initial_history = msprime.sim_ancestry(
    samples = Nref,
    population_size = Nref/2,
    random_seed = 12345,
    recombination_rate = RECRATE / L,
    discrete_genome = True,
    sequence_length = L,
    model = [
        msprime.DiscreteTimeWrightFisher(duration=int(0.1 * Nref)),
        msprime.StandardCoalescent(),
    ],
)

pop = fwdpy11.DiploidPopulation.create_from_tskit(initial_history)

rng = fwdpy11.GSLrng(54321)

fwdpy11.evolvets(rng, pop, params,
                 recorder=Recorder(when=demog.burnin_duration),
                 simplification_interval=100,
                 suppress_table_indexing=True)

ts = pop.dump_tables_to_tskit()

Now that we have some data, let’s look at how the fwdpy11 mutation and individual information got encoded as tskit metadata!

Exploring the mutation metadata#

By default, the metadata decode as a dict:

for m in ts.mutations():
    print(m.metadata)

{'esizes': [-0.012623238903808174, -0.012623239205499553, -0.0016703468456438187], 'h': 1.0, 'heffects': [1.0, 1.0, 1.0], 'key': 10, 'label': 0, 'neutral': False, 'origin': 143, 's': 0.0}
{'esizes': [-0.0003653137107332991, -0.0003653137049199985, -0.008385608495355528], 'h': 1.0, 'heffects': [1.0, 1.0, 1.0], 'key': 6, 'label': 0, 'neutral': False, 'origin': 131, 's': 0.0}
{'esizes': [-0.005145173903927555, -0.005145173951569319, -0.01313208676636952], 'h': 1.0, 'heffects': [1.0, 1.0, 1.0], 'key': 5, 'label': 0, 'neutral': False, 'origin': 128, 's': 0.0}
{'esizes': [-0.010095013649331552, -0.010095013774628793, -0.004373345644889459], 'h': 1.0, 'heffects': [1.0, 1.0, 1.0], 'key': 9, 'label': 0, 'neutral': False, 'origin': 141, 's': 0.0}
{'esizes': [-0.0005352057637457395, -0.0005352057524495173, -0.020502069417358642], 'h': 1.0, 'heffects': [1.0, 1.0, 1.0], 'key': 3, 'label': 0, 'neutral': False, 'origin': 122, 's': 0.0}
{'esizes': [-0.018070984822376193, -0.018070984634143302, -0.019988302717412986], 'h': 1.0, 'heffects': [1.0, 1.0, 1.0], 'key': 4, 'label': 0, 'neutral': False, 'origin': 127, 's': 0.0}
{'esizes': [-0.0010802891553584244, -0.0010802891761623927, -0.002925798477131282], 'h': 1.0, 'heffects': [1.0, 1.0, 1.0], 'key': 0, 'label': 0, 'neutral': False, 'origin': 17, 's': 0.0}
{'esizes': [-0.036046827270880044, -0.036046826375927724, -0.006416225519968411], 'h': 1.0, 'heffects': [1.0, 1.0, 1.0], 'key': 8, 'label': 0, 'neutral': False, 'origin': 138, 's': 0.0}
{'esizes': [-0.04793076710246424, -0.04793076734249477, -0.002349944631050108], 'h': 1.0, 'heffects': [1.0, 1.0, 1.0], 'key': 11, 'label': 0, 'neutral': False, 'origin': 149, 's': 0.0}
{'esizes': [-0.003918899722778209, -0.003918899653371586, -0.0008939461234158104], 'h': 1.0, 'heffects': [1.0, 1.0, 1.0], 'key': 7, 'label': 0, 'neutral': False, 'origin': 134, 's': 0.0}
{'esizes': [-0.01052387781383134, -0.010523877991434082, -0.003613386303846068], 'h': 1.0, 'heffects': [1.0, 1.0, 1.0], 'key': 1, 'label': 0, 'neutral': False, 'origin': 95, 's': 0.0}
{'esizes': [-0.004738711258115506, -0.004738711417968354, -0.010637415336509182], 'h': 1.0, 'heffects': [1.0, 1.0, 1.0], 'key': 2, 'label': 0, 'neutral': False, 'origin': 118, 's': 0.0}

We can call fwdpy11.tskit_tools.decode_mutation_metadata() to convert the dict to a fwdpy11.Mutation. Thus function returns a list because it can be accessed using a slice:

m = fwdpy11.tskit_tools.decode_mutation_metadata(ts, 0)
type(m[0])

fwdpy11._fwdpy11.Mutation

With no arguments, all metadata are converted to the Mutation type:

for m in fwdpy11.tskit_tools.decode_mutation_metadata(ts):
    print(m.esizes)

[-0.01262324 -0.01262324 -0.00167035]
[-0.00036531 -0.00036531 -0.00838561]
[-0.00514517 -0.00514517 -0.01313209]
[-0.01009501 -0.01009501 -0.00437335]
[-0.00053521 -0.00053521 -0.02050207]
[-0.01807098 -0.01807098 -0.0199883 ]
[-0.00108029 -0.00108029 -0.0029258 ]
[-0.03604683 -0.03604683 -0.00641623]
[-0.04793077 -0.04793077 -0.00234994]
[-0.0039189  -0.0039189  -0.00089395]
[-0.01052388 -0.01052388 -0.00361339]
[-0.00473871 -0.00473871 -0.01063742]

Note

For simulations without multivariate mutation effects, the metadata will not have the heffects and esizes fields. When converted to a fwdpy11.Mutation, the object will contain empty vectors.

Adding mutation with `msprime`.#

Mutations generated by fwdpy11 contain metadata, allowing you to get their effect sizes, etc.. When we add mutations to the tree sequence with msprime.sim_mutations(), the new mutation table rows have no metadata.

Let’s mutate a copy of our tree sequence. We will apply a binary, infinite-site model to add neutral mutations:

tscopy = msprime.sim_mutations(ts.tables.tree_sequence(), rate=RECRATE / L, model=msprime.BinaryMutationModel(), discrete_genome=False, random_seed=615243)
print("Our original number of mutations =", ts.num_mutations)
print("Our new number of mutations =", tscopy.num_mutations)

Our original number of mutations = 12
Our new number of mutations = 29495

metadata_is_none = 0
for i in tscopy.mutations():
    if i.metadata is None:
        metadata_is_none += 1

assert metadata_is_none == tscopy.num_mutations - ts.num_mutations
# mut be true because we asked for an "infinite-sites" mutation scheme
assert metadata_is_none == tscopy.num_sites - ts.num_mutations

Exploring the individual metadata#

As with mutations, individual metadata automatically decode to dict:

print(ts.individual(0).metadata)
print(type(ts.individual(0).metadata))

{'deme': 1, 'e': 0.0, 'g': 1.0, 'geography': [0.0, 0.0, 0.0], 'label': 0, 'nodes': [0, 1], 'parents': [8, 5], 'sex': 0, 'w': 1.0}
<class 'dict'>

It is often more efficient to decode the data into fwdpy11.tskit_tools.DiploidMetadata (which is an attrs-based analog to fwdpy11.DiploidMetadata). As with mutation metadata, fwdpy11.tskit_tools.decode_individual_metadata() returns a list:

fwdpy11.tskit_tools.decode_individual_metadata(ts, 0)

[DiploidMetadata(g=1.0, e=0.0, w=1.0, sex=0, deme=1, label=0, alive=True, preserved=False, first_generation=False, parents=[8, 5], geography=[0.0, 0.0, 0.0], nodes=[0, 1])]

print(type(fwdpy11.tskit_tools.decode_individual_metadata(ts, 0)[0]))

<class 'fwdpy11.tskit_tools.metadata.DiploidMetadata'>

The main difference between this Python class and its C++ analog is that the former contains several fields that decode the flags column of the individual table. See here for details.

Traversing all time points for which individuals exist#

The example simulation preserves individuals at many different time points. Use fwdpy11.tskit_tools.iterate_timepoints_with_individuals() to automate iterating over each time point:

import pandas as pd
pd.set_option("display.max_rows", 11)

times = []
num_nodes = []
len_metadata = []
for time, nodes, metadata in fwdpy11.tskit_tools.iterate_timepoints_with_individuals(ts, decode_metadata=True):
    times.append(time)
    num_nodes.append(len(nodes))
    len_metadata.append(len(metadata))

df = pd.DataFrame({"time": times, "num_nodes": num_nodes, "num_metadata_objects": len_metadata})
df

	time	num_nodes	num_metadata_objects
0	49.0	200	100
1	48.0	200	100
2	47.0	200	100
3	46.0	200	100
4	45.0	200	100
...	...	...	...
45	4.0	200	100
46	3.0	200	100
47	2.0	200	100
48	1.0	200	100
49	0.0	200	100

50 rows × 3 columns