Working with data in tskit format#
This vignette discusses how to process simulations stored in a tskit.TreeSequence
.
A main goal of this vignette is to describe how to access data from the forward simulation stored as metadata in the tree sequence.
This example will be fairly rich in terms of features:
We simulate a model of one deme splitting into two.
We will record ancient samples from both demes after the split.
We initialize our population with a simulation from
msprime
.We will have selected mutations from
fwdpy11
and neutral mutations frommsprime
in our tree sequence.
Setting up a model#
In order to have data to analyze, we need to simulate some. We will simulate the following demographic model:
yaml="""
description: Two deme model with migration and size changes.
time_units: generations
demes:
- name: ancestral
description: ancestral deme, two epochs
epochs:
- {end_time: 50, start_size: 100}
- name: deme1
description: child 1
epochs:
- {start_size: 25, end_size: 25, end_time: 0}
ancestors: [ancestral]
- name: deme2
description: child 2
epochs:
- {start_size: 75, end_size: 75, end_time: 0}
ancestors: [ancestral]
migrations:
- {demes: [deme1, deme2], rate: 1e-3}
"""
import demes
import demesdraw
graph = demes.loads(yaml)
demesdraw.tubes(graph);
(graph
is a demes.Graph
.)
The forward simulation parameters#
import fwdpy11
import msprime
import numpy as np
We will record all individuals in the entire (meta) population after a specific time:
class Recorder(object):
def __init__(self, when):
self.when = when
def __call__(self, pop, sampler):
if pop.generation >= self.when:
sampler.assign(np.arange(pop.N))
Set up some general parameters about the “genome”:
RHO = 1000. # 4 * ancestral N * r
L = 1000 # Genome length
Nref = graph.demes[0].epochs[0].start_size # Ancestral population size
RECRATE = RHO / 4 / Nref
Will will use a multivariate lognormal distribution of effect sizes. We have to consider the effect sizes in three different demes:
The ancestor (deme ID 0)
Derived deme 1 (deme ID 1)
Derived deme 2 (deme ID 2)
The goal here is to model a population split such that derived deme 1 represent the continuation of the ancestral population and derived deme 2 is a new deme. Thus, we want the effect sizes of mutations to be the same in ancestral deme and derived deme 1. To do this, we need to set our covariance matrix to have a perfect correlation between IDs (indexes) 0 and 1. However, we cannot use an off-diagonal value exactly equal to 1, else the Cholesky decomposition that is needed to generate deviates will fail. Our solution is to use off-diagonal values very very close to 1:
vcov = np.identity(3)
vcov[0,1] = 1-np.finfo(np.float64).eps
vcov[1,0] = 1-np.finfo(np.float64).eps
mvdes = fwdpy11.mvDES(
fwdpy11.LogNormalS.mv(0, L, 1, scaling=-2*Nref),
np.zeros(3),
vcov,
)
The rest is standard.
We will generate the model from a demes.Graph
:
demog = fwdpy11.ForwardDemesGraph.from_demes(graph, burnin=1)
Set up the parameters dictionary:
pdict = {
'nregions': [],
'sregions': [mvdes],
'recregions': [fwdpy11.PoissonInterval(0, L, RECRATE, discrete=True)],
'rates': (0., 1e-3, None),
'gvalue': fwdpy11.Multiplicative(ndemes=3, scaling=2.0),
'demography': demog,
'simlen': demog.model_duration,
}
params = fwdpy11.ModelParams(**pdict)
Finally, we initialize a population using output from msprime
, evolve it, and convert the data to tskit
format:
initial_history = msprime.sim_ancestry(
samples = Nref,
population_size = Nref/2,
random_seed = 12345,
recombination_rate = RECRATE / L,
discrete_genome = True,
sequence_length = L,
model = [
msprime.DiscreteTimeWrightFisher(duration=int(0.1 * Nref)),
msprime.StandardCoalescent(),
],
)
pop = fwdpy11.DiploidPopulation.create_from_tskit(initial_history)
rng = fwdpy11.GSLrng(54321)
fwdpy11.evolvets(rng, pop, params,
recorder=Recorder(when=demog.burnin_duration),
simplification_interval=100,
suppress_table_indexing=True)
ts = pop.dump_tables_to_tskit()
Now that we have some data, let’s look at how the fwdpy11
mutation and individual information got encoded as tskit
metadata!
Exploring the mutation metadata#
By default, the metadata decode as a dict
:
for m in ts.mutations():
print(m.metadata)
{'esizes': [-0.012623238903808174, -0.012623239205499553, -0.0016703468456438187], 'h': 1.0, 'heffects': [1.0, 1.0, 1.0], 'key': 10, 'label': 0, 'neutral': False, 'origin': 143, 's': 0.0}
{'esizes': [-0.0003653137107332991, -0.0003653137049199985, -0.008385608495355528], 'h': 1.0, 'heffects': [1.0, 1.0, 1.0], 'key': 6, 'label': 0, 'neutral': False, 'origin': 131, 's': 0.0}
{'esizes': [-0.005145173903927555, -0.005145173951569319, -0.01313208676636952], 'h': 1.0, 'heffects': [1.0, 1.0, 1.0], 'key': 5, 'label': 0, 'neutral': False, 'origin': 128, 's': 0.0}
{'esizes': [-0.010095013649331552, -0.010095013774628793, -0.004373345644889459], 'h': 1.0, 'heffects': [1.0, 1.0, 1.0], 'key': 9, 'label': 0, 'neutral': False, 'origin': 141, 's': 0.0}
{'esizes': [-0.0005352057637457395, -0.0005352057524495173, -0.020502069417358642], 'h': 1.0, 'heffects': [1.0, 1.0, 1.0], 'key': 3, 'label': 0, 'neutral': False, 'origin': 122, 's': 0.0}
{'esizes': [-0.018070984822376193, -0.018070984634143302, -0.019988302717412986], 'h': 1.0, 'heffects': [1.0, 1.0, 1.0], 'key': 4, 'label': 0, 'neutral': False, 'origin': 127, 's': 0.0}
{'esizes': [-0.0010802891553584244, -0.0010802891761623927, -0.002925798477131282], 'h': 1.0, 'heffects': [1.0, 1.0, 1.0], 'key': 0, 'label': 0, 'neutral': False, 'origin': 17, 's': 0.0}
{'esizes': [-0.036046827270880044, -0.036046826375927724, -0.006416225519968411], 'h': 1.0, 'heffects': [1.0, 1.0, 1.0], 'key': 8, 'label': 0, 'neutral': False, 'origin': 138, 's': 0.0}
{'esizes': [-0.04793076710246424, -0.04793076734249477, -0.002349944631050108], 'h': 1.0, 'heffects': [1.0, 1.0, 1.0], 'key': 11, 'label': 0, 'neutral': False, 'origin': 149, 's': 0.0}
{'esizes': [-0.003918899722778209, -0.003918899653371586, -0.0008939461234158104], 'h': 1.0, 'heffects': [1.0, 1.0, 1.0], 'key': 7, 'label': 0, 'neutral': False, 'origin': 134, 's': 0.0}
{'esizes': [-0.01052387781383134, -0.010523877991434082, -0.003613386303846068], 'h': 1.0, 'heffects': [1.0, 1.0, 1.0], 'key': 1, 'label': 0, 'neutral': False, 'origin': 95, 's': 0.0}
{'esizes': [-0.004738711258115506, -0.004738711417968354, -0.010637415336509182], 'h': 1.0, 'heffects': [1.0, 1.0, 1.0], 'key': 2, 'label': 0, 'neutral': False, 'origin': 118, 's': 0.0}
We can call fwdpy11.tskit_tools.decode_mutation_metadata()
to convert the dict
to a fwdpy11.Mutation
. Thus function returns a list
because it can be accessed using a slice
:
m = fwdpy11.tskit_tools.decode_mutation_metadata(ts, 0)
type(m[0])
fwdpy11._fwdpy11.Mutation
With no arguments, all metadata are converted to the Mutation
type:
for m in fwdpy11.tskit_tools.decode_mutation_metadata(ts):
print(m.esizes)
[-0.01262324 -0.01262324 -0.00167035]
[-0.00036531 -0.00036531 -0.00838561]
[-0.00514517 -0.00514517 -0.01313209]
[-0.01009501 -0.01009501 -0.00437335]
[-0.00053521 -0.00053521 -0.02050207]
[-0.01807098 -0.01807098 -0.0199883 ]
[-0.00108029 -0.00108029 -0.0029258 ]
[-0.03604683 -0.03604683 -0.00641623]
[-0.04793077 -0.04793077 -0.00234994]
[-0.0039189 -0.0039189 -0.00089395]
[-0.01052388 -0.01052388 -0.00361339]
[-0.00473871 -0.00473871 -0.01063742]
Note
For simulations without multivariate mutation effects, the metadata will not have the heffects
and esizes
fields.
When converted to a fwdpy11.Mutation
, the object will contain empty vectors.
Adding mutation with msprime
.#
Mutations generated by fwdpy11
contain metadata, allowing you to get their effect sizes, etc..
When we add mutations to the tree sequence with msprime.sim_mutations()
, the new mutation table rows have no metadata.
Let’s mutate a copy of our tree sequence. We will apply a binary, infinite-site model to add neutral mutations:
tscopy = msprime.sim_mutations(ts.tables.tree_sequence(), rate=RECRATE / L, model=msprime.BinaryMutationModel(), discrete_genome=False, random_seed=615243)
print("Our original number of mutations =", ts.num_mutations)
print("Our new number of mutations =", tscopy.num_mutations)
Our original number of mutations = 12
Our new number of mutations = 29495
metadata_is_none = 0
for i in tscopy.mutations():
if i.metadata is None:
metadata_is_none += 1
assert metadata_is_none == tscopy.num_mutations - ts.num_mutations
# mut be true because we asked for an "infinite-sites" mutation scheme
assert metadata_is_none == tscopy.num_sites - ts.num_mutations
Exploring the individual metadata#
As with mutations, individual metadata automatically decode to dict
:
print(ts.individual(0).metadata)
print(type(ts.individual(0).metadata))
{'deme': 1, 'e': 0.0, 'g': 1.0, 'geography': [0.0, 0.0, 0.0], 'label': 0, 'nodes': [0, 1], 'parents': [8, 5], 'sex': 0, 'w': 1.0}
<class 'dict'>
It is often more efficient to decode the data into fwdpy11.tskit_tools.DiploidMetadata
(which is an attrs
-based analog to fwdpy11.DiploidMetadata
).
As with mutation metadata, fwdpy11.tskit_tools.decode_individual_metadata()
returns a list:
fwdpy11.tskit_tools.decode_individual_metadata(ts, 0)
[DiploidMetadata(g=1.0, e=0.0, w=1.0, sex=0, deme=1, label=0, alive=True, preserved=False, first_generation=False, parents=[8, 5], geography=[0.0, 0.0, 0.0], nodes=[0, 1])]
print(type(fwdpy11.tskit_tools.decode_individual_metadata(ts, 0)[0]))
<class 'fwdpy11.tskit_tools.metadata.DiploidMetadata'>
The main difference between this Python class and its C++ analog is that the former contains several fields that decode the flags
column of the individual table.
See here for details.
Traversing all time points for which individuals exist#
The example simulation preserves individuals at many different time points.
Use fwdpy11.tskit_tools.iterate_timepoints_with_individuals()
to automate iterating over each time point:
import pandas as pd
pd.set_option("display.max_rows", 11)
times = []
num_nodes = []
len_metadata = []
for time, nodes, metadata in fwdpy11.tskit_tools.iterate_timepoints_with_individuals(ts, decode_metadata=True):
times.append(time)
num_nodes.append(len(nodes))
len_metadata.append(len(metadata))
df = pd.DataFrame({"time": times, "num_nodes": num_nodes, "num_metadata_objects": len_metadata})
df
time | num_nodes | num_metadata_objects | |
---|---|---|---|
0 | 49.0 | 200 | 100 |
1 | 48.0 | 200 | 100 |
2 | 47.0 | 200 | 100 |
3 | 46.0 | 200 | 100 |
4 | 45.0 | 200 | 100 |
... | ... | ... | ... |
45 | 4.0 | 200 | 100 |
46 | 3.0 | 200 | 100 |
47 | 2.0 | 200 | 100 |
48 | 1.0 | 200 | 100 |
49 | 0.0 | 200 | 100 |
50 rows × 3 columns