Converting data to tskit format#
Note
The examples in this vignette involve population objects with no ancestry. This is intentional, as it keeps the execution times to a minimum while still illustrating the concepts.
Show code cell source
import demes
import fwdpy11
At the end of a simulation, fwdpy11.DiploidPopulation.dump_tables_to_tskit()
will generate a tskit.TreeSequence
object.
This object can be saved to a file using tskit.TreeSequence.dump()
, resulting in a “trees file”.
Saving data in tskit
’s format gives you access to a huge array of methods for downstream analysis of your simulated data.
See the tskit docs for much more on what you can do.
Much of what you need is found in the docs for tskit.TreeSequence
.
What happens when we call fwdpy11.DiploidPopulation.dump_tables_to_tskit()
?
All tables are converted from
fwdpy11
format totskit
format in atskit.TableCollection
.For individuals, populations, and mutations, row-level metadata are encoded. The next vignette covers how to interact with this information.
Optionally, top-level metadata are encoded, which is controlled by keyword arguments to
fwdpy11.DiploidPopulation.dump_tables_to_tskit()
.
Obtaining a tskit
tree sequence#
pop = fwdpy11.DiploidPopulation(1000, 1.0)
By default, export to tskit
returns a tskit.TreeSequence
:
ts = pop.dump_tables_to_tskit()
type(ts)
tskit.trees.TreeSequence
In general, we want this type in order to dump straight to a file:
# not executed
ts.dump("treefile.trees")
We can also do a lot of analysis using this type.
However, certain tasks would be easier if this type knew about the metadata generated by fwdpy11
.
Metadata#
The tskit
data model includes metadata.
Metadata means anything not strictly needed for operations on trees/tree sequences.
Examples may include population names, properties of mutations such as selection coefficients, geographic locations of individuals, etc..
These sorts of metadata are included as optional columns in tables.
See the tskit
data model documentation for more detail.
Metadata can also include “top-level” information that is stored as metadata for the tree sequence (or tskit.TableCollection
) itself.
Much of this vignette concerns top-level metadata.
Metadata schema#
The tskit
metadata require a specification, or schema, that is used to validate the metadata contents.
The top-level metadata schema is:
print(fwdpy11.tskit_tools.metadata_schema.TopLevelMetadata)
tskit.MetadataSchema(
{'codec': 'json',
'properties': {'data': {'description': 'This field is reserved for the user '
'to fill.',
'type': ['object', 'string']},
'demes_graph': {'description': 'A demographic model specified '
'using demes.This information '
'will be redundant with that '
'stored in model_params,but it '
'may be useful as it allows '
'reconstruction of the YAML '
'filefrom the tree sequence.',
'type': 'object'},
'generation': {'description': 'The value of pop.generation at '
'the time datawere exported to '
'tskit',
'type': 'integer'},
'model_params': {'description': 'One or more '
'fwdpy11.ModelParams '
'instances.',
'type': ['string', 'object']},
'seed': {'description': 'Random number seed.This is optional '
'because a random number generatormay '
'be called prior to simulation, thus '
'making theinitial seed not capable of '
'reproducing the simulation',
'type': 'integer'}},
'required': ['generation'],
'title': 'Top-level metadata for table collection/tree sequence.',
'type': 'object'}
)
We see that the metadata are encoded as a JSON schema.
Only one field is required, generation
.
This required field is automatically populated by fwdpy11.DiploidPopulation.dump_tables_to_tskit()
.
The remaining fields are filled in by keyword arguments passed to that function.
Much of what follows concerns filling in this top-level metadata. To do this, we use incomplete data (populations with no ancestry) to illustrate the concepts.
Populating top-level metadata#
Top-level metadata are an attempt to improve reproducibility.
However, all of these features are optional.
Because a simulation project using Python may be very complex, with many local and non-local imports, etc., fully documenting the top-level metadata and/or “provenance” may be impractical.
In some cases, reproducibility may instead by best managed by a git
repository where script and Makefile
/Snakefile
work flows coexist.
Adding random number seeds#
You may add a random number seed to the top-level metadata.
Doing so is optional because we cannot guarantee that you used this seed to construct a fwdpy11.GSLrng
and then immediately run the simulation.
For example, if you built the rng
object and then simulated two replicates, the seed is only valid to reproduce the first.
You do not know the state of the rng
when the second simulation started.
ts = pop.dump_tables_to_tskit()
assert "seed" not in ts.metadata
ts = pop.dump_tables_to_tskit(seed=54321)
assert ts.metadata["seed"] == 54321
ts = pop.dump_tables_to_tskit(seed=-4616)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[7], line 1
----> 1 ts = pop.dump_tables_to_tskit(seed=-4616)
File ~/work/fwdpy11/fwdpy11/fwdpy11/_types/diploid_population.py:385, in DiploidPopulation.dump_tables_to_tskit(self, model_params, population_metadata, data, seed, parameters, destructive)
298 def dump_tables_to_tskit(
299 self,
300 *,
(...)
306 destructive=False,
307 ) -> tskit.TreeSequence:
308 """
309 Dump the population's TableCollection into
310 an tskit TreeSequence
(...)
383
384 """
--> 385 return fwdpy11.tskit_tools._dump_tables_to_tskit._dump_tables_to_tskit(
386 self,
387 model_params=model_params,
388 population_metadata=population_metadata,
389 data=data,
390 seed=seed,
391 parameters=parameters,
392 destructive=destructive,
393 )
File ~/work/fwdpy11/fwdpy11/fwdpy11/tskit_tools/_dump_tables_to_tskit.py:183, in _dump_tables_to_tskit(self, model_params, population_metadata, data, seed, parameters, destructive)
181 if seed is not None:
182 if seed < 0:
--> 183 raise ValueError(f"seed must be >=0, got {seed}")
184 top_level_metadata["seed"] = seed
186 tc.metadata = top_level_metadata
ValueError: seed must be >=0, got -4616
Adding ModelParams
objects#
Another optional top-level metadata item are instances of fwdpy11.ModelParams
.
Including this is optional for several reasons, but a big one is that this class can be arbitrarily hard to serialize.
The examples listed below will work with types provided by fwdpy11
(although we haven’t tested them all!), but user-defined types may be more challenging.
ts = pop.dump_tables_to_tskit()
assert "model_params" not in ts.metadata
Here is an example of Gaussian stabilizing selection around an optimum trait value of 0.0
:
optimum = fwdpy11.Optimum(optimum=0.0, VS=10.0, when=0)
gss = fwdpy11.GaussianStabilizingSelection.single_trait([optimum])
pdict = {
'nregions': [],
'sregions': [fwdpy11.GaussianS(0, 1, 1, 0.10)],
'recregions': [fwdpy11.PoissonInterval(0, 1, 1e-3)],
'gvalue': fwdpy11.Additive(2., gss),
'rates': (0.0, 1e-3, None),
'simlen': 10 * pop.N,
'prune_selected': False,
'demography': fwdpy11.ForwardDemesGraph.tubes([pop.N], 1)
}
params = fwdpy11.ModelParams(**pdict)
We can pass the params
object on when exporting the data to tskit
:
ts = pop.dump_tables_to_tskit(model_params=params)
recovered_params = fwdpy11.ModelParams(**eval(ts.metadata["model_params"]))
assert recovered_params == params
Note that we can recover our demographic model from the restored parameters:
print(recovered_params.demography)
fwdpy11.ForwardDemesGraph(yaml='time_units: generations\ngeneration_time: 1\ndemes:\n- name: deme0\n epochs:\n - {end_time: 0, start_size: 1000}\n', burnin=1, burnin_is_exact=False, round_non_integer_sizes=False)
print(recovered_params.demography.demes_graph)
time_units: generations
generation_time: 1.0
demes:
- name: deme0
epochs:
- {end_time: 0.0, start_size: 1000.0}
User-defined metadata#
It may be useful to store arbitrary data in the output.
For example, this could be a dict
containing information obtained from a time series analysis of a simulation.
The simplest situation is to use a dict
containing simple objects:
ts = pop.dump_tables_to_tskit(data={'x': 3})
ts.metadata["data"]
{'x': 3}
Any use of user-defined types will require conversion to string, however:
class MyData(object):
def __init__(self, x):
self.x = x
def __repr__(self):
return f"MyData(x={self.x})"
# Send in the str representation:
ts = pop.dump_tables_to_tskit(data=str(MyData(x=77)))
Getting the data back out requires an eval
:
assert eval(ts.metadata["data"]).x == 77
Warning
Recording data like this can become arbitrary difficult. The Python import system may make it more difficult to recover data than the examples here would indicate. Further, if the data are very large, then other output formats are likely more appropriate.
Setting population table metadata#
This example involves simulating a multi-deme model.
Show code cell source
# Example 07 from the demes tutorial
yaml = """
time_units: generations
demes:
- name: X
epochs:
- end_time: 1000
start_size: 2000
- name: A
ancestors:
- X
epochs:
- start_size: 2000
- name: B
ancestors:
- X
epochs:
- start_size: 2000
"""
demography = fwdpy11.ForwardDemesGraph.from_demes(
yaml,
burnin=100,
burnin_is_exact=True,
)
Opt = fwdpy11.Optimum
GSSmo_ancestor = fwdpy11.GaussianStabilizingSelection.single_trait(
[Opt(when=0, optimum=0.0, VS=1.0), Opt(when=100, optimum=1.0, VS=1.0)]
)
gvalue_ancestor = fwdpy11.Additive(2.0, GSSmo_ancestor)
GSSmo_daughter_1 = fwdpy11.GaussianStabilizingSelection.single_trait(
[Opt(when=100, optimum=1.0, VS=1.0)]
)
gvalue_daughter_1 = fwdpy11.Additive(2.0, GSSmo_daughter_1)
GSSmo_daughter_2 = fwdpy11.GaussianStabilizingSelection.single_trait(
[Opt(when=100, optimum=1.0, VS=1.0)]
)
gvalue_daughter_2 = fwdpy11.Additive(2.0, GSSmo_daughter_2)
p = {
"nregions": [],
"sregions": [fwdpy11.GaussianS(0, 1, 1, 0.25)],
"recregions": [fwdpy11.PoissonInterval(0, 1, 1e-3)],
"rates": (0.0, 0.025, None),
"gvalue": [gvalue_ancestor, gvalue_daughter_1, gvalue_daughter_2],
"prune_selected": False,
"demography": demography,
"simlen": demography.final_generation,
}
rng = fwdpy11.GSLrng(12351235)
params = fwdpy11.ModelParams(**p)
pop = fwdpy11.DiploidPopulation(demography.initial_sizes, 1)
fwdpy11.evolvets(rng, pop, params, simplification_interval=100)
By default, deme names are placed in the metadata:
ts = pop.dump_tables_to_tskit()
for p in ts.populations():
print(p.metadata)
{'name': 'deme0'}
{'name': 'deme1'}
{'name': 'deme2'}
We can override this behavior by providing a dictionary mapping a deme’s integer id (index) to data.
md = {}
for i, deme in enumerate(demography.demes_graph.demes):
md[i] = {'name': deme.name, 'end_time': deme.end_time}
ts = pop.dump_tables_to_tskit(population_metadata=md)
for p in ts.populations():
print(p.metadata)
{'end_time': 1000.0, 'name': 'X'}
{'end_time': 0.0, 'name': 'A'}
{'end_time': 0.0, 'name': 'B'}