Converting data to tskit format

Converting data to tskit format#

Note

The examples in this vignette involve population objects with no ancestry. This is intentional, as it keeps the execution times to a minimum while still illustrating the concepts.

At the end of a simulation, fwdpy11.DiploidPopulation.dump_tables_to_tskit() will generate a tskit.TreeSequence object. This object can be saved to a file using tskit.TreeSequence.dump(), resulting in a “trees file”.

Saving data in tskit’s format gives you access to a huge array of methods for downstream analysis of your simulated data. See the tskit docs for much more on what you can do. Much of what you need is found in the docs for tskit.TreeSequence.

What happens when we call fwdpy11.DiploidPopulation.dump_tables_to_tskit()?

All tables are converted from fwdpy11 format to tskit format in a tskit.TableCollection.
For individuals, populations, and mutations, row-level metadata are encoded. The next vignette covers how to interact with this information.
Optionally, top-level metadata are encoded, which is controlled by keyword arguments to fwdpy11.DiploidPopulation.dump_tables_to_tskit().

Obtaining a `tskit` tree sequence#

pop = fwdpy11.DiploidPopulation(1000, 1.0)

By default, export to tskit returns a tskit.TreeSequence:

ts = pop.dump_tables_to_tskit()
type(ts)

tskit.trees.TreeSequence

In general, we want this type in order to dump straight to a file:

# not executed
ts.dump("treefile.trees")

We can also do a lot of analysis using this type. However, certain tasks would be easier if this type knew about the metadata generated by fwdpy11.

Metadata#

The tskit data model includes metadata. Metadata means anything not strictly needed for operations on trees/tree sequences. Examples may include population names, properties of mutations such as selection coefficients, geographic locations of individuals, etc.. These sorts of metadata are included as optional columns in tables. See the tskit data model documentation for more detail.

Metadata can also include “top-level” information that is stored as metadata for the tree sequence (or tskit.TableCollection) itself. Much of this vignette concerns top-level metadata.

Metadata schema#

The tskit metadata require a specification, or schema, that is used to validate the metadata contents.

The top-level metadata schema is:

print(fwdpy11.tskit_tools.metadata_schema.TopLevelMetadata)

tskit.MetadataSchema(
{'codec': 'json',
 'properties': {'data': {'description': 'This field is reserved for the user '
                                        'to fill.',
                         'type': ['object', 'string']},
                'demes_graph': {'description': 'A demographic model specified '
                                               'using demes.This information '
                                               'will be redundant with that '
                                               'stored in model_params,but it '
                                               'may be useful as it allows '
                                               'reconstruction of the YAML '
                                               'filefrom the tree sequence.',
                                'type': 'object'},
                'generation': {'description': 'The value of pop.generation at '
                                              'the time datawere exported to '
                                              'tskit',
                               'type': 'integer'},
                'model_params': {'description': 'One or more '
                                                'fwdpy11.ModelParams '
                                                'instances.',
                                 'type': ['string', 'object']},
                'seed': {'description': 'Random number seed.This is optional '
                                        'because a random number generatormay '
                                        'be called prior to simulation, thus '
                                        'making theinitial seed not capable of '
                                        'reproducing the simulation',
                         'type': 'integer'}},
 'required': ['generation'],
 'title': 'Top-level metadata for table collection/tree sequence.',
 'type': 'object'}
)

We see that the metadata are encoded as a JSON schema. Only one field is required, generation. This required field is automatically populated by fwdpy11.DiploidPopulation.dump_tables_to_tskit(). The remaining fields are filled in by keyword arguments passed to that function.

Much of what follows concerns filling in this top-level metadata. To do this, we use incomplete data (populations with no ancestry) to illustrate the concepts.

Populating top-level metadata#

Top-level metadata are an attempt to improve reproducibility. However, all of these features are optional. Because a simulation project using Python may be very complex, with many local and non-local imports, etc., fully documenting the top-level metadata and/or “provenance” may be impractical. In some cases, reproducibility may instead by best managed by a git repository where script and Makefile/Snakefile work flows coexist.

Adding random number seeds#

You may add a random number seed to the top-level metadata. Doing so is optional because we cannot guarantee that you used this seed to construct a fwdpy11.GSLrng and then immediately run the simulation. For example, if you built the rng object and then simulated two replicates, the seed is only valid to reproduce the first. You do not know the state of the rng when the second simulation started.

ts = pop.dump_tables_to_tskit()
assert "seed" not in ts.metadata

ts = pop.dump_tables_to_tskit(seed=54321)
assert ts.metadata["seed"] == 54321

ts = pop.dump_tables_to_tskit(seed=-4616)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[7], line 1
----> 1 ts = pop.dump_tables_to_tskit(seed=-4616)

File ~/work/fwdpy11/fwdpy11/fwdpy11/_types/diploid_population.py:385, in DiploidPopulation.dump_tables_to_tskit(self, model_params, population_metadata, data, seed, parameters, destructive)
    298 def dump_tables_to_tskit(
    299     self,
    300     *,
   (...)
    306     destructive=False,
    307 ) -> tskit.TreeSequence:
    308     """
    309     Dump the population's TableCollection into
    310     an tskit TreeSequence
   (...)
    383 
    384     """
--> 385     return fwdpy11.tskit_tools._dump_tables_to_tskit._dump_tables_to_tskit(
    386         self,
    387         model_params=model_params,
    388         population_metadata=population_metadata,
    389         data=data,
    390         seed=seed,
    391         parameters=parameters,
    392         destructive=destructive,
    393     )

File ~/work/fwdpy11/fwdpy11/fwdpy11/tskit_tools/_dump_tables_to_tskit.py:183, in _dump_tables_to_tskit(self, model_params, population_metadata, data, seed, parameters, destructive)
    181 if seed is not None:
    182     if seed < 0:
--> 183         raise ValueError(f"seed must be >=0, got {seed}")
    184     top_level_metadata["seed"] = seed
    186 tc.metadata = top_level_metadata

ValueError: seed must be >=0, got -4616

Adding `ModelParams` objects#

Another optional top-level metadata item are instances of fwdpy11.ModelParams. Including this is optional for several reasons, but a big one is that this class can be arbitrarily hard to serialize. The examples listed below will work with types provided by fwdpy11 (although we haven’t tested them all!), but user-defined types may be more challenging.

ts = pop.dump_tables_to_tskit()
assert "model_params" not in ts.metadata

Here is an example of Gaussian stabilizing selection around an optimum trait value of 0.0:

optimum = fwdpy11.Optimum(optimum=0.0, VS=10.0, when=0)
gss = fwdpy11.GaussianStabilizingSelection.single_trait([optimum])
pdict = {
    'nregions': [],
    'sregions': [fwdpy11.GaussianS(0, 1, 1, 0.10)],
    'recregions': [fwdpy11.PoissonInterval(0, 1, 1e-3)],
    'gvalue': fwdpy11.Additive(2., gss),
    'rates': (0.0, 1e-3, None),
    'simlen': 10 * pop.N,
    'prune_selected': False,
    'demography': fwdpy11.ForwardDemesGraph.tubes([pop.N], 1)
}

params = fwdpy11.ModelParams(**pdict)

We can pass the params object on when exporting the data to tskit:

ts = pop.dump_tables_to_tskit(model_params=params)
recovered_params = fwdpy11.ModelParams(**eval(ts.metadata["model_params"])) 
assert recovered_params == params

Note that we can recover our demographic model from the restored parameters:

print(recovered_params.demography)

fwdpy11.ForwardDemesGraph(yaml='time_units: generations\ngeneration_time: 1\ndemes:\n- name: deme0\n  epochs:\n  - {end_time: 0, start_size: 1000}\n', burnin=1, burnin_is_exact=False, round_non_integer_sizes=False)

print(recovered_params.demography.demes_graph)

time_units: generations
generation_time: 1.0
demes:
- name: deme0
  epochs:
  - {end_time: 0.0, start_size: 1000.0}

User-defined metadata#

It may be useful to store arbitrary data in the output. For example, this could be a dict containing information obtained from a time series analysis of a simulation.

The simplest situation is to use a dict containing simple objects:

ts = pop.dump_tables_to_tskit(data={'x': 3})
ts.metadata["data"]

{'x': 3}

Any use of user-defined types will require conversion to string, however:

class MyData(object):
    def __init__(self, x):
        self.x = x

    def __repr__(self):
        return f"MyData(x={self.x})"

# Send in the str representation:
ts = pop.dump_tables_to_tskit(data=str(MyData(x=77)))

Getting the data back out requires an eval:

assert eval(ts.metadata["data"]).x == 77

Warning

Recording data like this can become arbitrary difficult. The Python import system may make it more difficult to recover data than the examples here would indicate. Further, if the data are very large, then other output formats are likely more appropriate.

Setting population table metadata#

This example involves simulating a multi-deme model.

By default, deme names are placed in the metadata:

ts = pop.dump_tables_to_tskit()
for p in ts.populations():
    print(p.metadata)

{'name': 'deme0'}
{'name': 'deme1'}
{'name': 'deme2'}

We can override this behavior by providing a dictionary mapping a deme’s integer id (index) to data.

md = {}
for i, deme in enumerate(demography.demes_graph.demes):
    md[i] = {'name': deme.name, 'end_time': deme.end_time}
ts = pop.dump_tables_to_tskit(population_metadata=md)
for p in ts.populations():
    print(p.metadata)

{'end_time': 1000.0, 'name': 'X'}
{'end_time': 0.0, 'name': 'A'}
{'end_time': 0.0, 'name': 'B'}