Converting data to tskit format#

Note

The examples in this vignette involve population objects with no ancestry. This is intentional, as it keeps the execution times to a minimum while still illustrating the concepts.

Hide code cell source
import demes
import fwdpy11

At the end of a simulation, fwdpy11.DiploidPopulation.dump_tables_to_tskit() will generate a tskit.TreeSequence object. This object can be saved to a file using tskit.TreeSequence.dump(), resulting in a “trees file”.

Saving data in tskit’s format gives you access to a huge array of methods for downstream analysis of your simulated data. See the tskit docs for much more on what you can do. Much of what you need is found in the docs for tskit.TreeSequence.

What happens when we call fwdpy11.DiploidPopulation.dump_tables_to_tskit()?

  1. All tables are converted from fwdpy11 format to tskit format in a tskit.TableCollection.

  2. For individuals, populations, and mutations, row-level metadata are encoded. The next vignette covers how to interact with this information.

  3. Optionally, top-level metadata are encoded, which is controlled by keyword arguments to fwdpy11.DiploidPopulation.dump_tables_to_tskit().

Obtaining a tskit tree sequence#

pop = fwdpy11.DiploidPopulation(1000, 1.0)

By default, export to tskit returns a tskit.TreeSequence:

ts = pop.dump_tables_to_tskit()
type(ts)
tskit.trees.TreeSequence

In general, we want this type in order to dump straight to a file:

# not executed
ts.dump("treefile.trees")

We can also do a lot of analysis using this type. However, certain tasks would be easier if this type knew about the metadata generated by fwdpy11.

Metadata#

The tskit data model includes metadata. Metadata means anything not strictly needed for operations on trees/tree sequences. Examples may include population names, properties of mutations such as selection coefficients, geographic locations of individuals, etc.. These sorts of metadata are included as optional columns in tables. See the tskit data model documentation for more detail.

Metadata can also include “top-level” information that is stored as metadata for the tree sequence (or tskit.TableCollection) itself. Much of this vignette concerns top-level metadata.

Metadata schema#

The tskit metadata require a specification, or schema, that is used to validate the metadata contents.

The top-level metadata schema is:

print(fwdpy11.tskit_tools.metadata_schema.TopLevelMetadata)
{'codec': 'json',
 'properties': {'data': {'description': 'This field is reserved for the user '
                                        'to fill.',
                         'type': ['object', 'string']},
                'demes_graph': {'description': 'A demographic model specified '
                                               'using demes.This information '
                                               'will be redundant with that '
                                               'stored in model_params,but it '
                                               'may be useful as it allows '
                                               'reconstruction of the YAML '
                                               'filefrom the tree sequence.',
                                'type': 'object'},
                'generation': {'description': 'The value of pop.generation at '
                                              'the time datawere exported to '
                                              'tskit',
                               'type': 'integer'},
                'model_params': {'description': 'One or more '
                                                'fwdpy11.ModelParams '
                                                'instances.',
                                 'type': ['string', 'object']},
                'seed': {'description': 'Random number seed.This is optional '
                                        'because a random number generatormay '
                                        'be called prior to simulation, thus '
                                        'making theinitial seed not capable of '
                                        'reproducing the simulation',
                         'type': 'integer'}},
 'required': ['generation'],
 'title': 'Top-level metadata for table collection/tree sequence.',
 'type': 'object'}

We see that the metadata are encoded as a JSON schema. Only one field is required, generation. This required field is automatically populated by fwdpy11.DiploidPopulation.dump_tables_to_tskit(). The remaining fields are filled in by keyword arguments passed to that function.

Much of what follows concerns filling in this top-level metadata. To do this, we use incomplete data (populations with no ancestry) to illustrate the concepts.

Populating top-level metadata#

Top-level metadata are an attempt to improve reproducibility. However, all of these features are optional. Because a simulation project using Python may be very complex, with many local and non-local imports, etc., fully documenting the top-level metadata and/or “provenance” may be impractical. In some cases, reproducibility may instead by best managed by a git repository where script and Makefile/Snakefile work flows coexist.

Adding random number seeds#

You may add a random number seed to the top-level metadata. Doing so is optional because we cannot guarantee that you used this seed to construct a fwdpy11.GSLrng and then immediately run the simulation. For example, if you built the rng object and then simulated two replicates, the seed is only valid to reproduce the first. You do not know the state of the rng when the second simulation started.

ts = pop.dump_tables_to_tskit()
assert "seed" not in ts.metadata
ts = pop.dump_tables_to_tskit(seed=54321)
assert ts.metadata["seed"] == 54321
ts = pop.dump_tables_to_tskit(seed=-4616)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[7], line 1
----> 1 ts = pop.dump_tables_to_tskit(seed=-4616)

File ~/work/fwdpy11/fwdpy11/fwdpy11/_types/diploid_population.py:385, in DiploidPopulation.dump_tables_to_tskit(self, model_params, demes_graph, population_metadata, data, seed, parameters, destructive)
    298 def dump_tables_to_tskit(
    299     self,
    300     *,
   (...)
    307     destructive=False,
    308 ) -> tskit.TreeSequence:
    309     """
    310     Dump the population's TableCollection into
    311     an tskit TreeSequence
   (...)
    383 
    384     """
--> 385     return fwdpy11.tskit_tools._dump_tables_to_tskit._dump_tables_to_tskit(
    386         self,
    387         model_params=model_params,
    388         demes_graph=demes_graph,
    389         population_metadata=population_metadata,
    390         data=data,
    391         seed=seed,
    392         parameters=parameters,
    393         destructive=destructive,
    394     )

File ~/work/fwdpy11/fwdpy11/fwdpy11/tskit_tools/_dump_tables_to_tskit.py:198, in _dump_tables_to_tskit(self, model_params, demes_graph, population_metadata, data, seed, parameters, destructive)
    196 if seed is not None:
    197     if seed < 0:
--> 198         raise ValueError(f"seed must be >=0, got {seed}")
    199     top_level_metadata["seed"] = seed
    201 tc.metadata = top_level_metadata

ValueError: seed must be >=0, got -4616

Adding ModelParams objects#

Another optional top-level metadata item are instances of fwdpy11.ModelParams. Including this is optional for several reasons, but a big one is that this class can be arbitrarily hard to serialize. The examples listed below will work with types provided by fwdpy11 (although we haven’t tested them all!), but user-defined types may be more challenging.

ts = pop.dump_tables_to_tskit()
assert "model_params" not in ts.metadata

Here is an example of Gaussian stabilizing selection around an optimum trait value of 0.0:

optimum = fwdpy11.Optimum(optimum=0.0, VS=10.0, when=0)
gss = fwdpy11.GaussianStabilizingSelection.single_trait([optimum])
pdict = {
    'nregions': [],
    'sregions': [fwdpy11.GaussianS(0, 1, 1, 0.10)],
    'recregions': [fwdpy11.PoissonInterval(0, 1, 1e-3)],
    'gvalue': fwdpy11.Additive(2., gss),
    'rates': (0.0, 1e-3, None),
    'simlen': 10 * pop.N,
    'prune_selected': False,
    'demography': fwdpy11.ForwardDemesGraph.tubes([pop.N], 1)
}

params = fwdpy11.ModelParams(**pdict)

We can pass the params object on when exporting the data to tskit:

ts = pop.dump_tables_to_tskit(model_params=params)
assert fwdpy11.ModelParams(**eval(ts.metadata["model_params"])) == params

Including a demes graph#

You may include a demes.Graph in the top-level metadata. If you also include a fwdpy11.ModelParams (see above), including the demes graph gives redundant information. However, including the graph may be useful if downstream analysis will involve other tools compatible with the demes specification. With the graph as metadata, you can extract it and reconstruct the original YAML file, or send it to another Python package that understands it.

The following hidden code block defines a function to return a demes.Graph from YAML input stored in a string literal.

Hide code cell source
def gutenkunst():
    import demes
    yaml = """
description: The Gutenkunst et al. (2009) OOA model.
doi:
- https://doi.org/10.1371/journal.pgen.1000695
time_units: years
generation_time: 25

demes:
- name: ancestral
  description: Equilibrium/root population
  epochs:
  - {end_time: 220e3, start_size: 7300}
- name: AMH
  description: Anatomically modern humans
  ancestors: [ancestral]
  epochs:
  - {end_time: 140e3, start_size: 12300}
- name: OOA
  description: Bottleneck out-of-Africa population
  ancestors: [AMH]
  epochs:
  - {end_time: 21.2e3, start_size: 2100}
- name: YRI
  description: Yoruba in Ibadan, Nigeria
  ancestors: [AMH]
  epochs:
  - start_size: 12300
- name: CEU
  description: Utah Residents (CEPH) with Northern and Western European Ancestry
  ancestors: [OOA]
  epochs:
  - {start_size: 1000, end_size: 29725}
- name: CHB
  description: Han Chinese in Beijing, China
  ancestors: [OOA]
  epochs:
  - {start_size: 510, end_size: 54090}

migrations:
- {demes: [YRI, OOA], rate: 25e-5}
- {demes: [YRI, CEU], rate: 3e-5}
- {demes: [YRI, CHB], rate: 1.9e-5}
- {demes: [CEU, CHB], rate: 9.6e-5}
"""
    return demes.loads(yaml)
graph = gutenkunst()
ts = pop.dump_tables_to_tskit(demes_graph=graph)
assert demes.Graph.fromdict(ts.metadata["demes_graph"]) == graph

Since this is an optional metadata field, accessing it will return None if no graph was provided:

ts = pop.dump_tables_to_tskit()
assert "demes_graph" not in ts.metadata

User-defined metadata#

It may be useful to store arbitrary data in the output. For example, this could be a dict containing information obtained from a time series analysis of a simulation.

The simplest situation is to use a dict containing simple objects:

ts = pop.dump_tables_to_tskit(data={'x': 3})
ts.metadata["data"]
{'x': 3}

Any use of user-defined types will require conversion to string, however:

class MyData(object):
    def __init__(self, x):
        self.x = x

    def __repr__(self):
        return f"MyData(x={self.x})"

# Send in the str representation:
ts = pop.dump_tables_to_tskit(data=str(MyData(x=77)))

Getting the data back out requires an eval:

assert eval(ts.metadata["data"]).x == 77

Warning

Recording data like this can become arbitrary difficult. The Python import system may make it more difficult to recover data than the examples here would indicate. Further, if the data are very large, then other output formats are likely more appropriate.

Setting population table metadata#

You can also set the population table metadata when exporting data. Doing so requires a dict mapping the integer label for each population to another dict.

For example, let’s create an demographic model from the demes graph that used above:

model = fwdpy11.discrete_demography.from_demes(gutenkunst())
type(model)
fwdpy11._types.demographic_model_details.DemographicModelDetails

This object contains a mapping from integer labels to the deme names:

model.metadata['deme_labels']
{0: 'ancestral', 1: 'AMH', 2: 'OOA', 3: 'YRI', 4: 'CEU', 5: 'CHB'}

We can make the required dict like this:

pop_md = {}
for key, value in model.metadata['deme_labels'].items():
    pop_md[key] = {'name': value}

To actually illustrate the use of this metadata, we need to make sure that our tskit output actually has a population table:

# initialize a population with the right number of demes...
multideme_pop = fwdpy11.DiploidPopulation([100]*len(pop_md), 1.)
ts = multideme_pop.dump_tables_to_tskit(population_metadata=pop_md)
for pop in ts.populations():
    print(pop.metadata)
{'name': 'ancestral'}
{'name': 'AMH'}
{'name': 'OOA'}
{'name': 'YRI'}
{'name': 'CEU'}
{'name': 'CHB'}

We could also easily add the deme description to the metadata from the demes graph:

graph = gutenkunst()
graph.demes
[Deme(name='ancestral', description='Equilibrium/root population', start_time=inf, ancestors=[], proportions=[], epochs=[Epoch(start_time=inf, end_time=220000.0, start_size=7300, end_size=7300, size_function='constant', selfing_rate=0, cloning_rate=0)]),
 Deme(name='AMH', description='Anatomically modern humans', start_time=220000.0, ancestors=['ancestral'], proportions=[1.0], epochs=[Epoch(start_time=220000.0, end_time=140000.0, start_size=12300, end_size=12300, size_function='constant', selfing_rate=0, cloning_rate=0)]),
 Deme(name='OOA', description='Bottleneck out-of-Africa population', start_time=140000.0, ancestors=['AMH'], proportions=[1.0], epochs=[Epoch(start_time=140000.0, end_time=21200.0, start_size=2100, end_size=2100, size_function='constant', selfing_rate=0, cloning_rate=0)]),
 Deme(name='YRI', description='Yoruba in Ibadan, Nigeria', start_time=140000.0, ancestors=['AMH'], proportions=[1.0], epochs=[Epoch(start_time=140000.0, end_time=0, start_size=12300, end_size=12300, size_function='constant', selfing_rate=0, cloning_rate=0)]),
 Deme(name='CEU', description='Utah Residents (CEPH) with Northern and Western European Ancestry', start_time=21200.0, ancestors=['OOA'], proportions=[1.0], epochs=[Epoch(start_time=21200.0, end_time=0, start_size=1000, end_size=29725, size_function='exponential', selfing_rate=0, cloning_rate=0)]),
 Deme(name='CHB', description='Han Chinese in Beijing, China', start_time=21200.0, ancestors=['OOA'], proportions=[1.0], epochs=[Epoch(start_time=21200.0, end_time=0, start_size=510, end_size=54090, size_function='exponential', selfing_rate=0, cloning_rate=0)])]
pop_md = {}
for i,deme in enumerate(graph.demes):
    pop_md[i] = {'name': deme.name, "description": deme.description}

pop_md
{0: {'name': 'ancestral', 'description': 'Equilibrium/root population'},
 1: {'name': 'AMH', 'description': 'Anatomically modern humans'},
 2: {'name': 'OOA', 'description': 'Bottleneck out-of-Africa population'},
 3: {'name': 'YRI', 'description': 'Yoruba in Ibadan, Nigeria'},
 4: {'name': 'CEU',
  'description': 'Utah Residents (CEPH) with Northern and Western European Ancestry'},
 5: {'name': 'CHB', 'description': 'Han Chinese in Beijing, China'}}