Calculating frequency spectra from tree sequences#

Note

In version 0.18.0, the sparse matrix type is now given by the scipy.sparse module rather than the sparse module. This change removes supports for frequency spectra with more than two sample groups.

The mutation frequency spectrum is an array-like object representing the number of mutation events occurring at different frequencies. Generating frequency spectra, or fs, from simulations is handled by fwdpy11.TableCollection.fs().

We have to define some conventions:

  • A sample list is a list of node indexes. The node indexes are stored in a numpy.ndarray with numpy.dtype numpy.int32.

  • We support fs calculated from multiple lists of samples.

  • An fs always contains the zero and fixed bins. Thus, for a sample of n nodes, there are n + 1 entries in the fs. The first and last values are for frequencies zero and n, respectively. Singletons start in bin 1, etc..

  • An fs from a single sample list is represented by a numpy.ma.MaskedArray, object with the zero and fixed bins masked.

  • An fs from more than one sample list is stored in a scipy.sparse.coo_matrix sparse matrix.

For our examples, we will initialize a fwdpy11.DiploidPopulation from a tskit.TreeSequence generated by msprime.sim_ancestry().

Note

These examples do not show how to get fs separately for neutral and non-neutral mutations. See fwdpy11.TableCollection.fs() for details.

import fwdpy11
import demes
import msprime
import numpy as np

yaml = """
time_units: generations
demes:
 - name: deme0
   epochs:
   - start_size: 250
 - name: deme1
   epochs:
   - start_size: 250
migrations:
 - {demes: [deme0, deme1], rate: 0.1}
"""
graph = demes.loads(yaml)
demography = msprime.Demography.from_demes(graph)


rng = fwdpy11.GSLrng(4321678)
config = [msprime.PopulationConfiguration(500), msprime.PopulationConfiguration(500)]
ts = msprime.sim_ancestry(
    samples = {0: 250, 1: 250},
    random_seed=777,
    demography = demography,
)

pop = fwdpy11.DiploidPopulation.create_from_tskit(ts)
md = np.array(pop.diploid_metadata, copy=False)
np.unique(md["deme"], return_counts=True)

nmuts = fwdpy11.infinite_sites(rng, pop, 0.1)
nmuts
1532

The following blocks show several methods for obtaining the fs from lists of nodes. First, let’s get the lists of nodes from the two demes in our population:

nodes = np.array(pop.tables.nodes, copy=False)
alive_nodes = pop.alive_nodes
deme0_nodes = alive_nodes[np.where(nodes["deme"][alive_nodes] == 0)[0]]
deme1_nodes = alive_nodes[np.where(nodes["deme"][alive_nodes] == 1)[0]]

Get an fs from nodes found only in deme 0:

pop.tables.fs([deme0_nodes[:10]])
masked_array(data=[--, 137, 326, 6, 47, 0, 40, 0, 137, 0, --],
             mask=[ True, False, False, False, False, False, False, False,
                   False, False,  True],
       fill_value=999999,
            dtype=int32)

Get a joint fs from nodes from each deme:

fs = pop.tables.fs([deme0_nodes[:10], deme1_nodes[50:55]])
fs
<11x6 sparse matrix of type '<class 'numpy.int32'>'
	with 11 stored elements in COOrdinate format>

Obtain the full numpy.ndarray for the joint fs:

np.asarray(fs.todense())
array([[  0,  61,   0,   0,   0,   0],
       [ 88,   5,  44,   0,   0,   0],
       [ 29, 291,   6,   0,   0,   0],
       [  6,   0,   0,   0,   0,   0],
       [  0,  47,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0],
       [  0,   0,   0,  40,   0,   0],
       [  0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0, 137,   0],
       [  0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0]], dtype=int32)

Warning

The joint fs can take a lot of memory!

Note

Without the call to np.asarray, you would get a np.matrix object back, which is a different thing!

We can use standard array operations to get the marginal fs from our joint fs as a numpy.ndarray:

np.asarray(fs.sum(axis=1).flatten())[0]
np.asarray(fs.sum(axis=0).flatten())[0]
array([123, 404,  50,  40, 137,   0])

The marginalization can be tedious for many samples, so you can have it happen automatically, in which case a dict is returned, keyed by sample list index:

fs = pop.tables.fs([deme0_nodes[:10], deme1_nodes[50:55]], marginalize=True)
for key, value in fs.items():
    print(key)
    print(value)
    print(value.data)
0
[-- 137 326 6 47 0 40 0 137 0 --]
[ 61 137 326   6  47   0  40   0 137   0   0]
1
[-- 404 50 40 137 --]
[123 404  50  40 137   0]

Note

Marginalizing in this way preserves the convention that the 1-d fs objects are instances of numpy.ma.MaskedArray.

To see how the dict keying works, let’s flip the sample lists:

fs = pop.tables.fs([deme1_nodes[50:55], deme0_nodes[:10]], marginalize=True)
for key, value in fs.items():
    print(key)
    print(value)
    print(value.data)
0
[-- 404 50 40 137 --]
[123 404  50  40 137   0]
1
[-- 137 326 6 47 0 40 0 137 0 --]
[ 61 137 326   6  47   0  40   0 137   0   0]

If you only want the fs from particular regions of the genome. By default, the fs is the sum across windows:

pop.tables.fs([deme0_nodes[:10]], windows=[(0.1, 0.2), (0.8, 0.9)])
masked_array(data=[--, 28, 65, 0, 11, 0, 4, 0, 35, 0, --],
             mask=[ True, False, False, False, False, False, False, False,
                   False, False,  True],
       fill_value=999999,
            dtype=int32)

You can get the fs separately by window, too:

pop.tables.fs(
    [deme0_nodes[:10]], windows=[(0.1, 0.2), (0.8, 0.9)], separate_windows=True
)
[masked_array(data=[--, 10, 38, 0, 8, 0, 2, 0, 16, 0, --],
              mask=[ True, False, False, False, False, False, False, False,
                    False, False,  True],
        fill_value=999999,
             dtype=int32),
 masked_array(data=[--, 18, 27, 0, 3, 0, 2, 0, 19, 0, --],
              mask=[ True, False, False, False, False, False, False, False,
                    False, False,  True],
        fill_value=999999,
             dtype=int32)]

You can also get a joint fs marginalized by sample list and separated by window. In this case, the return value is a list containing the dict for each window:

pop.tables.fs(
    [deme0_nodes[:10], deme1_nodes[:20]],
    windows=[(0.1, 0.2), (0.8, 0.9)],
    marginalize=True,
    separate_windows=True,
)
[{0: masked_array(data=[--, 10, 38, 0, 8, 0, 2, 0, 16, 0, --],
               mask=[ True, False, False, False, False, False, False, False,
                     False, False,  True],
         fill_value=999999),
  1: masked_array(data=[--, 19, 24, 9, 3, 3, 2, 0, 2, 0, 0, 0, 0, 0, 1, 2, 0,
                     0, 16, 0, --],
               mask=[ True, False, False, False, False, False, False, False,
                     False, False, False, False, False, False, False, False,
                     False, False, False, False,  True],
         fill_value=999999)},
 {0: masked_array(data=[--, 18, 27, 0, 3, 0, 2, 0, 19, 0, --],
               mask=[ True, False, False, False, False, False, False, False,
                     False, False,  True],
         fill_value=999999),
  1: masked_array(data=[--, 20, 19, 8, 8, 2, 1, 0, 7, 0, 0, 0, 0, 0, 0, 2, 0,
                     0, 19, 0, --],
               mask=[ True, False, False, False, False, False, False, False,
                     False, False, False, False, False, False, False, False,
                     False, False, False, False,  True],
         fill_value=999999)}]

Simplifying to the samples#

Finally, it is sometimes more efficient to simplify the tree sequences with respect to the sample nodes. For example, if there are a vast number of ancient samples and you are processing each time point separately (see fwdpy11.DiploidPopulation.sample_timepoints()), then not simplifying means iterating over trees that are redundant/irrelevant to the history of the current time point. In order to get the fs from a simplified tree sequence, pass simplify=True when calling fwdpy11.TableCollection.fs().