Calculating frequency spectra from tree sequences#
Note
In version 0.18.0, the sparse matrix type is now given by the scipy.sparse
module
rather than the sparse
module.
This change removes supports for frequency spectra with more than
two sample groups.
The mutation frequency spectrum is an array-like object representing the
number of mutation events occurring at different frequencies. Generating
frequency spectra, or fs
, from simulations is handled by fwdpy11.TableCollection.fs()
.
We have to define some conventions:
A sample list is a list of node indexes. The node indexes are stored in a
numpy.ndarray
withnumpy.dtype
numpy.int32
.We support
fs
calculated from multiple lists of samples.An
fs
always contains the zero and fixed bins. Thus, for a sample ofn
nodes, there aren + 1
entries in thefs
. The first and last values are for frequencies zero andn
, respectively. Singletons start in bin1
, etc..An
fs
from a single sample list is represented by anumpy.ma.MaskedArray
, object with the zero and fixed bins masked.An
fs
from more than one sample list is stored in ascipy.sparse.coo_matrix
sparse matrix.
For our examples, we will initialize a fwdpy11.DiploidPopulation
from
a tskit.TreeSequence
generated by msprime.sim_ancestry()
.
Note
These examples do not show how to get fs
separately
for neutral and non-neutral mutations. See
fwdpy11.TableCollection.fs()
for details.
import fwdpy11
import demes
import msprime
import numpy as np
yaml = """
time_units: generations
demes:
- name: deme0
epochs:
- start_size: 250
- name: deme1
epochs:
- start_size: 250
migrations:
- {demes: [deme0, deme1], rate: 0.1}
"""
graph = demes.loads(yaml)
demography = msprime.Demography.from_demes(graph)
rng = fwdpy11.GSLrng(4321678)
config = [msprime.PopulationConfiguration(500), msprime.PopulationConfiguration(500)]
ts = msprime.sim_ancestry(
samples = {0: 250, 1: 250},
random_seed=777,
demography = demography,
)
pop = fwdpy11.DiploidPopulation.create_from_tskit(ts)
md = np.array(pop.diploid_metadata, copy=False)
np.unique(md["deme"], return_counts=True)
nmuts = fwdpy11.infinite_sites(rng, pop, 0.1)
nmuts
1532
The following blocks show several methods for obtaining the fs
from lists of nodes.
First, let’s get the lists of nodes from the two demes in our population:
nodes = np.array(pop.tables.nodes, copy=False)
alive_nodes = pop.alive_nodes
deme0_nodes = alive_nodes[np.where(nodes["deme"][alive_nodes] == 0)[0]]
deme1_nodes = alive_nodes[np.where(nodes["deme"][alive_nodes] == 1)[0]]
Get an fs
from nodes found only in deme 0:
pop.tables.fs([deme0_nodes[:10]])
masked_array(data=[--, 137, 326, 6, 47, 0, 40, 0, 137, 0, --],
mask=[ True, False, False, False, False, False, False, False,
False, False, True],
fill_value=999999,
dtype=int32)
Get a joint fs
from nodes from each deme:
fs = pop.tables.fs([deme0_nodes[:10], deme1_nodes[50:55]])
fs
<11x6 sparse matrix of type '<class 'numpy.int32'>'
with 11 stored elements in COOrdinate format>
Obtain the full numpy.ndarray
for the joint fs
:
np.asarray(fs.todense())
array([[ 0, 61, 0, 0, 0, 0],
[ 88, 5, 44, 0, 0, 0],
[ 29, 291, 6, 0, 0, 0],
[ 6, 0, 0, 0, 0, 0],
[ 0, 47, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 40, 0, 0],
[ 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 137, 0],
[ 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0]], dtype=int32)
Warning
The joint fs
can take a lot of memory!
Note
Without the call to np.asarray
, you would get a np.matrix
object back, which is a different thing!
We can use standard array operations to get the marginal fs
from our joint fs
as a numpy.ndarray
:
np.asarray(fs.sum(axis=1).flatten())[0]
np.asarray(fs.sum(axis=0).flatten())[0]
array([123, 404, 50, 40, 137, 0])
The marginalization can be tedious for many samples, so you can have it happen automatically,
in which case a dict
is returned, keyed by sample list index:
fs = pop.tables.fs([deme0_nodes[:10], deme1_nodes[50:55]], marginalize=True)
for key, value in fs.items():
print(key)
print(value)
print(value.data)
0
[-- 137 326 6 47 0 40 0 137 0 --]
[ 61 137 326 6 47 0 40 0 137 0 0]
1
[-- 404 50 40 137 --]
[123 404 50 40 137 0]
Note
Marginalizing in this way preserves the convention that the 1-d fs
objects are instances of numpy.ma.MaskedArray
.
To see how the dict
keying works, let’s flip the sample lists:
fs = pop.tables.fs([deme1_nodes[50:55], deme0_nodes[:10]], marginalize=True)
for key, value in fs.items():
print(key)
print(value)
print(value.data)
0
[-- 404 50 40 137 --]
[123 404 50 40 137 0]
1
[-- 137 326 6 47 0 40 0 137 0 --]
[ 61 137 326 6 47 0 40 0 137 0 0]
If you only want the fs
from particular regions of the genome. By default,
the fs
is the sum across windows:
pop.tables.fs([deme0_nodes[:10]], windows=[(0.1, 0.2), (0.8, 0.9)])
masked_array(data=[--, 28, 65, 0, 11, 0, 4, 0, 35, 0, --],
mask=[ True, False, False, False, False, False, False, False,
False, False, True],
fill_value=999999,
dtype=int32)
You can get the fs
separately by window, too:
pop.tables.fs(
[deme0_nodes[:10]], windows=[(0.1, 0.2), (0.8, 0.9)], separate_windows=True
)
[masked_array(data=[--, 10, 38, 0, 8, 0, 2, 0, 16, 0, --],
mask=[ True, False, False, False, False, False, False, False,
False, False, True],
fill_value=999999,
dtype=int32),
masked_array(data=[--, 18, 27, 0, 3, 0, 2, 0, 19, 0, --],
mask=[ True, False, False, False, False, False, False, False,
False, False, True],
fill_value=999999,
dtype=int32)]
You can also get a joint fs
marginalized by sample list and separated
by window. In this case, the return value is a list
containing
the dict
for each window:
pop.tables.fs(
[deme0_nodes[:10], deme1_nodes[:20]],
windows=[(0.1, 0.2), (0.8, 0.9)],
marginalize=True,
separate_windows=True,
)
[{0: masked_array(data=[--, 10, 38, 0, 8, 0, 2, 0, 16, 0, --],
mask=[ True, False, False, False, False, False, False, False,
False, False, True],
fill_value=999999),
1: masked_array(data=[--, 19, 24, 9, 3, 3, 2, 0, 2, 0, 0, 0, 0, 0, 1, 2, 0,
0, 16, 0, --],
mask=[ True, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, True],
fill_value=999999)},
{0: masked_array(data=[--, 18, 27, 0, 3, 0, 2, 0, 19, 0, --],
mask=[ True, False, False, False, False, False, False, False,
False, False, True],
fill_value=999999),
1: masked_array(data=[--, 20, 19, 8, 8, 2, 1, 0, 7, 0, 0, 0, 0, 0, 0, 2, 0,
0, 19, 0, --],
mask=[ True, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, True],
fill_value=999999)}]
Simplifying to the samples#
Finally, it is sometimes more efficient to simplify the tree sequences with
respect to the sample nodes. For example, if there are a vast number of
ancient samples and you are processing each time point separately
(see fwdpy11.DiploidPopulation.sample_timepoints()
), then not simplifying
means iterating over trees that are redundant/irrelevant to the history of
the current time point. In order to get the fs
from a simplified
tree sequence, pass simplify=True
when calling fwdpy11.TableCollection.fs()
.