forward-time population genetic simulation in python
Home#Use Python 3's print a a function.
#This future-proofs the code in the notebook
from __future__ import print_function
#Import fwdpy. Give it a shorter name
import fwdpy as fp
##Other libs we need
import numpy as np
import pandas as pd
import math
# Where neutral mutations occur:
nregions = [fp.Region(beg=0,end=1,weight=1)]
# Where selected mutations occur:
sregions = [fp.ConstantS(beg=-1,end=0,weight=1,s=-0.05,h=1),
fp.ConstantS(beg=1,end=2,weight=1,s=-0.05,h=1)]
# Recombination:
recregions = [fp.Region(beg=-1,end=2,weight=1)]
#Population size
N=1000
#We'll evolve for 10N generations.
#nlist is a list of population sizes over time.
#len(nlist) is the length of the simulation
#We use numpy arrays for speed and optimised RAM
#use. Note the dtype=np.uint32, which means 32-bit
#unsigned integer. Failure to use this type will
#cause a run-time error.
nlist = np.array([N]*10*N,dtype=np.uint32)
#Initalize a random number generator with seed value of 101
rng = fp.GSLrng(101)
#Simulate 40 replicate populations. This uses C++11 threads behind the scenes:
pops = fp.evolve_regions(rng, #The random number generator
40, #The number of pops to simulate = number of threads to use.
N, #Initial population size for each of the 40 demes
nlist[0:], #List of population sizes over time.
0.005, #Neutral mutation rate (per gamete, per generation)
0.01, #Deleterious mutation rate (per gamete, per generation)
0.005, #Recombination rate (per diploid, per generation)
nregions, #Defined above
sregions, #Defined above
recregions)#Defined above
#Now, pops is a Python list with len(pops) = 40
#Each element's type is fwdpy.singlepop
print(len(pops))
print(type(pops[0]))
40
<type 'fwdpy.fwdpy.Spop'>
#Use a list comprehension to get a random sample of size
#n = 20 from each replicate
samples = [fp.get_samples(rng,i,20) for i in pops]
#Samples is now a list of tuples of two lists.
#Each list contains tuples of mutation positions and genotypes.
#The first list represents neutral variants.
#The second list represents variants affecting fitness ('selected' variants)
#We will manipulate/analyze these genotypes, etc.,
#in a later example
for i in samples[:4]:
print ("A sample from a population is a ",type(i))
print(len(samples))
A sample from a population is a <type 'tuple'>
A sample from a population is a <type 'tuple'>
A sample from a population is a <type 'tuple'>
A sample from a population is a <type 'tuple'>
40
#Again, use list comprehension to get the 'details' of each sample
#Given that each object in samples is a tuple, and that the second
#item in each tuple represents selected mutations, i[1] in the line
#below means that we are getting the mutation information only for
#selected variants
details = [pd.DataFrame(fp.get_sample_details(i[1],j)) for i,j in zip(samples,pops)]
#details is now a list of pandas DataFrame objects
#Each DataFrame has the following columns:
# a: mutation age (in generations)
# h: dominance of the mutation
# p: frequency of the mutation in the population
# s: selection coefficient of the mutation
# label: A label applied for mutations for each region. Here, I use 0 for all regions
for i in details[:4]:
print(i)
dcount ftime generation h label locus origin p s
0 1 4294967295 10000 1.0 0 0 9974 0.004 -0.05
dcount ftime generation h label locus origin p s
0 1 4294967295 10000 1.0 0 0 9976 0.0090 -0.05
1 1 4294967295 10000 1.0 0 0 9939 0.0090 -0.05
2 1 4294967295 10000 1.0 0 0 9957 0.0030 -0.05
3 1 4294967295 10000 1.0 0 0 9991 0.0005 -0.05
4 1 4294967295 10000 1.0 0 0 9946 0.0040 -0.05
5 1 4294967295 10000 1.0 0 0 9996 0.0035 -0.05
dcount ftime generation h label locus origin p s
0 1 4294967295 10000 1.0 0 0 9984 0.0035 -0.05
1 1 4294967295 10000 1.0 0 0 9975 0.0015 -0.05
2 1 4294967295 10000 1.0 0 0 9983 0.0020 -0.05
3 1 4294967295 10000 1.0 0 0 9998 0.0010 -0.05
4 1 4294967295 10000 1.0 0 0 9940 0.0045 -0.05
dcount ftime generation h label locus origin p s
0 1 4294967295 10000 1.0 0 0 9990 0.0045 -0.05
1 2 4294967295 10000 1.0 0 0 9953 0.0300 -0.05
2 1 4294967295 10000 1.0 0 0 9968 0.0035 -0.05
3 1 4294967295 10000 1.0 0 0 9987 0.0005 -0.05
#The order of the rows in each DataFrame is the
#same as the order as the objects in 'samples':
for i in range(4):
print("Number of sites in samples[",i,"] = ",
len(samples[i][1]),". Number of rows in DataFrame ",i,
" = ",len(details[i].index),sep="")
Number of sites in samples[0] = 1. Number of rows in DataFrame 0 = 1
Number of sites in samples[1] = 6. Number of rows in DataFrame 1 = 6
Number of sites in samples[2] = 5. Number of rows in DataFrame 2 = 5
Number of sites in samples[3] = 4. Number of rows in DataFrame 3 = 4
#Pandas DataFrames are cool.
#Let's add a column to each DataFrame
#specifying the mutation position,
#count of derived state,
#and a "replicate ID"
for i in range(len(details)):
##samples[i][1] again is the selected mutations in the sample taken
##from the i-th replicate
details[i]['pos']=[x[0] for x in samples[i][1]] #Mutation position
details[i]['count']=[ x[1].count('1') for x in samples[i][1]] #No. occurrences of derived state in sample
details[i]['id']=[i]*len(details[i].index) #Replicate id
##Merge into 1 big DataFrame:
BigTable = pd.concat(details)
print("This is the merged table:")
print(BigTable)
This is the merged table:
dcount ftime generation h label locus origin p s \
0 1 4294967295 10000 1.0 0 0 9974 0.0040 -0.05
0 1 4294967295 10000 1.0 0 0 9976 0.0090 -0.05
1 1 4294967295 10000 1.0 0 0 9939 0.0090 -0.05
2 1 4294967295 10000 1.0 0 0 9957 0.0030 -0.05
3 1 4294967295 10000 1.0 0 0 9991 0.0005 -0.05
4 1 4294967295 10000 1.0 0 0 9946 0.0040 -0.05
5 1 4294967295 10000 1.0 0 0 9996 0.0035 -0.05
0 1 4294967295 10000 1.0 0 0 9984 0.0035 -0.05
1 1 4294967295 10000 1.0 0 0 9975 0.0015 -0.05
2 1 4294967295 10000 1.0 0 0 9983 0.0020 -0.05
3 1 4294967295 10000 1.0 0 0 9998 0.0010 -0.05
4 1 4294967295 10000 1.0 0 0 9940 0.0045 -0.05
0 1 4294967295 10000 1.0 0 0 9990 0.0045 -0.05
1 2 4294967295 10000 1.0 0 0 9953 0.0300 -0.05
2 1 4294967295 10000 1.0 0 0 9968 0.0035 -0.05
3 1 4294967295 10000 1.0 0 0 9987 0.0005 -0.05
0 1 4294967295 10000 1.0 0 0 9992 0.0010 -0.05
1 1 4294967295 10000 1.0 0 0 9908 0.0125 -0.05
2 1 4294967295 10000 1.0 0 0 9997 0.0010 -0.05
3 1 4294967295 10000 1.0 0 0 9999 0.0005 -0.05
4 1 4294967295 10000 1.0 0 0 9981 0.0065 -0.05
5 1 4294967295 10000 1.0 0 0 9997 0.0010 -0.05
0 1 4294967295 10000 1.0 0 0 9987 0.0025 -0.05
1 1 4294967295 10000 1.0 0 0 9864 0.0130 -0.05
2 1 4294967295 10000 1.0 0 0 9938 0.0035 -0.05
3 1 4294967295 10000 1.0 0 0 9990 0.0015 -0.05
4 1 4294967295 10000 1.0 0 0 9977 0.0075 -0.05
0 1 4294967295 10000 1.0 0 0 9983 0.0025 -0.05
1 1 4294967295 10000 1.0 0 0 9981 0.0120 -0.05
2 1 4294967295 10000 1.0 0 0 9993 0.0050 -0.05
.. ... ... ... ... ... ... ... ... ...
0 1 4294967295 10000 1.0 0 0 9959 0.0060 -0.05
1 1 4294967295 10000 1.0 0 0 9980 0.0080 -0.05
2 1 4294967295 10000 1.0 0 0 9975 0.0040 -0.05
3 1 4294967295 10000 1.0 0 0 9976 0.0095 -0.05
4 1 4294967295 10000 1.0 0 0 9990 0.0005 -0.05
0 1 4294967295 10000 1.0 0 0 9998 0.0010 -0.05
1 1 4294967295 10000 1.0 0 0 9993 0.0030 -0.05
2 1 4294967295 10000 1.0 0 0 9958 0.0075 -0.05
0 1 4294967295 10000 1.0 0 0 9978 0.0035 -0.05
1 1 4294967295 10000 1.0 0 0 9959 0.0125 -0.05
2 1 4294967295 10000 1.0 0 0 9984 0.0065 -0.05
3 1 4294967295 10000 1.0 0 0 9985 0.0020 -0.05
4 1 4294967295 10000 1.0 0 0 9982 0.0065 -0.05
0 1 4294967295 10000 1.0 0 0 9988 0.0020 -0.05
1 1 4294967295 10000 1.0 0 0 9992 0.0005 -0.05
2 1 4294967295 10000 1.0 0 0 9988 0.0070 -0.05
3 1 4294967295 10000 1.0 0 0 9982 0.0020 -0.05
4 1 4294967295 10000 1.0 0 0 9975 0.0020 -0.05
5 1 4294967295 10000 1.0 0 0 9976 0.0075 -0.05
0 1 4294967295 10000 1.0 0 0 9991 0.0110 -0.05
1 1 4294967295 10000 1.0 0 0 9985 0.0040 -0.05
0 1 4294967295 10000 1.0 0 0 9996 0.0035 -0.05
1 1 4294967295 10000 1.0 0 0 9992 0.0060 -0.05
2 1 4294967295 10000 1.0 0 0 9998 0.0010 -0.05
3 1 4294967295 10000 1.0 0 0 9982 0.0165 -0.05
0 1 4294967295 10000 1.0 0 0 9978 0.0030 -0.05
1 1 4294967295 10000 1.0 0 0 9981 0.0035 -0.05
2 1 4294967295 10000 1.0 0 0 9993 0.0020 -0.05
3 2 4294967295 10000 1.0 0 0 9986 0.0095 -0.05
4 1 4294967295 10000 1.0 0 0 9998 0.0015 -0.05
pos count id
0 1.283749 1 0
0 -0.320125 1 1
1 -0.119514 1 1
2 -0.102664 1 1
3 1.042025 1 1
4 1.443235 1 1
5 1.804796 1 1
0 -0.632548 1 2
1 -0.460367 1 2
2 -0.119099 1 2
3 -0.055915 1 2
4 1.587909 1 2
0 -0.480663 1 3
1 -0.354161 2 3
2 1.351797 1 3
3 1.381058 1 3
0 -0.897565 1 4
1 -0.751551 1 4
2 -0.123021 1 4
3 -0.088541 1 4
4 1.242531 1 4
5 1.813220 1 4
0 -0.438263 1 5
1 -0.243305 1 5
2 -0.196773 1 5
3 1.483931 1 5
4 1.723836 1 5
0 -0.818220 1 6
1 -0.323753 1 6
2 -0.047837 1 6
.. ... ... ..
0 -0.923395 1 33
1 -0.880657 1 33
2 -0.501588 1 33
3 1.434086 1 33
4 1.897750 1 33
0 -0.521129 1 34
1 -0.083328 1 34
2 1.777795 1 34
0 -0.703952 1 35
1 -0.385990 1 35
2 1.091824 1 35
3 1.257222 1 35
4 1.772582 1 35
0 -0.595274 1 36
1 -0.499114 1 36
2 -0.416441 1 36
3 -0.320007 1 36
4 1.178538 1 36
5 1.530433 1 36
0 -0.941568 1 37
1 -0.498442 1 37
0 -0.221834 1 38
1 1.537788 1 38
2 1.738061 1 38
3 1.994364 1 38
0 -0.983060 1 39
1 -0.475971 1 39
2 -0.105493 1 39
3 1.197116 2 39
4 1.257867 1 39
[143 rows x 12 columns]
We will use the pylibseq package to calculate summary statistics. pylibseq is a Python wrapper around libsequence.
import libsequence.polytable as polyt
import libsequence.summstats as sstats
#Convert neutral mutations into libsequence "SimData" objects,
#which are intended to handle binary (0/1) data like
#what comes out of these simulations
n = [polyt.SimData(i[0]) for i in samples]
#Create "factories" for calculating the summary stats
an = [sstats.PolySIM(i) for i in n]
##Collect a bunch of summary stats into a pandas.DataFrame:
NeutralMutStats = pd.DataFrame([ {'thetapi':i.thetapi(),'npoly':i.numpoly(),'thetaw':i.thetaw()} for i in an ])
NeutralMutStats
npoly | thetapi | thetaw | |
---|---|---|---|
0 | 48 | 12.636842 | 13.529741 |
1 | 49 | 12.563158 | 13.811611 |
2 | 97 | 35.968421 | 27.341352 |
3 | 50 | 12.921053 | 14.093481 |
4 | 68 | 17.015789 | 19.167134 |
5 | 32 | 8.584211 | 9.019828 |
6 | 65 | 18.905263 | 18.321525 |
7 | 63 | 20.036842 | 17.757786 |
8 | 90 | 28.968421 | 25.368265 |
9 | 45 | 12.452632 | 12.684133 |
10 | 47 | 12.242105 | 13.247872 |
11 | 71 | 20.752632 | 20.012742 |
12 | 69 | 19.900000 | 19.449003 |
13 | 59 | 10.900000 | 16.630307 |
14 | 52 | 14.494737 | 14.657220 |
15 | 51 | 16.563158 | 14.375350 |
16 | 58 | 14.500000 | 16.348437 |
17 | 83 | 29.478947 | 23.395178 |
18 | 70 | 25.689474 | 19.730873 |
19 | 49 | 15.473684 | 13.811611 |
20 | 76 | 23.847368 | 21.422090 |
21 | 53 | 7.784211 | 14.939089 |
22 | 46 | 14.315789 | 12.966002 |
23 | 83 | 27.100000 | 23.395178 |
24 | 59 | 17.157895 | 16.630307 |
25 | 51 | 14.778947 | 14.375350 |
26 | 85 | 20.984211 | 23.958917 |
27 | 48 | 12.915789 | 13.529741 |
28 | 43 | 9.763158 | 12.120393 |
29 | 50 | 12.063158 | 14.093481 |
30 | 53 | 8.421053 | 14.939089 |
31 | 86 | 26.073684 | 24.240787 |
32 | 38 | 10.889474 | 10.711045 |
33 | 53 | 13.236842 | 14.939089 |
34 | 57 | 9.842105 | 16.066568 |
35 | 51 | 11.831579 | 14.375350 |
36 | 91 | 29.989474 | 25.650135 |
37 | 68 | 18.752632 | 19.167134 |
38 | 79 | 23.105263 | 22.267699 |
39 | 92 | 27.031579 | 25.932004 |
Under the BGS model, the expectation of $\pi$ is $E[\pi]=\pi_0e^{-\frac{U}{2sh+r}},$ $U$ is the mutation rate to strongly-deleterious variants, $\pi_0$ is the value expected in the absence of BGS (i.e. $\pi_0 = \theta = 4N_e\mu$), $s$ and $h$ are the selection and dominance coefficients, and $r$ is the recombination rate.
Note that the definition of $U$ is per diploid, meaning twice the per gamete rate. (See Hudson and Kaplan (1995) PMC1206891 for details).
For our parameters, we have $E[\pi] = 20e^{-\frac{0.02}{0.1+0.005}},$ which equals:
print(20*math.exp(-0.02/(0.1+0.005)))
16.5313087525
Now, let’s get the average $\pi$ from 1000 simulated replicates. We already have 40 replicates that we did above, so we’ll run another 24 sets of four populations.
We will use standard Python to grow our collection of summary statistics.
for i in range(0,24,1):
pops = fp.evolve_regions(rng,
40,
N,
nlist[0:],
0.005,
0.01,
0.005,
nregions,
sregions,
recregions)
samples = [fp.get_samples(rng,i,20) for i in pops]
simdatasNeut = [polyt.SimData(i[0]) for i in samples]
polySIMn = [sstats.PolySIM(i) for i in simdatasNeut]
##Append stats into our growing DataFrame:
NeutralMutStats=pd.concat([NeutralMutStats,
pd.DataFrame([ {'thetapi':i.thetapi(),
'npoly':i.numpoly(),
'thetaw':i.thetaw()} for i in polySIMn ])])
We’ve collected everything into a big pandas DataFrame. We can easily get the mean using the built-in groupby and mean functions.
For users happier in R, you could write this DataFrame to a text file and process it using R’s dplyr package, which is a really excellent tool for this sort of thing.
#Get means for each column:
NeutralMutStats.mean(0)
npoly 58.372000
thetapi 16.251858
thetaw 16.453293
dtype: float64
The ‘thetapi’ record is our mean $\pi$ from all of the simulations, and it is quite close to the theoretical value.