A Brief Introduction to msprime

Coalescent simulators are commonly used in evolutionary biology for a variety of reasons. They nicely complement forward-time simulators, which evaluate evolution forward in time, keep better track of individuals within populations, and allow users to implement more “biological” scenarios. Coalescent simulators, on the other hand, are more model-based, with simulations coming from draws of distributions of parameters, and they also simulate backward in time.

I sat through a brief introductory session on newer coalescent simulation software called msprime. This session was kindly provided by Mia Miyagi the Wakeley lab Harvard. msprime is published by Kelleher et al. (2016) and has detailed documentation. Some major advantages to msprime are that it is fast, accounts for numerous population genetic processes, and uses a new “tree sequence” data type to store geneologies, which more efficiently stores trees across genomes by taking advantage of the fact that the relationships between trees are often not random. While this leads to a more specialized data structure, it works well and common data formats (e.g., Newick) can be exported. msprime can also do forward-time simulations.

The tutorial I went through a couple relatively simple examples. msprime has a nice Python interface, which makes it quite easy to put together arbitrary simulation models that can be run effectively. The first scenario is a simple 4 population model with fixed population sizes, mutation rates, and recombination rates. The output that results are geneologies for each locus in Newick format and a binary genotype matrix.

import msprime

for tree in treelist:

This script can be saved as the file msprime_simple.py and run simply using the command python msprime_simple.py (assuming msprime is properly installed).

We also ran through a second, more complex example with a 4-tip tree with a migration event between tips 1 and 3. Here is an image of the tree model we were considering. The image should be rotated counterclockwise 90 degrees but GitHub is not cooperating properly to do that, so please pardon that issue.

msprime model

The script to create and run such a model, which includes more details on parameters, etc. is as follows. I have added some annotation comments to help make sense of different parts and of the output being produced.

import msprime
import numpy as np


def introgression_simulation(length, mutation_rate, recombination_rate, sample_size, pop_size, intro_prob, mig_rate):
	## initialize populations with equal initial sizes
	population_configurations = [
		msprime.PopulationConfiguration(sample_size=sample_size, initial_size = N_1),
       		msprime.PopulationConfiguration(sample_size=sample_size, initial_size = N_2),
        	msprime.PopulationConfiguration(sample_size=sample_size, initial_size = N_3),
		msprime.PopulationConfiguration(sample_size=sample_size, initial_size = N_4)]
	## number of demes (populations)
	## migration rates
	## migration matrix
	migration_matrix = [
        	[0, m, m, m],
        	[m, 0, m, m],
        	[m, m, 0, m],
			[m, m, m, 0]]

	## time at which there are 3 populations
	T_3 = 6*pop_size
	## time at which there are 4 populations
	T_4 = 4*pop_size
	## time at which there is introgression
	T_intro = 2*pop_size
	## time at which there are 2 populations
	T_2 = 20*pop_size

	demographic_events = [
		## migration between 1 and 3 before populations split (backward in time)
			time = T_intro, source = 0, destination = 2, proportion = intro_prob),
		## migration from 1 to 2 all the time (represents the node between tips 1 and 2)
			time = T_4, source = 0, destination = 1, proportion = 1.0),
		## migration from 2 (inclusive) to 3 all the time (represents the node between tips [1,2] and 3)
			time = T_3, source = 1, destination = 2, proportion = 1.0),
		## migration from 3 (inclusive) to 4 all the time (represents the node between tips [][1,2],3] and 4)
			time = T_2, source = 2, destination = 3, proportion = 1.0)
	dd = msprime.DemographyDebugger(

	dd.print_history() #can comment out when we are happy demography is correct

	output = msprime.simulate(
		population_configurations = population_configurations,
		migration_matrix = migration_matrix,
		demographic_events = demographic_events,
		mutation_rate = mutation_rate,
		length = length,
		recombination_rate = recombination_rate)
	return output

def treePrinter(treeseq):
	for index,tr in enumerate(treeseq.trees()):
		if lengthQ>0:

tree_sequence=introgression_simulation(length, mutation_rate, recombination_rate, sample_per_pop, pop_size, intro_prob, mig_rate)

# for tree in tree_sequence.trees():
# 	print(tree.newick())
# print(tree_sequence.genotype_matrix())

##Looking at mutations
## output = position and tree location of the mutation - [2] = tip 2; [0, 1, 2] = ancestor to 0, 1, & 2
# tree=tree_sequence.first()
# for site in tree.sites():
# 	for mutation in site.mutations:
# 		print("Mutation at position {:.2f} over node {}".format(site.position,[n for n in tree.leaves(mutation.node)]))

## output = length of geneology (in bp) and newick of the geneology
## can be used to simulate sequences in seqgen

This script can be saved as the file msprime_migration.py and run simply using the command python msprime_migration.py (assuming msprime is properly installed).