Introductory RAD-seq Activity
Introduction
Today, let’s go through a brief experimental design activity to help demonstrate some of the features of RAD-seq and how they can be harnessed to explore numerous questions in ecology, evolutionary biology, and genetics. To perform this activity, we will use Harvard’s Odyssey compute cluster.
Background & Resources
To introduce the topic of RAD-seq, we are relying on the excellent review article by Andrews et al. 2016. Nature Reviews Genetics. We are also using Catchen et al. 2013. Molecular Ecology as an empirical example of RAD-seq for discussion. Slides of key figures from these two resources are available here.
Setting up Our Analysis Environment
Given we cannot run jobs on the login nodes, we are going to start an interactive job that will allow us to go through this lesson. You can execute the following command to do this. It may take a few minutes to clear the queue, but eventually your prompt should return.
srun -p test --pty -n 2 --mem 4000 -t 0-06:00 --account=oeb275r /bin/bash
We are going to use the very useful environments feature of Anaconda Python to manage the software dependencies we have for this activity. Anaconda is already installed on Odyssey and can be loaded as follows:
module load Anaconda3/5.0.1-fasrc02
Now we can create an analysis environment that will always be available for us to use for this activity, which will run using Python version 3.7.
conda create -n radseq python=3.7
This will take a few minutes to run and along the way, we will answer Yes (y) to allow installation. When it is done, we can activate our newly created environment.
source activate radseq
# on some systems:
# conda activate radseq
We must also install some other important software in our environment, namely Biopython. For whatever reason, installing biopython
through conda
is not allowing later software to run properly for me, so I have commands for installing via conda
(preferred) and using pip
instead, which is Python’s default package manager.
# preferred
conda install -c conda-forge biopython
# only if conda does not work:
# pip install --user biopython
We must also install bedtools
, which is required for some of our software. This is available on Odyssey already, but let’s install it into our radseq
environment to demonstrate how to install using conda
.
conda install -c bioconda bedtools
Now let’s create a directory for our activity called radseq_activity
and move into it so we can do our work there.
mkdir radseq_activity
cd radseq_activity
We are also going to be using some software that I wrote for performing in silico RAD-seq called RADis. I have, unfortunately, not packaged this into easily-installable software, but we can still make use of it fairly easily. Let’s download the GitHub repository for this purpose.
git clone https://github.com/darencard/RADis
One more thing we will need to do is add the software in this repository to our software $PATH
, which is where the computer looks for software to run by default. Note that this is a temporary addition that will not persist beyond this session or if this directory is moved. There are ways to permanently add these software to your $PATH
that I can show if anyone is interested.
cd RADis
export PATH=$PATH:`pwd`
cd ..
In Silico RAD-seq Activity
In order to perform our in silico RAD-seq, we need a sequence for our digital restriction enzyme digestion. For time and ease, we will use chromosome 22 from the human genome, which is about 50 Mb in length. We can download this from the UCSC Genome Browser database.
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr22.fa.gz
gunzip chr22.fa.gz
We need to create a restriction enzymes file to run the in silico restriction digest. This software performs a double digest, so two enzymes need to be used. We are going to work in 4 groups, each with a different combination of enzymes as follows.
# group 1
sbfI CCTGCAGG 6
sau3AI GATC 0
# group 2
pstI CTGCAG 1
sau3AI GATC 0
# group 3
sbfI CCTGCAGG 6
mspI CCGG 1
# group 4
pstI CTGCAG 1
mspI CCGG 1
Each group should take their restriction enzymes and paste them into a new text file called enzymes.txt
. This can be done using the terminal-based text editor nano
. Note that there are TAB
characters between each of the columns.
Once we have our reference genome (human chromosome 22) and our enzymes, we can run an in silico digestion with each enzyme using the following command. Sit back and be patient, as I’m not a professional software engineer and my program is slower than I would like (should take about 10 minutes at most).
restriction_digest_insilico.py --input chr22.fa --output chr22 --enzymes enzymes.txt
Once the command ends, you will see a couple new files in the working directory when we run ls
.
ls
# group 1 output
# chr22_sbfI_CCTGCAGG_8.bed
# chr22_sau3AI_GATC_4.bed
Now that we have the individual restriction digest files for single enzymes, we can run the double digest. As part of this command, we need to also set the lower and upper limit on the fragment lengths we will select for. Given there are 4-5 people in a group, let’s have each person select a different range of fragment sizes.
# person 1
200 - 400
# person 2
300 - 500
# person 3
400 - 600
# person 4
500 - 700
# person 5
600 - 800
Now we can run our double digest as follows. Be sure to use the output from the longer enzyme as the -r
input. This will run much faster than the other command.
# example for group 1, person 1
extract_fragments.sh -g chr22.fa -r chr22_sbfI_CCTGCAGG_8.bed -c chr22_sau3AI_GATC_4.bed -l 200 -u 400 -o chr22_sbfI_sau3AI_200_400_output
You will see two outputs of this command.
ls
# example for group 1, person 1
# chr22_sbfI_sau3AI_200-400_output.bed.gz
# chr22_sbfI_sau3AI_200-400_output.fasta.gz
Both are gzipped, but we can easily unzip them.
gunzip *.gz
And look at their contents. One is a BED file.
head chr22_sbfI_sau3AI_200_400_output.bed
# chr22 10520281 10520562 sbfI-sau3AI 281 +
# chr22 10527904 10528214 sbfI-sau3AI 310 -
# chr22 10528218 10528432 sbfI-sau3AI 214 +
# chr22 11054755 11054977 sbfI-sau3AI 222 -
# chr22 11265186 11265460 sbfI-sau3AI 274 +
# chr22 11438712 11439070 sbfI-sau3AI 358 +
# chr22 11553161 11553482 sbfI-sau3AI 321 +
# chr22 11716993 11717223 sbfI-sau3AI 230 +
# chr22 11953435 11953797 sbfI-sau3AI 362 -
# chr22 11956673 11957032 sbfI-sau3AI 359 -
The other is a FASTA file.
head chr22_sbfI_sau3AI_200_400_output.fasta
# >chr22:10520281-10520562(+)
# ggtggggcttcactggggacccatcaccttccacccaggagcctgtctgcctcccacaaccatccatggcacccaggctgctggcatcaaggggcacttgcaggccagtgccaagccaccctcgtaccccctcatcttcccctcccatgctcctgctcctcagtgtccaaagtccagaaggggctgaggtggcaggggactgacatgtcagcactgcttccaatgtgtgcacacctggctgggcagtgacagcaccctgctgggtcccaaccccactctga
# >chr22:10527904-10528214(-)
# GGAAACCCCTTGAGGTCAAGACCCCACAATCAGACAAGGATGGAGTGGCTCACCCTCAGTAAACAGGCCAGACTCAAGGTGGTATAATGTCTTAACCAAGAGTGTGGGCCTCCAGGTCTGACTCCCATCTCAGTTCTCCTTTAATAACCACACTTTGATAATTCTCCTTAACAGGGGTTCCTggcaagtcagttcttcctcaggccttcggtttcctcacctacaagatgagagggctggaccagatgGAAATTCAGGGTGGAAGGGGATGTCCGCGCACAGCCCACCACCCCCCCCCAACAGGACCCTG
# >chr22:10528218-10528432(+)
# GGAAGAAGCAAATGGGTTTGCTTTCCTAGCTCTGTCCAGTATCTTAGGGACCCTGAGGACTGAAGAGATTCTTGTAGAGCCATCTGGTGTATGTCATGGGTGGGCCTTTTTTGAATGTCAGTCTGCCCAGTGAGCTGGCTCAGCCTGAATGAACTGTCTTGAATCTTTGGAGTTGTCTGTGTACTTTTAAGGGTTTCTCATCCTTGCACCAAAA
# >chr22:11054755-11054977(-)
# GGAGACCTTCTCCTGCCCGGACCAGAGAGGCTGCAGGGGTGCTGTGGGGAGAGGGCCAGATGGCCACAACACAGCTGGGAATGGCTCCAGAGGGGAGGGTGGAGAGTGCGCTGGCCCATGGGAGGACTCAGAGAGGAGTTGCGTGTGTTTGGCGCTTGTGGACATGGTGTGTGGCAGTGTGAGCCAACAGGCCAACGGGCAGCAAGCATAGGAGGCTTTATG
# >chr22:11265186-11265460(+)
# GGGCGGAGGTTCCCTAGGCAACGAGGGAGAGAGGGAGGGGTCTCCAGAAGGGAGAGACAGAAACCGGTTGCCCCAGGTTCGGTGAAGTCGGCCAGACCTCTCCCCGTGTCACCTCGACTTTCAATAACAGTGGCTGCTAGGTGATGCCCAAAGACAACCGATGCCTGCAAGTGTCAGTCAGCAGGGAAAAGAATGCatttatttatttatttatttatttatttagagacagagtctcactcaatctacagcccaggctgtagtgcagtggcgc
Now the critical piece of information we need for experimental design considerations is the number of fragments, as marker number/density enables different types of questions to be addressed. We can easily determine this by counting the number of lines in our BED file.
wc -l chr22_sbfI_sau3AI_200_400_output.bed
# 952
Or the number of sequences in our FASTA file.
grep -c ">" chr22_sbfI_sau3AI_200_400_output.fasta
# 952
Activity Questions
Within your group, work together to determine how fragment number varies across different fragment size selections. What pattern do you see and why might this be important?
As a class, what do you notice about the number of fragments sampled based on the restriction enzymes and the fragment size windows used? What aspect of the restriction enzyme appears to impact fragment number?
Based on your research interests, think of an interesting question (or questions) you might want to ask and how genome-wide data could be useful. Assuming you are restricted to using RAD-seq only, what experimental design considerations must you consider to answer your research question?