Due to my ongoing work with the snake Boa constrictor, I’ve read and reread Bradnam et. al’s Assemblathon2 paper and have been using their best rated snake assembly (7C – SGA team) successfully for a lot of my work. I’ve been gearing up to build on this work, providing the community with what I hope will be even a better genomic resource for this species.
As part of this process, there were a couple loose ends that I needed some clarification on. Keith Bradnam promptly replied to my email for assistance, though he suggested I contact individuals with more knowledge of the matters I was seeking answers to. Below, I describe the two clarifications I needed and the ultimate answer I received (and verified), so that others may leverage this information for any related work they are doing.
A couple years ago I led a paper in PLoS ONE outlining a reference-guided genome assembly approach that, more-or-less, simply maps reads from the target species to a high-quality guide genome and exports a consensus, producing a genome sequence for the target species one is interested in. As the paper pointed out, this has been applied between strains of single species, but hasn’t really been tried across larger evolutionary distances. We found that with the 3-5x genome coverage we had, this approach produced a better genome assembly, based on several metrics, than simply performing de novo genome assembly, despite mapping to a guide genome that was quite distantly related in both cases. This was helped along by the fact that we were working with bird genomes, which are the best-case scenario for this type of approach in vertebrates. Obviously, we were making the assumption that enough synteny exists to prevent our method from producing spurious assemblies, and this assumption may or may not hold up once some evidence starts emerging (it currently hasn’t across broader taxonomic scales). Despite this and other shortcomings, I hold out some hope that this type of assembly approach will become more common as more and more high-quality reference genomes from across the tree of life become available. Moreover, I think that even in cases of high sequencing coverage, where a de novo assembly is the norm, this type of approach could complement existing assembly methods to produce a better overall assembly. If you haven’t seen this, you can find it here.
Not many have taken notice of this work (at least based on citations), though I recently had an inquiry from someone who was trying to implement it. I’m also coming full circle and playing around with this approach again with some of the work I am currently doing. Admittedly, at the time I was quite new to graduate school and to handling and analyzing whole-genome sequencing data, and being more seasoned now, I thought I would update my approach to one that makes more sense and uses some different tools. The basic pipeline from the before (i.e., the publication) was to use CLC Genomics Workbench to do the following:
- Quality trim raw Illumina reads
- Map quality-trimmed reads to a guide genome
- Export the consensus sequence for regions where reads mapped based on the level of read coverage
When someone asked for guidance with their own data, I was quick to provide what I think is a more sound (and certainly more “open”) pipeline, which was facilitated by all I’ve learned since publishing the original paper. It goes as follows:
- Quality trim raw Illumina reads: Can use anything but I’m currently partial to Trimmomatic.
- Map quality-trimmed reads to a guide genome: I like BWA the most for this, but any mapper could work. This step can obviously vary a lot depending on the distance between the target and guide, and one has to balance high-quality mapping and sufficient coverage. I’m hoping to do a follow up post about this.
- Call any SNPs or INDELs: The two leaders here are SAMtools and GATK, which produce an output VCF file of any variants. The former is more streamlined and quick to implement while the “Best Practices” approach of the latter is more refined (e.g., INDEL realignment), which might help in this type of application. I’m planning to write another post outlining the GATK approach in the context of reference-guided assembly.
- Filter genotype calls by quality: Again, one could use SAMtools or GATK. One can filter based on many things, and the filtering we performed before was based on coverage, but I would now recommend that one uses things like mapping quality and genotype probability/quality scores to perform filtering, since this essentially encompasses the sequencing depth. This produces a refined VCF file with the most confident variant calls.
- Export a consensus sequence: SAMtools or GATK will take the filtered VCF and the guide genome and produce a consensus sequence where all positions without a genotype call in the VCF file are outputted as an ‘N’ ambiguity. This preserves some level of scaffolding, which is useful, though it is probably prudent to break longer stretches of Ns, as they aren’t founded on any empirical data derived from the target species, and may incorrectly link contigs.
Overall, I think this approach would do a much better job and uses probabilities for filtering genotype calls rather than simply coverage, a much more arbitrary filtering scheme that scales well with genome sequencing coverage. I’m aiming to have some follow-up posts discussing mapping parameter optimization and the details of mapping, variant calling, and variant filtering. I’m also hoping to implement this more refined approach in some upcoming work that will make its way into a manuscript.
It has been some time since I’ve posted anything, and I’m trying to start blogging regularly again. I’ve already described how to do a batch submission of data to the NCBI Sequence Read Archive, but today I was trying to do a batch download of a set of SRA sequence data for a project. Turns out, it can be a bit difficult to setup and use the SRA Toolkit, at least in my opinion, but it is certainly easier than uploading data. Fortunately, I found a nice Biostars thread that solved this issue for me, so I figured I’d reiterate it here for my own reference, and that of others.
You can find the thread by visiting https://www.biostars.org/p/111040/.
The command is a long pipe, as follows:
$ esearch -db sra -query <accession> | efetch --format runinfo | cut -d ',' -f 1 | grep SRR | xargs fastq-dump --split-files --bzip2
I’ll break this all down so it is apparent what this command is doing.
esearch, part of the NCBI Entrez Direct utilities, queries the database chosen (here, the SRA) with an accession (in the case of a batch, a Bioproject is most appropriate).
- The STDOUT from #1 is directed into
efetch, which uses this metadata to format a report in the ‘runinfo’ format, which is a comma-separated table of information about the accession.
- The STDOUT from #2 is then subsetted, such it splits columns by commas and takes only the first column, which corresponds to the SRR accession numbers.
- The STOUT list of SRR accession (#3), plus the header you don’t want, are then sent through
grep so that only SRR accession numbers are passed along.
- Finally, xargs is used to take the STDOUT from #4 and run
fastq-dump, from the NCBI SRA Toolkit, on it. The
--split-files argument splits the paired reads into two files, instead of interleaving them, and the
--bzip2 flag allows you to compress the output fastq files (you could use –gzip instead).
It is worth pointing out that by correctly setting up your SRA Toolkit configuation, you’ll notice some intermediate files being written to you
/path/to/ncbi/sra directory, where the
/path/to usually equals
$HOME. The ultimate fastq(.gz/bz2) files are written to the directory you called this command from, or to an output directory you can set using
fastq-dump (see documentation). This will take some time to actually run on a larger set of files.
Recently, I began work on a project that required downloading the genomes of currently sequenced vertebrates. Most of these genomes are available through Ensembl, an online repository of genomes and a website where you can browse genome annotations. In the past, I’ve only had to download a genome or two and always used an ftp connection through the terminal. Here are the commands you can use to do this:
$ cd /path/to/store/data
$ open ftp.ensembl.org
# Username: anonymous
# Password: <email>
# can then move around hierarchy using cd and ls commands
# to download your target file, once you navigate to it, type:
$ mget <file>
# answer y(es) when the server prompts.
If instead you want to download many files at once, you need to take a different approach using a program called rsync (which should be installed on Linux/OSX). Here are the commands you can adapt for your purpose:
$ cd /path/to/store/data
$ rsync -av rsync://ftp.ensembl.org/ensembl/pub/path/to/folder/targetfile.ext ./
# downloads targetfile.ext to your current directory (which is the \\
# path where you want to store data)
$ rsync -av rsync://ftp.ensembl.org/ensembl/pub/path/to/*/*.ext ./
# downloads all files with a .ext extension in all directories \\
# within the path/to/ directory
The command I used to download the most current releases of all complete Ensembl genomes was:
$ rsync -av rsync://ftp.ensembl.org/ensembl/pub/current_fasta/*/dna/*.dna.toplevel.fa.gz ./
You can modify this command to download all soft-masked assemblies, all cDNA assemblies, and many other file types.
One should also note the root directory structure for the Ensembl Genomes database, which includes Metazoans, Protozoans, etc. For Metazoans, it is ‘ftp.ensemblgenomes.org/all/pub/metazoa/current/’ and this can be modified accordingly for the target genome files the user is seeking.
In all cases, these batch rsync downloads are much better than performing the above ftp steps 30+ times and will work in the background. Download rates will vary based on connection and data type, but I would expect to wait between 5 and 10 minutes per assembly file. Hopefully this helps save you some time as well.
Occasionally, I have come across sequencing read files (fastq) that are a bit screwy near the end and have what appears to be truncated text. This results in issues with programs used to analyze the data. Not sure if this was a problem with data transfer or what, but it is an easy fix just to trim off these troublesome reads and then the file is usable again (4 lines for fastq, 2 lines for fasta). This isn’t trivial with the giant sequencing files, which are impossible to just open in a text editor. Below I am giving a simple Unix command that will allow you to trim a set number of lines off from the end of a large text file. This is mostly so I can reference it easily later, but maybe it will help others as well.
Command to trim x number of lines from the end of a file:
$ head -n -<#lines> <inputfile> > <outputfile>
On a related note, you can also trim the first x number of lines from a file using this command:
$ tail -n +<#lines> <inputfile> > <outputfile>
Adapted from stackoverflow.com/questions/10460919/how-to-delete-first-two-lines-and-last-four-lines-from-a-test-file-with-bash
Edit: Just discovered that the above head command does not work on Apple OSX. The workaround for Mac users is as follows:
$ cat <inputfile> | tail -r | tail -<#lines> | tail -r > <outputfile>
#command essentially reverses the file, then trims the first x lines, then reverses it back to the original order, which equates to trimming the last x lines.
I’m excited to report the submission of my first first-author paper. After a few months of work we submitted the manuscript to PLoS ONE today. The paper describes two new bird genomes – the Gunnison Sage-Grouse and the Clark’s Nutcracker – and empirically compares reference-guided and de novo assembly techniques. I won’t spill anymore of the beans on the paper until it is formally accepted, but I wanted to gloat a little bit about the accomplishment. Stay tuned!
Recently, I’ve submitted several sets of sequencing reads to NCBI’s Sequence Read Archive (SRA) in anticipation of a papers we are publishing. I previously found that NCBI does a pretty bad job of explaining how to actually upload the files for submission and wrote this guide, which I subsequently edited to be more thorough and complete. The overall description of actually creating a submission isn’t great, as currently you not only create an SRA, but also a BioProject and a BioSample. I thought I would explain the basics of what I did for creating my submission online and then describe the process of actually uploading the files to the SRA so they can be associated with your submission (it is not a simple upload, like you may be accustomed to). I should note that the instructions I have here are based on Mac OSx, though they should be very similar (or identical) with Linux. Windows will likely be pretty similar too (except for determining the MD5 checksum), but you may have to use Google for a little help. Here we go:
- Log in to or sign up for an NCBI account.
- Go to the BioProject page and click ‘New Submission’.
- You will then be asked to fill in various fields over a range of pages, some required and some not required. The BioProject Help booklet explains what each of these fields means and I’ll leave you to fill them in for your submission. At this point you can leave the BioSample field clear, since you haven’t created a BioSample yet (we will do this next).
- If you are submitting a small number of BioSamples, you can do them individually through the web browser. First go to the BioSample page and click ‘New Submission’.
- On the ‘General info’ tab you will specify that you are submitting a single sample (you will repeat the necessary steps for the rest of your submissions). Otherwise, you will fill in various fields and the SRA Handbook will be your guide to their meaning.
- If you are submitting a large number of BioSamples, you should use the batch submission option to make your life a thousand times easier. Next to the ‘New Submission’ button there is a link called “Download batch template”. You will be prompted to specify the type of sample you are uploading, and upon doing a template will be downloaded. This will be tab delimited text, which can be easily edited and saved using Excel (but be sure to keep it as tab delimited text). Fill in the required and necessary unrequired information. Now you can click ‘New Submission’ and fill everything in as above (see E.), but you will instead specify that you are submitting a batch of multiple BioSamples. NCBI will ask you to upload your completed template, will check that everything is filled in correctly, and will proceed in a logical manner.
- You should pause here and wait for NCBI to actually accept these submissions without error. It will be obvious when that occurs (you may have to refresh a few times). Note the accession numbers that are assigned to each. It only took a couple minutes each for the submissions I made today.
- You actual SRA submission will also differ depending on whether you are submitting one or a few runs versus a batch of runs. To do the submission for one or a few samples, go to the SRA page and click ‘Submit to SRA’ and then ‘Create new submission’. Here, you just give the sample alias and perhaps a description.
- Next, you click on the ‘New Experiment’ button to proceed and enter data about the experiment. Again, I will leave it to you to fill in the relevant fields using the SRA Handbook. Most importantly, here is where you will associate the SRA entry with both the BioProject and BioSample using the accession numbers created for those entries.
- Once the experiment is complete you can then fill in the run details by clicking ‘New Run’. Fill in an appropriate alias and then you can finally designate the reads that you plan to upload.
- For each read file (paired-end data comes in two files), you should fill in a sample file type and sample name (including extension; e.g., fastq). NCBI will accept gzipped or bzipped files so make sure that designation is included where applicable. The sample name should exactly match the file you will upload.
- For each file, you also need to input an MD5 checksum that matches that for the file. To obtain this in Mac OSx you simply open the terminal and type:
$ md5 /path/to/file/filename
- For each file, take the outputted MD5 checksum and copy it exactly into the field on NCBI. Once you have done this for all read files, you can click save.
- Alternatively, if you are doing a batch submission you will need to properly fill out an Excel template workbook. I’ve provided a version I received from NCBI here, as it doesn’t appear to be posted anywhere on NCBI. There is some basic information about the submission on the first sheet of the workbook, along with some basic instructions. The second sheet contains most of the fields that need to be completed while the third sheet has definitions for everything. Be sure that you enter the BioProject accession and individual BioSample accessions for each run (note that some BioSamples may have more than one run). NCBI will need these to properly link the SRA accessions it creates with the BioSample/BioProject accessions you already have. The other fields should be self-explanatory and see the steps above for determining MD5 checksums. Save a completed version of the template (as an Excel document) and email it to the SRA folks at NCBI using the email address firstname.lastname@example.org. You can briefly describe your data in the body of the message, but otherwise they shouldn’t need anything more than the attached template. You should be notified that they received your email soon after sending it and you can expect your request to be processed within a couple of days. You will not be able to proceed without confirmation that the accessions were created by an SRA curator.
- Once you have entered all of the run details individually on your browser or have sent the completed SRA template to NCBI and received confirmation that it was accepted, you are ready to proceed and actually upload your read data.
- If you did a small submission you may have noticed directions for using the SRA ftp on the run details page. If you submitted a batch template you can view these instructions by clicking the sample name link next to a run on the SRA submission summary page. Note the ftp domain, the username, and the password, as you will need these.
- You can now use the ftp program in the terminal to complete the process. First, I would makes sure all read files for a given SRA submission are in the same directory and then I would change directories (cd) into that directory.
- You can then initiate the upload using the following commands:
$ ftp -i
$ open <hostname> #Likely <host>.ncbi.nih.gov; then some
#text should appear and you will be prompted for a #username.
> <username> #Probably 'sra'.
$ mput * #Will upload each read file in the directory
#one-by-one until they are complete. Be sure that you
#also modify the wildcard argument if you have more than
#just read files in your working directory.
$ bye #Closes ftp.
- Once the files are uploaded, NCBI will associate them with your SRA submission and you should find the status will eventually indicate the process is complete. NCBI states that this could take some time, so I would give it a day before I would start emailing NCBI, if the status does not show as complete.
That should take care of everything. One thing to note is that you can control the release date, in case you don’t want your data made public quite yet. Much of this information can probably be changed manually later, though I don’t have experience on how to do that. Hopefully this is useful to people, as the directions aren’t explicit on NCBI and you are left to fill in some of the blanks. There are other guides out there that I also came across, but I figured it might serve everyone well to make a current one, since things seem to have changed recently. And now I have a cheat sheet for when I need to do this again!