Tag Archives: bioinformatics

A (still in progress) best practices workflow for reference-guided genome assembly

A couple years ago I led a paper in PLoS ONE outlining a reference-guided genome assembly approach that, more-or-less, simply maps reads from the target species to a high-quality guide genome and exports a consensus, producing a genome sequence for the target species one is interested in. As the paper pointed out, this has been applied between strains of single species, but hasn’t really been tried across larger evolutionary distances. We found that with the 3-5x genome coverage we had, this approach produced a better genome assembly, based on several metrics, than simply performing de novo genome assembly, despite mapping to a guide genome that was quite distantly related in both cases. This was helped along by the fact that we were working with bird genomes, which are the best-case scenario for this type of approach in vertebrates. Obviously, we were making the assumption that enough synteny exists to prevent our method from producing spurious assemblies, and this assumption may or may not hold up once some evidence starts emerging (it currently hasn’t across broader taxonomic scales). Despite this and other shortcomings, I hold out some hope that this type of assembly approach will become more common as more and more high-quality reference genomes from across the tree of life become available. Moreover, I think that even in cases of high sequencing coverage, where a de novo assembly is the norm, this type of approach could complement existing assembly methods to produce a better overall assembly. If you haven’t seen this, you can find it here.

Not many have taken notice of this work (at least based on citations), though I recently had an inquiry from someone who was trying to implement it. I’m also coming full circle and playing around with this approach again with some of the work I am currently doing. Admittedly, at the time I was quite new to graduate school and to handling and analyzing whole-genome sequencing data, and being more seasoned now, I thought I would update my approach to one that makes more sense and uses some different tools. The basic pipeline from the before (i.e., the publication) was to use CLC Genomics Workbench to do the following:

  1. Quality trim raw Illumina reads
  2. Map quality-trimmed reads to a guide genome
  3. Export the consensus sequence for regions where reads mapped based on the level of read coverage

When someone asked for guidance with their own data, I was quick to provide what I think is a more sound (and certainly more “open”) pipeline, which was facilitated by all I’ve learned since publishing the original paper. It goes as follows:

  1. Quality trim raw Illumina reads: Can use anything but I’m currently partial to Trimmomatic.
  2. Map quality-trimmed reads to a guide genome: I like BWA the most for this, but any mapper could work. This step can obviously vary a lot depending on the distance between the target and guide, and one has to balance high-quality mapping and sufficient coverage. I’m hoping to do a follow up post about this.
  3. Call any SNPs or INDELs: The two leaders here are SAMtools and GATK, which produce an output VCF file of any variants. The former is more streamlined and quick to implement while the “Best Practices” approach of the latter is more refined (e.g., INDEL realignment), which might help in this type of application. I’m planning to write another post outlining the GATK approach in the context of reference-guided assembly.
  4. Filter genotype calls by quality: Again, one could use SAMtools or GATK. One can filter based on many things, and the filtering we performed before was based on coverage, but I would now recommend that one uses things like mapping quality and genotype probability/quality scores to perform filtering, since this essentially encompasses the sequencing depth. This produces a refined VCF file with the most confident variant calls.
  5. Export a consensus sequence: SAMtools or GATK will take the filtered VCF and the guide genome and produce a consensus sequence where all positions without a genotype call in the VCF file are outputted as an ‘N’ ambiguity. This preserves some level of scaffolding, which is useful, though it is probably prudent to break longer stretches of Ns, as they aren’t founded on any empirical data derived from the target species, and may incorrectly link contigs.

Overall, I think this approach would do a much better job and uses probabilities for filtering genotype calls rather than simply coverage, a much more arbitrary filtering scheme that scales well with genome sequencing coverage. I’m aiming to have some follow-up posts discussing mapping parameter optimization and the details of mapping, variant calling, and variant filtering. I’m also hoping to implement this more refined approach in some upcoming work that will make its way into a manuscript.


Batch download from NCBI SRA

It has been some time since I’ve posted anything, and I’m trying to start blogging regularly again. I’ve already described how to do a batch submission of data to the NCBI Sequence Read Archive, but today I was trying to do a batch download of a set of SRA sequence data for a project. Turns out, it can be a bit difficult to setup and use the SRA Toolkit, at least in my opinion, but it is certainly easier than uploading data. Fortunately, I found a nice Biostars thread that solved this issue for me, so I figured I’d reiterate it here for my own reference, and that of others.

You can find the thread by visiting https://www.biostars.org/p/111040/.

The command is a long pipe, as follows:

$ esearch -db sra -query <accession> | efetch --format runinfo | cut -d ',' -f 1 | grep SRR | xargs fastq-dump --split-files --bzip2

I’ll break this all down so it is apparent what this command is doing.

  1. First, esearch, part of the NCBI Entrez Direct utilities, queries the database chosen (here, the SRA) with an accession (in the case of a batch, a Bioproject is most appropriate).
  2. The STDOUT from #1 is directed into efetch, which uses this metadata to format a report in the ‘runinfo’ format, which is a comma-separated table of information about the accession.
  3. The STDOUT from #2 is then subsetted, such it splits columns by commas and takes only the first column, which corresponds to the SRR accession numbers.
  4. The STOUT list of SRR accession (#3), plus the header you don’t want, are then sent through grep so that only SRR accession numbers are passed along.
  5. Finally, xargs is used to take the STDOUT from #4 and run fastq-dump, from the NCBI SRA Toolkit, on it. The --split-files argument splits the paired reads into two files, instead of interleaving them, and the --bzip2 flag allows you to compress the output fastq files (you could use –gzip instead).

It is worth pointing out that by correctly setting up your SRA Toolkit configuation, you’ll notice some intermediate files being written to you /path/to/ncbi/sra directory, where the /path/to usually equals $HOME. The ultimate fastq(.gz/bz2) files are written to the directory you called this command from, or to an output directory you can set using fastq-dump (see documentation). This will take some time to actually run on a larger set of files.

Automated batch download from Ensembl

Recently, I began work on a project that required downloading the genomes of currently sequenced vertebrates. Most of these genomes are available through Ensembl, an online repository of genomes and a website where you can browse genome annotations. In the past, I’ve only had to download a genome or two and always used an ftp connection through the terminal. Here are the commands you can use to do this:
$ cd /path/to/store/data
$ ftp
$ open ftp.ensembl.org
# Username: anonymous
# Password: <email>
# can then move around hierarchy using cd and ls commands
# to download your target file, once you navigate to it, type:
$ mget <file>
# answer y(es) when the server prompts.

If instead you want to download many files at once, you need to take a different approach using a program called rsync (which should be installed on Linux/OSX). Here are the commands you can adapt for your purpose:
$ cd /path/to/store/data
$ rsync -av rsync://ftp.ensembl.org/ensembl/pub/path/to/folder/targetfile.ext ./
# downloads targetfile.ext to your current directory (which is the \\
# path where you want to store data)
$ rsync -av rsync://ftp.ensembl.org/ensembl/pub/path/to/*/*.ext ./
# downloads all files with a .ext extension in all directories \\
# within the path/to/ directory

The command I used to download the most current releases of all complete Ensembl genomes was:
$ rsync -av rsync://ftp.ensembl.org/ensembl/pub/current_fasta/*/dna/*.dna.toplevel.fa.gz ./
You can modify this command to download all soft-masked assemblies, all cDNA assemblies, and many other file types.

One should also note the root directory structure for the Ensembl Genomes database, which includes Metazoans, Protozoans, etc. For Metazoans, it is ‘ftp.ensemblgenomes.org/all/pub/metazoa/current/’ and this can be modified accordingly for the target genome files the user is seeking.

In all cases, these batch rsync downloads are much better than performing the above ftp steps 30+ times and will work in the background. Download rates will vary based on connection and data type, but I would expect to wait between 5 and 10 minutes per assembly file. Hopefully this helps save you some time as well.

Rosalind: A great resource for learning bioinformatics programming

Most well-versed bioinformaticians have likely heard, or perhaps used, Rosalind, but I decided to write about it for any visitors who haven’t. Rosalind teaches users bioinformatics programming using exercises. Basically, the student is given a bit of background that ties the programming to the real-world application, is given a clear set of instructions, and is able to download data to use to test his or her program. The student then uploads their answer and the website grades it as correct or incorrect. Each dataset generated is unique, so you can’t just copy a friends answer. More importantly, with the way the exercises are set up, any programming language can be used to solve the problems, which provides a huge amount of flexibility. Last, and certainly not least, Rosalind is free to use!

Overall, for anyone who wants to learn how to program for bioinformatics with real-life challenges, I encourage you to give Rosalind a try. Likewise, if you are teaching students or forming a graduate student learning group, this is a great resource to use. I am currently using it for a class and am getting a pretty good handle on Python. I plan to attempt most or all of the challenges (beyond just those assigned in class) and may even go through a subset with a new language (such as Perl) to build my skills set. Seeing as most of the postdoc and faculty position listings I see in evolutionary biology and genomics explicitly demand the ability to use programming languages, it makes perfect sense to start learning!