Tag Archives: fastq

Batch download from NCBI SRA

It has been some time since I’ve posted anything, and I’m trying to start blogging regularly again. I’ve already described how to do a batch submission of data to the NCBI Sequence Read Archive, but today I was trying to do a batch download of a set of SRA sequence data for a project. Turns out, it can be a bit difficult to setup and use the SRA Toolkit, at least in my opinion, but it is certainly easier than uploading data. Fortunately, I found a nice Biostars thread that solved this issue for me, so I figured I’d reiterate it here for my own reference, and that of others.

You can find the thread by visiting https://www.biostars.org/p/111040/.

The command is a long pipe, as follows:

$ esearch -db sra -query <accession> | efetch --format runinfo | cut -d ',' -f 1 | grep SRR | xargs fastq-dump --split-files --bzip2

I’ll break this all down so it is apparent what this command is doing.

  1. First, esearch, part of the NCBI Entrez Direct utilities, queries the database chosen (here, the SRA) with an accession (in the case of a batch, a Bioproject is most appropriate).
  2. The STDOUT from #1 is directed into efetch, which uses this metadata to format a report in the ‘runinfo’ format, which is a comma-separated table of information about the accession.
  3. The STDOUT from #2 is then subsetted, such it splits columns by commas and takes only the first column, which corresponds to the SRR accession numbers.
  4. The STOUT list of SRR accession (#3), plus the header you don’t want, are then sent through grep so that only SRR accession numbers are passed along.
  5. Finally, xargs is used to take the STDOUT from #4 and run fastq-dump, from the NCBI SRA Toolkit, on it. The --split-files argument splits the paired reads into two files, instead of interleaving them, and the --bzip2 flag allows you to compress the output fastq files (you could use –gzip instead).

It is worth pointing out that by correctly setting up your SRA Toolkit configuation, you’ll notice some intermediate files being written to you /path/to/ncbi/sra directory, where the /path/to usually equals $HOME. The ultimate fastq(.gz/bz2) files are written to the directory you called this command from, or to an output directory you can set using fastq-dump (see documentation). This will take some time to actually run on a larger set of files.


Trimming lines from the end of a sequence file

Occasionally, I have come across sequencing read files (fastq) that are a bit screwy near the end and have what appears to be truncated text. This results in issues with programs used to analyze the data. Not sure if this was a problem with data transfer or what, but it is an easy fix just to trim off these troublesome reads and then the file is usable again (4 lines for fastq, 2 lines for fasta). This isn’t trivial with the giant sequencing files, which are impossible to just open in a text editor. Below I am giving a simple Unix command that will allow you to trim a set number of lines off from the end of a large text file. This is mostly so I can reference it easily later, but maybe it will help others as well.

Command to trim x number of lines from the end of a file:
$ head -n -<#lines> <inputfile> > <outputfile>
On a related note, you can also trim the first x number of lines from a file using this command:
$ tail -n +<#lines> <inputfile> > <outputfile>
Adapted from stackoverflow.com/questions/10460919/how-to-delete-first-two-lines-and-last-four-lines-from-a-test-file-with-bash

Edit: Just discovered that the above head command does not work on Apple OSX. The workaround for Mac users is as follows:
$ cat <inputfile> | tail -r | tail -<#lines> | tail -r > <outputfile>
#command essentially reverses the file, then trims the first x lines, then reverses it back to the original order, which equates to trimming the last x lines.