Trimming lines from the end of a sequence file

Occasionally, I have come across sequencing read files (fastq) that are a bit screwy near the end and have what appears to be truncated text. This results in issues with programs used to analyze the data. Not sure if this was a problem with data transfer or what, but it is an easy fix just to trim off these troublesome reads and then the file is usable again (4 lines for fastq, 2 lines for fasta). This isn’t trivial with the giant sequencing files, which are impossible to just open in a text editor. Below I am giving a simple Unix command that will allow you to trim a set number of lines off from the end of a large text file. This is mostly so I can reference it easily later, but maybe it will help others as well.

Command to trim x number of lines from the end of a file:
$ head -n -<#lines> <inputfile> > <outputfile>
On a related note, you can also trim the first x number of lines from a file using this command:
$ tail -n +<#lines> <inputfile> > <outputfile>
Adapted from stackoverflow.com/questions/10460919/how-to-delete-first-two-lines-and-last-four-lines-from-a-test-file-with-bash

Edit: Just discovered that the above head command does not work on Apple OSX. The workaround for Mac users is as follows:
$ cat <inputfile> | tail -r | tail -<#lines> | tail -r > <outputfile>
#command essentially reverses the file, then trims the first x lines, then reverses it back to the original order, which equates to trimming the last x lines.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s