Performing the Bonferonni and Benjamini–Hochberg Procedures on a Large Dataset
Introduction
A common procedure in statistics when performing many statistical tests is to control for false positives using one of many procedures devised for doing so. I commonly use the Bonferonni or Benjamini–Hochberg procedures, depending on the situation. The easiest way to take advantage of these statistical approaches is to use R, which has several options for “correcting” p-values, transforming the raw p-value to a “corrected” form that minimizes false positives. In this circumstance, one considers tests with a corrected p-value below some threshold (usually 0.05) to be statistically significant.
While this procedure is straightforward, it relies on being able to load your dataset, containing raw p-values, into R. However, sometimes datasets are very large (i.e., 10s / 100s of megabytes or even gigabytes) and R does not play nice with very large files because of the memory requirements, which your computer may not support. So, how does one determine which statistical tests are significant without using R? Below, I outline how one can perform either the Bonferonni or Benjamini-Hochberg procedure to control for false positives using the Unix shell.
Dependencies: Following these procedures should not require any software installations, as it uses standard Unix utilities that should be available on your system. The one tool that may not be available is datamash, which should be easy to install on your system. However, you will need to install a tool for handling the genomics file formats we are working with, which is described when needed below.
Background Information
I found a great overview of both the Bonferonni and Benjamini-Hochberg procedures, and controlling for false positives overall, in a section of the free, online textbook Biological Statistics by John H. McDonald (University of Deleware).
In reading through the background information, I found that one does not need to strictly “correct” p-values. Rather, one can perform a procedure that will identify statistically significant tests based on the raw p-values. Moreover, the operations performed in these procedures are relatively straightforward, which makes Unix a great candidate for performing this task on a dataset that is too large for R.
Preparing the Dataset
Let’s start by gathering a dataset of reasonable scale that justifies why I am writing this blog post. We will go with this very large dataset of PhyloP scores produced from a 241-way alignment of mammalian genomes produced as part of the Zoonomia Project. PhyloP scores measure evolutionary conservation or acceleration of genomic sites based on an alignment of genomics regions and a phylogenetic tree. The file we will be using was produced using the procedure outlined in Example 1 here. Because this test is performed across all sites in a 241-way alignment of mammals, a giant file is produced in Wiggle format. This file was converted to bigWig format to compress the data for a lower disk footprint. We can retrieve this file with positions based on the human reference genome from the Zoonomia data repository.
# big file! this may take a while to download
wget https://cgl.gi.ucsc.edu/data/cactus/241-mammalian-2020v2-hub/Homo_sapiens/241-mammalian-2020v2.bigWig
This bigWig file is ~21 GB. However, given it is compressed / binary, we cannot actually look at it or perform operations. We first need to convert from bigWig to a human readable format like Wiggle. Let’s make this conversion using the tool bigWigToWig
(warning: link directly downloads program) from UCSC (distributed as part of the Kent utilities). Let’s look at the beginning of the file to get a sense of the format.
bigWigToWig 241-mammalian-2020v2.bigWig /dev/stdout | head
Here is what the output looks like.
#bedGraph section chr1:10074-11098
chr1 10074 10075 0.053
chr1 10075 10076 0.064
chr1 10076 10077 0.064
chr1 10077 10078 0.064
chr1 10078 10079 -2.109
chr1 10079 10080 0.053
chr1 10080 10081 0.053
chr1 10081 10082 0.064
chr1 10082 10083 0.064
Notice that there are lines beginning with #
, which are comment lines that we can ignore. Then we have a pretty standard genomic coordinate format for each line indicating the chromosome / scaffold and start and end positions of each site in the human genome where a PhyloP score was calculated. The 4th column is the PhyloP score. This score requires some interpretation. First, it can be either positive or negative, where positive scores are considered conserved sites and negative scores are considered accelerated sites evolving faster than expected under neutral evolution. Scores of 0 encode sites evolving neutrally. Beyond that, the numeric score is not a p-value, but rather a -log10(p-value) from the likelihood ratio test performed in PhyloP. We can use the formula 10^-(|PhyloP|)
to convert to a raw p-value. The following command will classify sites as conserved, accelerated, or neutral and will perform the conversion to a p-value for all sites within the bigWig alignment file. We will write this to a tabular, plain text file.
bigWigToWig 241-mammalian-2020v2.bigWig /dev/stdout | grep -v "^#" | awk -v OFS="\t" '{ if ($4 > 0) print $0, "conserved", 10^-$4; else if ($4 < 0) print $0, "accelerated", 10^$4; else print $0, "neutral", 10^$4 }' > 241-mammalian-2020v2.parse.txt
We only added a couple additional columns of information, but between that and uncompressing the data from the bigWig format, we now have a file that is ~134 GB. Good luck loading that into R!
Bonferroni Correction in Unix
We will use Unix to follow procedures for identifying sites with p-values that are statistically significant after controlling for false positives. First, let’s perform the Bonferonni correction. Bonferroni correction is a simple procedure: the alpha value used in the statistical comparison (usually 0.05) is divided by the number of tests performed and this value is the new threshold for statistical significance. This is a quite conservative correction.
# this will take a while to run!
total=`cat 241-mammalian-2020v2.parse.txt | wc -l` && cat 241-mammalian-2020v2.parse.txt | awk -v OFS="\t" -v total="${total}" '{ if ($6 <= (0.05 / total)) print $0, "TRUE"; else print $0, "FALSE" }' > 241-mammalian-2020v2.parse.bonferroni.txt
The output was the same as the input with an additional binary column where sites that remain significant after Bonferroni correction are encoded as TRUE
and non-signficant sites are encoded as FALSE
. It is also possible to view the new Bonferroni threshold for statistical significance.
cat 241-mammalian-2020v2.parse.txt | wc -l | awk -v total="${total}" -v OFS="\t" '{ print total, 0.05 / total }'
# 2852623265 1.75277e-11
Pretty simple! That’s a ton of statistical tests / sites and thus the Bonferonni threshold is quite low. Here it is with commas so it is more readable: 2,852,623,265. 2.85 billion!
Benjamini-Hochberg Procedure in Unix
Now let’s turn to the Benjamini-Hochberg (BH) procedure, which is more complex than the simple Bonferroni correction. But the BH procedure is less conservative and may detect statistical differences in comparisons that the Bonferroni does not. You can read more details about the procedure in the Background Information above, but let’s walk through the BH procedure simply as it is laid out there.
- In the BH procedure, raw p-values are sorted from lowest to highest (a step that is not necessary in the Bonferroni correction).
- Based on the sorted p-values, a rank is assigned to each test (i.e., site) from 1 to N where N is the number of tests.
- A BH critical value is then calculated based on the ranks using the formula
(i/m)Q
, wherei
is the rank,m
is the total number of tests performed (as is used in the Bonferroni correction), andQ
is the false discovery rate you choose (usually 0.05). - Finally, the BH critical value is compared to the raw p-value. Specifically, we are looking for circumstances where the raw p-value is below the BH critical value. Many tests could meet this criterion but we are most interested in the largest p-value that is below its BH critical value. Tests with a raw p-value at or below this largest p-value are all considered statistically significant.
Again, if you need more information and a simple example with a far smaller dataset, visit the Information above.
We can use this procedure in our giant dataset by applying the following code. We are tabulating the total number of tests (i.e., sites) in the original file, sorting the raw p-values using 4 cores (--parallel=4
) and the current working directory for temporary files (-T $(pwd)
), calculating the ranks and BH critical values for each test (i.e., site), and identifying all sites where the raw p-value is below the BH critical value.
# this will take a while to run!
total=`cat 241-mammalian-2020v2.parse.txt | wc -l` && cat 241-mammalian-2020v2.parse.txt | sort -k6,6g --parallel=4 -T $(pwd) | awk -v OFS="\t" '{ print $0, NR }' |
awk -v OFS="\t" -v total="${total}" '{ print $0, ($7 / total)*0.05 }' |
awk -v OFS="\t" '{ if ($6 < $8) print $0, "TRUE"; else print $0, "FALSE" }' > 241-mammalian-2020v2.parse.fdr.txt
Unlike the Bonferroni procedure, this one will require multiple steps / commands. This first one produced a useful intermediate with all the information we need. Now we need to determine the largest raw p-value where the raw p-value is below the BH critical value.
# this will take a while to run!
cat 241-mammalian-2020v2.parse.fdr.txt | awk '{ if ($9 == "TRUE") print $0 }' | tail -n 1 | awk '{ print $6 }'
# 0.00229615
Now we can gather those tests (i.e., sites) where the raw p-value is lesser than or equal to 0.00229615. We will also re-sort the data by chromosome / scaffold and coordinates, which is more logical (using the same tricks as before to run with 4 cores).
cat 241-mammalian-2020v2.parse.fdr.txt | awk '{ if ($6 <= 0.00229615) print $0 }' |
sort -k1,1 -k2,2n --parallel=4 -T $(pwd) > 241-mammalian-2020v2.parse.fdr.significant.txt
Summarizing the Results
Remember, the Bonferroni correction is far more conservative than the BH FDR procedure, so we expect far fewer sites to be identified as statistically significant. Let’s see how that shakes out.
# number of statistically significant sites after Bonferroni correction
wc -l 241-mammalian-2020v2.parse.fdr.significant.txt
# 131124451 241-mammalian-2020v2.parse.fdr.significant.txt
# number of statistically significant sites after performing the BH procedure
cat 241-mammalian-2020v2.parse.bonferroni.txt | awk '{ if ($7 == "TRUE") print $0 }' > 241-mammalian-2020v2.parse.bonferroni.significant.txt
cat 241-mammalian-2020v2.parse.bonferroni.significant.txt | wc -l
# 175862
Just as we expected: 131 million sites vs. only 176 thousand sites. A very large difference! Let’s dig further into the dataset of sites that are statistically significant after running the Benjamini-Hochberg procedure. How many sites are conserved versus accelerated? We can use datamash
to investigate this in Unix without needing to use R, which would be difficult anyway (241-mammalian-2020v2.parse.fdr.significant.txt
is 9.6 GB).
cat 241-mammalian-2020v2.parse.fdr.significant.txt | datamash --sort groupby 5 count 5
# accelerated 50558458
# conserved 80565993
We can even go further and summarize based on chromosome as well.
cat 241-mammalian-2020v2.parse.fdr.significant.txt | datamash --sort groupby 1,5 count 1
# chr1 accelerated 4215264
# chr1 conserved 7427904
# chr10 accelerated 2466657
# chr10 conserved 3769684
# chr11 accelerated 2243972
# chr11 conserved 4151872
# chr12 accelerated 2211997
# chr12 conserved 3893533
# chr13 accelerated 1560650
# chr13 conserved 2161264
# chr14 accelerated 1569982
# chr14 conserved 2936207
# chr15 accelerated 1464326
# chr15 conserved 2706121
# chr16 accelerated 1929270
# chr16 conserved 2821175
# chr17 accelerated 1791503
# chr17 conserved 3377866
# chr18 accelerated 1334192
# chr18 conserved 1944436
# chr19 accelerated 1393935
# chr19 conserved 1924627
# chr2 accelerated 3946703
# chr2 conserved 7261033
# chr20 accelerated 1385740
# chr20 conserved 1902183
# chr21 accelerated 921060
# chr21 conserved 703765
# chr22 accelerated 1145298
# chr22 conserved 1114539
# chr3 accelerated 2798629
# chr3 conserved 5486505
# chr4 accelerated 2850552
# chr4 conserved 3928253
# chr5 accelerated 2804820
# chr5 conserved 4893091
# chr6 accelerated 2753938
# chr6 conserved 4439519
# chr7 accelerated 2744352
# chr7 conserved 4098364
# chr8 accelerated 2454247
# chr8 conserved 3322716
# chr9 accelerated 2027966
# chr9 conserved 3732695
# chrX accelerated 2253501
# chrX conserved 2322307
# chrY accelerated 289904
# chrY conserved 246334
Pretty easy and cool! And that’s essentially it! These commands could easily be put together into a little pipeline for files like this but I will not do that here. The amazing thing is that this analysis can be done in an hour or two of interactive coding on the terminal even though we are dealing with files that are 100s of gigabytes in size! The Unix shell does not really blink but we would be helpless with R. I hope this tutorial is useful to others out there working with genomics or other extremely large statistical datasets!