Gwas Catalog
Table of Contents
Using the GWAS Catalog
Finding Bipolar Related SNPs
Today we will be using the GWAS Catalog to identify SNPs (Single Nucleotide Polymorphisms) that are associated with Bipolar Disorder. Sure, there is an interactive search on the page, but let’s do this locally.
Navigate to the downloadable files by first clicking the ‘Download’ link on the main page, and then ‘Files’ link, you should be here: https://www.ebi.ac.uk/gwas/docs/file-downloads The main file to get is All associations v1.0.2 - with added ontology annotations, GWAS Catalog study accession numbers and genotyping technology It can also be grabbed via
wget https://www.ebi.ac.uk/gwas/api/search/downloads/alternative
Now that we have the file, let’s quickly see what we have that might qualify:
grep bipolar gwas_catalog_v1.0.2-associations_e96_r2019-04-06.tsv
Note that the filename will likely be different for you, it is versioned with the date the catalog was updated.Well, that’s a lot of text. Let’s see how many candidates we may be dealing with by piping this output to wc.
grep bipolar gwas_catalog_v1.0.2-associations_e96_r2019-04-06.tsv | wc
Hmm, 1045. Take a look through the results a bit. Some of these mentions of bipolar are not necessarily what we are looking for.
Some may be giving p-values associated with particular things other than just bipolar disorder, e.g. binge eating and bipolar disorder.Well, let’s see, there is another file, the GWAS to EFO mappings file. This is useful for us to find which entries in the GWAS Catalog correspond to particular EFO terms. Let’s grab that file:
https://www.ebi.ac.uk/gwas/api/search/downloads/trait_mappings
and see what it has for things related to bipolar disordergrep bipolar gwas_catalog_trait-mappings_r2019-04-06.tsv
, again, the filename may be different for you.Looking through this file, we can see that the entries that seem to be most exclusively about bipolar disorder seem to correspond with EFO_0000289 and if we go to the site listed https://www.ebi.ac.uk/gwas/efotraits/EFO_0000289 we see that this is indeed the term related to bipolar disorder.
Ok, so let’s search the first file just for EFO_0000289 and see how many hits we get:
grep EFO_0000289 gwas_catalog_v1.0.2-associations_e96_r2019-04-06.tsv | wc
610 hits, this is the same number of associations that we see if we do the search on the GWAS Catalog site: https://www.ebi.ac.uk/gwas/efotraits/EFO_0000289We can actually download them from the GWAS Catalog site, there is a link to do so in the top right of the presented table. Since we already have all the data though, we can just extract it locally:
grep EFO_0000289 gwas_catalog_v1.0.2-associations_e96_r2019-04-06.tsv > bipolar_disorder_associated_SNPs.tsv
If the EFO_* term had appeared in other columns, or in other contexts, then we may have wished to take an alternate approach, for instance, getting just the columns we wanted to check using cut, then piping that to grep and getting the line numbers with it. Then we could extract only those line numbers using sed or awk.
Why?
What’s the point of doing this locally? Not much in this case. The GWAS Catalog site is pretty nice and makes it easy to search and download just the data for the the disease we’re interested in investigating. It also makes it easy to visualize the p-values across all the included studies.
In this case Bipolar Disorder is a trait with a pretty well defined mapping to EFO, some other traits may not have that, or you may not agree with the mapping that the GWAS Catalog folks used. In cases like these it may be useful to delve into the full data set locally and make more nuanced decisions about which studies to include and/or exclude from consideration in your analysis. On the other hand, it is a bit of a warmup, as the following processing to line things up with the data from our study can’t be done by the GWAS Catalog site.In our case, we want to make a table with each previously discovered SNP associated with Bipolar Disorder and what the p-value was for the same SNP in our the data from the study we’re currently analyzing. Let’s look more particularly at the values we’re a bit interested in:
grep EFO_0000289 gwas_catalog_v1.0.2-associations_e96_r2019-04-06.tsv | cut -f8,12,13,24,28
We know which columns to select with cut by checking the header values first. The file is tab separated, so we don’t need to pass any delimiter to cut, if it were instead comma separated we could docut -d ',' -f8,12,13,24,28
Note though that we should be careful that the delimiter character does not also appear in the entries… this can happen with .csv files and we must use something a bit more complex than cut to parse them.Since we’re satisfied with this as the start of our table, let’s separately grab the list of only the SNP rs IDs, this is the column headed with SNP_ID_CURRENT.
grep EFO_0000289 gwas_catalog_v1.0.2-associations_e96_r2019-04-06.tsv | cut -f24 | sort | uniq > SNP_IDs
Let’s check how many there arewc SNP_IDs
, we’ve found 538 unique SNPs associated with Bipolar Disorder, we will check what p-values we have for these SNPs in the study we’re analyzing, hopefully we replicate some of these, but we would also like to discover newly associated SNPs.Let’s get the essential info:
grep EFO_0000289 ../GWAS\ Catalog/gwas_catalog_v1.0.2-associations_e96_r2019-04-06.tsv | cut -f12,13,28,31 | sort -h > chr_bp_pval
And let’s take a look at this file:CHR BP P-VAL <- Header is not in the file, purely illustrative., also I'm not showing column 31, as I decided to add the odds ratio column retroactively (see that in the cut command above). 1 101472921 6E-6 1 101475100 6E-6 1 109654510 3E-6 1 115918397 9E-6 1;1;1 61366266;61366218;61358717 2E-7 1 116140919 7E-6 1 150351808 7E-6 1 150444437 6E-8 1 159184522 9E-6 1;1 61356147;61354400 4E-7 1;1 61356147;61356611 4E-7 1;1 61366266;61366218 2E-7 1 161493797 3E-6 1 165408315 7E-6 1 165411386 2E-7 1 17669971 3E-7 1 178131084 7E-7 ... 22 29820911 6E-7 22 30858049 6E-6 22 34102046 9E-7 22 44855931 3E-8 22 44858015 9E-7 22 45566024 2E-8 22 48527647 5E-6 8E-7
So, we have a few problems here. Some columns are overloaded with SNPs, and one has no CHR and BP.
There are actually only a few issues though, so we can just fix them in a text editor this time around.
Here is the cleaned up version:CHR BP P-VAL Odds Ratio/Beta <- Header is not in the file, purely illustrative. 1 101472921 6E-6 1.15 1 101475100 6E-6 1 109654510 3E-6 0.244 1 115918397 9E-6 39.0894 1 61366266 2E-7 2.5 1 61366218 2E-7 2.5 1 61358717 2E-7 2.5 1 116140919 7E-6 35.6552 1 150351808 7E-6 1.086 1 150444437 6E-8 1 159184522 9E-6 2.0 1 61356147 4E-7 2.26 1 61354400 4E-7 2.26 1 61356147 4E-7 2.26 1 61356611 4E-7 2.26 1 61366266 2E-7 2.34 1 61366218 2E-7 2.34 1 161493797 3E-6 1 165408315 7E-6 1 165411386 2E-7 1 17669971 3E-7 1 178131084 7E-7 1.8491876 ... 22 29820911 6E-7 22 30858049 6E-6 22.0315 22 34102046 9E-7 1.19373 22 44855931 3E-8 1.923077 22 44858015 9E-7 1.99 22 45566024 2E-8 1.06 22 48527647 5E-6
Now, let’s get a proper final sort
sort -V chr_bp_pval > chr_bp_pval_clean_sorted
Ironically, numeric sort (-n) and human sort (-h) don’t do what one would intuitively expect a sort to do, that is reserved for version sort (-V).
After checking that it is doing what we want, let’s scratch that and also remove duplicates:sort -V chr_bp_pval | uniq > chr_bp_pval_clean_sorted
Well, we have another problem still, the locations in the “Data in the GWAS Catalog is currently mapped to genome assembly GRCh38.p12 and dbSNP Build 151,” but our data is based on hg19 (aka GRCh37). Thankfully, there are tools for converting between different builds of the genome, one of these is LiftOver, it can be used online here. The tools expects the format to be Chr:BPstart-BPend or a range of base pairs, e.g. 1:12345-12346 for a single SNP at 12345 or 1:12345-12348 for a range covering 3 SNPs. Let’s get our data in that format by isolating just the locations, we’ll set aside dealing with the P-Values, and getting that ‘:’ between them. The format is a little obnoxious, so this will take some Bash magic.
cat chr_bp_pval_clean_sorted | cut -f1,2 | tr '\t' ':' > locations
cat locations | cut -d ':' -f2 | xargs printf "%d+1\n" | bc > ends
for i in $(seq 1 $(wc -l locations | cut -d ' ' -f1)); do echo chr >> prefix; done
paste -d "" prefix locations | paste - ends | tr '\t' '-' > liftover
We may need to cut some off the end of the file because of a few too many prefix lines, echo adds a newline, but I’m not sure where the 2nd extra line comes from.So now everything in our file ‘liftover’ should look something like this:
chr1:8362616-8362617 chr1:17669971-17669972 chr1:24105466-24105467 chr1:29964421-29964422 chr1:30052867-30052868 chr1:34138966-34138967 chr1:40710794-40710795 chr1:41374150-41374151 chr1:43222879-43222880 ... chr22:34102046-34102047 chr22:44855931-44855932 chr22:44858015-44858016 chr22:45566024-45566025 chr22:48527647-48527648
Alright, that was obnoxious (LiftOver doesn’t have the best error handling…) but anyway, we can pass the above file we created to the webserver’s LiftOver (or download it and run it locally) and what we get out is as follows, which should now be hg19 positions:
chr1:8422676-8422677 chr1:17996466-17996467 chr1:24431956-24431957 chr1:30437268-30437269 chr1:30525714-30525715 chr1:34604567-34604568 chr1:41176466-41176467 chr1:41839822-41839823 chr1:43688550-43688551 ... chr22:34498035-34498036 chr22:45251811-45251812 chr22:45253895-45253896 chr22:45961904-45961905 chr22:48923459-48923460
Sigh… now let’s strip the format back down so we have something we can search for.
We want only the chromosome number and the position. First we cut using the ‘r’ character as the delimiter and keep the 2nd column, (everything to the right of the ‘r’). We then cut again, this time at the ‘-’ character, and we keep the first column (everything to the left of the ‘-’).
cat hglft_genome_28c5_ce3e20.bed | cut -d 'r' -f2 | cut -d '-' -f1 > final_selected_SNPs
Now we need to locate all these SNPs in the data from our own GWAS, which we have in the file ALL_RES_PC. We will do this by grepping for the unique id formed by the chr:position pair that we now have after doing the liftover and that we also used in our data. Some of these won’t exist in our data for various reasons (e.g. excluded in quality control, or that the SNP was monomorphic in our data).
for i in $(cat final_selected_SNPs); do grep -E "\b${i}\b" ALL_RES_PC >> our_SNPs; done
cut -f3,4 chr_bp_pval | paste final_selected_SNPs - > gwas_catalog_version
Let’s merge these on the chr:bp column
sort -k1,1 gwas_catalog_version > gwas_catalog_version_sorted_for_join
sort -k2,2 our_SNPs > our_sorted_SNPs_sorted_for_join
nl gwas_catalog_version_sorted_for_join > gwas_catalog_version_sorted_for_join2
join -1 2 -2 2 our_sorted_SNPs_sorted_for_join gwas_catalog_version_sorted_for_join2 | sort -gb -k2,2 -k3,3 | uniq -f9 -w 4 | cut -d ' ' -f10 --complement | sort -gb -k2,2 -k3,3 > figured_out
awk '{if(a[$1]){a[$1]=a[$1]","$11} else { a[$1]=$11}} END {for (i in a) {print i" "a[i]}}' figured_out | sort -gb -k1,1 > odds_beta
awk '{if(a[$1]){a[$1]=a[$1]","$10} else { a[$1]=$10}} END {for (i in a) {print i" "a[i]}}' figured_out | sort -gb -k1,1 > p_vals
cut -d ' ' -f1-9 figured_out | sort -gb -k1,1 | join - p_vals | join - odds_beta | uniq | sort -gb -k2,2 -k3,3 > stone_table
Sorry, the Bash got a bit more complicated at the end there and I’ve been busy, I will add some more explanation of what is going on later.