May 05, 2019 Categories: Bioinformatics

Gwas Catalog

Table of Contents

Using the GWAS Catalog

Today we will be using the GWAS Catalog to identify SNPs (Single Nucleotide Polymorphisms) that are associated with Bipolar Disorder. Sure, there is an interactive search on the page, but let’s do this locally.

Navigate to the downloadable files by first clicking the ‘Download’ link on the main page, and then ‘Files’ link, you should be here: https://www.ebi.ac.uk/gwas/docs/file-downloads The main file to get is All associations v1.0.2 - with added ontology annotations, GWAS Catalog study accession numbers and genotyping technology It can also be grabbed via wget https://www.ebi.ac.uk/gwas/api/search/downloads/alternative
Now that we have the file, let’s quickly see what we have that might qualify: grep bipolar gwas_catalog_v1.0.2-associations_e96_r2019-04-06.tsv Note that the filename will likely be different for you, it is versioned with the date the catalog was updated.
Well, that’s a lot of text. Let’s see how many candidates we may be dealing with by piping this output to wc.
grep bipolar gwas_catalog_v1.0.2-associations_e96_r2019-04-06.tsv | wc Hmm, 1045. Take a look through the results a bit. Some of these mentions of bipolar are not necessarily what we are looking for.
Some may be giving p-values associated with particular things other than just bipolar disorder, e.g. binge eating and bipolar disorder.
Well, let’s see, there is another file, the GWAS to EFO mappings file. This is useful for us to find which entries in the GWAS Catalog correspond to particular EFO terms. Let’s grab that file: https://www.ebi.ac.uk/gwas/api/search/downloads/trait_mappings and see what it has for things related to bipolar disorder grep bipolar gwas_catalog_trait-mappings_r2019-04-06.tsv, again, the filename may be different for you.
Looking through this file, we can see that the entries that seem to be most exclusively about bipolar disorder seem to correspond with EFO_0000289 and if we go to the site listed https://www.ebi.ac.uk/gwas/efotraits/EFO_0000289 we see that this is indeed the term related to bipolar disorder.
Ok, so let’s search the first file just for EFO_0000289 and see how many hits we get: grep EFO_0000289 gwas_catalog_v1.0.2-associations_e96_r2019-04-06.tsv | wc 610 hits, this is the same number of associations that we see if we do the search on the GWAS Catalog site: https://www.ebi.ac.uk/gwas/efotraits/EFO_0000289
We can actually download them from the GWAS Catalog site, there is a link to do so in the top right of the presented table. Since we already have all the data though, we can just extract it locally: grep EFO_0000289 gwas_catalog_v1.0.2-associations_e96_r2019-04-06.tsv > bipolar_disorder_associated_SNPs.tsv
If the EFO_* term had appeared in other columns, or in other contexts, then we may have wished to take an alternate approach, for instance, getting just the columns we wanted to check using cut, then piping that to grep and getting the line numbers with it. Then we could extract only those line numbers using sed or awk.

Why?

What’s the point of doing this locally? Not much in this case. The GWAS Catalog site is pretty nice and makes it easy to search and download just the data for the the disease we’re interested in investigating. It also makes it easy to visualize the p-values across all the included studies.
In this case Bipolar Disorder is a trait with a pretty well defined mapping to EFO, some other traits may not have that, or you may not agree with the mapping that the GWAS Catalog folks used. In cases like these it may be useful to delve into the full data set locally and make more nuanced decisions about which studies to include and/or exclude from consideration in your analysis. On the other hand, it is a bit of a warmup, as the following processing to line things up with the data from our study can’t be done by the GWAS Catalog site.
In our case, we want to make a table with each previously discovered SNP associated with Bipolar Disorder and what the p-value was for the same SNP in our the data from the study we’re currently analyzing. Let’s look more particularly at the values we’re a bit interested in: grep EFO_0000289 gwas_catalog_v1.0.2-associations_e96_r2019-04-06.tsv | cut -f8,12,13,24,28 We know which columns to select with cut by checking the header values first. The file is tab separated, so we don’t need to pass any delimiter to cut, if it were instead comma separated we could do cut -d ',' -f8,12,13,24,28 Note though that we should be careful that the delimiter character does not also appear in the entries… this can happen with .csv files and we must use something a bit more complex than cut to parse them.
Since we’re satisfied with this as the start of our table, let’s separately grab the list of only the SNP rs IDs, this is the column headed with SNP_ID_CURRENT. grep EFO_0000289 gwas_catalog_v1.0.2-associations_e96_r2019-04-06.tsv | cut -f24 | sort | uniq > SNP_IDs Let’s check how many there are wc SNP_IDs, we’ve found 538 unique SNPs associated with Bipolar Disorder, we will check what p-values we have for these SNPs in the study we’re analyzing, hopefully we replicate some of these, but we would also like to discover newly associated SNPs.

Let’s get the essential info: grep EFO_0000289 ../GWAS\ Catalog/gwas_catalog_v1.0.2-associations_e96_r2019-04-06.tsv | cut -f12,13,28,31 | sort -h > chr_bp_pval And let’s take a look at this file:

CHR     BP              P-VAL    <- Header is not in the file, purely illustrative., also I'm not showing column 31, 
                                    as I decided to add the odds ratio column retroactively (see that in the cut command above).
1       101472921       6E-6
1       101475100       6E-6
1       109654510       3E-6
1       115918397       9E-6
1;1;1   61366266;61366218;61358717      2E-7
1       116140919       7E-6
1       150351808       7E-6
1       150444437       6E-8
1       159184522       9E-6
1;1     61356147;61354400       4E-7
1;1     61356147;61356611       4E-7
1;1     61366266;61366218       2E-7
1       161493797       3E-6
1       165408315       7E-6
1       165411386       2E-7
1       17669971        3E-7
1       178131084       7E-7
...
22      29820911        6E-7
22      30858049        6E-6
22      34102046        9E-7
22      44855931        3E-8
22      44858015        9E-7
22      45566024        2E-8
22      48527647        5E-6
                        8E-7

So, we have a few problems here. Some columns are overloaded with SNPs, and one has no CHR and BP.
There are actually only a few issues though, so we can just fix them in a text editor this time around.
Here is the cleaned up version:

CHR     BP              P-VAL   Odds Ratio/Beta  <- Header is not in the file, purely illustrative.
1       101472921       6E-6    1.15
1       101475100       6E-6    
1       109654510       3E-6    0.244
1       115918397       9E-6    39.0894
1       61366266        2E-7    2.5
1       61366218        2E-7    2.5
1       61358717        2E-7    2.5
1       116140919       7E-6    35.6552
1       150351808       7E-6    1.086
1       150444437       6E-8    
1       159184522       9E-6    2.0
1       61356147        4E-7    2.26
1       61354400        4E-7    2.26
1       61356147        4E-7    2.26
1       61356611        4E-7    2.26
1       61366266        2E-7    2.34
1       61366218        2E-7    2.34
1       161493797       3E-6    
1       165408315       7E-6    
1       165411386       2E-7    
1       17669971        3E-7    
1       178131084       7E-7    1.8491876
...
22      29820911        6E-7    
22      30858049        6E-6    22.0315
22      34102046        9E-7    1.19373
22      44855931        3E-8    1.923077
22      44858015        9E-7    1.99
22      45566024        2E-8    1.06
22      48527647        5E-6

Now, let’s get a proper final sort sort -V chr_bp_pval > chr_bp_pval_clean_sorted Ironically, numeric sort (-n) and human sort (-h) don’t do what one would intuitively expect a sort to do, that is reserved for version sort (-V).
After checking that it is doing what we want, let’s scratch that and also remove duplicates: sort -V chr_bp_pval | uniq > chr_bp_pval_clean_sorted

Well, we have another problem still, the locations in the “Data in the GWAS Catalog is currently mapped to genome assembly GRCh38.p12 and dbSNP Build 151,” but our data is based on hg19 (aka GRCh37). Thankfully, there are tools for converting between different builds of the genome, one of these is LiftOver, it can be used online here. The tools expects the format to be Chr:BPstart-BPend or a range of base pairs, e.g. 1:12345-12346 for a single SNP at 12345 or 1:12345-12348 for a range covering 3 SNPs. Let’s get our data in that format by isolating just the locations, we’ll set aside dealing with the P-Values, and getting that ‘:’ between them. The format is a little obnoxious, so this will take some Bash magic.
cat chr_bp_pval_clean_sorted | cut -f1,2 | tr '\t' ':' > locations
cat locations | cut -d ':' -f2 | xargs printf "%d+1\n" | bc > ends
for i in $(seq 1 $(wc -l locations | cut -d ' ' -f1)); do echo chr >> prefix; done
paste -d "" prefix locations | paste - ends | tr '\t' '-' > liftover
We may need to cut some off the end of the file because of a few too many prefix lines, echo adds a newline, but I’m not sure where the 2nd extra line comes from.

So now everything in our file ‘liftover’ should look something like this:
```
chr1:8362616-8362617
chr1:17669971-17669972
chr1:24105466-24105467
chr1:29964421-29964422
chr1:30052867-30052868
chr1:34138966-34138967
chr1:40710794-40710795
chr1:41374150-41374151
chr1:43222879-43222880
...
chr22:34102046-34102047
chr22:44855931-44855932
chr22:44858015-44858016
chr22:45566024-45566025
chr22:48527647-48527648
```
Alright, that was obnoxious (LiftOver doesn’t have the best error handling…) but anyway, we can pass the above file we created to the webserver’s LiftOver (or download it and run it locally) and what we get out is as follows, which should now be hg19 positions:
```
chr1:8422676-8422677
chr1:17996466-17996467
chr1:24431956-24431957
chr1:30437268-30437269
chr1:30525714-30525715
chr1:34604567-34604568
chr1:41176466-41176467
chr1:41839822-41839823
chr1:43688550-43688551
...
chr22:34498035-34498036
chr22:45251811-45251812
chr22:45253895-45253896
chr22:45961904-45961905
chr22:48923459-48923460
```
Sigh… now let’s strip the format back down so we have something we can search for.
We want only the chromosome number and the position. First we cut using the ‘r’ character as the delimiter and keep the 2nd column, (everything to the right of the ‘r’). We then cut again, this time at the ‘-’ character, and we keep the first column (everything to the left of the ‘-’).
cat hglft_genome_28c5_ce3e20.bed | cut -d 'r' -f2 | cut -d '-' -f1 > final_selected_SNPs
Now we need to locate all these SNPs in the data from our own GWAS, which we have in the file ALL_RES_PC. We will do this by grepping for the unique id formed by the chr:position pair that we now have after doing the liftover and that we also used in our data. Some of these won’t exist in our data for various reasons (e.g. excluded in quality control, or that the SNP was monomorphic in our data).
for i in $(cat final_selected_SNPs); do grep -E "\b${i}\b" ALL_RES_PC >> our_SNPs; done
cut -f3,4 chr_bp_pval | paste final_selected_SNPs - > gwas_catalog_version
Let’s merge these on the chr:bp column
sort -k1,1 gwas_catalog_version > gwas_catalog_version_sorted_for_join
sort -k2,2 our_SNPs > our_sorted_SNPs_sorted_for_join
nl gwas_catalog_version_sorted_for_join > gwas_catalog_version_sorted_for_join2
join -1 2 -2 2 our_sorted_SNPs_sorted_for_join gwas_catalog_version_sorted_for_join2 | sort -gb -k2,2 -k3,3 | uniq -f9 -w 4 | cut -d ' ' -f10 --complement | sort -gb -k2,2 -k3,3 > figured_out
awk '{if(a[$1]){a[$1]=a[$1]","$11} else { a[$1]=$11}} END {for (i in a) {print i" "a[i]}}' figured_out | sort -gb -k1,1 > odds_beta
awk '{if(a[$1]){a[$1]=a[$1]","$10} else { a[$1]=$10}} END {for (i in a) {print i" "a[i]}}' figured_out | sort -gb -k1,1 > p_vals
cut -d ' ' -f1-9 figured_out | sort -gb -k1,1 | join - p_vals | join - odds_beta | uniq | sort -gb -k2,2 -k3,3 > stone_table
Sorry, the Bash got a bit more complicated at the end there and I’ve been busy, I will add some more explanation of what is going on later.

Gwas Catalog

Using the GWAS Catalog

Finding Bipolar Related SNPs

Why?