Gwas Catalog

Table of Contents

Using the GWAS Catalog

Today we will be using the GWAS Catalog to identify SNPs (Single Nucleotide Polymorphisms) that are associated with Bipolar Disorder. Sure, there is an interactive search on the page, but let’s do this locally.

  1. Navigate to the downloadable files by first clicking the ‘Download’ link on the main page, and then ‘Files’ link, you should be here: https://www.ebi.ac.uk/gwas/docs/file-downloads The main file to get is All associations v1.0.2 - with added ontology annotations, GWAS Catalog study accession numbers and genotyping technology It can also be grabbed via wget https://www.ebi.ac.uk/gwas/api/search/downloads/alternative

  2. Now that we have the file, let’s quickly see what we have that might qualify: grep bipolar gwas_catalog_v1.0.2-associations_e96_r2019-04-06.tsv Note that the filename will likely be different for you, it is versioned with the date the catalog was updated.

  3. Well, that’s a lot of text. Let’s see how many candidates we may be dealing with by piping this output to wc.
    grep bipolar gwas_catalog_v1.0.2-associations_e96_r2019-04-06.tsv | wc Hmm, 1045. Take a look through the results a bit. Some of these mentions of bipolar are not necessarily what we are looking for.
    Some may be giving p-values associated with particular things other than just bipolar disorder, e.g. binge eating and bipolar disorder.

  4. Well, let’s see, there is another file, the GWAS to EFO mappings file. This is useful for us to find which entries in the GWAS Catalog correspond to particular EFO terms. Let’s grab that file: https://www.ebi.ac.uk/gwas/api/search/downloads/trait_mappings and see what it has for things related to bipolar disorder grep bipolar gwas_catalog_trait-mappings_r2019-04-06.tsv, again, the filename may be different for you.

  5. Looking through this file, we can see that the entries that seem to be most exclusively about bipolar disorder seem to correspond with EFO_0000289 and if we go to the site listed https://www.ebi.ac.uk/gwas/efotraits/EFO_0000289 we see that this is indeed the term related to bipolar disorder.

  6. Ok, so let’s search the first file just for EFO_0000289 and see how many hits we get: grep EFO_0000289 gwas_catalog_v1.0.2-associations_e96_r2019-04-06.tsv | wc 610 hits, this is the same number of associations that we see if we do the search on the GWAS Catalog site: https://www.ebi.ac.uk/gwas/efotraits/EFO_0000289

  7. We can actually download them from the GWAS Catalog site, there is a link to do so in the top right of the presented table. Since we already have all the data though, we can just extract it locally: grep EFO_0000289 gwas_catalog_v1.0.2-associations_e96_r2019-04-06.tsv > bipolar_disorder_associated_SNPs.tsv

  8. If the EFO_* term had appeared in other columns, or in other contexts, then we may have wished to take an alternate approach, for instance, getting just the columns we wanted to check using cut, then piping that to grep and getting the line numbers with it. Then we could extract only those line numbers using sed or awk.

    Why?

    What’s the point of doing this locally? Not much in this case. The GWAS Catalog site is pretty nice and makes it easy to search and download just the data for the the disease we’re interested in investigating. It also makes it easy to visualize the p-values across all the included studies.
    In this case Bipolar Disorder is a trait with a pretty well defined mapping to EFO, some other traits may not have that, or you may not agree with the mapping that the GWAS Catalog folks used. In cases like these it may be useful to delve into the full data set locally and make more nuanced decisions about which studies to include and/or exclude from consideration in your analysis. On the other hand, it is a bit of a warmup, as the following processing to line things up with the data from our study can’t be done by the GWAS Catalog site.

  9. In our case, we want to make a table with each previously discovered SNP associated with Bipolar Disorder and what the p-value was for the same SNP in our the data from the study we’re currently analyzing. Let’s look more particularly at the values we’re a bit interested in: grep EFO_0000289 gwas_catalog_v1.0.2-associations_e96_r2019-04-06.tsv | cut -f8,12,13,24,28 We know which columns to select with cut by checking the header values first. The file is tab separated, so we don’t need to pass any delimiter to cut, if it were instead comma separated we could do cut -d ',' -f8,12,13,24,28 Note though that we should be careful that the delimiter character does not also appear in the entries… this can happen with .csv files and we must use something a bit more complex than cut to parse them.

  10. Since we’re satisfied with this as the start of our table, let’s separately grab the list of only the SNP rs IDs, this is the column headed with SNP_ID_CURRENT. grep EFO_0000289 gwas_catalog_v1.0.2-associations_e96_r2019-04-06.tsv | cut -f24 | sort | uniq > SNP_IDs Let’s check how many there are wc SNP_IDs, we’ve found 538 unique SNPs associated with Bipolar Disorder, we will check what p-values we have for these SNPs in the study we’re analyzing, hopefully we replicate some of these, but we would also like to discover newly associated SNPs.

  11. Let’s get the essential info: grep EFO_0000289 ../GWAS\ Catalog/gwas_catalog_v1.0.2-associations_e96_r2019-04-06.tsv | cut -f12,13,28,31 | sort -h > chr_bp_pval And let’s take a look at this file:

    CHR     BP              P-VAL    <- Header is not in the file, purely illustrative., also I'm not showing column 31, 
                                        as I decided to add the odds ratio column retroactively (see that in the cut command above).
    1       101472921       6E-6
    1       101475100       6E-6
    1       109654510       3E-6
    1       115918397       9E-6
    1;1;1   61366266;61366218;61358717      2E-7
    1       116140919       7E-6
    1       150351808       7E-6
    1       150444437       6E-8
    1       159184522       9E-6
    1;1     61356147;61354400       4E-7
    1;1     61356147;61356611       4E-7
    1;1     61366266;61366218       2E-7
    1       161493797       3E-6
    1       165408315       7E-6
    1       165411386       2E-7
    1       17669971        3E-7
    1       178131084       7E-7
    ...
    22      29820911        6E-7
    22      30858049        6E-6
    22      34102046        9E-7
    22      44855931        3E-8
    22      44858015        9E-7
    22      45566024        2E-8
    22      48527647        5E-6
                            8E-7
    

    So, we have a few problems here. Some columns are overloaded with SNPs, and one has no CHR and BP.
    There are actually only a few issues though, so we can just fix them in a text editor this time around.
    Here is the cleaned up version:

    CHR     BP              P-VAL   Odds Ratio/Beta  <- Header is not in the file, purely illustrative.
    1       101472921       6E-6    1.15
    1       101475100       6E-6    
    1       109654510       3E-6    0.244
    1       115918397       9E-6    39.0894
    1       61366266        2E-7    2.5
    1       61366218        2E-7    2.5
    1       61358717        2E-7    2.5
    1       116140919       7E-6    35.6552
    1       150351808       7E-6    1.086
    1       150444437       6E-8    
    1       159184522       9E-6    2.0
    1       61356147        4E-7    2.26
    1       61354400        4E-7    2.26
    1       61356147        4E-7    2.26
    1       61356611        4E-7    2.26
    1       61366266        2E-7    2.34
    1       61366218        2E-7    2.34
    1       161493797       3E-6    
    1       165408315       7E-6    
    1       165411386       2E-7    
    1       17669971        3E-7    
    1       178131084       7E-7    1.8491876
    ...
    22      29820911        6E-7    
    22      30858049        6E-6    22.0315
    22      34102046        9E-7    1.19373
    22      44855931        3E-8    1.923077
    22      44858015        9E-7    1.99
    22      45566024        2E-8    1.06
    22      48527647        5E-6    
    

    Now, let’s get a proper final sort sort -V chr_bp_pval > chr_bp_pval_clean_sorted Ironically, numeric sort (-n) and human sort (-h) don’t do what one would intuitively expect a sort to do, that is reserved for version sort (-V).
    After checking that it is doing what we want, let’s scratch that and also remove duplicates: sort -V chr_bp_pval | uniq > chr_bp_pval_clean_sorted

  12. Well, we have another problem still, the locations in the “Data in the GWAS Catalog is currently mapped to genome assembly GRCh38.p12 and dbSNP Build 151,” but our data is based on hg19 (aka GRCh37). Thankfully, there are tools for converting between different builds of the genome, one of these is LiftOver, it can be used online here. The tools expects the format to be Chr:BPstart-BPend or a range of base pairs, e.g. 1:12345-12346 for a single SNP at 12345 or 1:12345-12348 for a range covering 3 SNPs. Let’s get our data in that format by isolating just the locations, we’ll set aside dealing with the P-Values, and getting that ‘:’ between them. The format is a little obnoxious, so this will take some Bash magic.
    cat chr_bp_pval_clean_sorted | cut -f1,2 | tr '\t' ':' > locations
    cat locations | cut -d ':' -f2 | xargs printf "%d+1\n" | bc > ends
    for i in $(seq 1 $(wc -l locations | cut -d ' ' -f1)); do echo chr >> prefix; done
    paste -d "" prefix locations | paste - ends | tr '\t' '-' > liftover
    We may need to cut some off the end of the file because of a few too many prefix lines, echo adds a newline, but I’m not sure where the 2nd extra line comes from.

    So now everything in our file ‘liftover’ should look something like this:

    chr1:8362616-8362617
    chr1:17669971-17669972
    chr1:24105466-24105467
    chr1:29964421-29964422
    chr1:30052867-30052868
    chr1:34138966-34138967
    chr1:40710794-40710795
    chr1:41374150-41374151
    chr1:43222879-43222880
    ...
    chr22:34102046-34102047
    chr22:44855931-44855932
    chr22:44858015-44858016
    chr22:45566024-45566025
    chr22:48527647-48527648
    

    Alright, that was obnoxious (LiftOver doesn’t have the best error handling…) but anyway, we can pass the above file we created to the webserver’s LiftOver (or download it and run it locally) and what we get out is as follows, which should now be hg19 positions:

    chr1:8422676-8422677
    chr1:17996466-17996467
    chr1:24431956-24431957
    chr1:30437268-30437269
    chr1:30525714-30525715
    chr1:34604567-34604568
    chr1:41176466-41176467
    chr1:41839822-41839823
    chr1:43688550-43688551
    ...
    chr22:34498035-34498036
    chr22:45251811-45251812
    chr22:45253895-45253896
    chr22:45961904-45961905
    chr22:48923459-48923460
    
  13. Sigh… now let’s strip the format back down so we have something we can search for.
    We want only the chromosome number and the position. First we cut using the ‘r’ character as the delimiter and keep the 2nd column, (everything to the right of the ‘r’). We then cut again, this time at the ‘-’ character, and we keep the first column (everything to the left of the ‘-’).
    cat hglft_genome_28c5_ce3e20.bed | cut -d 'r' -f2 | cut -d '-' -f1 > final_selected_SNPs

  14. Now we need to locate all these SNPs in the data from our own GWAS, which we have in the file ALL_RES_PC. We will do this by grepping for the unique id formed by the chr:position pair that we now have after doing the liftover and that we also used in our data. Some of these won’t exist in our data for various reasons (e.g. excluded in quality control, or that the SNP was monomorphic in our data).
    for i in $(cat final_selected_SNPs); do grep -E "\b${i}\b" ALL_RES_PC >> our_SNPs; done

  15. cut -f3,4 chr_bp_pval | paste final_selected_SNPs - > gwas_catalog_version

  16. Let’s merge these on the chr:bp column
    sort -k1,1 gwas_catalog_version > gwas_catalog_version_sorted_for_join
    sort -k2,2 our_SNPs > our_sorted_SNPs_sorted_for_join
    nl gwas_catalog_version_sorted_for_join > gwas_catalog_version_sorted_for_join2
    join -1 2 -2 2 our_sorted_SNPs_sorted_for_join gwas_catalog_version_sorted_for_join2 | sort -gb -k2,2 -k3,3 | uniq -f9 -w 4 | cut -d ' ' -f10 --complement | sort -gb -k2,2 -k3,3 > figured_out

  17. awk '{if(a[$1]){a[$1]=a[$1]","$11} else { a[$1]=$11}} END {for (i in a) {print i" "a[i]}}' figured_out | sort -gb -k1,1 > odds_beta
    awk '{if(a[$1]){a[$1]=a[$1]","$10} else { a[$1]=$10}} END {for (i in a) {print i" "a[i]}}' figured_out | sort -gb -k1,1 > p_vals
    cut -d ' ' -f1-9 figured_out | sort -gb -k1,1 | join - p_vals | join - odds_beta | uniq | sort -gb -k2,2 -k3,3 > stone_table

  18. Sorry, the Bash got a bit more complicated at the end there and I’ve been busy, I will add some more explanation of what is going on later.

Kunji avatar
Kunji
Biophysics, Mathematics, Statistical Genetics, Computer Science.