- Perform single-sample germline variant calling with GATK HaplotypeCaller and the GATK GVCF workflow on exome data
- Perform joint genotype calling on exome data, including additional exomes from 1000 Genomes Project
- Manipulate and Filter VCFs to remove artifacts and identify variants of interest
In this module we will use the GATK HaplotypeCaller to call germline variants from "normal" bams. For a more in-depth look, see this excellent GATK tutorial, provided by the Broad Institute.
Note: We are using GATK v4.2.1.0 for this tutorial.
Grab some bam files from the HCC1395 "normal" cell line and save them:
mkdir -p ~/workspace/germline
cd ~/workspace/germline
wget https://storage.googleapis.com/bfx_workshop_tmp/align.tar
tar -xvf align.tar
First, let's run GATK HaplotypeCaller to call germline SNPs and indels. Whenever HaplotypeCaller finds signs of variation it performs a local de novo re-assembly of reads. This improves the accuracy of variant calling, especially in challenging regions. We'll also include an option to generate bam output from HaplotypeCaller so that local reassembly/realignment around called variants can be visualized.
We'll run GATK HaplotypeCaller with the following options:
--java-options '-Xmx12g'
tells GATK to use 12GB of memoryHaplotypeCaller
specifies the GATK command to run-R
specifies the path to the reference genome-I
specifies the path to the input bam file for which to call variants-O
specifies the path to the output vcf file to write variants to--bam-output
specifies the path to an optional bam file which will store local realignments around variants that HaplotypeCaller used to make calls-L
is used to limit calling to specific regions (e.g. just chr17). It cat unwieldy when this list of regions gets long, so we often use an environment variable to handle this. In this case, we've pre-loaded one called$GATK_REGIONS
. To verify it's contents, you can runecho $GATK_VARIANTS
All right, let's run this tool:
#set an environment variable containing the regions of the genome we want to call on
export GATK_REGIONS='-L chr17'
#Call variants for exome data
gatk --java-options '-Xmx12g' HaplotypeCaller -R /workspace/inputs/references/genome/ref_genome.fa -I align/Exome_Norm_sorted_mrkdup_bqsr.bam -O /workspace/germline/Exome_Norm_HC_calls.vcf --bam-output /workspace/germline/Exome_Norm_HC_out.bam $GATK_REGIONS
Calling variants for the WGS data is going to take a while. Let's run it in the background and pipe the results to files so that we can keep working at the terminal:
gatk --java-options '-Xmx7g' HaplotypeCaller -R /workspace/inputs/references/genome/ref_genome.fa -I align/WGS_Norm_merged_sorted_mrkdup_bqsr.bam -O /workspace/germline/WGS_Norm_HC_calls.vcf --bam-output /workspace/germline/WGS_Norm_HC_out.bam $GATK_REGIONS 2>hc_wgs.err >hc_wgs.out &
We're also going to want to create a GVCF (more on that later on). Let's go ahead and get that chugging along in the background too:
gatk --java-options '-Xmx7g' HaplotypeCaller -ERC GVCF -R /workspace/inputs/references/genome/ref_genome.fa -I align/Exome_Norm_sorted_mrkdup_bqsr.bam -O /workspace/germline/Exome_Norm_HC_calls.g.vcf --bam-output /workspace/germline/Exome_Norm_HC_GVCF_out.bam $GATK_REGIONS 2>hc_gvcf.err >hc_gvcf.out
Let's start by looking through this Exome_Norm_HC_calls.vcf
:
- How many variants were called?
- For the first variant in the VCF file, how many reads are listed as supporting the reference allele and how many support the variant allele?
- Start a new IGV session and load the following bam files:
align/Exome_Norm_sorted_mrkdup_bqsr.bam
Exome_Norm_HC_out.bam
- Navigate to the location of the first variant in the file. Why doesn't the number of reads match the allelic depth from the VCF?
- Try right-clicking and choose
Shade Alignments > mapping quality low
. Does this help explain the discrepancy? - Why is there a difference between the two tracks?
- Set your alignment shading back to "none"
- Navigate to
chr17:76286974-76287203
. - Some variant callers might report
chr17:76287089:C>T
as an event. Would that be correct? - What information do the soft-clipped bases give you?
- What happens if you zoom out to a 500bp window or so? Why does the bwa-mem alignment have more reads?
- View the alignments in a "squished" format to see more reads at a time. What's your interpretation of this event?
- Find this event in your VCF file. Does the genotype field (GT) match your expectation?
An alternate (and GATK recommended) method is to use the GVCF workflow. In this mode, HaplotypeCaller runs per-sample to generate an intermediate GVCF. These intermediate GVCFs can then be used with the GenotypeGVCF
command for joint genotyping of multiple samples in a very efficient way. Joint genotyping has several advantages:
- In joint genotyping, variants are analyzed across all samples simultaneously. This ensures that if any sample in the study population has variation at a specific site then the genotype for all samples will be determined at that site.
- It allows more sensitive detection of genotype calls at sites where one sample might have lower coverage but other samples have a confident variant at that position.
- Joint calling is also necessary to produce data needed for VQSR filtering.
While joint genotyping can be performed directly with HaplotypeCaller (by specifying multiple -I paramters) it is less efficient, and less scalable. Each time a new sample is added the entire (progressively slower) variant calling analysis must be repeated. With the GVCF workflow, new samples can be run individually and then more quickly joint genotyped using the GenotypeGVCFs command.
GATK HaplotypeCaller is run with all of the same options as above, except for one addition: -ERC GVCF
specifies to use the GCVF workflow, producing a GVCF instead of VCF.
You already ran this command above, so check in on your log file and make sure that it has completed:
less hc_gvcf.err
Including additional samples and doing joint genotyping can improve the overall accuracy of variant calling and filtering. In this example, we only have one "normal" sample, our HCC1395BL cell line.
To beef up our sample numbers, let's find some public exome sequencing data to use. The 1000 Genome Project is an international effort to create a set of publicly available sequence data for a large cohort of ethnically diverse individuals. Recall that the cell line being used in this tutorial (HCC1395) was derived from a 43 year old caucasian female (by a research group in Dallas Texas). Let's see if we can find some suitable "matching" samples in 1kGenomes.
As you go looking for this data, keep in mind the FAIR principles that we talked about earlier in the course (Findable, Accessible, Interoperable, Reproducible)
-
Hop onto the 1000 Genomes Samples page and scroll through the list of samples available.
-
In a real study, you'd spend a bunch of time thinking about what the proper samples to use were, to make sure that you're finding disease-related variants and not just population markers. In this case, we'll just assume that a few samples from the British in England and Scotland population will be a reasonable match.
-
Filter by population -> "British in England and Scotland"
-
Filter by technology -> Exome
-
Filter by data collection -> 1000 genomes on GRCh38
-
Download the list to get sample details (e.g., save as
igsr_samples_GBR.tsv
) -
Open up this file and take a look. Does this have enough data to get at the variant calls? (Findable!)
- Click on "Data Collections" at the top
- Choose "1000 Genomes on GRCh38"
- In the Populations box, choose "British in England and Scotland"
- Scroll down to the bottom and filter the file list to Exome and Alignments
- 'Download the list' (e.g., save as
igsr_GBR_GRCh38_exome_alignment.tsv
).
With this list, we could download the already aligned exome data for a bunch of 1KGP individuals in cram format. (Accessible!) How can we tell whether these data files were generated in a way compatible with the data we've generated? Spend a minute or two looking before you check the answer below.
Answer
This takes a little digging. On the 1000 genomes site, go to the top, then click through to Help -> FAQ -> :What methods were used for generating alignments?". That directs you back to the [data collection page](https://www.internationalgenome.org/data-portal/data-collection), and for this "1000 Genomes on GRCh38" study, to the [latest 2019 publication here](https://wellcomeopenresearch.org/articles/4-50).
That was quite a journey, but if you scroll down, you'll notice that they do have excellent documentation of the tools, input files, and parameters used. (Reproducible!)
With this info, we can determine that these cram files were created in a generally compatible way with the anlysis done so far in this tutorial. (Interoperable!)
We still need to consider that there could be batch effects related to potentially different sample preparation, library construction, exome capture reagent and protocol, sequencing pipeline, etc. These are complex topics that need to built into the statistical models, and are a bit beyond the scope of what we're going to talk about in this course!
For expediency, we downloaded exome alignments for 5 females of GBR descent and subset it to the same regions as our cell line data, and then performed germline variant calling using the GATK HaplotypeCaller GVCF workflow commands, as above. The details for this can be found on the Developer's Notes page.
Go ahead and download these VCFs and their index files.
cd /workspace/germline/
wget http://genomedata.org/pmbio-workshop/references/1KGP/gvcfs/chr17/HG00099_HC_calls.g.vcf
wget http://genomedata.org/pmbio-workshop/references/1KGP/gvcfs/chr17/HG00099_HC_calls.g.vcf.idx
wget http://genomedata.org/pmbio-workshop/references/1KGP/gvcfs/chr17/HG00102_HC_calls.g.vcf
wget http://genomedata.org/pmbio-workshop/references/1KGP/gvcfs/chr17/HG00102_HC_calls.g.vcf.idx
wget http://genomedata.org/pmbio-workshop/references/1KGP/gvcfs/chr17/HG00104_HC_calls.g.vcf
wget http://genomedata.org/pmbio-workshop/references/1KGP/gvcfs/chr17/HG00104_HC_calls.g.vcf.idx
wget http://genomedata.org/pmbio-workshop/references/1KGP/gvcfs/chr17/HG00150_HC_calls.g.vcf
wget http://genomedata.org/pmbio-workshop/references/1KGP/gvcfs/chr17/HG00150_HC_calls.g.vcf.idx
wget http://genomedata.org/pmbio-workshop/references/1KGP/gvcfs/chr17/HG00158_HC_calls.g.vcf
wget http://genomedata.org/pmbio-workshop/references/1KGP/gvcfs/chr17/HG00158_HC_calls.g.vcf.idx
To create a joint genotype GVCF, we'll combining ouur HCC1395 Exome normal together with these five using the CombineGVCFs
command:
GATK CombineGVCFs has the following new options:
-V
[multiple] specifies the path to each of the input g.vcf files-O
specifies the path to the output combined g.vcf file
gatk --java-options '-Xmx8g' CombineGVCFs -R /workspace/inputs/references/genome/ref_genome.fa -V /workspace/germline/Exome_Norm_HC_calls.g.vcf -V /workspace/germline/HG00099_HC_calls.g.vcf -V /workspace/germline/HG00102_HC_calls.g.vcf -V /workspace/germline/HG00104_HC_calls.g.vcf -V /workspace/germline/HG00150_HC_calls.g.vcf -V /workspace/germline/HG00158_HC_calls.g.vcf -O /workspace/germline/Exome_Norm_1KGP_HC_calls_combined.g.vcf $GATK_REGIONS
This'll take about two minutes or so.
Next, perfom joint genotyping with the GenotypeGVCFs
command. This command is run with the following new options:
-V
specifies the path to combined g.vcf files-O
specifies the path to the output the joint genotyped vcf file
gatk --java-options '-Xmx12g' GenotypeGVCFs -R /workspace/inputs/references/genome/ref_genome.fa -V /workspace/germline/Exome_Norm_1KGP_HC_calls_combined.g.vcf -O /workspace/germline/Exome_GGVCFs_jointcalls.vcf $GATK_REGIONS
- Look at this joint genotyping VCF (
Exome_GGVCFs_jointcalls.vcf
) and compare it to the "standard" VCF you created above (Exome_Norm_HC_calls.vcf
). What are the big differences? - Load the VCF in IGV and take a look at
Let's clean up this directory a bit to make our life easier:
mkdir 1kgenomes
mv HG0* 1kgenomes
Check in on the WGS data - look at the log/err file and see if it's done. How long did it take? How would this scale if we were running it on all chromosomes?
Let's clean up the WGS data as well to keep our workspace tidy.
mkdir wgs
mv WGS* hc_wgs* wgs/
The raw output of GATK HaplotypeCaller will include many variants with varying degrees of quality. Doing additional filtering is generally a good idea. The details of VQSR and hard-filtering strategies are covered in some detail on the PMBio variant filtering lesson and the GATK site: How to Filter variants either with VQSR or by hard-filtering
In the interest of time, we're going to set up a bash script to breeze through these steps. On your own data, you probably would be following a pre-designed workflow anyway! Take a look through the steps of this script before running it:
#this line provides several safeguards that instruct the script to fail if things go wrong. (if any command fails, if any variables are undefined, etc.
set -euo pipefail
#split the data into SNPs and INDELs to be filtered seprately:
cd /workspace/germline/
gatk --java-options '-Xmx12g' SelectVariants -R /workspace/inputs/references/genome/ref_genome.fa -V /workspace/germline/Exome_Norm_HC_calls.vcf -select-type SNP -O /workspace/germline/Exome_Norm_HC_calls.snps.vcf
gatk --java-options '-Xmx12g' SelectVariants -R /workspace/inputs/references/genome/ref_genome.fa -V /workspace/germline/Exome_Norm_HC_calls.vcf -select-type INDEL -O /workspace/germline/Exome_Norm_HC_calls.indels.vcf
#use hard filtering (specific cut-offs) to label variants
gatk --java-options '-Xmx12g' VariantFiltration -R /workspace/inputs/references/genome/ref_genome.fa -V /workspace/germline/Exome_Norm_HC_calls.snps.vcf --filter-expression "QD < 2.0" --filter-name "QD_lt_2" --filter-expression "FS > 60.0" --filter-name "FS_gt_60" --filter-expression "MQ < 40.0" --filter-name "RPRS_lt_n8" --filter-expression "SOR > 3.0" --filter-name "SOR_gt_3" -O /workspace/germline/Exome_Norm_HC_calls.snps.filtered.vcf
gatk --java-options '-Xmx12g' VariantFiltration -R /workspace/inputs/references/genome/ref_genome.fa -V /workspace/germline/Exome_Norm_HC_calls.indels.vcf --filter-expression "QD < 2.0" --filter-name "QD_lt_2" --filter-expression "FS > 200.0" --filter-name "FS_gt_200" --filter-expression "SOR > 10.0" --filter-name "SOR_gt_10" -O /workspace/germline/Exome_Norm_HC_calls.indels.filtered.vcf
# Merge the filtered SNP and INDEL vcfs back together
gatk --java-options '-Xmx12g' MergeVcfs -I /workspace/germline/Exome_Norm_HC_calls.snps.filtered.vcf -I /workspace/germline/Exome_Norm_HC_calls.indels.filtered.vcf -O /workspace/germline/Exome_Norm_HC_calls.filtered.vcf
# Extract PASSing variants only into a separate file
gatk --java-options '-Xmx12g' SelectVariants -R /workspace/inputs/references/genome/ref_genome.fa -V /workspace/germline/Exome_Norm_HC_calls.filtered.vcf -O /workspace/germline/Exome_Norm_HC_calls.filtered.PASS.vcf --exclude-filtered
Copy and paste these into a script called filter.sh
, run it with bash filter.sh
NOTE: There are a number of MIXED type variants (multi-allelic with both SNP and INDEL alleles) that are currently dropped by the above workflow (selecting for only SNPs and INDELs). In the future we could consider converting these with gatk LeftAlignAndTrimVariants --split-multi-allelics
.
-
Open IGV and load your filtered single-sample VCF file (
/workspace/germline/Exome_Norm_HC_calls.filtered.PASS.vcf
). Go ahead and remove your non-filtered VCF as well. -
Navigate to the TP53 gene, in which four variants have been called. Inspect of each of them, and answer whether the site looks like a solid call to you, and why or why not.
-
Navigate to
chr17:7,579,249-7,579,790
. Do you think that GATK got the call right here? -
What can we say about the phase of these three variants?
chr17:5,381,336-5,381,470
-
How about these two?
chr17:82,870,385-82,870,452
-
Or these two?
chr17:7,726,842-7,727,112
The script above uses some pretty heavy duty GATK commands to filter VCFs. In theory, if we wanted to do this in a more lightweight way, we could write a one-liner with grep or use a complicated regular expression to get the fields we care about, but we strongly encourage you not to do that - it's too easy to screw it up, and there are better ways!
One additional method to be aware of is bcftools
, which is a powerful tool for manipulating VCF files.
As a quick example, let's just subset one of our VCFs to only keep calls with a :
bcftools view --include 'MQ>55' Exome_Norm_HC_calls.filtered.PASS.vcf >high_mq.vcf
Now what if we wanted to combine that with accessing an attribute inside the INFO field?
bcftools view --include 'INFO/DP>=100 & MQ>55' Exome_Norm_HC_calls.filtered.PASS.vcf >high_mq_high_dp.vcf
How many variants did those produce compared to the original?
for file in Exome_Norm_HC_calls.filtered.PASS.vcf high_mq.vcf high_mq_high_dp.vcf;do grep -v "^#" $file | wc -l;done
How about our multi-sample VCF? What if we wanted to find a list of sites that were homozygous for the variant allele in all samples? One way might be to convert the VCF to a matrix using bcftools;
bcftools query -Hf '%CHROM\t%POS\t%ID\t%REF\t%ALT[\t%GT]\n' Exome_GGVCFs_jointcalls.vcf > genotype_matrix.tsv
Using that file as input, can you search through and find the sites that are homozygous variant in all samples?
Answer
There are many possible ways to do this, but one simple way might just be to iteratively construct some searches: ```grep "1/1" genotype_matrix.tsv | less```
Okay, that gets us sites with at least one sample homozygous reference. Let's filter out genotypes we don't want:
grep "1/1" genotype_matrix.tsv | grep -v "0/0" | grep -v "0/1" | grep -v "\./\." | less
That's close, but also includes some sites with second and third alleles. Those sites are the only lines that contain a comma, so we can filter them out too:
grep "1/1" genotype_matrix.tsv | grep -v "0/0" | grep -v "0/1" | grep -v "\./\." | grep -v "," | less
That gets us there, but honestly, this query is complicated enough that we should probably start writing some actual code to parse the VCF - more on that later!
These bcftools commands are nice because they adhere closely to the unix philosphy - small tools that output to the command line and can be piped to additional downstream steps. That can be convenient, especially for ad-hoc work/exploring data.
Adapted from the PMBio Germline calling exercises - see full sources and citations there.