Annotation of genetic variants

4 minute read


Tools such as ANNOVAR, Variant Effect Predictor (VEP) or SnpEff annotate genetic variants (SNPs, INDELS, CNVs etc) present in VCF file. These tools integrate the annotations within the INFO column of the original VCF file.

Variant Call Format or VCF is a text file with multiple lines of meta-information, a header line, and data rows each containing information about a loci or sequence variant in the genome. Aside from the chromosomal location and observed alleles, these loci are basically anonymous. VCF is the primary (and only well-supported) format used by the GATK for variant calls. A GVCF or Genomic VCF is a kind of VCF, so the basic format specification is the same as for a regular VCF, but a Genomic VCF contains extra information.

The Ensembl Variant Effect Predictor (VEP) annotates
• location of the variants on genome
• known variants based on database matching
• gene and transcript names affected by the variants
• consequence of variation on the protein sequence (stop gain/lost, missense, frameshift)

STEP 1: Install Docker Engine

“Docker Engine is available on a variety of Linux platforms, macOS and Windows 10 through Docker Desktop, and as a static binary installation”.

STEP 2: Clone or Pull the VEP docker image

Create a directory on your machine and make sure that the created directory on your machine has read and write access granted so that the docker container can write in the directory (VEP output).

mkdir $HOME/vep_data
chmod a+rwx $HOME/vep_data    
docker pull ensemblorg/ensembl-vep   

Pre-configure the volume vep_data for the container; this is required if you wish to download data (e.g. cache files) that persists across sessions. The following command mounts the volume $HOME/vep_data into /opt/vep/.vep in the container. If you copy your data to $HOME/vep_data directory, you will see that data in /opt/vep/.vep directory of the container.

-t: Allocate a pseudo-TTY
-i: Interactive / Keep STDIN open even if not attached
-v: Bind mount a volume
docker_image is ensemblorg/ensembl-vep

docker run -t -i -v $HOME/vep_data:/opt/vep/.vep ensemblorg/ensembl-vep  


The above command docker run, allows you to enter into the container using an interactive terminal. Once you are inside do: ls -lsa to list the files

vep@56d4ccaf3e34:~/src/ensembl-vep$ ls -lsa
total 188
 4 drwxr-xr-x  1 vep  vep   4096 Apr  7 15:30 .
 4 drwxr-xr-x  1 root root  4096 Apr  7 15:26 ..
 4 drwxr-xr-x 45 vep  vep   4096 Apr  7 15:27 Bio
20 -rwxr-xr-x  1 vep  vep  16698 Apr  7 15:19
 4 drwxr-xr-x  2 vep  vep   4096 Apr  7 15:19 examples
20 -rwxr-xr-x  1 vep  vep  16620 Apr  7 15:19 filter_vep
 8 -rwxr-xr-x  1 vep  vep   4880 Apr  7 15:19 haplo
56 -rwxr-xr-x  1 vep  vep  56607 Apr  7 15:19
12 -rw-r--r--  1 vep  vep  11478 Apr  7 15:19 LICENSE
 4 drwxr-xr-x  3 vep  vep   4096 Apr  7 15:19 modules
16 -rw-r--r--  1 vep  vep  12906 Apr  7 15:19
 4 drwxr-xr-x  2 vep  vep   4096 Apr  7 15:19 validator
 4 -rw-r--r--  1 vep  vep    128 Apr  7 15:29 variant_effect_output.txt_warnings.txt
 8 -rwxr-xr-x  1 vep  vep   5079 Apr  7 15:19 variant_recoder
16 -rwxr-xr-x  1 vep  vep  14462 Apr  7 15:19 vep
 4 drwxr-xr-x  2 vep  vep   4096 Apr  7 15:27 .version

Depending on Plugin, you need to download and prepare data for a plugin.
For example dbNSFP plugin needs the following preprocessing of downloaded file,
Programs like tabix, unzip, wget and zcat are available in container at /opt/vep/.vep/.

zcat dbNSFP4.1a_variant.chr1.gz | head -n1 > h
zgrep -h -v ^#chr dbNSFP4.1a_variant.chr* | sort -T /path/to/tmp_folder -k1,1 -k2,2n - | cat h - | bgzip -c > dbNSFP4.1a_grch38.gz
tabix -s 1 -b 2 -e 2 dbNSFP4.1a_grch38.gz

In a separate terminal, run docker ps command to list all running containers. To see all the containers both stopped and running use -a flag.

docker ps -a

STEP 3: Cache and Plugins installation

Use the following commands to set up the cache and corresponding FASTA for human GRCh38 genome.

docker run -t -i -v $HOME/vep_data:/opt/vep/.vep ensemblorg/ensembl-vep perl -a cf -s homo_sapiens -y GRCh38

To install all the available plugins

docker run -t -i -v $HOME/vep_data:/opt/vep/.vep ensemblorg/ensembl-vep perl -a cfp -s homo_sapiens -y GRCh38 -g all

STEP 4: Create file

# Example of VEP command line:
vep --cache --offline --format vcf --vcf --force_overwrite \
	--dir_cache /opt/vep/.vep/ \
	--dir_plugins /opt/vep/.vep/Plugins/ \
	--input_file /opt/vep/.vep/input/Dystonia.gatk4.vqsr_variants.g.vcf.gz \
	--output_file /opt/vep/.vep/output/Dystonia.gatk4.vqsr_vep.g.vcf \
	--plugin dbNSFP,/opt/vep/.vep/Plugins/dbNSFP4.1a_grch38.gz,ALL

STEP 5: Copy or transfer the input data i.e vcf file

mkdir -p /Users/adinasa/vep_data/{input,output,script}
cp Dystonia.gatk4.vqsr_variants.g.vcf.gz ~/vep_data/input/
cp /Users/adinasa/vep_data/script
cp dbNSFP4.1a_grch38.gz /Users/adinasa/vep_data/Plugins 

STEP 6: Run the script

Finally, run the script inside container.

docker run -t -i -v $HOME/vep_data:/opt/vep/.vep ensemblorg/ensembl-vep

Starts a bash session

vep@88d2b95cfd6a:~/src/ensembl-vep$ cd /opt/vep/.vep/script/