Annotation of genetic variants
Published:
Tools such as ANNOVAR, Variant Effect Predictor (VEP) or SnpEff annotate genetic variants (SNPs, INDELS, CNVs etc) present in VCF file. These tools integrate the annotations within the INFO column of the original VCF file.
Variant Call Format or VCF is a text file with multiple lines of meta-information, a header line, and data rows each containing information about a loci or sequence variant in the genome. Aside from the chromosomal location and observed alleles, these loci are basically anonymous. VCF is the primary (and only well-supported) format used by the GATK for variant calls. A GVCF or Genomic VCF is a kind of VCF, so the basic format specification is the same as for a regular VCF, but a Genomic VCF contains extra information.
The Ensembl Variant Effect Predictor (VEP) annotates
• location of the variants on genome
• known variants based on database matching
• gene and transcript names affected by the variants
• consequence of variation on the protein sequence (stop gain/lost, missense, frameshift)
STEP 1: Install Docker Engine
“Docker Engine is available on a variety of Linux platforms, macOS and Windows 10 through Docker Desktop, and as a static binary installation”.
STEP 2: Clone or Pull the VEP docker image
Create a directory on your machine and make sure that the created directory on your machine has read and write access granted so that the docker container can write in the directory (VEP output).
mkdir $HOME/vep_data
chmod a+rwx $HOME/vep_data
docker pull ensemblorg/ensembl-vep
Pre-configure the volume vep_data
for the container; this is required if you wish to download data (e.g. cache files) that persists across sessions. The following command mounts the volume $HOME/vep_data
into /opt/vep/.vep
in the container. If you copy your data to $HOME/vep_data
directory, you will see that data in /opt/vep/.vep
directory of the container.
-t
: Allocate a pseudo-TTY
-i
: Interactive / Keep STDIN open even if not attached
-v
: Bind mount a volume
docker_image
is ensemblorg/ensembl-vep
docker run -t -i -v $HOME/vep_data:/opt/vep/.vep ensemblorg/ensembl-vep
exit
The above command docker run
, allows you to enter into the container using an interactive terminal. Once you are inside do: ls -lsa
to list the files
vep@56d4ccaf3e34:~/src/ensembl-vep$ ls -lsa
total 188
4 drwxr-xr-x 1 vep vep 4096 Apr 7 15:30 .
4 drwxr-xr-x 1 root root 4096 Apr 7 15:26 ..
4 drwxr-xr-x 45 vep vep 4096 Apr 7 15:27 Bio
20 -rwxr-xr-x 1 vep vep 16698 Apr 7 15:19 convert_cache.pl
4 drwxr-xr-x 2 vep vep 4096 Apr 7 15:19 examples
20 -rwxr-xr-x 1 vep vep 16620 Apr 7 15:19 filter_vep
8 -rwxr-xr-x 1 vep vep 4880 Apr 7 15:19 haplo
56 -rwxr-xr-x 1 vep vep 56607 Apr 7 15:19 INSTALL.pl
12 -rw-r--r-- 1 vep vep 11478 Apr 7 15:19 LICENSE
4 drwxr-xr-x 3 vep vep 4096 Apr 7 15:19 modules
16 -rw-r--r-- 1 vep vep 12906 Apr 7 15:19 README.md
4 drwxr-xr-x 2 vep vep 4096 Apr 7 15:19 validator
4 -rw-r--r-- 1 vep vep 128 Apr 7 15:29 variant_effect_output.txt_warnings.txt
8 -rwxr-xr-x 1 vep vep 5079 Apr 7 15:19 variant_recoder
16 -rwxr-xr-x 1 vep vep 14462 Apr 7 15:19 vep
4 drwxr-xr-x 2 vep vep 4096 Apr 7 15:27 .version
Depending on Plugin, you need to download and prepare data for a plugin.
For example dbNSFP
plugin needs the following preprocessing of downloaded file,
Programs like tabix
, unzip
, wget
and zcat
are available in container at /opt/vep/.vep/.
wget ftp://dbnsfp:dbnsfp@dbnsfp.softgenetics.com/dbNSFP4.1a.zip
unzip dbNSFP4.1a.zip
zcat dbNSFP4.1a_variant.chr1.gz | head -n1 > h
zgrep -h -v ^#chr dbNSFP4.1a_variant.chr* | sort -T /path/to/tmp_folder -k1,1 -k2,2n - | cat h - | bgzip -c > dbNSFP4.1a_grch38.gz
tabix -s 1 -b 2 -e 2 dbNSFP4.1a_grch38.gz
In a separate terminal, run docker ps
command to list all running containers. To see all the containers both stopped and running use -a
flag.
docker ps -a
STEP 3: Cache and Plugins installation
Use the following commands to set up the cache and corresponding FASTA for human GRCh38 genome.
docker run -t -i -v $HOME/vep_data:/opt/vep/.vep ensemblorg/ensembl-vep perl INSTALL.pl -a cf -s homo_sapiens -y GRCh38
To install all the available plugins
docker run -t -i -v $HOME/vep_data:/opt/vep/.vep ensemblorg/ensembl-vep perl INSTALL.pl -a cfp -s homo_sapiens -y GRCh38 -g all
STEP 4: Create vep.sh
file
#!/bin/sh
# Example of VEP command line:
vep --cache --offline --format vcf --vcf --force_overwrite \
--dir_cache /opt/vep/.vep/ \
--dir_plugins /opt/vep/.vep/Plugins/ \
--input_file /opt/vep/.vep/input/Dystonia.gatk4.vqsr_variants.g.vcf.gz \
--output_file /opt/vep/.vep/output/Dystonia.gatk4.vqsr_vep.g.vcf \
--plugin dbNSFP,/opt/vep/.vep/Plugins/dbNSFP4.1a_grch38.gz,ALL
STEP 5: Copy or transfer the input data i.e vcf file
mkdir -p /Users/adinasa/vep_data/{input,output,script}
cp Dystonia.gatk4.vqsr_variants.g.vcf.gz ~/vep_data/input/
cp vep.sh /Users/adinasa/vep_data/script
cp dbNSFP4.1a_grch38.gz /Users/adinasa/vep_data/Plugins
STEP 6: Run the script
Finally, run the script inside container.
docker run -t -i -v $HOME/vep_data:/opt/vep/.vep ensemblorg/ensembl-vep
Starts a bash
session
vep@88d2b95cfd6a:~/src/ensembl-vep$ cd /opt/vep/.vep/script/
sh vep.sh