Genomic variants from RNA-Seq data

1 minute read

Published:

RNA-Seq allows the detection and quantification of known and rare RNA transcripts within a sample. In addition to differential expression and detection of novel transcripts, RNA-seq also supports the detection of genomic variation in expressed regions.

Currently few workflows exist for detecting SNPs in RNA-seq data, including eSNV-detect, SNPiR and Opossum. Here, I have employed GATK workflow for SNP and indel calling on RNAseq data, which is based on the following steps:

  • Reference (hg38) based read mapping using STAR aligner. This is a 2-pass approach with the suggested parameters. In this STAR 2-pass approach, splice junctions detected in a first alignment run are used to guide the final alignment (reads which have been mapped across splice junctions must be split to remove intronic parts).
  • Add read group information, sort, mark the duplicates and index with picard.jar
  • GATK’s SplitNCigarReads split the reads into exon segments (removing Ns but maintaining grouping information) and reassigning mapping qualities.
  • Indel realignment and recalibration of Base qualities and
  • Variant calling with GATK’s HaplotypeCaller, and finally filtering the variants with GATK’s VariantFiltration

My qsub-based pipeline is available at bitbucket.org