Genomic variants from RNA-Seq data
RNA-Seq allows the detection and quantification of known and rare RNA transcripts within a sample. In addition to differential expression and detection of novel transcripts, RNA-seq also supports the detection of genomic variation in expressed regions.
Currently few workflows exist for detecting SNPs in RNA-seq data, including eSNV-detect, SNPiR and Opossum. Here, I have employed GATK workflow for SNP and indel calling on RNAseq data, which is based on the following steps:
- Reference (hg38) based read mapping using
STARaligner. This is a 2-pass approach with the suggested parameters. In this STAR 2-pass approach, splice junctions detected in a first alignment run are used to guide the final alignment (reads which have been mapped across splice junctions must be split to remove intronic parts).
- Add read group information, sort, mark the duplicates and index with
SplitNCigarReadssplit the reads into exon segments (removing Ns but maintaining grouping information) and reassigning mapping qualities.
- Indel realignment and recalibration of Base qualities and
- Variant calling with GATK’s
HaplotypeCaller, and finally filtering the variants with GATK’s
My qsub-based pipeline is available at bitbucket.org