Genomic variants from RNA-Seq data
Published:
RNA-Seq allows the detection and quantification of known and rare RNA transcripts within a sample. In addition to differential expression and detection of novel transcripts, RNA-seq also supports the detection of genomic variation in expressed regions.
Currently few workflows exist for detecting SNPs in RNA-seq data, including eSNV-detect, SNPiR and Opossum. Here, I have employed GATK workflow for SNP and indel calling on RNAseq data, which is based on the following steps:
- Reference (hg38) based read mapping using
STAR
aligner. This is a 2-pass approach with the suggested parameters. In this STAR 2-pass approach, splice junctions detected in a first alignment run are used to guide the final alignment (reads which have been mapped across splice junctions must be split to remove intronic parts). - Add read group information, sort, mark the duplicates and index with
picard.jar
- GATK’s
SplitNCigarReads
split the reads into exon segments (removing Ns but maintaining grouping information) and reassigning mapping qualities. - Indel realignment and recalibration of Base qualities and
- Variant calling with GATK’s
HaplotypeCaller
, and finally filtering the variants with GATK’sVariantFiltration
My qsub-based pipeline is available at bitbucket.org