Bioinformatic processing of microbiome sequencing data

3 minute read


The 16S ribosomal RNA (rRNA) gene of Bacteria codes for the RNA component of the 30S subunit. Different bacterial species have one to multiple copies of the 16S rRNA gene, and each with 9 hypervariable regions, V1-V9. High-throughput sequencing of 16S rRNA gene (a “marker gene”) amplicons has become a widely used method to study bacterial phylogeny and species classification.

Quantitative Insights Into Microbial Ecology “QIIME” 2 (release 2018.2)1 is a widely used package to identity abundance of microbes using 16s rRNA. Briefly, feature table containing counts of each unique sequence in the samples will be constructed using qiime dada2 denoise-paired method. A feature is essentially any unit of observation, e.g., an OTU (Operational Taxonomic Unit), a sequence variant, a gene or a metabolite. In QIIME2 (currently), most features will be OTUs or sequence variants (alternatively, for OTUs, use QIIME2 plugin q2-vsearch).

Data produced by QIIME 2 exist as QIIME 2 artifacts. A QIIME 2 artifact typically has the .qza file extension when output data stored in a file. Visualizations are another type of data (.qzv file extension) generated by QIIME 2, which can be viewed using a web interface (at Firefox web browser) without requiring a QIIME installation. Since QIIME 2 works with artifacts instead of data files (e.g. FASTA files), we must create a QIIME 2 artifact by importing our fastq.gz data files.

Here are some bash scripts that submit QIIME jobs to the cluster. qsub is a job submission command to Sun Grid Engine (SGE) cluster.

  1. To find FASTQ file’s PHRED offset2 (i.e. PHRED scores with an ASCII offset), use script.
  2. To import FASTQ files as QIIME 2 artifact and plot sequence positional quality scores, use script.
  3. To denoise imported paired sequence data and filter chimeras, QIIME 2 uses DADA2, which is an open-source software package that denoises and removes sequencing errors from Illumina amplicon sequence data, download script.
  4. For denoising and merging multiple sequencing runs, first import and denoise independently using qiime dada2, then merge the two feature tables and two sets of representative sequences created by dada2, downlaod script.
  5. For feature filtering (i.e., removal of samples and features from a feature table), download script.
  6. For diversity (alpha and beta), taxonomic and comratative analyses, download script.

When submitting the following interactive job, first qlogin and run the script.


To submit the following batch jobs to the queues, use the qsub command followed by the name of your SGE batch script file.


The above analyses also produce a key summary table or BIOM (Biological Observation Matrix) file containing feature (Operational Taxonomic Units, OTUs) abundance information across samples, along with various annotations and sample metadata. Alternatively, upload BIOM file to MicrobiomeAnalyst3, a web-based tool for comprehensive statistical, visual and meta-analysis of microbiome data, for taxonomic profiling - to characterize community compositions based on methods developed in ecology such as alpha-diversity (within-sample diversity) or beta-diversity (between-sample diversity) and comparative analysis - to identify features that are significantly different among conditions under study.

After analyzing your data, it’s finally time to interpret your results!


  2. PHRED offset, The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. 

  3. MicrobiomeAnalyst