Single cell RNA-seq data analysis using CellRanger and Seurat on Cluster

2 minute read

Published:

Running cellranger at HGCC (Human Genetics Computing Cluster) cluster that uses Sun Grid Engine (SGE) queuing system to run via batch scheduling. This allows highly parallelizable jobs to utilize multiple cores concurrently, dramatically reducing time to solution.

There are 4 steps to analyze Chromium Single Cell data1.

Step 1: cellranger mkfastq demultiplexes raw base call (BCL) files generated by Illumina sequencers into FASTQ files.
Step 2: cellranger count takes FASTQ files from cellranger mkfastq and performs alignment, filtering, barcode counting, and UMI counting. When doing large studies involving multiple GEM wells, run cellranger count on FASTQ data from each of the GEM wells individually, and then pool the results using cellranger aggr, as described here.
Step 3: cellranger aggr aggregates outputs from multiple runs of cellranger count.
Step 4: Downstream/Secondary analysis using R package Seurat v3.02.

Running pipelines at our HGCC cluster requires the following:

1. Load Cell Ranger module (cellranger-3.0.1)1 or install at $HOME directory and add PATH in ~/.bashrc.

2. Update Job config file (cellranger-3.0.1/martian-cs/v3.1.0/jobmanagers/config.json) threads and memory.

"threads_per_job": 9,
"memGB_per_job": 72,

3. Update Template file (cellranger-3.0.1/martian-cs/v3.1.0/jobmanagers/sge.template).

#!/bin/bash
#$ -pe smp __MRO_THREADS__
#$ -q b.q
#$ -S /bin/bash
#$ -m abe
#$ -M <e-mail>

cd __MRO_JOB_WORKDIR__
source $HOME/cellranger-3.0.1/sourceme.bash

4. Download single cell gene expression and reference genome datasets from 10XGenomics.

5. Create sge.sh file

TR="$HOME/refdata-cellranger-GRCh38-3.0.0"

for Single Cell 3′

FASTQS="$HOME/pbmc_10k_v3_fastqs"

cellranger count --disable-ui \
--id=PBMC_10k_v3 \
--transcriptome=${TR} \
--fastqs=${FASTQS} \
--sample=pbmc_10k_v3 \
--expect-cells=10000 \
--jobmode=sge \
--mempercore=8 \
--jobinterval=5000 \
--maxjobs=5

for Single Cell 5′ (Feature Barcoding/Antibody Capture Assay) data analysis

LIBRARY=$HOME/vdj_v1_hs_pbmc2_5gex_protein_fastqs/vdj_v1_hs_pbmc2_5gex_protein_library.csv
FEATURE_REF=$HOME/vdj_v1_hs_pbmc2_5gex_protein_fastqs/vdj_v1_hs_pbmc2_5gex_protein_feature_ref.csv

cellranger count \
--libraries=${LIBRARY} \
--feature-ref=${FEATURE_REF} \
--id=PBMC_5GEX \
--transcriptome=${TR} \
--expect-cells=9000 \
--jobmode=sge \
--mempercore=8 \
--maxjobs=5 \
--jobinterval=5000

6. Execute a command in screen and, detach and reconnect

Use screen command to get in/out of the system while keeping the processes running.

screen -S screen_name

bash sge.sh

If you want to exit the terminal without killing the running process, simply press Ctrl+A+D.

To reconnect to the screen: screen -R screen_name

7. Monitor work progress through a web browser

Open _log file present in output folder PBMC_5GEX

If you see serving UI as http://cluster.university.edu:3600?auth=rlSdT_QLzQ9O7fxEo-INTj1nQManinD21RzTAzkDVJ8, then type the following from your laptop

ssh -NT -L 9000:cluster.university.edu:3600 user@cluster.university.edu

user@cluster.university.edu's password:

Then access the UI using the following URL in your web browser http://localhost:9000/

8. Single Cell Integration in Seurat v3.0

Seurat is an R package designed for QC, analysis, and exploration of single cell RNA-seq data. Seurat aims to enable users to identify and interpret sources of heterogeneity from single cell transcriptomic measurements, and to integrate diverse types of single cell data. Seurat starts by reading cellranger data (barcodes.tsv.gz, features.tsv.gz and matrix.mtx.gz)

pbmc.data <- Read10X(data.dir = "~/PBMC_5GEX/outs/filtered_feature_bc_matrix/")