Table 1. Overview of common proteomic platforms (Jiang W et al, Cancers, 2022).
Analytical Technique | Category | Protein Sample Values | Accepted Biospecimen Types |
---|---|---|---|
Proximity Extension Assay (Olink) | Antibody | 1 µL | Plasma, tissue/cell, synovial fluid, CSF, plaque extract, and saliva |
Reverse Phase Protein Arrays | Antibody | 5 µg (1.0- 1.5 mg/mL protein) | Tissue/cell, plasma, serum, biopsies, body fluids |
Bio-Plex | Antibody (bead) | 12.5 µL (serum/plasma) | 50 µL (cell culture) Plasma, serum, tissue/cell |
Simoa | Antibody (bead) | 25 µL | Plasma, serum, urine, tissue/cell, CSF, saliva |
Aptamer Group (Optmer) | Aptamer | 38 µL | Plasma (diagnostics and therapeutics), urine, tissue/cell, liquid matrices |
Base Pair Technologies | Aptamer | 5–100 µL | Plasma, serum, tissue/cell |
SOMAscan | Aptamer | 55–100 µL | Plasma, serum, CSF, urine, cell/tissue, synovial fluid, exosomes |
Electrochemiluminescence Immunoassay | ECLIA | 50 µL | Plasma, serum, tissue/cell, CSF, urine, blood spots, tears, synovial fluid, tissue extracts |
Multiplex ELISA | ELISA | 25–50 µL | Plasma, serum, tissue/cell, urine, saliva, CSF |
Singleplex ELISA | ELISA | 100 µL | Plasma, serum, tissue/cell, urine, saliva, CSF |
2D-PAGE | Gel electrophoresis | ~100 µg (15–50 µL) | Plasma, serum, tissue/cell, urine |
DDA-MS | MS | 10 µL | Plasma, serum, tissue/cell |
SWATH-MS | MS (DIA) | 5–10 µg | Plasma, serum tissue/cell, platelets, monocytes/neutrophils |
iTRAQ | MS (labeling in LC–MS–MS) | 12 µg | Plasma, serum, tissue/cells, saliva |
SRM/MRM | MS (LC–MS–MS) | 15 µL | Plasma, tissue/cell, dried blood spots |
The SomaScan Assay v4.1 measures simultaneously ~6,600 unique human proteins in a single sample (see Table 2). Those protein targets were evaluated by ~7,300 aptamers called SOMAmers (Slow Off-rate Modified Aptamers). SOMAmers are short single-stranded DNA molecules, which are chemically modified to specifically bind to protein targets. The SOMAscan assay measures native proteins in complex matrices by transforming each individual protein concentration into a corresponding SOMAmer reagent concentration, which is then quantified using DNA microarrays. SOMAmer reagents are selected against proteins in their native folded conformations and are therefore generally found to require an intact, tertiary protein structure for binding.
Table 2. The 7k SomaScan Assay v4.1 panel (7,596 aptamers mapping to 6,414 unique human protein targets).
Organism | SOMAmers | UniProt IDs (all) | UniProt IDs (Unique) | Protein Targets | Gene IDs | Gene Symbols |
---|---|---|---|---|---|---|
Human | 7335 | 7301 | 6414 | 6610 | 6408 | 6398 |
Mouse | 236 | 236 | 4 | 4 | 3 | 3 |
African clawed frog | 3 | 3 | 1 | 2 | 1 | 1 |
Gila monster | 3 | 3 | 1 | 1 | 0 | 0 |
Hornet | 3 | 3 | 1 | 1 | 0 | 1 |
Jellyfish | 3 | 3 | 1 | 1 | 0 | 1 |
Thermus thermophilus | 3 | 3 | 1 | 1 | 0 | 1 |
Common eastern firefly | 2 | 2 | 1 | 1 | 0 | 0 |
Bacillus stearothermophilus | 1 | 1 | 1 | 1 | 0 | 1 |
Ensifer meliloti | 1 | 1 | 1 | 1 | 0 | 1 |
European elder | 2 | 2 | 1 | 1 | 0 | 0 |
HIV-1 | 1 | 1 | 1 | 1 | 0 | 1 |
HIV-2 | 1 | 1 | 1 | 1 | 1 | 1 |
Red alga | 1 | 1 | 1 | 1 | 1 | 1 |
strain K12 | 1 | 1 | 1 | 1 | 1 | 1 |
Total | 7596 | 7562 | 6431 | 6628 | 6415 | 6411 |
ADAT file
ADAT is a tab-delimited text file format. The contents include SOMAmer reagent intensities, sample data, sequence data and experimental metadata. For each SOMAmer reagent sequence, the ADAT file typically contains corresponding protein name, UniProt ID, Entrez Gene ID and Entrez Gene symbol.
SomaDataIO
SomaDataIO v5.3.1 is an R package for working with the SomaLogic ADAT file format.
library(SomaDataIO)
library(purrr)
library(tidyr)
library(dplyr)
library(ggplot2)
The read_adat()
function imports data from ADAT files.
base.dir = "/Users/adinasa/Documents/"
adat_file <- "example.adat"
my_adat <- read_adat(paste0(base.dir,adat_file))
Update the ADAT file with sample group information (adding sample group details from external file). Save the updated ADAT file (optional).
meta_file = paste(base.dir, "MAPPING.csv", sep="/")
meta <- read.csv(meta_file, header = T, stringsAsFactors = FALSE)
meta$SampleId <- as.character(meta$SampleId)
my_adat <- dplyr::left_join(my_adat,meta, by="SampleId", keep=FALSE)
write_adat(my_adat, file = paste(base.dir, "example_updated.adat", sep="/"))
Utility functions.
regex for analytes
is_seq <- function(.x) grepl("^seq\\.[0-9]{4}", .x)
center/scale vector (z-scores).
cs <- function(.x) { out <- .x - mean(.x)
out / sd(out)
}
Data were log2-transformed within each sample. Control and Disease are two groups (this may vary in your data).
cleanData <- my_adat %>%
filter(SampleType == "Sample") %>% drop_na(Group) %>%
log2() %>%
mutate(SampleGroup = as.numeric(factor(Group, levels=c("Control", " Disease"))) - 1) %>%
modify_if(is_seq(names(.)), cs)
Human proteins with Uniprot ID and QC=PASS selected.
t_tests <- getAnalyteInfo(cleanData2) %>%
filter(ColCheck == "PASS") %>%
filter(Organism == "Human") %>%
filter(UniProt != "") %>%
select(AptName, SeqId, Target = TargetFullName,Organism, EntrezGeneID, EntrezGeneSymbol, UniProt, ColCheck)
Performed a Student’s t-test.
t_tests <- t_tests %>%
mutate(
formula = map(AptName, ~ as.formula(paste(.x, "~ SampleGroup"))),
t_test = map(formula, ~ stats::t.test(.x, data = cleanData,var.equal = TRUE)),
t_stat = map_dbl(t_test, "statistic"),
p.value = map_dbl(t_test, "p.value"),
fdr = p.adjust(p.value, method = "BH")
) %>% arrange(p.value)
The results were used to identify proteins significantly associated with disease using a Benjamini-Hochberg false discovery rate (FDR) threshold of 1%.
t_tests [t_tests$fdr <= 0.1,]
Further reading …
Tandem Mass Tag (TMT)-based quantitation of proteins
Label-Free Quantitation (LFQ) of proteins
KM Analysis using R The packages used for the analysis are survival and survminer. Use install.packages( ) to install these libraries just in case if they are not pre installed in your R workspace.
Load Required libraries
library(survival)
library(survminer)
library(dplyr)
Read the vital dataset
base.dir = "/Users/adinasa/Documents/Nabil/survival_analysis"
data = read.csv(file=paste0(base.dir,"/KM_Test_data.csv"),header=T)
Examine the dataset (Vital status: 1 for dead; 0 for alive)
head(data)
ID p16 Status Days
GHN-11 unknown 1 803
GHN-15 unknown 1 775
GHN-20 unknown 1 150
GHN-21 unknown 0 2036
GHN-24 unknown 1 718
GHN-25 negative 1 598
HN-39 positive 0 1232
Create a survival object, usually used as a response variable in a model formula.
surv_obj = survival::Surv(time=data$Days, event = data$Status)
Wrapper arround the standard survfit() function to create survival curves
fit = survminer::surv_fit(surv_obj ~ p16, data = data)
You can replace the above two steps with
fit = surv_fit(Surv(Days, Status) ~ p16, data = data)
Plot the KM curve. With pval = TRUE argument, it plots the p-value of a log rank test, which will help us to get an idea if the groups are significantly different or not.
png(filename = paste(base.dir, "KM_Plot.png", sep="/"),width = 1300, height = 1300, res=200)
ggsurvplot(fit_1, pval=TRUE, risk.table=TRUE)
dev.off()
KM plot The lines represent survival curves of the 3 groups; HPV-status: positive [N=23], negative [N=5] and unknown [N=7]. A vertical drop in the curves indicates an event (eg. death). For HPV-positive: 5 (21.7%); HPV-negative: 4 (80%) and HPV-unknown: 5 (71.4%) events.
The lengths of the horizontal lines along the X-axis of serial times represent the survival duration for that interval. The vertical tick mark on the curves means that a patient was censored at this time; a patient has not (yet) experienced the event of interest, such as death, within the study time period. If many patients were censored in a given group(s), one must question how the study was carried out or how the type of treatment affected the patients. This stresses the importance of showing censored patients as tick marks in survival curves.
Risk table At time zero, the survival probability is 1.0 (or 100% of the participants are alive). At time 0, all 35 are alive or at risk and after 1000 days, there are 21 participants alive or at risk; after 3000 days, 3 participants are alive or at risk.
]]>Variant Call Format or VCF is a text file with multiple lines of meta-information, a header line, and data rows each containing information about a loci or sequence variant in the genome. Aside from the chromosomal location and observed alleles, these loci are basically anonymous. VCF is the primary (and only well-supported) format used by the GATK for variant calls. A GVCF or Genomic VCF is a kind of VCF, so the basic format specification is the same as for a regular VCF, but a Genomic VCF contains extra information.
The Ensembl Variant Effect Predictor (VEP) annotates
• location of the variants on genome
• known variants based on database matching
• gene and transcript names affected by the variants
• consequence of variation on the protein sequence (stop gain/lost, missense, frameshift)
STEP 1: Install Docker Engine
“Docker Engine is available on a variety of Linux platforms, macOS and Windows 10 through Docker Desktop, and as a static binary installation”.
STEP 2: Clone or Pull the VEP docker image
Create a directory on your machine and make sure that the created directory on your machine has read and write access granted so that the docker container can write in the directory (VEP output).
mkdir $HOME/vep_data
chmod a+rwx $HOME/vep_data
docker pull ensemblorg/ensembl-vep
Pre-configure the volume vep_data
for the container; this is required if you wish to download data (e.g. cache files) that persists across sessions. The following command mounts the volume $HOME/vep_data
into /opt/vep/.vep
in the container. If you copy your data to $HOME/vep_data
directory, you will see that data in /opt/vep/.vep
directory of the container.
-t
: Allocate a pseudo-TTY
-i
: Interactive / Keep STDIN open even if not attached
-v
: Bind mount a volume
docker_image
is ensemblorg/ensembl-vep
docker run -t -i -v $HOME/vep_data:/opt/vep/.vep ensemblorg/ensembl-vep
exit
The above command docker run
, allows you to enter into the container using an interactive terminal. Once you are inside do: ls -lsa
to list the files
vep@56d4ccaf3e34:~/src/ensembl-vep$ ls -lsa
total 188
4 drwxr-xr-x 1 vep vep 4096 Apr 7 15:30 .
4 drwxr-xr-x 1 root root 4096 Apr 7 15:26 ..
4 drwxr-xr-x 45 vep vep 4096 Apr 7 15:27 Bio
20 -rwxr-xr-x 1 vep vep 16698 Apr 7 15:19 convert_cache.pl
4 drwxr-xr-x 2 vep vep 4096 Apr 7 15:19 examples
20 -rwxr-xr-x 1 vep vep 16620 Apr 7 15:19 filter_vep
8 -rwxr-xr-x 1 vep vep 4880 Apr 7 15:19 haplo
56 -rwxr-xr-x 1 vep vep 56607 Apr 7 15:19 INSTALL.pl
12 -rw-r--r-- 1 vep vep 11478 Apr 7 15:19 LICENSE
4 drwxr-xr-x 3 vep vep 4096 Apr 7 15:19 modules
16 -rw-r--r-- 1 vep vep 12906 Apr 7 15:19 README.md
4 drwxr-xr-x 2 vep vep 4096 Apr 7 15:19 validator
4 -rw-r--r-- 1 vep vep 128 Apr 7 15:29 variant_effect_output.txt_warnings.txt
8 -rwxr-xr-x 1 vep vep 5079 Apr 7 15:19 variant_recoder
16 -rwxr-xr-x 1 vep vep 14462 Apr 7 15:19 vep
4 drwxr-xr-x 2 vep vep 4096 Apr 7 15:27 .version
Depending on Plugin, you need to download and prepare data for a plugin.
For example dbNSFP
plugin needs the following preprocessing of downloaded file,
Programs like tabix
, unzip
, wget
and zcat
are available in container at /opt/vep/.vep/.
wget ftp://dbnsfp:dbnsfp@dbnsfp.softgenetics.com/dbNSFP4.1a.zip
unzip dbNSFP4.1a.zip
zcat dbNSFP4.1a_variant.chr1.gz | head -n1 > h
zgrep -h -v ^#chr dbNSFP4.1a_variant.chr* | sort -T /path/to/tmp_folder -k1,1 -k2,2n - | cat h - | bgzip -c > dbNSFP4.1a_grch38.gz
tabix -s 1 -b 2 -e 2 dbNSFP4.1a_grch38.gz
In a separate terminal, run docker ps
command to list all running containers. To see all the containers both stopped and running use -a
flag.
docker ps -a
STEP 3: Cache and Plugins installation
Use the following commands to set up the cache and corresponding FASTA for human GRCh38 genome.
docker run -t -i -v $HOME/vep_data:/opt/vep/.vep ensemblorg/ensembl-vep perl INSTALL.pl -a cf -s homo_sapiens -y GRCh38
To install all the available plugins
docker run -t -i -v $HOME/vep_data:/opt/vep/.vep ensemblorg/ensembl-vep perl INSTALL.pl -a cfp -s homo_sapiens -y GRCh38 -g all
STEP 4: Create vep.sh
file
#!/bin/sh
# Example of VEP command line:
vep --cache --offline --format vcf --vcf --force_overwrite \
--dir_cache /opt/vep/.vep/ \
--dir_plugins /opt/vep/.vep/Plugins/ \
--input_file /opt/vep/.vep/input/Dystonia.gatk4.vqsr_variants.g.vcf.gz \
--output_file /opt/vep/.vep/output/Dystonia.gatk4.vqsr_vep.g.vcf \
--plugin dbNSFP,/opt/vep/.vep/Plugins/dbNSFP4.1a_grch38.gz,ALL
STEP 5: Copy or transfer the input data i.e vcf file
mkdir -p /Users/adinasa/vep_data/{input,output,script}
cp Dystonia.gatk4.vqsr_variants.g.vcf.gz ~/vep_data/input/
cp vep.sh /Users/adinasa/vep_data/script
cp dbNSFP4.1a_grch38.gz /Users/adinasa/vep_data/Plugins
STEP 6: Run the script
Finally, run the script inside container.
docker run -t -i -v $HOME/vep_data:/opt/vep/.vep ensemblorg/ensembl-vep
Starts a bash
session
vep@88d2b95cfd6a:~/src/ensembl-vep$ cd /opt/vep/.vep/script/
sh vep.sh
Liquid chromatography (LC) coupled with mass spectrometry (MS) has been widely used for protein expression quantification. Protein quantification by tandem-MS (MS/MS) uses integrated peak intensity from the parent-ion mass (MS1) or features from fragment-ions (MS2).
Label-free quantification (LFQ) may be based on precursor ion intensity (peak areas or peak heights) or on spectral counting. Here, the MaxLFQ algorithm is applied, which relies on chromatographic ion intensities.
iBAC and LFQ intensity
MS1 methods use the iBAQ (intensity Based Absolute Quantification) algorithm and iBAQ intensity is a protein’s total non-normalised intensity(all peptides) divided by the number of measurable tryptic peptides.
iBAQ: Σintensity/#theoretical peptides
LFQ (label-free quantitation) intensity is a very similar to the iBAQ intensity but the protein intensities are normalized to exculde some “outliers” to best represent the ratio changes of different samples.
Untargeted label-free quantitation (LFQ) of proteins, aims to determine the relative amount of proteins in two or more biological samples. Mass spectrometer generated raw
files are used for label-free quantitation of proteins. Base peak chromatograms can be inspected visually using RawMeat
1, which is a data quality assessment tool designed for Thermo instruments.
1. MaxQuant Search
All raw files are processed together in a single run by MaxQuant v1.6.15.0
2 with default parameters except the following
Raw data
pane
Set experiment
to assign a unique ID. If you don´t assign a unique ID to each biological sample, MaxQuant will put them together in the output.Set fractions
to assign fraction value. If you don’t have a fractionation, set 1.Group-specific parameters
pane
Global parameters
pane
Match between runs: Peptides, which are present in several samples, but not identified via MS/MS in all of them, can still be identified via matching between runs. Setting TRUE will boosts number of identifications.
Database searches are performed using the Andromeda
search engine (a peptide search engine based on probabilistic scoring) with the UniProt-SwissProt
human canonical database as a reference and a contaminants database of common laboratory contaminants. MaxQuant reports summed intensity for each protein, as well as it`s iBAQ and LFQ values.
Proteins that share all identified peptides are combined into a single protein group. Peptides that match multiple protein groups (“razor” peptides) are assigned to the protein group with the most unique peptides. MaxQuant employs the MaxLFQ algorithm for label-free quantitation (LFQ). Quantification will be performed using razor and unique peptides, including those modified by acetylation (protein N-terminal), oxidation (Met) and deamidation (NQ).
2. Quality Control of MaxQuant Search (optional)
PTXQC
3 is an R package used for general quality control of proteomics data, which takes MaxQuant result files.
library("devtools")
install_github("cbielow/PTXQC", build_vignettes=TRUE, dependencies=TRUE)
library(PTXQC)
PTXQC::createReport("path_to_txt_directory")
Open final report file report_v1.0.5_combined.pdf
3. Post processing of MaxQuant Search results
Further data processing is performed using `Perseus v1.6.14.0)4. In brief, protein group LFQ intensities are log2-transformed to reduce the effect of outliers. To overcome the obstacle of missing LFQ values, missing values are imputed before fit the models. Hierarchical clustering is performed on Z-score normalized, log2-transformed LFQ intensities. Log ratios are calculated as the difference in average log2 LFQ intensity values between experimental and control groups. Two-tailed, Student’s t test calculations are used in statistical tests. A protein is considered statistically significant if its fold change is ≥ 2 and FDR ≤ 0.01. Please refer to its documentation for more details.
Export Perseus processed LFQ data as a text file Perseus_filtered_transformed_valid_values_imputed_ttest.txt
4. Perseus exported file processing
PerseusR enables the interoperability between the Perseus platform for omics data analysis (Tyanova et al. 2016). If you select “Write quality and imputed matrices” when you save Perseus processed data as a text file, inlude additionalMatrices=TRUE
.
library(PerseusR)
setwd("/Users/Documents/Proteomics/Perseus")
inFile <- "Perseus_filtered_transformed_valid_values_imputed_ttest.txt"
mdata <- PerseusR::read.perseus.as.matrixData(inFile,additionalMatrices=TRUE,check = FALSE)
# 1. log(LFQ) data with imputed values
data <- main(mdata) # head(data)
# 2. Gene/Protein names, p-value, FDR and fold change
annotations <- annotCols(mdata) # colnames(annotations)
# Select first annotation in Protein.ID "sp|P31943|HNRH1_HUMAN;sp|Q9NQA5|TRPV5_HUMAN"
annotations$Protein.IDs <- sub(";.*","", annotations$Protein.IDs)
# Select Protein ID as HNRH1_HUMAN from sp|P31943|HNRH1_HUMAN;
annotations$Protein.IDs <- as.character(lapply(strsplit(as.character(annotations$Protein.IDs), split="\\|"),tail, n=1))
# Remove "_HUMAN" part HNRH1_HUMAN
annotations$Protein.IDs <- sub("_HUMAN","",annotations$Protein.IDs)
# Select Columns; write complete column name.
annotations <- annotations[,c("Protein.IDs","Student.s.T.test.p.value....", "Student.s.T.test.Difference...")]
# Save the data
write.table(annotations, file = "Volcano_plot_data.txt", col.names = TRUE, row.names = FALSE, quote = FALSE)
5. Data visualization
Volcano plot illustrates significantly differentially abundant proteins. The following plot is generated using GraphPad Prism.
In addition to the above analytical considerations, good experimental design helps effectively identify true differences in the presence of variability from various sources and also avoids bias during data acquisition.
Further reading…
MaxQuant – Information and Tutorial
How to use Cloud for Proteomics Data Analysis
Data dependent vs Data independent proteomics
RawMeat is a nice Thermo raw file diagnostic tool developed by the now defunct Vast Scientific ↩
MaxQuant is a quantitative proteomics software package designed for analyzing large mass-spectrometric data sets ↩
PTXQC, an R-based quality control pipeline called Proteomics Quality Control ↩
Perseus is software package for shotgun proteomics data analyses ↩
Alkylation with iodoacetamide (IAA) after cystine reduction results in the covalent addition of a carbamidomethyl group that prevents the formation of disulfide bonds. Then, overnight digestion of the proteins using trypsin or trypsin/LyC mixture, Tandem Mass Tag (TMT) labeling on Lysine residues, pooling of all samples and the fractionation of peptide mixture. Finally, LC-MS/MS data acquisition and database search for protein identification and quantification.
Fractionation prior to LC-MS/MS analysis effectively detects low abundance proteins (<100ng/mL), which is the concentration range of most clinical biomarkers.
TMT-based approach allows multiplexing of samples. TMT 6-plex reagents produces a series of different reporter ions with nominal masses from 126 to 131 Da at 1 Da intervals. Several TMT reagents are commercially available including TMTzero, TMT duplex, TMT 6-plex, and TMT 10-plex 1. They have the same chemical structure but contain different numbers and combinations of 13C and 15N isotopes in the reporter group. The overall calibrated intensities of reporter ions equal.
Isobaric labeling reagent (TMT) structure:
a. An amine-specific reactive group – an N-hydroxysuccinimide ester, which reacts with primary amines i.e unblocked N-terminals and lysine side chains.
b. A mass reporter group for quantification – the reporter groups are partially fragmented from the peptide during precursor fragmentation in the mass spectrometer.
c. A mass normalizer group to link the reactive and reporter groups – mass normalizer group ensures that the peptide complexity in the MS1 spectra does not increase with multiplexing.
Unless otherwise noted, every analysis utilizes an MS3-based TMT-centric mass spectrometry method. MS2-based TMT yields the highest precision but lowest accuracy due to ratio compression, which MS3-based TMT can partly rescue 2. To 10 precursors selected for MS2/MS3 analysis.
What are the correction factors used for TMT?
TMT reporter ion signals need to be adjusted to account for isotopic impurities in each TMT variant. For TMT-10plex labeling, different batches of TMT reagents have slightly different isotope impurities that need to be included for database search to correct the reporter ion ratio. The information of isotope impurity can be found in the reagent kit. However, I do not normally correct for isotopic impurities of TMT reagents, as this is typically a «1.2% shift.
Database search for protein identification and quantification:
The resulting TMT-MS3 data (.raw files) are processed using MaxQuant with an integrated Andromeda search engine (v.1.6.7.0). Tandem mass spectra were searched against the Uniprot database.
Download and install MaxQuant
Specification of Various Parameters in MaxQuant (All the other parameters in MaxQuant are set to the default values for processing orbitrap-type data)3:
Raw data pane
Using Load, select all raw files from all batches. Update Experiment (a unique value for each of the samples) and Fraction (if available) column values using set experiment and set fractions buttons.
Group-specific parameters pane
Change Type as Reporter ion MS3 and then select “10plex TMT”. Update isobaric impurities values, if available. For Modifications, select Variable modifications as Oxidation (M), Acetyl (Protein N-term); Deamidation (NQ) and Fixed Modifications as Carbamidomethyl© and, TMT labeled N- terminus and lysine residue. Specify Trypsin/P as a cleavage enzyme using Digestion and allow up to 2 missing cleavages. Set Label-free quantification as None.
Global parameters pane
For Sequences select a reference proteome database Uniprot FASTA file and update Max. peptide mass [Da] as 6000. For Protein quantification select Use only unmodified peptides and a list of modifications such as Oxidation (M), Acetyl (Protein N-term) and Deamidation. Select Match between runs in Identification tab.
Update the following two parameters for MS/MS analyzer:
a. FTMS MS/MS match tolerance: 0.05 Da
b. ITMS MS/MS match tolerance: 0.6 Da
For Folder location select tmp directory (optional).
Further data processing is performed using Perseus v1.6.12.0
4. The search results in ProteinGroups.txt
generated by MaxQuant are directly processed by Perseus software. MaxQuant reports the TMT-MS3 quantitative relative abundance metrics in the columns titled “Reporter intensity corrected”.
Further reading:
Data normalization and analysis in multiple TMT experimental designs 5
Multiplexed Protein Quantification Using the Isobaric TMT … 6
Isobaric matching between runs and novel PSM-level normalization in MaxQuant … 7
AWS Windows instance for MaxQuant/Perseus
Bąchor, R.; Waliczek, M.; Stefanowicz, P.; Szewczuk, Z. Trends in the Design of New Isobaric Labeling Reagents for Quantitative Proteomics. Molecules 2019, 24, 701. ↩
A. Hogrebe, L. von Stechow, D.B. Bekker-Jensen, B.T. Weinert, C.D. Kelstrup, J.V. Olsen Benchmarking common quantification strategies for large-scale phosphoproteomics Nat. Commun., 9 (2018), p. 1045. ↩
Isobaric matching between runs and novel PSM-level normalization ↩
There are 2 steps to analyze Spatial RNA-seq data1.
Step 1: spaceranger mkfastq
demultiplexes raw base call (BCL
) files generated by Illumina sequencers into FASTQ files.
Step 2: spaceranger count
takes FASTQ files from spaceranger mkfastq
and performs alignment, filtering, barcode counting, and UMI counting.
Running pipelines on cluster requires the following:
1. Load Space Ranger module (spaceranger-1.0.0
)1 or, download and uncompress spaceranger at your $HOME
directory and add PATH in ~/.bashrc
.
2. Update job config file (spaceranger-1.0.0/external/martian/jobmanagers/config.json
) for threads and memory. For example
"threads_per_job": 8,
"memGB_per_job": 64,
3. Update template file (spaceranger-1.0.0/external/martian/jobmanagers/sge.template
).
#!/bin/bash
#$ -pe smp __MRO_THREADS__
##$ -l mem_free=__MRO_MEM_GB__G
(comment this line if your cluster do not support it!)
#$ -q b.q
#$ -S /bin/bash
#$ -m abe
#$ -M <e-mail>
cd __MRO_JOB_WORKDIR__
source ../spaceranger-1.0.0/sourceme.bash
(update with complete path)
For clusters whose job managers do not support memory requests, it is possible to request memory
in the form of cores via the --mempercore
command-line option. This option scales up the number
of threads requested via the __MRO_THREADS__
variable according to how much memory a stage requires.
Read more at Cluster Mode
4. Download spatial gene expression, image file and reference genome datasets from 10XGenomics.
5. Create sge.sh
file
TR="$HOME/refdata-cellranger-mm10-3.0.0"
Output files will appear in the out/ subdirectory within this pipeline output directory.
cd $HOME/10xgenomics/out
For pipeline output directory, the --id
argument is used i.e Adult_Mouse_Brain.
FASTQS="$HOME/V1_Adult_Mouse_Brain_fastqs"
spaceranger count --disable-ui \
--id=Adult_Mouse_Brain \
--transcriptome=${TR} \
--fastqs=${FASTQS} \
--sample=V1_Adult_Mouse_Brain \
--image=$DATA_DIR/V1_Adult_Mouse_Brain_image.tif \
--slide=V19L01-041 \
--area=C1 \
--jobmode=sge \
--mempercore=8 \
--jobinterval=5000 \
--maxjobs=3
6. Execute a command in screen and, detach and reconnect
Use screen
command to get in/out of the system while keeping the processes running.
screen -S screen_name
bash sge.sh
If you want to exit the terminal without killing the running process, simply press Ctrl+A+D
.
To reconnect to the screen: screen -R screen_name
7. Monitor work progress through a web browser
Open _log
file present in output folder Adult_Mouse_Brain
If you see serving UI as http://cluster.university.edu:3600?auth=rlSdT_QLzQ9O7fxEo-INTj1nQManinD21RzTAzkDVJ8
, then type the following from your laptop
ssh -NT -L 9000:cluster.university.edu:3600 user@cluster.university.edu
user@cluster.university.edu's password:
Then access the UI using the following URL in your web browser
http://localhost:9000/
ATAC-seq achieves this by simultaneously fragmenting and tagging genomic DNA with sequencing adapters using the hyperactive Tn5 transposase enzyme 1. Other global chromatin accessibility methods include FAIRE-seq and DNase-seq. This document aims to provide accessibility.
Pre-processing of raw sequencing reads – before mapping the raw reads to the genome, trim the adapter sequences. Poor read quality or sequencing errors often lead to low mapping rate.
Mapping/alignment of sequencing reads to a reference genome – use Burrows-Wheeler Aligner (BWA) for mapping of sequencing reads. The output alignment file will be saved as a sequence alignment/map (SAM) format or binary version of SAM called BAM. Mark the duplicate reads using Picard 2 and exclude reads mapping to mitochondrial DNA and other chromosomes from analysis together with low quality reads (MAPQ<10 and reads in Encode black list regions) using SAMtools 3.
Filtering and shifting of the mapped reads - shift the read position +4 and -5 bp in the BAM file before peak calling adjust the reads alignment. When the Tn5 transposase cuts open chromatin regions, it introduces two cuts that are separated by 9 bp. Therefore, ATAC-seq reads aligning to the positive and negative strands need to be adjusted by +4 bp and -5 bp respectively to represent the center of the transposase binding site. Picard CollectInsertSizeMetrics will be used to compute the fragment sizes on alignment shifted BAM files.
Identification and visualization of the ATAC-seq peaks – use MACS2 for peak calling with the parameters nomodel or BAMPE 4 and identify the differentially enriched peaks using the MACS2 bdgdiff
module. Individual peaks separated by <100 bp will be join together. For peak annotation and functional analysis use the R package ChIPpeakAnno or HOMER 5,6. First, ATAC-seq peaks will be categorized into different groups based on the nearest RefSeq gene i.e. promoter, untranslated regions (UTRs), intron and exon. Second, peaks that are within 5 kb upstream and 3 kb downstream of the Transcription Start Site (TSS) are associated to the nearest genes. Finally, these genes are then analyzed for over-represented gene ontology (GO) terms and KEGG pathways using ChIPpeakAnno. Visualize all sequencing tracks using the Integrated Genomic Viewer (IGV) 7.
Scripts are available for HPC Cluster.
For further reading: ATAC-seq-data-analysis-from-FASTQ-to-peaks
Cell Ranger can be run in cluster mode, using job schedulers like Sun Grid Engine (or simply SGE) or Load Sharing Facility (or simply LSF) as queuing system allows highly parallelizable jobs.
There are 4 steps to analyze Chromium Single Cell data1.
Step 1: cellranger mkfastq
demultiplexes raw base call (BCL) files generated by Illumina sequencers into FASTQ files.
Step 2: cellranger count
takes FASTQ files from cellranger mkfastq
and performs alignment, filtering, barcode counting, and unique molecular identifier (UMI) counting.
When doing large studies involving multiple GEM wells, first run cellranger count
on FASTQ data from each of the GEM wells individually, and then pool the results using cellranger aggr
, as described here.
Step 3: cellranger aggr
aggregates outputs from multiple runs of cellranger count
.
Step 4: Use R package Seurat2 for downstream analysis.
Running pipelines on cluster requires the following:
1. Download and uncompress cellranger-7.2.0
1 at your $HOME
directory and add PATH in ~/.bashrc
.
2. Update job config file (external/martian/jobmanagers/config.json
) for threads and memory.
For example
"threads_per_job": 4,
"memGB_per_job": 32,
"name":"SGE_ROOT",
"description":"/opt/sge"
3. Update template file sge.template
(external/martian/jobmanagers/sge.template
).
#!/bin/bash
#$ -N __MRO_JOB_NAME__
#$ -V
#$ -pe smp __MRO_THREADS__
#$ -q b.q
#$ -cwd
#$ -l mem_free=__MRO_MEM_GB__G
#$ -o __MRO_STDOUT__
#$ -e __MRO_STDERR__
#$ -m abe
#$ -M <email>
#$ -S /bin/bash
cd __MRO_JOB_WORKDIR__
source $HOME/10xgenomics/cellranger-7.2.0/sourceme.bash
For clusters whose job managers do not support memory requests, it is possible to request memory
in the form of cores via the --mempercore
command-line option. This option scales up the number
of threads requested via the __MRO_THREADS__
variable according to how much memory a stage requires.
see more at Cluster Mode
4. Download single cell gene expression and reference genome datasets from 10XGenomics.
5. Create sge.sh
file
for Single Cell 3′ gene expression
Output files will appear in the out/ subdirectory within this pipeline output directory. For pipeline output directory, the --id
argument is used i.e 10XGTX_v3.
#!/bin/bash
cd $HOME/10xgenomics/out
FASTQS="$HOME/pbmc_10k_v3_fastqs"
TR="$HOME/refdata-gex-GRCh38-2020-A"
cellranger count --disable-ui \
--id=10XGTX_v3 \
--transcriptome=${TR} \
--fastqs=${FASTQS} \
--sample=pbmc_10k_v3 \
--expect-cells=10000 \
--jobmode=sge \
--mempercore=8 \
--jobinterval=5000 \
--maxjobs=3
Execute a command in screen and, detach and reconnect:
Start a screen as screen -S some_name
.
Run the above script as sh sge.sh
To detach the screen session from the terminal use control
+ a
followed by d
To reconnect to the screen: screen -R some_name
If the job is Eqw: Job waiting in error state
qstat -j jobid | grep error
If you understand the reason and can get it fixed, you can clear the error state with
qmod -cj jobid
for Single Cell 5′ gene expression
Use either --force-cells
or --expect-cells
#!/bin/bash
cd $HOME/10xgenomics/out
TR="$HOME/refdata-cellranger-GRCh38-3.0.0
FASTQS="$HOME/vdj_v1_hs_nsclc_5gex_fastqs"
cellranger count \
--id=10XGTX_v5 \
--fastqs=${FASTQS} \
--transcriptome=${TR} \
--sample=vdj_v1_hs_nsclc_5gex \
--force-cells=7802 \
--jobmode=sge \
--mempercore=8 \
--maxjobs=3 \
--jobinterval=2000
Execute a command in screen and, detach and reconnect:
Start a screen as screen -S some_name
.
Run the above script as sh sge.sh
To detach the screen session from the terminal use control
+ a
followed by d
To reconnect to the screen: screen -R some_name
for Feature Barcode Analysis
Tested on Single Cell 5′ gene expression and cell surface protein (Feature Barcoding/Antibody Capture Assay) data.
For more information, please visit the Single Cell Gene Expression with Feature Barcoding page (Single Cell 3’) or the Single Cell Immune Profiling with Feature Barcoding page (Single Cell 5’).
Currently available Feature Barcode kits for Single Cell Gene Expression Feature Barcode Technology
10x Solution | Gene Expression | Cell Surface Protein | CRISPR Screening |
---|---|---|---|
Single Cell Gene Expression v2 | ✓ | - | - |
Single Cell Gene Expression v3 | ✓ | ✓ | ✓ |
Single Cell Gene Expression v3.1 | ✓ | ✓ | ✓ |
Single Cell Gene Expression v3.1 (Dual Index) | ✓ | ✓ | ✓ |
Currently available Feature Barcoding kits for Single Cell Immune Profiling Feature Barcoding Technology
10x Solution | TCR/Ig | Gene Expression | Cell Surface Marker | TCR-Antigen Specificity |
---|---|---|---|---|
Single Cell Immune Profiling | ✓ | ✓ | - | - |
Single Cell Immune Profiling with Feature Barcoding technology | ✓ | ✓ | ✓ | ✓ |
To enable Feature Barcode analysis, cellranger count
needs two inputs:
First you need a csv file declaring input library data sources; one for the normal single-cell gene expression reads, and one for the Feature Barcode reads (the FASTQ file directory and library type for each input dataset).
LIBRARY=$HOME/vdj_v1_hs_pbmc2_5gex_protein_fastqs/vdj_v1_hs_pbmc2_5gex_protein_library.csv
fastqs | sample | library_type |
---|---|---|
/path/to/antibody_fastqs | vdj_v1_hs_pbmc2_antibody | Antibody Capture |
/path/to/gene_expression_fastqs | vdj_v1_hs_pbmc2_5gex | Gene Expression |
Second, you need Feature reference csv file, declaring feature-barcode constructs and associated barcodes. The pattern will be used to extract the Feature Barcode sequence from the read sequence.
FEATURE_REF=$HOME/vdj_v1_hs_pbmc2_5gex_protein_fastqs/vdj_v1_hs_pbmc2_5gex_protein_feature_ref.csv
id | name | read | pattern | sequence | feature_type |
---|---|---|---|---|---|
CD3 | CD3_TotalSeqC | R2 | 5PNNNNNNNNNN(BC)NNNNNNNNN | CTCATTGTAACTCCT | Antibody Capture |
CD19 | CD19_TotalSeqC | R2 | 5PNNNNNNNNNN(BC)NNNNNNNNN | CTGGGCAATTACTCG | Antibody Capture |
CD45RA | CD45RA_TotalSeqC | R2 | 5PNNNNNNNNNN(BC)NNNNNNNNN | TCAATCCTTCCGCTT | Antibody Capture |
… | … | … | … | … | … |
Feature and Library Types - When inputting Feature Barcode data to Cell Ranger via the Libraries CSV File, you must declare the library_type of each library. Examples include Antibody Capture, CRISPR Guide Capture or Custom. If your assay scheme creates a library containing multiple library_types, for example if you’re using CRISPR Guide Capture and Antibody Capture features, you will need to run Cell Ranger multiple times, passing different library_type values in the Libraries CSV File. If Targeted Gene Expression data is analyzed in conjunction with CRISPR-based Feature Barcode data, there are additional requirements imposed for the Feature Reference CSV file.
TotalSeq™ Reagents for Single-Cell Proteogenomics
TotalSeq™-B is a line of antibody-oligonucleotide conjugates supplied by BioLegend that are compatible with the Single Cell 3’ v3 assay.
TotalSeq™-C is a line of antibody-oligonucleotide conjugates supplied by BioLegend that are compatible with the Single Cell 5’ assay.
TotalSeq™-A is a line of antibody-oligonucleotide conjugates supplied by BioLegend that are compatible with the Single Cell 3’ v2 and Single Cell 3’ v3 kits. Although TotalSeq™-A can be used with the CITE-Seq assay, CITE-Seq is not a 10x supported assay.
CITE-seq (Cellular Indexing of Transcriptomes and Epitopes by Sequencing) allows simultaneous analysis of transcriptome and cell surface protein information at the level of a single cell.3
“The pipeline first extracts and corrects the cell barcode and UMI from the feature library using the same methods as gene expression read processing. It then matches the Feature Barcode read against the list of features declared in the above Feature Barcode Reference. The counts for each feature are available in the feature-barcode matrix output files.”
#!/bin/bash
cd $HOME/10xgenomics/out
TR="$HOME/refdata-cellranger-GRCh38-3.0.0
LIBRARY=$HOME/vdj_v1_hs_pbmc2_5gex_protein_fastqs/vdj_v1_hs_pbmc2_5gex_protein_library.csv
FEATURE_REF=$HOME/vdj_v1_hs_pbmc2_5gex_protein_fastqs/vdj_v1_hs_pbmc2_5gex_protein_feature_ref.csv
cellranger count \
--libraries=${LIBRARY} \
--feature-ref=${FEATURE_REF} \
--id=PBMC_5GEX \
--transcriptome=${TR} \
--expect-cells=9000 \
--jobmode=sge \
--mempercore=8 \
--maxjobs=3 \
--jobinterval=5000
Execute a command in screen and, detach and reconnect:
Start a screen as screen -S some_name
.
Run the above script as sh sge.sh
.
To detach the screen session from the terminal use control
+ a
followed by d
To reconnect to the screen: screen -R some_name
6. Monitor work progress through a web browser
Open _log
file present in output folder PBMC_5GEX
If you see serving UI as http://cluster.university.edu:3600?auth=rlSdT_QLzQ9O7fxEo-INTj1nQManinD21RzTAzkDVJ8
, then type the following from your laptop
ssh -NT -L 9000:cluster.university.edu:3600 user@cluster.university.edu
user@cluster.university.edu's password:
Then access the UI using the following URL in your web browser
http://localhost:9000/
7. Single Cell Integration in Seurat
Seurat is an R package designed for QC, analysis, and exploration of single cell RNA-seq data. Seurat aims to enable users to identify and interpret sources of heterogeneity from single cell transcriptomic measurements, and to integrate diverse types of single cell data. Seurat starts by reading cellranger data (barcodes.tsv.gz, features.tsv.gz and matrix.mtx.gz)
pbmc.data <- Read10X(data.dir = "~/PBMC_5GEX/outs/filtered_feature_bc_matrix/")
Quantitative Insights Into Microbial Ecology “QIIME” 2 (release 2018.11)1 is a widely used package to identity abundance of microbes using 16s rRNA. Briefly, feature table containing counts of each unique sequence in the samples will be constructed using qiime dada2 denoise-paired
method. A feature is essentially any unit of observation, e.g., an OTU (Operational Taxonomic Unit), a sequence variant, a gene or a metabolite. In QIIME2 (currently), most features will be OTUs or sequence variants (alternatively, for OTUs, use QIIME2 plugin q2-vsearch
).
Data produced by QIIME 2 exist as QIIME 2 artifacts. A QIIME 2 artifact typically has the .qza
file extension when output data stored in a file. Visualizations are another type of data (.qzv
file extension) generated by QIIME 2, which can be viewed using a web interface https://view.qiime2.org (at Firefox web browser) without requiring a QIIME installation. Since QIIME 2 works with artifacts instead of data files (e.g. FASTA files), we must create a QIIME 2 artifact by importing our fastq.gz
data files.
Scripts are available for AWS Cloud and HPC Cluster.
MetaPhlAn2 provides microbial (bacterial, archaeal, viral, and eukaryotic) taxonomic profiling allowing the quantification of individual species across metagenomic samples. MetaPhlAn2 relies on ~1M unique clade-specific marker genes identified from ~17,000 reference genomes. Microbial reads, aligned by MetaPhlAn2, belonging to clades with no sequenced genomes available are reported as an “unclassified” subclade of the closest ancestor with available sequence data. HUMAnN2 utilizes the MetaCyc database as well as the UniRef gene family catalog to characterize the microbial pathways present in samples. HUMAnN2 relies on programs such as BowTie (for accelerated nucleotide-level searches) and Diamond (for accelerated translated searches) to compute the abundance of gene families and metabolic pathways present. HUMAnN2 generates three outputs: 1) gene families based on UniRef proteins and their abundances reported in reads per kilobase, 2) MetaCyc pathways and their coverage, and 3) MetaCyc pathways and their abundances reported in reads per kilobase.
Scripts are available at shotgun_metagenomics