Ashok R. Dinasarapu Ph.D

Quantitative Proteomics: Aptamer-Based Protein Quantification

2023-02-20T00:00:00-05:00

Quantitative proteomics is a cutting-edge approach for measuring protein levels in complex biological samples. One innovative method in this field is aptamer-based protein quantification. Aptamers, which are short, single-stranded DNA or RNA molecules, are engineered to specifically bind to target proteins with high precision.

The SomaScan® assay is a widely used aptamer-based method for measuring protein abundances. However, there is limited information on how SomaScan correlates with mass spectrometry (MS)-based proteomics and Olink assays, another high-throughput antibody-based platform. Some studies have noted measurement variations between these platforms. Aptamers or SOMAmers in the SomaScan assay are typically selected to bind target proteins in their native conformation or with known post-translational modifications (PTMs). However, novel PTMs induced by pathogens or diseases can alter protein structure, electrophilicity, and interactions. A key drawback of the SomaScan assay is that its quantification relies on DNA microarray chips, which can introduce background noise. On the other hand, the advantages include lower cost and efficient data analysis.

Table 1. Overview of common proteomic platforms (Jiang W et al, Cancers, 2022).

Analytical Technique	Category	Protein Sample Values	Accepted Biospecimen Types
Proximity Extension Assay (Olink)	Antibody	1 µL	Plasma, tissue/cell, synovial fluid, CSF, plaque extract, and saliva
Reverse Phase Protein Arrays	Antibody	5 µg (1.0- 1.5 mg/mL protein)	Tissue/cell, plasma, serum, biopsies, body fluids
Bio-Plex	Antibody (bead)	12.5 µL (serum/plasma)	50 µL (cell culture) Plasma, serum, tissue/cell
Simoa	Antibody (bead)	25 µL	Plasma, serum, urine, tissue/cell, CSF, saliva
Aptamer Group (Optmer)	Aptamer	38 µL	Plasma (diagnostics and therapeutics), urine, tissue/cell, liquid matrices
Base Pair Technologies	Aptamer	5–100 µL	Plasma, serum, tissue/cell
SOMAscan	Aptamer	55–100 µL	Plasma, serum, CSF, urine, cell/tissue, synovial fluid, exosomes
Electrochemiluminescence Immunoassay	ECLIA	50 µL	Plasma, serum, tissue/cell, CSF, urine, blood spots, tears, synovial fluid, tissue extracts
Multiplex ELISA	ELISA	25–50 µL	Plasma, serum, tissue/cell, urine, saliva, CSF
Singleplex ELISA	ELISA	100 µL	Plasma, serum, tissue/cell, urine, saliva, CSF
2D-PAGE	Gel electrophoresis	~100 µg (15–50 µL)	Plasma, serum, tissue/cell, urine
DDA-MS	MS	10 µL	Plasma, serum, tissue/cell
SWATH-MS	MS (DIA)	5–10 µg	Plasma, serum tissue/cell, platelets, monocytes/neutrophils
iTRAQ	MS (labeling in LC–MS–MS)	12 µg	Plasma, serum, tissue/cells, saliva
SRM/MRM	MS (LC–MS–MS)	15 µL	Plasma, tissue/cell, dried blood spots

The SomaScan Assay v4.1 enables the simultaneous measurement of approximately 6,600 unique human proteins in a single sample (see Table 2). This is achieved using around 7,300 aptamers known as SOMAmers (Slow Off-rate Modified Aptamers). SOMAmers are short, chemically modified single-stranded DNA molecules designed to specifically bind to protein targets. The assay quantifies native proteins in complex samples by converting each protein’s concentration into the corresponding SOMAmer reagent concentration, which is then measured using DNA microarrays. SOMAmers are selected to bind proteins in their native, folded states, typically requiring an intact tertiary protein structure for effective binding.

Table 2. The 7k SomaScan Assay v4.1 panel (7,596 aptamers mapping to 6,414 unique human protein targets).

Organism	SOMAmers	UniProt IDs (all)	UniProt IDs (Unique)	Protein Targets	Gene IDs	Gene Symbols
Human	7335	7301	6414	6610	6408	6398
Mouse	236	236	4	4	3	3
African clawed frog	3	3	1	2	1	1
Gila monster	3	3	1	1	0	0
Hornet	3	3	1	1	0	1
Jellyfish	3	3	1	1	0	1
Thermus thermophilus	3	3	1	1	0	1
Common eastern firefly	2	2	1	1	0	0
Bacillus stearothermophilus	1	1	1	1	0	1
Ensifer meliloti	1	1	1	1	0	1
European elder	2	2	1	1	0	0
HIV-1	1	1	1	1	0	1
HIV-2	1	1	1	1	1	1
Red alga	1	1	1	1	1	1
strain K12	1	1	1	1	1	1
Total	7596	7562	6431	6628	6415	6411

SomaDataIO

SomaDataIO v5.3.1 is an R package for working with the SomaLogic ADAT file format. ADAT is a tab-delimited text file format that contains various data elements, including SOMAmer reagent intensities, sample data, sequence information, and experimental metadata. For each SOMAmer reagent sequence, the ADAT file usually includes the corresponding protein name, UniProt ID, Entrez Gene ID, and Entrez Gene symbol.

library(SomaDataIO)
library(purrr)
library(tidyr)
library(dplyr)
library(ggplot2)  

The read_adat() function imports data from ADAT files.

base.dir = "/Users/adinasa/Documents/"
adat_file <- "example.adat"
my_adat <- read_adat(paste0(base.dir,adat_file))  

Update the ADAT file with sample group information (adding sample group details from external file). Save the updated ADAT file (optional).

meta_file = paste(base.dir, "MAPPING.csv", sep="/")  
meta <- read.csv(meta_file, header = T, stringsAsFactors = FALSE)  
meta$SampleId <- as.character(meta$SampleId)  
my_adat <- dplyr::left_join(my_adat,meta, by="SampleId", keep=FALSE)
write_adat(my_adat, file = paste(base.dir, "example_updated.adat", sep="/"))  

Utility functions.

regex for analytes

is_seq <- function(.x) grepl("^seq\\.[0-9]{4}", .x)

center/scale vector (z-scores).

cs <- function(.x) {      out <- .x - mean(.x)  
  out / sd(out)       
}

Data were log2-transformed within each sample. Control and Disease are two groups (this may vary in your data).

cleanData <- my_adat %>% 
  filter(SampleType == "Sample") %>% drop_na(Group) %>% 
  log2() %>% 
  mutate(SampleGroup = as.numeric(factor(Group, levels=c("Control", " Disease"))) - 1) %>% 
  modify_if(is_seq(names(.)), cs)

Human proteins with Uniprot ID and QC=PASS selected.

t_tests <- getAnalyteInfo(cleanData2) %>% 
  filter(ColCheck == "PASS") %>% 
  filter(Organism == "Human") %>%
  filter(UniProt != "") %>%
  select(AptName, SeqId, Target = TargetFullName,Organism, EntrezGeneID, EntrezGeneSymbol, UniProt, ColCheck)

Performed a Student’s t-test.

t_tests <- t_tests %>% 
  mutate(
    formula = map(AptName, ~ as.formula(paste(.x, "~ SampleGroup"))), 
    t_test  = map(formula, ~ stats::t.test(.x, data = cleanData,var.equal = TRUE)),  
    t_stat  = map_dbl(t_test, "statistic"),            
    p.value = map_dbl(t_test, "p.value"),              
    fdr     = p.adjust(p.value, method = "BH")         
  ) %>% arrange(p.value)

The results were used to identify proteins significantly associated with disease using a Benjamini-Hochberg false discovery rate (FDR) threshold of 1%.

t_tests [t_tests$fdr <= 0.1,]

Further reading …
Tandem Mass Tag (TMT)-based quantitation of proteins
Label-Free Quantitation (LFQ) of proteins

Kaplan-Meier Curve using R

2022-10-19T00:00:00-04:00

The Kaplan-Meier curve is a powerful tool in survival analysis, commonly used to estimate the probability of an event—such as survival—at different time intervals. It provides a visual representation of the time it takes for an event to occur across a patient population. This method is especially useful in medical studies where understanding survival rates is key.

However, when analyzing survival data, it’s important to consider the limitations. For instance, if you’re working with a small subset of patients, the Kaplan-Meier estimates can sometimes be misleading. Small sample sizes can lead to erratic survival curves, making it crucial to interpret the results with caution.

When comparing survival outcomes across two or more groups (e.g., treatment vs. control), the log-rank test comes into play. This statistical test helps determine if there are significant differences in survival rates between the groups.

In R, the survival and survminer packages are essential for performing Kaplan-Meier survival analysis and generating publication-ready plots. If you don’t already have these libraries installed, you can add them to your workspace using the following command:

Load Required libraries

library(survival)  
library(survminer)  
library(dplyr)

Read the vital dataset

base.dir = "home/user/Documents/survival_analysis"    
data = read.csv(file=paste0(base.dir,"/KM_Test_data.csv"),header=T)  

Examine the dataset (Vital status: 1 for dead; 0 for alive)

head(data)  
  
ID      p16 		Status	Days  
GHN-11  unknown		1  	803  
GHN-15  unknown		1  	775  
GHN-20  unknown		1  	150  
GHN-21  unknown		0 	2036  
GHN-24  unknown		1  	718  
GHN-25	negative	1  	598  
HN-39	positive	0	1232  

Create a survival object, usually used as a response variable in a model formula.

surv_obj = survival::Surv(time=data$Days, event = data$Status)

Wrapper arround the standard survfit() function to create survival curves

fit = survminer::surv_fit(surv_obj ~ p16, data = data)

You can replace the above two steps with

fit = surv_fit(Surv(Days, Status) ~ p16, data = data)

Plot the KM curve. With pval = TRUE argument, it plots the p-value of a log rank test, which will help us to get an idea if the groups are significantly different or not.

png(filename = paste(base.dir, "KM_Plot.png", sep="/"),width = 1300, height = 1300, res=200)  
ggsurvplot(fit_1, pval=TRUE, risk.table=TRUE)  
dev.off()  

KM plot Kaplan-Meier survival curves for the three HPV-status groups: HPV-positive (N=23), HPV-negative (N=5), and HPV-unknown (N=7). Each vertical drop in the curves represents an event (e.g., death). The event rates are as follows: HPV-positive group: 5 events (21.7%), HPV-negative group: 4 events (80%), and HPV-unknown group: 5 events (71.4%).

The horizontal lines along the X-axis represent survival duration for each time interval. A vertical tick mark on the curve indicates a censored patient, meaning the patient has not experienced the event of interest (e.g., death) during the study period. When many patients are censored in a group, it raises questions about how the study was conducted or how the treatment impacted patients. This highlights the importance of displaying censored patients as tick marks in survival curves to provide a clearer interpretation of the data.

Risk table At time zero, the survival probability is 1.0 (meaning 100% of participants are alive or at risk). Initially, all 35 participants are alive or at risk. After 1,000 days, 21 participants remain alive or at risk, and by 3,000 days, only 3 participants are still alive or at risk.

Annotation of genetic variants

2021-04-13T00:00:00-04:00

Tools such as ANNOVAR, Variant Effect Predictor (VEP) or SnpEff annotate genetic variants (SNPs, INDELS, CNVs etc) present in VCF file. These tools integrate the annotations within the INFO column of the original VCF file.

Variant Call Format or VCF is a text file with multiple lines of meta-information, a header line, and data rows each containing information about a loci or sequence variant in the genome. Aside from the chromosomal location and observed alleles, these loci are basically anonymous. VCF is the primary (and only well-supported) format used by the GATK for variant calls. A GVCF or Genomic VCF is a kind of VCF, so the basic format specification is the same as for a regular VCF, but a Genomic VCF contains extra information.

The Ensembl Variant Effect Predictor (VEP) annotates
• location of the variants on genome
• known variants based on database matching
• gene and transcript names affected by the variants
• consequence of variation on the protein sequence (stop gain/lost, missense, frameshift)

STEP 1: Install Docker Engine

“Docker Engine is available on a variety of Linux platforms, macOS and Windows 10 through Docker Desktop, and as a static binary installation”.

STEP 2: Clone or Pull the VEP docker image

Create a directory on your machine and make sure that the created directory on your machine has read and write access granted so that the docker container can write in the directory (VEP output).

mkdir $HOME/vep_data
chmod a+rwx $HOME/vep_data    
docker pull ensemblorg/ensembl-vep   

Pre-configure the volume vep_data for the container; this is required if you wish to download data (e.g. cache files) that persists across sessions. The following command mounts the volume $HOME/vep_data into /opt/vep/.vep in the container. If you copy your data to $HOME/vep_data directory, you will see that data in /opt/vep/.vep directory of the container.

-t: Allocate a pseudo-TTY
-i: Interactive / Keep STDIN open even if not attached
-v: Bind mount a volume
docker_image is ensemblorg/ensembl-vep

docker run -t -i -v $HOME/vep_data:/opt/vep/.vep ensemblorg/ensembl-vep  

exit  

The above command docker run, allows you to enter into the container using an interactive terminal. Once you are inside do: ls -lsa to list the files

vep@56d4ccaf3e34:~/src/ensembl-vep$ ls -lsa
total 188
drwxr-xr-x  1 vep  vep   4096 Apr  7 15:30 .
drwxr-xr-x  1 root root  4096 Apr  7 15:26 ..
drwxr-xr-x 45 vep  vep   4096 Apr  7 15:27 Bio
-rwxr-xr-x  1 vep  vep  16698 Apr  7 15:19 convert_cache.pl
drwxr-xr-x  2 vep  vep   4096 Apr  7 15:19 examples
-rwxr-xr-x  1 vep  vep  16620 Apr  7 15:19 filter_vep
-rwxr-xr-x  1 vep  vep   4880 Apr  7 15:19 haplo
-rwxr-xr-x  1 vep  vep  56607 Apr  7 15:19 INSTALL.pl
-rw-r--r--  1 vep  vep  11478 Apr  7 15:19 LICENSE
drwxr-xr-x  3 vep  vep   4096 Apr  7 15:19 modules
-rw-r--r--  1 vep  vep  12906 Apr  7 15:19 README.md
drwxr-xr-x  2 vep  vep   4096 Apr  7 15:19 validator
-rw-r--r--  1 vep  vep    128 Apr  7 15:29 variant_effect_output.txt_warnings.txt
-rwxr-xr-x  1 vep  vep   5079 Apr  7 15:19 variant_recoder
-rwxr-xr-x  1 vep  vep  14462 Apr  7 15:19 vep
drwxr-xr-x  2 vep  vep   4096 Apr  7 15:27 .version

Depending on Plugin, you need to download and prepare data for a plugin.
For example dbNSFP plugin needs the following preprocessing of downloaded file,
Programs like tabix, unzip, wget and zcat are available in container at /opt/vep/.vep/.

wget ftp://dbnsfp:dbnsfp@dbnsfp.softgenetics.com/dbNSFP4.1a.zip
unzip dbNSFP4.1a.zip
zcat dbNSFP4.1a_variant.chr1.gz | head -n1 > h
zgrep -h -v ^#chr dbNSFP4.1a_variant.chr* | sort -T /path/to/tmp_folder -k1,1 -k2,2n - | cat h - | bgzip -c > dbNSFP4.1a_grch38.gz
tabix -s 1 -b 2 -e 2 dbNSFP4.1a_grch38.gz

In a separate terminal, run docker ps command to list all running containers. To see all the containers both stopped and running use -a flag.

docker ps -a

STEP 3: Cache and Plugins installation

Use the following commands to set up the cache and corresponding FASTA for human GRCh38 genome.

docker run -t -i -v $HOME/vep_data:/opt/vep/.vep ensemblorg/ensembl-vep perl INSTALL.pl -a cf -s homo_sapiens -y GRCh38

To install all the available plugins

docker run -t -i -v $HOME/vep_data:/opt/vep/.vep ensemblorg/ensembl-vep perl INSTALL.pl -a cfp -s homo_sapiens -y GRCh38 -g all

STEP 4: Create vep.sh file

#!/bin/sh
# Example of VEP command line:
vep --cache --offline --format vcf --vcf --force_overwrite \
	--dir_cache /opt/vep/.vep/ \
	--dir_plugins /opt/vep/.vep/Plugins/ \
	--input_file /opt/vep/.vep/input/Dystonia.gatk4.vqsr_variants.g.vcf.gz \
	--output_file /opt/vep/.vep/output/Dystonia.gatk4.vqsr_vep.g.vcf \
	--plugin dbNSFP,/opt/vep/.vep/Plugins/dbNSFP4.1a_grch38.gz,ALL

STEP 5: Copy or transfer the input data i.e vcf file

mkdir -p /Users/adinasa/vep_data/{input,output,script}
cp Dystonia.gatk4.vqsr_variants.g.vcf.gz ~/vep_data/input/
cp vep.sh /Users/adinasa/vep_data/script
cp dbNSFP4.1a_grch38.gz /Users/adinasa/vep_data/Plugins 

STEP 6: Run the script

Finally, run the script inside container.

docker run -t -i -v $HOME/vep_data:/opt/vep/.vep ensemblorg/ensembl-vep

Starts a bash session

vep@88d2b95cfd6a:~/src/ensembl-vep$ cd /opt/vep/.vep/script/
sh vep.sh

Quantitative proteomics: label-free quantitation of proteins

2020-04-04T00:00:00-04:00

Updated on September 13, 2021

Liquid chromatography (LC) coupled with mass spectrometry (MS) is a powerful method for quantifying protein expression. In tandem mass spectrometry (MS/MS), protein quantification is based on the integrated peak intensity of either the parent-ion mass (MS1) or features from fragment-ions (MS2).

A widely used approach for protein quantification is label-free quantification (LFQ), which can be based on either precursor ion intensity (peak areas or peak heights) or spectral counting. One of the key algorithms used in LFQ is MaxLFQ, which measures chromatographic ion intensities to compare protein levels across samples.

iBAQ and LFQ Intensity

iBAQ (Intensity-Based Absolute Quantification):
In MS1-based methods, protein quantification often uses the iBAQ. iBAQ intensity is calculated by dividing a protein’s total non-normalized intensity (sum of all peptide intensities) by the number of measurable tryptic peptides. This method provides a measure of the protein’s absolute abundance.

iBAQ = Σintensity / #theoretical peptides

LFQ (Label-Free Quantification):
LFQ intensity is similar to iBAQ but applies normalization to exclude outliers, providing more accurate ratios of protein abundance between different samples. This normalization is crucial for ensuring reliable comparison of protein levels across biological samples. Takes into account retention time and aligns peptide peaks between different samples.

Untargeted label-free quantitation (LFQ) aims to determine the relative amount of proteins in two or more biological samples without prior labeling. Raw data files generated by the mass spectrometer are used for this process. For quality control, base peak chromatograms can be visually inspected using tools like RawMeat, a data quality assessment tool specifically designed for Thermo instruments.

1. MaxQuant Search

MaxQuant is a widely used computational platform designed for analyzing mass spectrometry (MS) data, particularly in the field of quantitative proteomics. It processes data from Liquid Chromatography coupled with Tandem Mass Spectrometry (LC-MS/MS) experiments, helping to identify and quantify proteins in complex samples. The tool is especially valuable for label-free quantification, SILAC (Stable Isotope Labeling by Amino acids in Cell culture), and TMT (Tandem Mass Tags) based proteomics.

All raw files are processed together in a single run by MaxQuant v1.6.15.0 ¹ with default parameters except the following

Raw data pane

Load all raw data files of a single run.
Select a sample file(s) and edit Set experiment to assign a unique ID. If you don´t assign a unique ID to each biological sample, MaxQuant will put them together in the output.
Select a sample file and edit Set fractions to assign fraction value. If you don’t have a fractionation, set 1.
Number of processors: 4 (depends on your computer)

Group-specific parameters pane

Type: Standard and Multiplicity: 1
Modifications:
a. Variable modifications: Oxidation(M); Acetyl (Protein N-term); Deamidation (NQ)
b. Fixed modifications: Carbamidomethyl (C)
Digestion: trypsin
Instrument: Orbitrap
Label-free quantification: LFQ (LFQ min. ratio count: 2)

Global parameters pane

Sequences:
a. Add D:\Proteomics\HUMAN.fasta (download it from UNIPROT)
b. Identifier rule: Uniprot identifier
c. Min. peptide length: 6
d. Max. peptide mass [Da]: 6000
Protein quantification:
a. Label min. ratio count: 1
b. Peptides for quantification: Unique+razor
c. Modifications used in protein quantification: Oxidation(M); Acetyl (Protein N-term); Deamidation (NQ)
d. Discard unmodified counterpart peptides: FALSE
Tables
a. Write msScans tabls: TRUE
MS/MS analyzer
a. FTMS MS/MS match tolerance: 0.05 Da
b. ITMS MS/MS match tolerance: 0.6 Da
Identification:
a. Match between runs: TRUE (optional)
b. Find dependent peptides: FALSE
c. Razor protein FDR: TRUE
Label free quantification
a. iBAQ: TRUE
b. Separate LFQ in parameter groups: TRUE
Folder locations
a. Combine folder location: D:\results (optional)

Match between runs: Peptides, which are present in several samples, but not identified via MS/MS in all of them, can still be identified via matching between runs. Setting TRUE will boosts number of identifications.

Database searches are performed using the Andromeda search engine (a peptide search engine based on probabilistic scoring) with the UniProt-SwissProt human canonical database as a reference and a contaminants database of common laboratory contaminants. MaxQuant reports summed intensity for each protein, as well as it`s iBAQ and LFQ values.

Proteins that share all identified peptides are combined into a single protein group. Peptides that match multiple protein groups (“razor” peptides) are assigned to the protein group with the most unique peptides. MaxQuant employs the MaxLFQ algorithm for label-free quantitation (LFQ). Quantification will be performed using razor and unique peptides, including those modified by acetylation (protein N-terminal), oxidation (Met) and deamidation (NQ).

In the MaxQuant output, iBAQ values indicate the absolute abundance of each protein within a sample, providing a measure of the total amount of protein present. Meanwhile, LFQ intensity values in the output tables reflect the relative abundance of each protein across different samples or conditions.

2. Quality Control of MaxQuant Search (optional)

PTXQC² is an R package used for general quality control of proteomics data, which takes MaxQuant result files.

library("devtools")  
install_github("cbielow/PTXQC", build_vignettes=TRUE, dependencies=TRUE)  

library(PTXQC)  
PTXQC::createReport("path_to_txt_directory")  

Open final report file report_v1.0.5_combined.pdf

3. Post processing of MaxQuant Search results

Further data processing is performed using `Perseus v1.6.14.0)³. In brief, protein group LFQ intensities are log2-transformed to reduce the effect of outliers. To overcome the obstacle of missing LFQ values, missing values are imputed before fit the models. Hierarchical clustering is performed on Z-score normalized, log2-transformed LFQ intensities. Log ratios are calculated as the difference in average log2 LFQ intensity values between experimental and control groups. Two-tailed, Student’s t test calculations are used in statistical tests. A protein is considered statistically significant if its fold change is ≥ 2 and FDR ≤ 0.01. Please refer to its documentation for more details.

Export Perseus processed LFQ data as a text file Perseus_filtered_transformed_valid_values_imputed_ttest.txt

4. Perseus exported file processing

PerseusR enables the interoperability between the Perseus platform for omics data analysis (Tyanova et al. 2016). If you select “Write quality and imputed matrices” when you save Perseus processed data as a text file, inlude additionalMatrices=TRUE.

library(PerseusR)

setwd("/Users/Documents/Proteomics/Perseus")

inFile <- "Perseus_filtered_transformed_valid_values_imputed_ttest.txt"
mdata <- PerseusR::read.perseus.as.matrixData(inFile,additionalMatrices=TRUE,check = FALSE)

# 1. log(LFQ) data with imputed values 
data <- main(mdata) # head(data)

# 2. Gene/Protein names, p-value, FDR and fold change
annotations <- annotCols(mdata) # colnames(annotations)

# Select first annotation in Protein.ID "sp|P31943|HNRH1_HUMAN;sp|Q9NQA5|TRPV5_HUMAN"    
annotations$Protein.IDs <- sub(";.*","", annotations$Protein.IDs)

# Select Protein ID as HNRH1_HUMAN from sp|P31943|HNRH1_HUMAN;
annotations$Protein.IDs <- as.character(lapply(strsplit(as.character(annotations$Protein.IDs), split="\\|"),tail, n=1))

# Remove "_HUMAN" part HNRH1_HUMAN
annotations$Protein.IDs <- sub("_HUMAN","",annotations$Protein.IDs)

# Select Columns; write complete column name. 
annotations <- annotations[,c("Protein.IDs","Student.s.T.test.p.value....", "Student.s.T.test.Difference...")]

# Save the data 
write.table(annotations, file = "Volcano_plot_data.txt", col.names = TRUE, row.names = FALSE, quote = FALSE)

5. Data visualization

Volcano plot illustrates significantly differentially abundant proteins. The following plot is generated using GraphPad Prism.

In addition to the above analytical considerations, good experimental design helps effectively identify true differences in the presence of variability from various sources and also avoids bias during data acquisition.

Further reading…
MaxQuant – Information and Tutorial
How to use Cloud for Proteomics Data Analysis
Data dependent vs Data independent proteomics

References

MaxQuant is a quantitative proteomics software package designed for analyzing large mass-spectrometric data sets ↩
PTXQC, an R-based quality control pipeline called Proteomics Quality Control ↩
Perseus is software package for shotgun proteomics data analyses ↩

Quantitative proteomics: TMT-based quantitation of proteins

2020-04-04T00:00:00-04:00

Quantification of proteins using isobaric labeling (tandem mass tag or TMT) starts with the reduction of disulfide bonds in proteins with Dithiothreitol (DTT).

Alkylation with iodoacetamide (IAA) after cystine reduction results in the covalent addition of a carbamidomethyl group that prevents the formation of disulfide bonds. Then, overnight digestion of the proteins using trypsin or trypsin/LyC mixture, Tandem Mass Tag (TMT) labeling on Lysine residues, pooling of all samples and the fractionation of peptide mixture. Finally, LC-MS/MS data acquisition and database search for protein identification and quantification.

Fractionation prior to LC-MS/MS analysis effectively detects low abundance proteins (<100ng/mL), which is the concentration range of most clinical biomarkers.

TMT-based approach allows multiplexing of samples. TMT 6-plex reagents produces a series of different reporter ions with nominal masses from 126 to 131 Da at 1 Da intervals. Several TMT reagents are commercially available including TMTzero, TMT duplex, TMT 6-plex, and TMT 10-plex ¹. They have the same chemical structure but contain different numbers and combinations of 13C and 15N isotopes in the reporter group. The overall calibrated intensities of reporter ions equal.

Isobaric labeling reagent (TMT) structure:

a. An amine-specific reactive group – an N-hydroxysuccinimide ester, which reacts with primary amines i.e unblocked N-terminals and lysine side chains.
b. A mass reporter group for quantification – the reporter groups are partially fragmented from the peptide during precursor fragmentation in the mass spectrometer.
c. A mass normalizer group to link the reactive and reporter groups – mass normalizer group ensures that the peptide complexity in the MS1 spectra does not increase with multiplexing.

Unless otherwise noted, every analysis utilizes an MS3-based TMT-centric mass spectrometry method. MS2-based TMT yields the highest precision but lowest accuracy due to ratio compression, which MS3-based TMT can partly rescue ². To 10 precursors selected for MS2/MS3 analysis.

What are the correction factors used for TMT?

TMT reporter ion signals need to be adjusted to account for isotopic impurities in each TMT variant. For TMT-10plex labeling, different batches of TMT reagents have slightly different isotope impurities that need to be included for database search to correct the reporter ion ratio. The information of isotope impurity can be found in the reagent kit. However, I do not normally correct for isotopic impurities of TMT reagents, as this is typically a «1.2% shift.

Database search for protein identification and quantification:

The resulting TMT-MS3 data (.raw files) are processed using MaxQuant with an integrated Andromeda search engine (v.1.6.7.0). Tandem mass spectra were searched against the Uniprot database.

Download and install MaxQuant

Specification of Various Parameters in MaxQuant (All the other parameters in MaxQuant are set to the default values for processing orbitrap-type data)³:

Raw data pane
Using Load, select all raw files from all batches. Update Experiment (a unique value for each of the samples) and Fraction (if available) column values using set experiment and set fractions buttons.

Group-specific parameters pane
Change Type as Reporter ion MS3 and then select “10plex TMT”. Update isobaric impurities values, if available. For Modifications, select Variable modifications as Oxidation (M), Acetyl (Protein N-term); Deamidation (NQ) and Fixed Modifications as Carbamidomethyl© and, TMT labeled N- terminus and lysine residue. Specify Trypsin/P as a cleavage enzyme using Digestion and allow up to 2 missing cleavages. Set Label-free quantification as None.

Global parameters pane
For Sequences select a reference proteome database Uniprot FASTA file and update Max. peptide mass [Da] as 6000. For Protein quantification select Use only unmodified peptides and a list of modifications such as Oxidation (M), Acetyl (Protein N-term) and Deamidation. Select Match between runs in Identification tab.

Update the following two parameters for MS/MS analyzer:
a. FTMS MS/MS match tolerance: 0.05 Da
b. ITMS MS/MS match tolerance: 0.6 Da

For Folder location select tmp directory (optional).

Further data processing is performed using Perseus v1.6.12.0 ⁴. The search results in ProteinGroups.txt generated by MaxQuant are directly processed by Perseus software. MaxQuant reports the TMT-MS3 quantitative relative abundance metrics in the columns titled “Reporter intensity corrected”.

Further reading:
Data normalization and analysis in multiple TMT experimental designs ⁵
Multiplexed Protein Quantification Using the Isobaric TMT … ⁶
Isobaric matching between runs and novel PSM-level normalization in MaxQuant … ⁷

AWS Windows instance for MaxQuant/Perseus

References

Bąchor, R.; Waliczek, M.; Stefanowicz, P.; Szewczuk, Z. Trends in the Design of New Isobaric Labeling Reagents for Quantitative Proteomics. Molecules 2019, 24, 701. ↩
A. Hogrebe, L. von Stechow, D.B. Bekker-Jensen, B.T. Weinert, C.D. Kelstrup, J.V. Olsen Benchmarking common quantification strategies for large-scale phosphoproteomics Nat. Commun., 9 (2018), p. 1045. ↩
MaxQuant ↩
Perseus ↩
TMT Batch Correction ↩
TMT analysis using Proteome Discoverer ↩
Isobaric matching between runs and novel PSM-level normalization ↩

Spatial gene expression data analysis on Cluster (10X Genomics, Space Ranger)

2020-03-06T00:00:00-05:00

Running spaceranger as cluster mode that uses Sun Grid Engine (SGE) as queuing.

There are 2 steps to analyze Spatial RNA-seq data¹.

Step 1: spaceranger mkfastq demultiplexes raw base call (BCL) files generated by Illumina sequencers into FASTQ files.
Step 2: spaceranger count takes FASTQ files from spaceranger mkfastq and performs alignment, filtering, barcode counting, and UMI counting.

Running pipelines on cluster requires the following:

1. Load Space Ranger module (spaceranger-1.0.0)¹ or, download and uncompress spaceranger at your $HOME directory and add PATH in ~/.bashrc.

2. Update job config file (spaceranger-1.0.0/external/martian/jobmanagers/config.json) for threads and memory. For example

"threads_per_job": 8,
"memGB_per_job": 64,

3. Update template file (spaceranger-1.0.0/external/martian/jobmanagers/sge.template).

#!/bin/bash
#$ -pe smp __MRO_THREADS__
##$ -l mem_free=__MRO_MEM_GB__G (comment this line if your cluster do not support it!)
#$ -q b.q
#$ -S /bin/bash
#$ -m abe
#$ -M
cd __MRO_JOB_WORKDIR__
source ../spaceranger-1.0.0/sourceme.bash (update with complete path)

For clusters whose job managers do not support memory requests, it is possible to request memory in the form of cores via the --mempercore command-line option. This option scales up the number of threads requested via the __MRO_THREADS__ variable according to how much memory a stage requires.
Read more at Cluster Mode

4. Download spatial gene expression, image file and reference genome datasets from 10XGenomics.

5. Create sge.sh file

TR="$HOME/refdata-cellranger-mm10-3.0.0"

Output files will appear in the out/ subdirectory within this pipeline output directory.

cd $HOME/10xgenomics/out

For pipeline output directory, the --id argument is used i.e Adult_Mouse_Brain.

FASTQS="$HOME/V1_Adult_Mouse_Brain_fastqs"

spaceranger count --disable-ui \
--id=Adult_Mouse_Brain \
--transcriptome=${TR} \
--fastqs=${FASTQS} \
--sample=V1_Adult_Mouse_Brain \
--image=$DATA_DIR/V1_Adult_Mouse_Brain_image.tif \
--slide=V19L01-041 \
--area=C1 \
--jobmode=sge \
--mempercore=8 \
--jobinterval=5000 \
--maxjobs=3

6. Execute a command in screen and, detach and reconnect

Use screen command to get in/out of the system while keeping the processes running.

screen -S screen_name

bash sge.sh

If you want to exit the terminal without killing the running process, simply press Ctrl+A+D.

To reconnect to the screen: screen -R screen_name

7. Monitor work progress through a web browser

Open _log file present in output folder Adult_Mouse_Brain

If you see serving UI as http://cluster.university.edu:3600?auth=rlSdT_QLzQ9O7fxEo-INTj1nQManinD21RzTAzkDVJ8, then type the following from your laptop

ssh -NT -L 9000:cluster.university.edu:3600 user@cluster.university.edu

user@cluster.university.edu's password:

Then access the UI using the following URL in your web browser http://localhost:9000/

10XGenomics- Visium spatial RNA-seq ↩ ↩²

ATAC-seq peak calling with MACS2

2019-12-05T00:00:00-05:00

ATAC-seq (Assay for Transposase Accessible Chromatin with high-throughput Sequencing) is a next-generation sequencing approach for the analysis of open chromatin regions to assess the genome-wise chromatin accessibility.

ATAC-seq achieves this by simultaneously fragmenting and tagging genomic DNA with sequencing adapters using the hyperactive Tn5 transposase enzyme ¹. Other global chromatin accessibility methods include FAIRE-seq and DNase-seq. This document aims to provide accessibility.

Pre-processing of raw sequencing reads – before mapping the raw reads to the genome, trim the adapter sequences. Poor read quality or sequencing errors often lead to low mapping rate.

Mapping/alignment of sequencing reads to a reference genome – use Burrows-Wheeler Aligner (BWA) for mapping of sequencing reads. The output alignment file will be saved as a sequence alignment/map (SAM) format or binary version of SAM called BAM. Mark the duplicate reads using Picard ² and exclude reads mapping to mitochondrial DNA and other chromosomes from analysis together with low quality reads (MAPQ<10 and reads in Encode black list regions) using SAMtools ³.

Filtering and shifting of the mapped reads - shift the read position +4 and -5 bp in the BAM file before peak calling adjust the reads alignment. When the Tn5 transposase cuts open chromatin regions, it introduces two cuts that are separated by 9 bp. Therefore, ATAC-seq reads aligning to the positive and negative strands need to be adjusted by +4 bp and -5 bp respectively to represent the center of the transposase binding site. Picard CollectInsertSizeMetrics will be used to compute the fragment sizes on alignment shifted BAM files.

Identification and visualization of the ATAC-seq peaks – use MACS2 for peak calling with the parameters nomodel or BAMPE ⁴ and identify the differentially enriched peaks using the MACS2 bdgdiff module. Individual peaks separated by <100 bp will be join together. For peak annotation and functional analysis use the R package ChIPpeakAnno or HOMER ⁵,⁶. First, ATAC-seq peaks will be categorized into different groups based on the nearest RefSeq gene i.e. promoter, untranslated regions (UTRs), intron and exon. Second, peaks that are within 5 kb upstream and 3 kb downstream of the Transcription Start Site (TSS) are associated to the nearest genes. Finally, these genes are then analyzed for over-represented gene ontology (GO) terms and KEGG pathways using ChIPpeakAnno. Visualize all sequencing tracks using the Integrated Genomic Viewer (IGV) ⁷.

Scripts are available for HPC Cluster.
For further reading: ATAC-seq-data-analysis-from-FASTQ-to-peaks

ATAC-seq - Nature Methods. 2013; 10:1213–1218. ↩
PICARD ↩
HTSLIB ↩
MACS ↩
ChIPpeakAnno ↩
HOMER ↩
IGV ↩

Single cell gene expression data analysis on Cluster (10X Genomics, Cell Ranger)

2019-11-18T00:00:00-05:00

Updated on November 28, 2023

Cell Ranger can be run in cluster mode, using job schedulers like Sun Grid Engine (or simply SGE) or Load Sharing Facility (or simply LSF) as queuing system allows highly parallelizable jobs.

There are 4 steps to analyze Chromium Single Cell data¹.

Step 1: cellranger mkfastq demultiplexes raw base call (BCL) files generated by Illumina sequencers into FASTQ files.

Step 2: cellranger count takes FASTQ files from cellranger mkfastq and performs alignment, filtering, barcode counting, and unique molecular identifier (UMI) counting.

When doing large studies involving multiple GEM wells, first run cellranger count on FASTQ data from each of the GEM wells individually, and then pool the results using cellranger aggr, as described here.

Step 3: cellranger aggr aggregates outputs from multiple runs of cellranger count.

Step 4: Use R package Seurat² for downstream analysis.

Running pipelines on cluster requires the following:

1. Download and uncompress cellranger-7.2.0¹ at your $HOME directory and add PATH in ~/.bashrc.

2. Update job config file (external/martian/jobmanagers/config.json) for threads and memory.

For example

"threads_per_job": 4,
"memGB_per_job": 32, "name":"SGE_ROOT", "description":"/opt/sge"

3. Update template file sge.template (external/martian/jobmanagers/sge.template).

#!/bin/bash
#$ -N __MRO_JOB_NAME__
#$ -V
#$ -pe smp __MRO_THREADS__
#$ -q b.q
#$ -cwd
#$ -l mem_free=__MRO_MEM_GB__G
#$ -o __MRO_STDOUT__
#$ -e __MRO_STDERR__
#$ -m abe
#$ -M 
#$ -S /bin/bash
 
cd __MRO_JOB_WORKDIR__
source $HOME/10xgenomics/cellranger-7.2.0/sourceme.bash 

4. Download single cell gene expression and reference genome datasets from 10XGenomics.

5. Create sge.sh file

for Single Cell 3′ gene expression

Output files will appear in the out/ subdirectory within this pipeline output directory. For pipeline output directory, the --id argument is used i.e 10XGTX_v3.

#!/bin/bash

cd $HOME/10xgenomics/out  

FASTQS="$HOME/pbmc_10k_v3_fastqs"  
TR="$HOME/refdata-gex-GRCh38-2020-A"

cellranger count --disable-ui \  
  --id=10XGTX_v3 \  
  --transcriptome=${TR} \  
  --fastqs=${FASTQS} \  
  --sample=pbmc_10k_v3 \  
  --expect-cells=10000 \  
  --jobmode=sge \  
  --mempercore=8 \  
  --jobinterval=5000 \  
  --maxjobs=3  

Execute a command in screen and, detach and reconnect:

Start a screen as screen -S some_name.
Run the above script as sh sge.sh
To detach the screen session from the terminal use control + a followed by d
To reconnect to the screen: screen -R some_name

If the job is Eqw: Job waiting in error state
qstat -j jobid | grep error

If you understand the reason and can get it fixed, you can clear the error state with
qmod -cj jobid

for Single Cell 5′ gene expression

Use either --force-cells or --expect-cells

#!/bin/bash  

cd $HOME/10xgenomics/out  

TR="$HOME/refdata-cellranger-GRCh38-3.0.0
FASTQS="$HOME/vdj_v1_hs_nsclc_5gex_fastqs"

cellranger count \  
  --id=10XGTX_v5 \  
  --fastqs=${FASTQS} \  
  --transcriptome=${TR} \  
  --sample=vdj_v1_hs_nsclc_5gex \  
  --force-cells=7802 \  
  --jobmode=sge \  
  --mempercore=8 \  
  --maxjobs=3 \  
  --jobinterval=2000  

Execute a command in screen and, detach and reconnect:

for Feature Barcode Analysis

Tested on Single Cell 5′ gene expression and cell surface protein (Feature Barcoding/Antibody Capture Assay) data.

For more information, please visit the Single Cell Gene Expression with Feature Barcoding page (Single Cell 3’) or the Single Cell Immune Profiling with Feature Barcoding page (Single Cell 5’).

Currently available Feature Barcode kits for Single Cell Gene Expression Feature Barcode Technology

10x Solution	Gene Expression	Cell Surface Protein	CRISPR Screening
Single Cell Gene Expression v2	✓	-	-
Single Cell Gene Expression v3	✓	✓	✓
Single Cell Gene Expression v3.1	✓	✓	✓
Single Cell Gene Expression v3.1 (Dual Index)	✓	✓	✓

Currently available Feature Barcoding kits for Single Cell Immune Profiling Feature Barcoding Technology

10x Solution	TCR/Ig	Gene Expression	Cell Surface Marker	TCR-Antigen Specificity
Single Cell Immune Profiling	✓	✓	-	-
Single Cell Immune Profiling with Feature Barcoding technology	✓	✓	✓	✓

To enable Feature Barcode analysis, cellranger count needs two inputs:

First you need a csv file declaring input library data sources; one for the normal single-cell gene expression reads, and one for the Feature Barcode reads (the FASTQ file directory and library type for each input dataset).

LIBRARY=$HOME/vdj_v1_hs_pbmc2_5gex_protein_fastqs/vdj_v1_hs_pbmc2_5gex_protein_library.csv

fastqs	sample	library_type
/path/to/antibody_fastqs	vdj_v1_hs_pbmc2_antibody	Antibody Capture
/path/to/gene_expression_fastqs	vdj_v1_hs_pbmc2_5gex	Gene Expression

Second, you need Feature reference csv file, declaring feature-barcode constructs and associated barcodes. The pattern will be used to extract the Feature Barcode sequence from the read sequence.

FEATURE_REF=$HOME/vdj_v1_hs_pbmc2_5gex_protein_fastqs/vdj_v1_hs_pbmc2_5gex_protein_feature_ref.csv

id	name	read	pattern	sequence	feature_type
CD3	CD3_TotalSeqC	R2	5PNNNNNNNNNN(BC)NNNNNNNNN	CTCATTGTAACTCCT	Antibody Capture
CD19	CD19_TotalSeqC	R2	5PNNNNNNNNNN(BC)NNNNNNNNN	CTGGGCAATTACTCG	Antibody Capture
CD45RA	CD45RA_TotalSeqC	R2	5PNNNNNNNNNN(BC)NNNNNNNNN	TCAATCCTTCCGCTT	Antibody Capture
…	…	…	…	…	…

Feature and Library Types - When inputting Feature Barcode data to Cell Ranger via the Libraries CSV File, you must declare the library_type of each library. Examples include Antibody Capture, CRISPR Guide Capture or Custom. If your assay scheme creates a library containing multiple library_types, for example if you’re using CRISPR Guide Capture and Antibody Capture features, you will need to run Cell Ranger multiple times, passing different library_type values in the Libraries CSV File. If Targeted Gene Expression data is analyzed in conjunction with CRISPR-based Feature Barcode data, there are additional requirements imposed for the Feature Reference CSV file.

TotalSeq™ Reagents for Single-Cell Proteogenomics

TotalSeq™-B is a line of antibody-oligonucleotide conjugates supplied by BioLegend that are compatible with the Single Cell 3’ v3 assay.
TotalSeq™-C is a line of antibody-oligonucleotide conjugates supplied by BioLegend that are compatible with the Single Cell 5’ assay.
TotalSeq™-A is a line of antibody-oligonucleotide conjugates supplied by BioLegend that are compatible with the Single Cell 3’ v2 and Single Cell 3’ v3 kits. Although TotalSeq™-A can be used with the CITE-Seq assay, CITE-Seq is not a 10x supported assay.

CITE-seq (Cellular Indexing of Transcriptomes and Epitopes by Sequencing) allows simultaneous analysis of transcriptome and cell surface protein information at the level of a single cell.³

“The pipeline first extracts and corrects the cell barcode and UMI from the feature library using the same methods as gene expression read processing. It then matches the Feature Barcode read against the list of features declared in the above Feature Barcode Reference. The counts for each feature are available in the feature-barcode matrix output files.”

#!/bin/bash  

cd $HOME/10xgenomics/out

TR="$HOME/refdata-cellranger-GRCh38-3.0.0

LIBRARY=$HOME/vdj_v1_hs_pbmc2_5gex_protein_fastqs/vdj_v1_hs_pbmc2_5gex_protein_library.csv  
FEATURE_REF=$HOME/vdj_v1_hs_pbmc2_5gex_protein_fastqs/vdj_v1_hs_pbmc2_5gex_protein_feature_ref.csv  

cellranger count \  
 --libraries=${LIBRARY} \  
 --feature-ref=${FEATURE_REF} \  
 --id=PBMC_5GEX \  
 --transcriptome=${TR} \  
 --expect-cells=9000 \  
 --jobmode=sge \  
 --mempercore=8 \  
 --maxjobs=3 \  
 --jobinterval=5000  

Execute a command in screen and, detach and reconnect:

Start a screen as screen -S some_name.
Run the above script as sh sge.sh.
To detach the screen session from the terminal use control + a followed by d
To reconnect to the screen: screen -R some_name

6. Monitor work progress through a web browser

Open _log file present in output folder PBMC_5GEX

If you see serving UI as http://cluster.university.edu:3600?auth=rlSdT_QLzQ9O7fxEo-INTj1nQManinD21RzTAzkDVJ8, then type the following from your laptop

ssh -NT -L 9000:cluster.university.edu:3600 user@cluster.university.edu

user@cluster.university.edu's password:

Then access the UI using the following URL in your web browser http://localhost:9000/

7. Single Cell Integration in Seurat

Seurat is an R package designed for QC, analysis, and exploration of single cell RNA-seq data. Seurat aims to enable users to identify and interpret sources of heterogeneity from single cell transcriptomic measurements, and to integrate diverse types of single cell data. Seurat starts by reading cellranger data (barcodes.tsv.gz, features.tsv.gz and matrix.mtx.gz)

pbmc.data <- Read10X(data.dir = "~/PBMC_5GEX/outs/filtered_feature_bc_matrix/")

10XGenomics ↩ ↩²
Seurat ↩
Stoeckius, M. et al. Simultaneous epitope and transcriptome measurement in single cells. Nat. Methods 14, 865–868 (2017). ↩

Taxonomic and diversity profiling of the microbiome - 16S rRNA gene amplicon sequence data

2019-01-01T00:00:00-05:00

The 16S ribosomal RNA (rRNA) gene of Bacteria codes for the RNA component of the 30S subunit. Different bacterial species have one to multiple copies of the 16S rRNA gene, and each with 9 hypervariable regions, V1-V9. High-throughput sequencing of 16S rRNA gene (a “marker gene”) amplicons has become a widely used method to study bacterial phylogeny and species classification.

Quantitative Insights Into Microbial Ecology “QIIME” 2 (release 2018.11)¹ is a widely used package to identity abundance of microbes using 16s rRNA. Briefly, feature table containing counts of each unique sequence in the samples will be constructed using qiime dada2 denoise-paired method. A feature is essentially any unit of observation, e.g., an OTU (Operational Taxonomic Unit), a sequence variant, a gene or a metabolite. In QIIME2 (currently), most features will be OTUs or sequence variants (alternatively, for OTUs, use QIIME2 plugin q2-vsearch).

Data produced by QIIME 2 exist as QIIME 2 artifacts. A QIIME 2 artifact typically has the .qza file extension when output data stored in a file. Visualizations are another type of data (.qzv file extension) generated by QIIME 2, which can be viewed using a web interface https://view.qiime2.org (at Firefox web browser) without requiring a QIIME installation. Since QIIME 2 works with artifacts instead of data files (e.g. FASTA files), we must create a QIIME 2 artifact by importing our fastq.gz data files.

Scripts are available for AWS Cloud and HPC Cluster.

https://qiime2.org ↩

Taxonomic and functional profiling of the microbiome - whole genome shotgun metagenomics

2018-07-27T00:00:00-04:00

This workflow consists of taxonomic and functional profiling of shotgun metagenomics sequencing (MGS) reads using MetaPhlAn2 and HUMAnN2, respectively. To perform taxonomic (phyla, genera or species level) profiling of the MGS data, the MetaPhlAn2 pipeline was run on a high performance multicore cluster computing environment.

MetaPhlAn2 provides microbial (bacterial, archaeal, viral, and eukaryotic) taxonomic profiling allowing the quantification of individual species across metagenomic samples. MetaPhlAn2 relies on ~1M unique clade-specific marker genes identified from ~17,000 reference genomes. Microbial reads, aligned by MetaPhlAn2, belonging to clades with no sequenced genomes available are reported as an “unclassified” subclade of the closest ancestor with available sequence data. HUMAnN2 utilizes the MetaCyc database as well as the UniRef gene family catalog to characterize the microbial pathways present in samples. HUMAnN2 relies on programs such as BowTie (for accelerated nucleotide-level searches) and Diamond (for accelerated translated searches) to compute the abundance of gene families and metabolic pathways present. HUMAnN2 generates three outputs: 1) gene families based on UniRef proteins and their abundances reported in reads per kilobase, 2) MetaCyc pathways and their coverage, and 3) MetaCyc pathways and their abundances reported in reads per kilobase.

Scripts are available at shotgun_metagenomics