Abstract

protrusionproteome is a package that provides an analytical workflow of shotgun mass spectrometry-based proteomics experiments with tandem mass tag (TMT) labelingof protrusion profiling experiments (Maria Dermit 2020). Protrusion are purified from microporous transwell filters and their proteomes are compare to the cell bodies.

protrusionproteome requires tabular input e.g. proteinGroups.txt, peptides.txt and evidence.txt output of quantitative analysis software like MaxQuant. Functions are provided for preparation, filtering as well as log transformation,calculating TMT ratios and median substation as well as generating SummarizedExperiment objects. It also includes tools to visualize protrusion purification efficiency, protease miscleavages and TMT incorporation efficiency with visualization such as scatterplot and barplot representations. Finally, it includes statistical testing of significantly enrich categories in cell protrusions.

Installation and loading package

Start R and install the protrusionproteome package:

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("protrusionproteome")
library("protrusionproteome")

Once you have the package installed, load protrusionproteome and dplyr for data transformation into R.

Data loading, QC and processing

We analyze the time-course protrusion profiles proteome dataset from (Maria Dermit 2020) (PRIDE PXD021239), which is provided within this package. The raw mass spectrometry data were first analyzed using MaxQuant (Jürgen Cox 2014). We initially perform a number of data quality checks including trypsin efficiency and TMT labeling incorporation.

Calculating protease efficiency

Tryspsin was the protease used to produce tryptic peptides for this dataset. Note that the maximum missed cleavages allowed on the MaxQuant search was 2. The numbers of trypsin miscleavages can be used as a proxy of trypsin efficiency. This information is contained in the peptides.txt file, and it is provided with this package:

# Peptides data is provided with the package
data("peptides.raw")
pepdata <- peptides.raw

To visualize trypsin efficiency, the plot_miscleavagerate function can be used.

# Stacked barplot of trypsin miscleavages
plot_miscleavagerate(pepdata)

Calculating TMT labeling efficiency

TMT label efficiency can be measured as a proxy of peptides modified by TMT. This information is contained in the evidence.txt file, and is provided with this package:

#Evidence data is provided with the package
data("evidence.raw")
evidencedata <- evidence.raw

To visualize TMT label efficiency, we can use plot_labelingefficiency function.

Loading protein data.

Protein intensities are obtained from aggregated peptides over protein groups. This information is contained in the proteinGroups.txt file, is provided with this package and is used as input for the downstream analysis.

# The data is provided with the package
data("prot.raw")
data <- prot.raw

This dataset has the following dimensions:

dim(data)
#> [1] 5905  107

Filtering the data

We filter for decoy database hits, contaminant proteins and hits only identified by site, which are indicated by “+” in the columns “Reverse,”“Potential.contaminants” and “Only.identified.by.site,” respectively using filter_MaxQuant function.

proteins_filtered <- filter_MaxQuant(data,
    tofilter= c("Reverse" , "Potential.contaminant" ,"Only.identified.by.site"))
#> Filtering based on 'Reverse', 'Potential.contaminant', 'Only.identified.by.site' column(s)

We can see that the filtered data has following dimensions:

dim(proteins_filtered)
#> [1] 5541  107

Making unique identifiers

Uniprot names are protein unique identifiers but are not immediately informative. The associated Gene names are informative, however these are not always unique (i.e Gene.names is not primary key to the proteins_filtered table).

# Are Gene.names primary key to this table?
proteins_filtered %>% group_by(Gene.names) %>% summarize(count= n()) %>% 
  arrange(desc( count)) %>% filter( count > 1)
#> # A tibble: 94 x 2
#>    Gene.names count
#>    <chr>      <int>
#>  1 ""            18
#>  2 "PLEC"         6
#>  3 "CAST"         4
#>  4 "ANXA2"        3
#>  5 "HLA-C"        3
#>  6 "LSM14B"       3
#>  7 "NAP1L1"       3
#>  8 "PCBP2"        3
#>  9 "ABCE1"        2
#> 10 "ABCF1"        2
#> # … with 84 more rows

Even more critical, some proteins do not have an annotated Gene name. Similarly to (Xiaofei Zhang 2018) approach, for those proteins missing a Gene name identifier we will use the Uniprot ID.

data_unique <- make_unique(proteins_filtered, "Gene.names", "Protein.IDs", delim = ";")

We can check that name variable uniquely identifies proteins.

data_unique %>% group_by(name) %>% tally() %>% filter(n > 1) %>% nrow()
#> [1] 0

SummarizedExperiment proteomics

Experimental design

The columns that will be used in the summarised experiment are Reporter.intensity.corrected columns. Below is the table of how the samples were labeled in this experiment.

columns_positions<-str_which(colnames(data_unique),"Reporter.intensity.corrected.(\\d)+.(\\d)")
intensities <- colnames(data_unique)[str_which(colnames(data_unique),"Reporter.intensity.corrected.(\\d)+.(\\d)")]
time_unit=30
time_span=c(1,2,4,8,16)
experiment <- str_c(minute(rep(minutes(x = time_unit) *time_span,each=2)),c("body","prot"),sep="_")
knitr::kable(tibble(intensities,experiment))
intensities experiment
Reporter.intensity.corrected.1.1 30_body
Reporter.intensity.corrected.2.1 30_prot
Reporter.intensity.corrected.3.1 60_body
Reporter.intensity.corrected.4.1 60_prot
Reporter.intensity.corrected.5.1 120_body
Reporter.intensity.corrected.6.1 120_prot
Reporter.intensity.corrected.7.1 240_body
Reporter.intensity.corrected.8.1 240_prot
Reporter.intensity.corrected.9.1 480_body
Reporter.intensity.corrected.10.1 480_prot

SummarizedExperiment of TMT data

SummarizedExperiment objects (Martin Morgan 2020) are widely used across Bioconductor packages as data containers. This class of object contains the actual data (assays), information on the samples (colData) and additional feature annotation. We generate the SummarizedExperiment object from our data extracting information directly from the column names of rectangular data using the make_TMT_se function. The actual assay data is log2-transformed of median-subtracted Prot/Cell-bodies for each condition.

# Generate a SummarizedExperiment object using file information and user input
columns_positions<-str_which(colnames(data_unique),"Reporter.intensity.corrected.(\\d)+.(\\d)")
se <- make_TMT_se(data_unique,columns_positions,intensities,time_unit=30,
time_span=c(1,2,4,8,16), numerator= "prot", denominator= "body", sep = "_")

Let’s have a look at the SummarizedExperiment object

se
#> class: SummarizedExperiment 
#> dim: 5541 5 
#> metadata(0):
#> assays(1): ''
#> rownames(5541): RBM47 TTC26 ... MAU2 ENPP4
#> rowData names(99): Protein.IDs Majority.protein.IDs ... name ID
#> colnames(5): 30_1 60_1 120_1 240_1 480_1
#> colData names(4): label ID condition replicate

As we see the number of columns has been reduced to half (from 10 column sample data to 5 column sample data containing log2 Prot/Cell-bodies ratios)

Exploring the description of each sample in the SummarizedExperiment object

SummarizedExperiment::colData(se)
#> DataFrame with 5 rows and 4 columns
#>             label          ID   condition replicate
#>       <character> <character> <character> <numeric>
#> 30_1      30_prot        30_1          30         1
#> 60_1      60_prot        60_1          60         1
#> 120_1    120_prot       120_1         120         1
#> 240_1    240_prot       240_1         240         1
#> 480_1    480_prot       480_1         480         1

Vizualing the results

We can visualise the Prot/Cell-bodies distribution across timepoints.

plot_scatter(se, 1, 2, "HIST", 'orange', 4, 4)

Histone proteins are depleted from cell body.

1D enrichment analysis

We can perform a protein enrichment analysis for a given timepoint

enrichment_table <- enrich_1D(se,timepoint= 5,
                               dbs = "GO_Molecular_Function_2018",
                              number_dbs=1)

Session information

#> R version 4.0.3 (2020-10-10)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Mojave 10.14.6
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
#> 
#> attached base packages:
#> [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#>  [1] SummarizedExperiment_1.20.0 Biobase_2.50.0             
#>  [3] GenomicRanges_1.42.0        GenomeInfoDb_1.26.2        
#>  [5] IRanges_2.24.1              S4Vectors_0.28.1           
#>  [7] BiocGenerics_0.36.0         MatrixGenerics_1.2.1       
#>  [9] matrixStats_0.58.0          forcats_0.5.1              
#> [11] stringr_1.4.0               purrr_0.3.4                
#> [13] readr_1.4.0                 tidyr_1.1.2                
#> [15] tibble_3.0.6                tidyverse_1.3.0            
#> [17] lubridate_1.7.9.2           dplyr_1.0.4                
#> [19] knitr_1.31                  protrusionproteome_0.99.0  
#> [21] ggplot2_3.3.3               BiocStyle_2.18.1           
#> 
#> loaded via a namespace (and not attached):
#>  [1] bitops_1.0-6           fs_1.5.0               httr_1.4.2            
#>  [4] rprojroot_2.0.2        tools_4.0.3            backports_1.2.1       
#>  [7] utf8_1.1.4             R6_2.5.0               DBI_1.1.1             
#> [10] colorspace_2.0-0       withr_2.4.1            tidyselect_1.1.0      
#> [13] compiler_4.0.3         textshaping_0.3.0      cli_2.3.0             
#> [16] rvest_0.3.6            enrichR_3.0            xml2_1.3.2            
#> [19] desc_1.2.0             DelayedArray_0.16.1    labeling_0.4.2        
#> [22] bookdown_0.21          scales_1.1.1           pkgdown_1.6.1         
#> [25] systemfonts_1.0.1      digest_0.6.27          rmarkdown_2.6         
#> [28] XVector_0.30.0         pkgconfig_2.0.3        htmltools_0.5.1.1     
#> [31] highr_0.8              dbplyr_2.1.0           fastmap_1.1.0         
#> [34] rlang_0.4.10           readxl_1.3.1           rstudioapi_0.13       
#> [37] farver_2.0.3           generics_0.1.0         jsonlite_1.7.2        
#> [40] RCurl_1.98-1.2         magrittr_2.0.1         GenomeInfoDbData_1.2.4
#> [43] Matrix_1.3-2           fansi_0.4.2            Rcpp_1.0.6            
#> [46] munsell_0.5.0          lifecycle_1.0.0        stringi_1.5.3         
#> [49] yaml_2.2.1             snakecase_0.11.0       zlibbioc_1.36.0       
#> [52] grid_4.0.3             crayon_1.4.1           lattice_0.20-41       
#> [55] haven_2.3.1            hms_1.0.0              pillar_1.4.7          
#> [58] rjson_0.2.20           reprex_1.0.0           glue_1.4.2            
#> [61] evaluate_0.14          BiocManager_1.30.10    modelr_0.1.8          
#> [64] vctrs_0.3.6            cellranger_1.1.0       gtable_0.3.0          
#> [67] assertthat_0.2.1       cachem_1.0.4           xfun_0.21             
#> [70] janitor_2.1.0.9000     broom_0.7.5            ragg_1.1.0            
#> [73] memoise_2.0.0          ellipsis_0.3.1

References

Jürgen Cox, Christian A Luber et al, Marco Hein. 2014. “Accurate Proteome-Wide Label-Free Quantification by Delayed Normalization and Maximal Peptide Ratio Extraction, Termed MaxLFQ.” Mol Cell Proteomics. https://doi.org/10.1074/mcp.M113.031591.
Maria Dermit, Flora Lee et al, Martin Dodel. 2020. “Subcellular mRNA Localization Regulates Ribosome Biogenesis in Migrating Cells.” Developmental Cell 55: 298–313.
Martin Morgan, Jim Hester et al, Valerie Obenchain. 2020. SummarizedExperiment: SummarizedExperiment Container. https://bioconductor.org/packages/SummarizedExperiment.
Xiaofei Zhang, Gabrielle BA et al, Arne H Smits. 2018. “Proteome-Wide Identification of Ubiquitin Interactions Using UbIA-MS.” Nature Protocols 13: 530–50.