Introduction_to_TMBleR.Rmd
Tumor mutational burden (TMB), the total amount of somatic coding mutations in a tumor, represents a promising bio-marker to predict immunotherapy response in cancer patients. Initial studies investigating TMB used whole exome sequencing (WES) for quantification. Recently, it was demonstrated that TMB can be accurately measured also from targeted sequencing gene panels, which consistently reduces sequencing costs and paves the way for TMB routine adoption in clinical care. Currently, different sets of mutations are used to perform TMB quantification (e.g. only non-synonymous mutations, both synonymous and non-synonymous mutations, all but known cancer mutations as they are over-represented in gene panels) and no common standard exists. This hampers fair comparison of results from different studies and different gene panels. TMBleR allows to perform parallel panel-based TMB quantification by different combinations of variants filtering (e.g. synonymous, known cancer mutations, by VAF) and to easily plot estimated TMB values and visualize differences between methods and/or gene panels.
library(TMBleR)
This package finds to following important applications:
In the example here provided, we want to perform TMB quantification of 2 different Horizon cell lines sequenced by a targeted gene panel2. Different filters can be applied on the variants reported in the corresponding vcf files, prior to TMB quantification. Indeed, panel-based TMB quantification requires a method to extrapolate the number of mutations observed in the narrow sequencing space of gene panels to the whole exome. For this reason, synonymous mutations can be counted for panel-based TMB quantification: although not immunogenic, they increase the statistical sampling size and reduce the noise. Besides panel size, gene composition also affects the accuracy of panel-based TMB quantification. In many gene panels, the over-representation of genes known to be highly mutated in cancers may lead to TMB overestimation. Therefore, some studies try to avoid this bias by removing somatic COSMIC variants or truncating mutations in tumor suppressors from variants used for TMB quantification. Other filters can be applied to variants for TMB quantification, such as minimum VAF thresholds. No common standard currently exists and we apply here different filters and compare the obtained TMB.
There are 3 inputs required to perform the analysis:
.bed
file or .txt
file with list of gene IDs.Finally, in order for the package to work at his full potential, optionally you might want to provide the latest COMISC somatic pathogenic SNVs. This is optional and allows to use the remove.cancer
filter option of applyFilter(). This will be discussed further below:
Specify human assembly of reference, either “hg19” or “hg38”.
genome_assembly <- "hg19"
Read vcf files containing somatic variants identified by panel sequencing.
# Provide vcf_files as a list ---------------------------------
# Note that even when reading only one file you
# need to provide it as a list with element name corresponding to the name to
# give to the sample and value corresponding to the path to the file
# (i.e. vcf_file=list(LungSample="inst/extdata/LungSample.vcf"))
vcf_files <- list(Horizon5="Horizon5_ExamplePanel.vcf",
HorizonFFPEmild="HorizonFFPEmild_ExamplePanel.vcf")
# For each vcf file, get the absolute path. ------------------
# Here we are creating absolute paths by using .vcf files that were
# bundled with the package
vcf_files <- lapply(vcf_files, function(x)
system.file("extdata", x, package = "TMBleR", mustWork = TRUE))
# read in the files -------------------------------------------
vcfs <- readVcfFiles(vcfFiles = vcf_files, assembly = genome_assembly)
#> Validate VCF input:
#> $Horizon5
#> $Horizon5$warnings
#> [1] "line 141: The VCF file contains 2 samples. Only the first (1_20170317_01_i01_0_horizon_5) will be used"
#> [2] "line 142: AF is described both in FORMAT and in INFO of this and possibly other variants. FORMAT's AF will be used"
#>
#> $Horizon5$errors
#> NULL
#>
#>
#> $HorizonFFPEmild
#> $HorizonFFPEmild$warnings
#> [1] "line 141: The VCF file contains 2 samples. Only the first (2_20170317_01_i01_0_horizon_ffpe_mild) will be used"
#> [2] "line 142: AF is described both in FORMAT and in INFO of this and possibly other variants. FORMAT's AF will be used"
#>
#> $HorizonFFPEmild$errors
#> NULL
#> Warnings and Errors exported to 'validator_check' variable
Read in input the panel sequencing design (either a .bed
file with genomic coordinates of the targeted genes or .txt
file with targeted gene names).
# Get absolute path
design_file_path = system.file( "extdata"
, "ExamplePanel_GeneIDs.txt"
, package = "TMBleR"
, mustWork = TRUE)
# Read bed file with the panel sequencing design
design <- readDesign(
filename = design_file_path
, assembly = genome_assembly
, ids = "entrezgene_id")
#> Warning: `select_()` is deprecated as of dplyr 0.7.0.
#> Please use `select()` instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_warnings()` to see where this warning was generated.
#> Warning: `filter_()` is deprecated as of dplyr 0.7.0.
#> Please use `filter()` instead.
#> See vignette('programming') for more help
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_warnings()` to see where this warning was generated.
#> Cache found
# Preview
head(design)
#> GRanges object with 6 ranges and 1 metadata column:
#> seqnames ranges strand | ensembl_exon_id
#> <Rle> <IRanges> <Rle> | <character>
#> [1] chr1 244006427-244006538 - | ENSE00002329149
#> [2] chr1 243858893-243859018 - | ENSE00003573177
#> [3] chr1 243828074-243828185 - | ENSE00003478804
#> [4] chr1 243809195-243809339 - | ENSE00000815947
#> [5] chr1 243800913-243801044 - | ENSE00001069794
#> [6] chr1 243778398-243778463 - | ENSE00001358887
#> -------
#> seqinfo: 24 sequences from an unspecified genome; no seqlengths
External dataset are required for one of the mutation filters that removes known “cancer” mutation. This is achieved by setting remove.cancer = TRUE
when calling applyFilter()
(see section paragraph). Since these files are quite large, TMBleR comes with just a very small version that can be used to demo the package. If you are planning to use the remove.cancer
filter, please see the vignette HOWTO_Import_external_data which contains the instructions on how to retrieve the COSMIC data and import it into TMBleR.
Apply different filters on the identified variants. Typically, known cancer mutations are removed when performing panel-based TMB quantification, because the enrichment in cancer genes of these panels may lead to TMB overestimation. Synonymous mutations instead are removed when using WES datasets (in principle they do not generate tumor neoepitopes), whereas they are kept when using panels to increase statistical sampling size and reduce noise. Here we showcase how you can set up 7 different filtering combinations that will later on be used to estimate the TMB:
remove.cancer = TRUE
;variantType = c("synonymous", "frameshift)
)vaf.cutoff
argument) will be filtered out;When setting remove.cancer = T
, TMBleR will filter out cancer mutation from the vcfs
object. If you did not provide a cancer dataset, a demo dataset will be loaded instead and a warning message will be print out.
Note the results here shown use a under-sampled vcf from the COSMIC file. This data are only for illustration purposes.
# Filter 1 ------------------------------------------------------------
# Remove known cancer mutations using COSMIC data
vcfs_NoCancer <- applyFilters( vcfs = vcfs
, assembly = genome_assembly
, design = design
, remove.cancer = T)
#> Warning in filterCOSMIC(vcf_NoTruncatingInTS, assembly): data object COSMIC_hg19
#> was not found. Loading a COSMIC demo dataset. Please do not use this for TMB
#> analysis but only for demoing TMBleR
#> Warning in filterCOSMIC(vcf_NoTruncatingInTS, assembly): data object COSMIC_hg19
#> was not found. Loading a COSMIC demo dataset. Please do not use this for TMB
#> analysis but only for demoing TMBleR
# Filter 2 ------------------------------------------------------------
# Remove synonymous mutations
vcfs_NoSynonymous <- applyFilters(vcfs = vcfs,
assembly = genome_assembly,
design = design,
variantType = c("synonymous"))
# Filter 3 ------------------------------------------------------------
# Remove known cancer mutations and synonymous mutations
vcfs_NoCancer_NoSynonymous <- applyFilters(vcfs = vcfs,
assembly = genome_assembly,
design = design,
remove.cancer = T,
variantType = c("synonymous"))
#> Warning in filterCOSMIC(vcf_NoTruncatingInTS, assembly): data object COSMIC_hg19
#> was not found. Loading a COSMIC demo dataset. Please do not use this for TMB
#> analysis but only for demoing TMBleR
#> Warning in filterCOSMIC(vcf_NoTruncatingInTS, assembly): data object COSMIC_hg19
#> was not found. Loading a COSMIC demo dataset. Please do not use this for TMB
#> analysis but only for demoing TMBleR
# Filter 4 ------------------------------------------------------------
# Remove synonymous mutations and mutations with variant
# allele frequency < 0.05
vcfs_NoSynonymous_VAFFilter <- applyFilters(vcfs = vcfs,
assembly = genome_assembly,
design = design,
vaf.cutoff = 0.05,
variantType = c("synonymous"))
# Filter 5 ------------------------------------------------------------
# Remove mutations with variant allele frequency < 0.05
vcfs_VAFFilter <- applyFilters(
vcfs = vcfs,
assembly = genome_assembly,
design = design,
vaf.cutoff = 0.05,
tsList = NULL,
)
# Filter 6 ------------------------------------------------------------
# Remove known cancer mutations (e.g. coding mutations described in COSMIC and
# truncating mutations in tumor suppressor genes) and mutations with variant
# allele frequency < 0.05
vcfs_NoCancer_VAFFilter <- applyFilters(
vcfs = vcfs,
assembly = genome_assembly,
design = design,
vaf.cutoff = 0.05,
remove.cancer = T,
)
#> Warning in filterCOSMIC(vcf_NoTruncatingInTS, assembly): data object COSMIC_hg19
#> was not found. Loading a COSMIC demo dataset. Please do not use this for TMB
#> analysis but only for demoing TMBleR
#> Warning in filterCOSMIC(vcf_NoTruncatingInTS, assembly): data object COSMIC_hg19
#> was not found. Loading a COSMIC demo dataset. Please do not use this for TMB
#> analysis but only for demoing TMBleR
# Filter 7 ------------------------------------------------------------
# Remove known cancer mutations (e.g. coding mutations described in COSMIC and
# truncating mutations in tumor suppressor genes), synonymous mutations and
# mutations with variant allele frequency < 0.05
vcfs_NoCancer_VAFFilter_NoSynonymous <- applyFilters(
vcfs = vcfs,
assembly = genome_assembly,
design = design,
vaf.cutoff = 0.05,
remove.cancer = T,
variantType = c("synonymous")
)
#> Warning in filterCOSMIC(vcf_NoTruncatingInTS, assembly): data object COSMIC_hg19
#> was not found. Loading a COSMIC demo dataset. Please do not use this for TMB
#> analysis but only for demoing TMBleR
#> Warning in filterCOSMIC(vcf_NoTruncatingInTS, assembly): data object COSMIC_hg19
#> was not found. Loading a COSMIC demo dataset. Please do not use this for TMB
#> analysis but only for demoing TMBleR
# We also want to perform TMB quantification on a non filtered vcf. Therefore
# we need to format it as required by the applyTMB function. We can
# do so by simply providing no Filtering options
vcfs_nonfiltered <- applyFilters(
vcfs = vcfs,
assembly = genome_assembly,
design = design,
)
The following command allows to estimate the TMB from the different filtering criteria chosen.
# Concatenate all filtered and unfiltered vcfs
vcfs_all <- c(vcfs_NoCancer,
vcfs_NoCancer_NoSynonymous,
vcfs_NoSynonymous,
vcfs_NoSynonymous_VAFFilter,
vcfs_VAFFilter,
vcfs_NoCancer_VAFFilter,
vcfs_NoCancer_VAFFilter_NoSynonymous,
vcfs_nonfiltered )
# Perform TMB quantification
TMB_res=applyTMB(inputForTMB = vcfs_all, assembly = genome_assembly)
# Print results in table format
DT::datatable(TMB_res)
The TMB estimate results can be preview in different ways:
where the samples are grouped together on the x-axis, while the TMB value per Mb is shown on the y-axis. Each filter is displayed with a different color.
# Plot results using barplots
plotTMB(TMB_results = TMB_res, type="barplot")
When processing a large number of samples, the previous plot might be a little bit cluttered. Here we present an alternative visualization that groups together the different samples showing their TMB value density for each filter.
# Plot results using density plots
plotTMB(TMB_results = TMB_res, type="densityplot")
Finally, a boxplot can also provide additional information on the distribution
# Plot results with boxplot
plotTMB(TMB_results = TMB_res, type="boxplot")
In case we are investigating a targeted gene panel design and we are interest in its performance to evaluate TMB but no sequencing data is available, then we use WES dataset to simulate it. This section covers an example on how to just do that; we will subset a WES dataset to only include the genes targeted by the panel in examination. Then we compare the accuracy of the (simulated) panel-based TMB with the reference value of WES-based TMB. This is a common strategy for in-silico validation of accuracy and predictive power of panel-based TMB.
There are 4 inputs required to perform the analysis:
.bed
file or .txt
file with list of gene IDs;Specify human assembly of reference, either “hg19” or “hg38”
# 1 . =========================================================================
genome_assembly <- "hg19"
The mutational profile is to be provided as a .vcf
file. For this example we have already prepared a showcase dataset.
# Load built-in list of 4 vcfs generated by WES
# 2. ==========================================================================
# Read file names
vcf_files <- list( Pat02="Pat02_vs_NORMAL.vcf"
, Pat03="Pat03_vs_NORMAL.vcf"
, Pat04="Pat04_vs_NORMAL.vcf"
, Pat06="Pat06_vs_NORMAL.vcf")
# Find the absolute path of the vcf files bundled with the TMBleR package
vcf_files <- lapply(
vcf_files
, function(x) system.file("extdata", x, package = "TMBleR", mustWork = TRUE)
)
# Read vcf files and put in a list
ExampleWESvcfs <- readVcfFiles(vcfFiles=vcf_files, genome_assembly)
#> Validate VCF input:
#> $Pat02
#> $Pat02$warnings
#> [1] "line 90: The VCF file contains 2 samples. Only the first (pat02) will be used"
#> [2] "line 91: AF is neither described in FORMAT nor in INFO of this and possibly other variants"
#>
#> $Pat02$errors
#> NULL
#>
#>
#> $Pat03
#> $Pat03$warnings
#> [1] "line 90: The VCF file contains 2 samples. Only the first (pat03) will be used"
#> [2] "line 91: AF is neither described in FORMAT nor in INFO of this and possibly other variants"
#>
#> $Pat03$errors
#> NULL
#>
#>
#> $Pat04
#> $Pat04$warnings
#> [1] "line 90: The VCF file contains 2 samples. Only the first (pat04) will be used"
#> [2] "line 91: AF is neither described in FORMAT nor in INFO of this and possibly other variants"
#>
#> $Pat04$errors
#> NULL
#>
#>
#> $Pat06
#> $Pat06$warnings
#> [1] "line 90: The VCF file contains 2 samples. Only the first (pat06) will be used"
#> [2] "line 91: AF is neither described in FORMAT nor in INFO of this and possibly other variants"
#>
#> $Pat06$errors
#> NULL
#> Warnings and Errors exported to 'validator_check' variable
#> Pat02
#> Some variants in the .vcf file were discarded because did not have a chromosome ID maching the expected format:
#> Found:
#> [1] "1" "2" "3" "4" "5"
#> [6] "6" "7" "8" "9" "10"
#> [11] "11" "12" "13" "14" "15"
#> [16] "16" "17" "18" "19" "20"
#> [21] "21" "22" "GL000191.1" "GL000192.1" "GL000193.1"
#> [26] "GL000194.1" "GL000195.1" "GL000196.1" "GL000197.1" "GL000198.1"
#> [31] "GL000199.1" "GL000200.1" "GL000201.1" "GL000202.1" "GL000203.1"
#> [36] "GL000204.1" "GL000205.1" "GL000206.1" "GL000207.1" "GL000208.1"
#> [41] "GL000209.1" "GL000210.1" "GL000211.1" "GL000212.1" "GL000213.1"
#> [46] "GL000214.1" "GL000215.1" "GL000216.1" "GL000217.1" "GL000218.1"
#> [51] "GL000219.1" "GL000220.1" "GL000221.1" "GL000222.1" "GL000223.1"
#> [56] "GL000224.1" "GL000225.1" "GL000226.1" "GL000227.1" "GL000228.1"
#> [61] "GL000229.1" "GL000230.1" "GL000231.1" "GL000232.1" "GL000233.1"
#> [66] "GL000234.1" "GL000235.1" "GL000236.1" "GL000237.1" "GL000238.1"
#> [71] "GL000239.1" "GL000240.1" "GL000241.1" "GL000242.1" "GL000243.1"
#> [76] "GL000244.1" "GL000245.1" "GL000246.1" "GL000247.1" "GL000248.1"
#> [81] "GL000249.1" "MT" "X" "Y"
#> Expected:
#> [1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8" "chr9"
#> [10] "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18"
#> [19] "chr19" "chr20" "chr21" "chr22" "chrX" "chrY"
#> Variants discarded: 5 out of 378
Import both the WES genomic coordinates as well as the panel design targeted regions. In this example we will be using 2 gene panel designs :
ExamplePaneldesign.bed
MskImpactPanel_468GenesMutsAndCNA.txt
Finally the WES genomic coordinates can be found here:
SureSelect_Human_All_Exon_V2_Regions.bed
(note: use the following command to preview the file system.file("extdata","SureSelect_Human_All_Exon_V2_Regions.bed",package = "TMBleR",mustWork = TRUE)
).
# Read files with WES and gene panel design
# 3. ==========================================================================
# First panel ------------------------------------------------
# Load the design of the first gene panels that you want to simulate
design_file_path <- system.file("extdata",
"ExamplePaneldesign.bed",
package = "TMBleR",
mustWork = TRUE)
ExamplePaneldesign <- readDesign(filename = design_file_path
, assembly = "hg19"
, name = "SimulatedPanel")
# Second Panel ------------------------------------------------
design_file_path <- system.file("extdata",
"MskImpactPanel_468GenesMutsAndCNA.txt",
package = "TMBleR",
mustWork = TRUE)
MSKdesign <- readDesign(filename = design_file_path
, assembly = "hg19"
, ids = "hgnc_symbol"
, name = "MSKPanel")
#> Cache found
# Load the design of the WES ---------------------------------
design_file_path <- system.file("extdata",
"SureSelect_Human_All_Exon_V2_Regions.bed",
package = "TMBleR",
mustWork = TRUE)
ExampleWESdesign <- readDesign(filename = design_file_path
, assembly = "hg19"
, name = "WES")
In the TMB quantification of genomic panel, cancer mutations are usually removed to avoid overestimation of TMB due to panel enrichment in cancer genes. This can be achieved by filtering out known cancer mutations or truncating mutations in tumor suppressors ( e.g. remove.cancer=T
). In the following example, we will apply 3 different TMB filters to the WES somatic dataset:
# Filters for gene panel simulation ===================================
# Filter 1 ------------------------------------------------------------
# Remove known cancer mutations (e.g. coding mutations described in COSMIC
# and truncating mutations in tumor suppressor genes)
vcfs_NoCancer <- applyFilters(
vcfs = ExampleWESvcfs
, assembly = genome_assembly
, design = ExampleWESdesign
, remove.cancer = T
)
# Filter 2 ------------------------------------------------------------
# Remove synonymous, indels and nonsense mutations
vcfs_NoSynonymousNoFrameshiftNoNonsense <- applyFilters(
vcfs = ExampleWESvcfs,
assembly = genome_assembly,
design = ExampleWESdesign,
variantType = c("synonymous", "frameshift", "nonsense"))
# Filter 3 ----------------------------------------------------------
# Filter out synonymous mutations, as usually done for WES-based TMB
# quantification. This filtered WES dataset will be used for WES-based TMB
# quantification
vcfs_NoSynonymous_WES <- applyFilters(
vcfs = ExampleWESvcfs,
assembly = genome_assembly,
design = ExampleWESdesign,
variantType = c("synonymous"))
Now that we have applied different mutational filters on the somatic mutation in our WES dataset, we will try to simulate the panel by removing from the dataset the mutational profile that was “mapped” outside of the genomic panel of interest.
We will performed this process on the two panels we are currently investigating.
# PANEL 1 - generic example panel
# =====================================================================
# Subset WES with Filter 1
SimulatedExamplePanel_NoCancer <- applySimulatePanel(
WES = vcfs_NoCancer, # <--- results from Filter 1
WES.design = ExampleWESdesign, # <--- This is the WES design
panel.design = ExamplePaneldesign, # <--- this is design of the first panels
assembly = genome_assembly
)
# Subset WES with Filter 2
SimulatedExamplePanel_NoSynonymousNoFrameshiftNoNonsense <- applySimulatePanel(
WES = vcfs_NoSynonymousNoFrameshiftNoNonsense, # <--- results from Filter 2
WES.design = ExampleWESdesign,
panel.design = ExamplePaneldesign,
assembly = genome_assembly
)
# Concatenate the 2 resulting panels with different vcf filters
vcfs_SimulatedExamplePanel <- c(
SimulatedExamplePanel_NoCancer
, SimulatedExamplePanel_NoSynonymousNoFrameshiftNoNonsense
)
# Panel 2 - MSK panel
# =====================================================================
# Subset WES with Filter 1
SimulatedMSKPanel_NoCancer <- applySimulatePanel(
WES = vcfs_NoCancer, # <--- results from Filter 1
WES.design = ExampleWESdesign, # <--- this is the WES design
panel.design = MSKdesign, # <--- this is the design of the second panel
assembly = genome_assembly
)
# Subset WES with Filter 2
SimulatedMSKPanel_NoSynonymousNoFrameshiftNoNonsense <- applySimulatePanel(
WES = vcfs_NoSynonymousNoFrameshiftNoNonsense, # <--- results from Filter 2
WES.design = ExampleWESdesign,
panel.design = MSKdesign,
assembly = genome_assembly
)
vcfs_SimulatedMSKPanel <- c(
SimulatedMSKPanel_NoCancer,
SimulatedMSKPanel_NoSynonymousNoFrameshiftNoNonsense
)
After the previous step, the subset dataset will only contain variants in the regions targeted by the panel you want to simulate. We can now finally estimate the TMB. We will estimate the TMB for the following 3 scenarios:
We have included also the “original WES” dataset so that we can perform a comparison.
# TMB quantification from the simulated panels
# ====================================================================
# Panel 1 ----------------------------------------------------------
TMBs_SimulatedExamplePanel <- applyTMB(
inputForTMB = vcfs_SimulatedExamplePanel
, assembly = genome_assembly
)
# generate plot
plotTMB(TMB_results = TMBs_SimulatedExamplePanel, type="barplot")
# Panel 2 ----------------------------------------------------------
TMBs_SimulatedMSKPanel <- applyTMB(
inputForTMB = vcfs_SimulatedMSKPanel
, assembly = genome_assembly
)
# generate plot
plotTMB(TMB_results = TMBs_SimulatedMSKPanel, type="barplot")
# Original WES --------------------------------------------
TMBs_WES <- applyTMB(inputForTMB = vcfs_NoSynonymous_WES
, assembly = genome_assembly)
# generate plot
plotTMB(TMB_results = TMBs_WES, type="barplot")
The following code is just to merge together all the results from the 3 previous experiments into a single dataframe. We will have a new colum “Panel” that encodes the experiment
library(purrr)
#>
#> Attaching package: 'purrr'
#> The following object is masked from 'package:TMBleR':
#>
#> set_names
#> The following object is masked from 'package:testthat':
#>
#> is_null
# Create a dataframe with the TMB results from different
# panels together all together
TMB_res_df <- list(SimulatedExamplePanel=TMBs_SimulatedExamplePanel
, SimulatedMSKPanel=TMBs_SimulatedMSKPanel
, WES=TMBs_WES) %>%
# convert Factors into strings in the process.
purrr::map( ~ dplyr::mutate_at(., c("Sample", "Filter", "Design"), as.character)) %>%
# make list into a dataframe
dplyr::bind_rows()
# Preview
DT::datatable(TMB_res_df)
The same bar plot as before can now be printed all together into one single plot calling the very same fuction.
plotTMB(TMB_res_df)
To simplify interpretation, the results can be grouped in
plotTMB(TMB_res_df, type ="densityplot")
For a more compact plot, we can use boxplots
plotTMB(TMB_res_df, type="boxplot")
Here we showcase how you can use the function correlateTMBvalues()
to run a quick correlation analysis on two variable: the TMB generated by one of the filters and a comparison TMB from the original WES.
In the following example we will use Spearman correlation coefficient between the ExamplePanel and the original WES dataset both filtered out by removing the pathogenic mutations only (i.e. noCancerMuts
, the other filters are not considered in this analysis).
# Spearman correlation -------------------------------------------------------
# Plot Spearman correlation
#par(cex.axis=1.2, cex.lab=1.2)
TMBs_SimulatedExamplePanel %>%
dplyr::filter(Filter == "vaf=0NoCancerMuts") %>%
.$TMB_per_Mb %>%
correlateTMBvalues(panel.TMB = .,
WES.TMB = TMBs_WES$TMB_per_Mb,
corr.coeff = "spearman",
title.plot="Panel-WES correlation")
In alternative, it is also possible to use the Pearson correlation coefficient by changing the corr.coeff
argument in the function:
# Pearson correlation --------------------------------------------------------
# Plot Pearson correlation
#par(cex.axis=1.2, cex.lab=1.2)
TMBs_SimulatedExamplePanel %>%
dplyr::filter(Filter == "vaf=0NoCancerMuts") %>%
.$TMB_per_Mb %>%
correlateTMBvalues(panel.TMB = ,
WES.TMB = as.numeric(as.vector(TMBs_WES$TMB_per_Mb)),
corr.coeff = "pearson",
title.plot="Panel-WES correlation")
If both TMB and clinical data are available, the TMBleR package allows to build ROC curves and evaluate TMB performance in predicting immunotherapy responders. Here we use a dataset from Hellman et al., Cancer Cell 2018 (PMID: 29731394), in a study on immunotherapy response on small cell lung cancer patients. The original dataset provided as supplementary table contains tumor mutational burden, as total number of non-synonymous mutations identified by WES, and clinical response, which was transformed in a binary vector of values ‘responder’ and ‘nonresponder’ to be suitable to generateBoxplots and generateROC functions.
There are 3 inputs required to perform the analysis:
# 1. Specify human assembly of reference, either "hg19" or "hg38"
# ==============================================================================
genome_assembly <- "hg19"
Use in input TMB values generated above on simulated panel-based TMB already computed TMB_res_df
.
In this section we will import the clinical dataset published by VanAllen EM. et al. in Science 2015. This dataset contains lots of information but we only interest into knowning if the patient was considered as either a responder or nonresponder.
# 2. =========================================================================
# Load dataset with clinical response to immunotherapy
data("VanAllen_Clinical")
TMB_clinical = annotateTMB(TMB_df = TMB_res_df, ClinicalData = VanAllen_Clinical)
generateBoxplot(TMB_clinical)
The generateROC() function separates data in training (random 75% of input data) and test (remaining 25% of input data) using the caret package and provides metrics of classification performance using functions from the OptimalCutpoints package. It provides in output a list of objects containing all analyses results.
First, it identifies the best TMB cutoff, based on responders vs non-responders classification in the training set and provides in output: - best TMB cutoff, AUC, specificity, sensitivity, positive and negative predictive values, in the optimalCutpoints object “train_res” - the ROC curve, the predictive ROC curve and the variation of the optimal criterion value (i.e. Youden) for different cutoffs in the training set, in the optimalCutpoints object “train_res”
Then, it uses the identified TMB cutoff to perform classification of samples from the test set and provides in output: - the confusion matrix of test samples classification based on the best TMB cutoff value, in the confusionMatrix object “confMatrix_test” - the ROC curve for the test set, in the optimalCutpoints object “test_res”
# Set the seed to create reproducible train and test sets in generateROC
set.seed(949)
data(Hellman_SimulatedFM1Panel_WES)
# Fix dataset format
Hellman_Df <- Hellman_SimulatedFM1Panel_WES %>%
dplyr::rename("TMB_per_Mb" = "WES.NumMuts") %>%
dplyr::mutate("ClinicalResponse" = as.factor(.data$ClinicalResponse)) %>%
dplyr::filter(!is.na(.data$TMB_per_Mb)) %>%
dplyr::filter(!is.na(.data$ClinicalResponse))
# Analysis of WES-based TMB performance in classification -------------------
WES_ROCanalysis <- generateROC(dataset = Hellman_Df,
method = "Youden")
# Print AUC, sensitivity and specificity, calculated on training set
print(summary(WES_ROCanalysis$train_res))
#>
#> Call:
#> optimal.cutpoints.default(X = "TMB_per_Mb", status = "ClinicalResponse",
#> tag.healthy = "nonresponder", methods = method, data = training,
#> pop.prev = NULL, control = OptimalCutpoints::control.cutpoints(),
#> ci.fit = TRUE, conf.level = 0.95, trace = FALSE)
#>
#> Area under the ROC curve (AUC): 0.691 (0.525, 0.857)
#>
#> CRITERION: Youden
#> Number of optimal cutoffs: 1
#>
#> Estimate 95% CI lower limit 95% CI upper limit
#> cutoff 565.0000000 - -
#> Se 0.6666667 0.4099252 0.8665726
#> Sp 0.7777778 0.6084824 0.8988495
#> PPV 0.6000000 0.3997851 0.8296723
#> NPV 0.8235294 0.6184616 0.9221689
#> DLR.Positive 3.0000000 1.5002929 5.9988285
#> DLR.Negative 0.4285714 0.2179369 0.8427828
#> FP 8.0000000 - -
#> FN 6.0000000 - -
#> Optimal criterion 0.4444444 - -
# Print confusion matrix describing the performance of classification on the
# test set using the best TMB cutoff identified on the training set
print(WES_ROCanalysis$confMatrix_test)
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction responder nonresponder
#> responder 3 3
#> nonresponder 2 9
#>
#> Accuracy : 0.7059
#> 95% CI : (0.4404, 0.8969)
#> No Information Rate : 0.7059
#> P-Value [Acc > NIR] : 0.6177
#>
#> Kappa : 0.3307
#>
#> Mcnemar's Test P-Value : 1.0000
#>
#> Sensitivity : 0.6000
#> Specificity : 0.7500
#> Pos Pred Value : 0.5000
#> Neg Pred Value : 0.8182
#> Prevalence : 0.2941
#> Detection Rate : 0.1765
#> Detection Prevalence : 0.3529
#> Balanced Accuracy : 0.6750
#>
#> 'Positive' Class : responder
#>
# Plot ROC curve on test set
par(cex.axis=1.5, cex.lab=1.5)
plot(WES_ROCanalysis$test_res, which=1)
# Analysis of panel-based TMB performance in classification -------------------
panel_ROCanalysis <- generateROC(dataset = Hellman_Df,
method = "Youden")
# Print AUC, sensitivity and specificity, calculated on training set
print(summary(panel_ROCanalysis$train_res))
#>
#> Call:
#> optimal.cutpoints.default(X = "TMB_per_Mb", status = "ClinicalResponse",
#> tag.healthy = "nonresponder", methods = method, data = training,
#> pop.prev = NULL, control = OptimalCutpoints::control.cutpoints(),
#> ci.fit = TRUE, conf.level = 0.95, trace = FALSE)
#>
#> Area under the ROC curve (AUC): 0.663 (0.481, 0.845)
#>
#> CRITERION: Youden
#> Number of optimal cutoffs: 1
#>
#> Estimate 95% CI lower limit 95% CI upper limit
#> cutoff 565.0000000 - -
#> Se 0.6666667 0.4099252 0.8665726
#> Sp 0.8055556 0.6397520 0.9180564
#> PPV 0.6315789 0.4235782 0.8477211
#> NPV 0.8285714 0.6267068 0.9289306
#> DLR.Positive 3.4285714 1.6345163 7.1917924
#> DLR.Negative 0.4137931 0.2111611 0.8108726
#> FP 7.0000000 - -
#> FN 6.0000000 - -
#> Optimal criterion 0.4722222 - -
# Plot ROC curve, PROC curve and plot of optimal criterion values depending on
# TMB cutoff, calculated on the training set
par(mfrow=c(2,2), cex.axis=1.5, cex.lab=1.5)
plot(panel_ROCanalysis$train_res, which=1)
plot(panel_ROCanalysis$train_res, which=2)
plot(panel_ROCanalysis$train_res, which=3, ylim = c(0, 1))
# Print confusion matrix describing the performance of classification on the
# test set using the best TMB cutoff identified on the training set
print(panel_ROCanalysis$confMatrix_test)
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction responder nonresponder
#> responder 3 4
#> nonresponder 2 8
#>
#> Accuracy : 0.6471
#> 95% CI : (0.3833, 0.8579)
#> No Information Rate : 0.7059
#> P-Value [Acc > NIR] : 0.7914
#>
#> Kappa : 0.2388
#>
#> Mcnemar's Test P-Value : 0.6831
#>
#> Sensitivity : 0.6000
#> Specificity : 0.6667
#> Pos Pred Value : 0.4286
#> Neg Pred Value : 0.8000
#> Prevalence : 0.2941
#> Detection Rate : 0.1765
#> Detection Prevalence : 0.4118
#> Balanced Accuracy : 0.6333
#>
#> 'Positive' Class : responder
#>
# Plot ROC curve on test set
par(cex.axis=1.5, cex.lab=1.5)
plot(panel_ROCanalysis$test_res, which=1)
par(cex.axis=1.5, cex.lab=1.5)
compareROC(dataset = Hellman_SimulatedFM1Panel_WES,
TMB1 = "Panel.NumMuts",
TMB2 = "WES.NumMuts",
method = "d",
paired = TRUE)
#> Setting levels: control = nonresponder, case = responder
#> Setting direction: controls < cases
#> Setting levels: control = nonresponder, case = responder
#> Setting direction: controls < cases
#>
#> DeLong's test for two correlated ROC curves
#>
#> data: TMB1 and TMB2 by as.factor(dataset$ClinicalResponse) (nonresponder, responder)
#> Z = 1.05, p-value = 0.2937
#> alternative hypothesis: true difference in AUC is not equal to 0
#> sample estimates:
#> AUC of roc1 AUC of roc2
#> 0.7368659 0.6644022
#> Setting direction: controls < cases
#> Setting direction: controls < cases
sessionInfo()
#> R version 3.6.3 (2020-02-29)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS 10.16
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] purrr_0.3.4 TMBleR_0.1.1 testthat_2.3.2 devtools_2.3.0
#> [5] usethis_1.6.1 BiocStyle_2.12.0
#>
#> loaded via a namespace (and not attached):
#> [1] readxl_1.3.1
#> [2] backports_1.1.7
#> [3] BiocFileCache_1.10.2
#> [4] systemfonts_1.0.1
#> [5] plyr_1.8.6
#> [6] splines_3.6.3
#> [7] crosstalk_1.1.0.1
#> [8] BiocParallel_1.20.1
#> [9] GenomeInfoDb_1.22.1
#> [10] ggplot2_3.3.1
#> [11] digest_0.6.25
#> [12] foreach_1.5.0
#> [13] htmltools_0.5.0
#> [14] magick_2.3
#> [15] fansi_0.4.1
#> [16] magrittr_1.5
#> [17] memoise_1.1.0
#> [18] BSgenome_1.54.0
#> [19] openxlsx_4.1.5
#> [20] remotes_2.1.1
#> [21] recipes_0.1.12
#> [22] Biostrings_2.54.0
#> [23] gower_0.2.1
#> [24] matrixStats_0.56.0
#> [25] BSgenome.Hsapiens.UCSC.hg38_1.4.1
#> [26] askpass_1.1
#> [27] pkgdown_1.6.1
#> [28] prettyunits_1.1.1
#> [29] colorspace_1.4-1
#> [30] blob_1.2.1
#> [31] rappdirs_0.3.1
#> [32] textshaping_0.3.4
#> [33] haven_2.3.1
#> [34] xfun_0.18
#> [35] dplyr_1.0.0
#> [36] jsonlite_1.7.1
#> [37] callr_3.4.3
#> [38] crayon_1.3.4
#> [39] RCurl_1.98-1.2
#> [40] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
#> [41] survival_3.1-12
#> [42] VariantAnnotation_1.30.1
#> [43] iterators_1.0.12
#> [44] glue_1.4.2
#> [45] gtable_0.3.0
#> [46] ipred_0.9-9
#> [47] zlibbioc_1.32.0
#> [48] XVector_0.26.0
#> [49] DelayedArray_0.12.3
#> [50] TxDb.Hsapiens.UCSC.hg38.knownGene_3.10.0
#> [51] car_3.0-8
#> [52] pkgbuild_1.0.8
#> [53] OptimalCutpoints_1.1-4
#> [54] BiocGenerics_0.32.0
#> [55] abind_1.4-5
#> [56] scales_1.1.1
#> [57] DBI_1.1.0
#> [58] rstatix_0.5.0
#> [59] Rcpp_1.0.5
#> [60] progress_1.2.2
#> [61] foreign_0.8-75
#> [62] bit_1.1-15.2
#> [63] lava_1.6.7
#> [64] prodlim_2019.11.13
#> [65] stats4_3.6.3
#> [66] DT_0.13
#> [67] htmlwidgets_1.5.3
#> [68] httr_1.4.2
#> [69] RColorBrewer_1.1-2
#> [70] ellipsis_0.3.1
#> [71] farver_2.0.3
#> [72] pkgconfig_2.0.3
#> [73] XML_3.99-0.3
#> [74] nnet_7.3-14
#> [75] dbplyr_1.4.4
#> [76] caret_6.0-86
#> [77] labeling_0.3
#> [78] reshape2_1.4.4
#> [79] tidyselect_1.1.0
#> [80] rlang_0.4.7
#> [81] AnnotationDbi_1.48.0
#> [82] munsell_0.5.0
#> [83] cellranger_1.1.0
#> [84] tools_3.6.3
#> [85] cli_2.0.2
#> [86] generics_0.0.2
#> [87] RSQLite_2.2.0
#> [88] broom_0.7.2
#> [89] evaluate_0.14
#> [90] stringr_1.4.0
#> [91] yaml_2.2.1
#> [92] ragg_1.1.2
#> [93] ModelMetrics_1.2.2.2
#> [94] processx_3.4.2
#> [95] knitr_1.30
#> [96] bit64_0.9-7
#> [97] fs_1.4.1
#> [98] zip_2.0.4
#> [99] nlme_3.1-148
#> [100] biomaRt_2.42.1
#> [101] compiler_3.6.3
#> [102] rstudioapi_0.11
#> [103] curl_4.3
#> [104] e1071_1.7-3
#> [105] ggsignif_0.6.0
#> [106] tibble_3.0.1
#> [107] stringi_1.4.6
#> [108] ps_1.3.3
#> [109] GenomicFeatures_1.38.2
#> [110] desc_1.2.0
#> [111] forcats_0.5.0
#> [112] lattice_0.20-41
#> [113] Matrix_1.2-18
#> [114] vctrs_0.3.4
#> [115] pillar_1.4.4
#> [116] lifecycle_0.2.0
#> [117] BiocManager_1.30.10
#> [118] data.table_1.12.8
#> [119] bitops_1.0-6
#> [120] rtracklayer_1.46.0
#> [121] GenomicRanges_1.38.0
#> [122] R6_2.4.1
#> [123] rio_0.5.16
#> [124] IRanges_2.20.2
#> [125] sessioninfo_1.1.1
#> [126] codetools_0.2-16
#> [127] MASS_7.3-51.6
#> [128] assertthat_0.2.1
#> [129] pkgload_1.1.0
#> [130] cutpointr_1.0.2
#> [131] SummarizedExperiment_1.16.1
#> [132] openssl_1.4.1
#> [133] rprojroot_1.3-2
#> [134] withr_2.3.0
#> [135] GenomicAlignments_1.22.1
#> [136] Rsamtools_2.2.3
#> [137] S4Vectors_0.24.4
#> [138] GenomeInfoDbData_1.2.2
#> [139] parallel_3.6.3
#> [140] hms_0.5.3
#> [141] rpart_4.1-15
#> [142] grid_3.6.3
#> [143] timeDate_3043.102
#> [144] tidyr_1.1.0
#> [145] class_7.3-17
#> [146] rmarkdown_2.6
#> [147] BSgenome.Hsapiens.UCSC.hg19_1.4.0
#> [148] carData_3.0-4
#> [149] ggpubr_0.3.0
#> [150] pROC_1.16.2
#> [151] lubridate_1.7.9
#> [152] Biobase_2.46.0
In the example showcased we are using targeted sequencing (specific group of genes), but the application of this package can be extended also to whole exome sequencing (WES).↩︎
(Horizon, provides cell lines with a known mutational provide as well as the variant allele frequency (VAF). The samples in this example, are both characterized by mutations at 5% VAF; one of them was exposed to mild Formalin-fixed and paraffin-embedded (FFPE) treatment to reproduce the effects of DNA degradation typical of FFPE).↩︎