Introduction

Tumor mutational burden (TMB), the total amount of somatic coding mutations in a tumor, represents a promising bio-marker to predict immunotherapy response in cancer patients. Initial studies investigating TMB used whole exome sequencing (WES) for quantification. Recently, it was demonstrated that TMB can be accurately measured also from targeted sequencing gene panels, which consistently reduces sequencing costs and paves the way for TMB routine adoption in clinical care. Currently, different sets of mutations are used to perform TMB quantification (e.g. only non-synonymous mutations, both synonymous and non-synonymous mutations, all but known cancer mutations as they are over-represented in gene panels) and no common standard exists. This hampers fair comparison of results from different studies and different gene panels. TMBleR allows to perform parallel panel-based TMB quantification by different combinations of variants filtering (e.g. synonymous, known cancer mutations, by VAF) and to easily plot estimated TMB values and visualize differences between methods and/or gene panels.

library(TMBleR)

This package finds to following important applications:

Calculate the TMB for cancer patients undergoing genomic sequencing;¹
Investigate how well the estimated TMB value from a genomic panel correlates to a whole exome sequencing (WES);
Guide the design of a clinical genomics targeted panel by evaluate TMB performance to predict immunotherapy responders;

TMB quantification using different filters

In the example here provided, we want to perform TMB quantification of 2 different Horizon cell lines sequenced by a targeted gene panel². Different filters can be applied on the variants reported in the corresponding vcf files, prior to TMB quantification. Indeed, panel-based TMB quantification requires a method to extrapolate the number of mutations observed in the narrow sequencing space of gene panels to the whole exome. For this reason, synonymous mutations can be counted for panel-based TMB quantification: although not immunogenic, they increase the statistical sampling size and reduce the noise. Besides panel size, gene composition also affects the accuracy of panel-based TMB quantification. In many gene panels, the over-representation of genes known to be highly mutated in cancers may lead to TMB overestimation. Therefore, some studies try to avoid this bias by removing somatic COSMIC variants or truncating mutations in tumor suppressors from variants used for TMB quantification. Other filters can be applied to variants for TMB quantification, such as minimum VAF thresholds. No common standard currently exists and we apply here different filters and compare the obtained TMB.

Figure: TMBleR worflow schematic for simple TMB calculation

Dataset preparation

There are 3 inputs required to perform the analysis:

Genome assemby of reference: either “hg19” or “hg38”;
Somatic mutations identified by the gene panel, provided in vcf file format;
Panel design, provided as .bed file or .txt file with list of gene IDs.

Finally, in order for the package to work at his full potential, optionally you might want to provide the latest COMISC somatic pathogenic SNVs. This is optional and allows to use the remove.cancer filter option of applyFilter(). This will be discussed further below:

External dataset (Optional)

Genome assembly

Specify human assembly of reference, either “hg19” or “hg38”.

genome_assembly <- "hg19"

Somatic mutations identified by the gene panel

Read vcf files containing somatic variants identified by panel sequencing.

# Provide vcf_files as a list ---------------------------------
# Note that even when reading only one file you 
# need to provide it as a list with element name corresponding to the name to 
# give to the sample and value corresponding to the path to the file 
# (i.e. vcf_file=list(LungSample="inst/extdata/LungSample.vcf"))
vcf_files <- list(Horizon5="Horizon5_ExamplePanel.vcf", 
                  HorizonFFPEmild="HorizonFFPEmild_ExamplePanel.vcf")
# For each vcf file, get the absolute path. ------------------
# Here we are creating absolute paths by using .vcf files that were
# bundled with the package 
vcf_files <- lapply(vcf_files, function(x) 
  system.file("extdata", x, package = "TMBleR", mustWork = TRUE))
# read in the files -------------------------------------------
vcfs <- readVcfFiles(vcfFiles = vcf_files, assembly = genome_assembly)
#> Validate VCF input:
#> $Horizon5
#> $Horizon5$warnings
#> [1] "line 141: The VCF file contains 2 samples. Only the first (1_20170317_01_i01_0_horizon_5) will be used"            
#> [2] "line 142: AF is described both in FORMAT and in INFO of this and possibly other variants. FORMAT's AF will be used"
#> 
#> $Horizon5$errors
#> NULL
#> 
#> 
#> $HorizonFFPEmild
#> $HorizonFFPEmild$warnings
#> [1] "line 141: The VCF file contains 2 samples. Only the first (2_20170317_01_i01_0_horizon_ffpe_mild) will be used"    
#> [2] "line 142: AF is described both in FORMAT and in INFO of this and possibly other variants. FORMAT's AF will be used"
#> 
#> $HorizonFFPEmild$errors
#> NULL
#> Warnings and Errors exported to 'validator_check' variable

Panel Design

Read in input the panel sequencing design (either a .bed file with genomic coordinates of the targeted genes or .txt file with targeted gene names).

# Get absolute path
design_file_path = system.file( "extdata"
                        , "ExamplePanel_GeneIDs.txt"
                        , package = "TMBleR"
                        , mustWork = TRUE)

# Read bed file with the panel sequencing design 
design <- readDesign(
  filename = design_file_path
  , assembly = genome_assembly
  , ids = "entrezgene_id")
#> Warning: `select_()` is deprecated as of dplyr 0.7.0.
#> Please use `select()` instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_warnings()` to see where this warning was generated.
#> Warning: `filter_()` is deprecated as of dplyr 0.7.0.
#> Please use `filter()` instead.
#> See vignette('programming') for more help
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_warnings()` to see where this warning was generated.
#> Cache found

# Preview
head(design)
#> GRanges object with 6 ranges and 1 metadata column:
#>       seqnames              ranges strand | ensembl_exon_id
#>          <Rle>           <IRanges>  <Rle> |     <character>
#>   [1]     chr1 244006427-244006538      - | ENSE00002329149
#>   [2]     chr1 243858893-243859018      - | ENSE00003573177
#>   [3]     chr1 243828074-243828185      - | ENSE00003478804
#>   [4]     chr1 243809195-243809339      - | ENSE00000815947
#>   [5]     chr1 243800913-243801044      - | ENSE00001069794
#>   [6]     chr1 243778398-243778463      - | ENSE00001358887
#>   -------
#>   seqinfo: 24 sequences from an unspecified genome; no seqlengths

External dataset (Optional)

External dataset are required for one of the mutation filters that removes known “cancer” mutation. This is achieved by setting remove.cancer = TRUE when calling applyFilter() (see section paragraph). Since these files are quite large, TMBleR comes with just a very small version that can be used to demo the package. If you are planning to use the remove.cancer filter, please see the vignette HOWTO_Import_external_data which contains the instructions on how to retrieve the COSMIC data and import it into TMBleR.

Mutation filtering

Apply different filters on the identified variants. Typically, known cancer mutations are removed when performing panel-based TMB quantification, because the enrichment in cancer genes of these panels may lead to TMB overestimation. Synonymous mutations instead are removed when using WES datasets (in principle they do not generate tumor neoepitopes), whereas they are kept when using panels to increase statistical sampling size and reduce noise. Here we showcase how you can set up 7 different filtering combinations that will later on be used to estimate the TMB:

Using an external dataset (downloaded from the Catalogue of Somatic Mutations in Cancer - COSMIC https://cancer.sanger.ac.uk/cosmic) to filter for pathogenic SNV mutations (e.g coding mutations described in COSMIC and truncating mutations in tumor suppressor genes). This is achieved by setting remove.cancer = TRUE;
filter one or a combination of mutations classes:
- “synonymous”
- “non-synonymous”
- “frameshift”
- “nonsense” (Usage example: variantType = c("synonymous", "frameshift))
Filer by Variant Allele Frequency (VAF), SNV with VAF lower than a specified threshold (defined by the vaf.cutoff argument) will be filtered out;
a combination of all the above.

When setting remove.cancer = T, TMBleR will filter out cancer mutation from the vcfs object. If you did not provide a cancer dataset, a demo dataset will be loaded instead and a warning message will be print out.

Note the results here shown use a under-sampled vcf from the COSMIC file. This data are only for illustration purposes.

# Filter 1 ------------------------------------------------------------
# Remove known cancer mutations using COSMIC data
vcfs_NoCancer <- applyFilters(  vcfs = vcfs
                              , assembly = genome_assembly
                              , design = design
                              , remove.cancer = T)
#> Warning in filterCOSMIC(vcf_NoTruncatingInTS, assembly): data object COSMIC_hg19
#> was not found. Loading a COSMIC demo dataset. Please do not use this for TMB
#> analysis but only for demoing TMBleR

#> Warning in filterCOSMIC(vcf_NoTruncatingInTS, assembly): data object COSMIC_hg19
#> was not found. Loading a COSMIC demo dataset. Please do not use this for TMB
#> analysis but only for demoing TMBleR

# Filter 2 ------------------------------------------------------------
# Remove synonymous mutations
vcfs_NoSynonymous <- applyFilters(vcfs = vcfs, 
                                  assembly = genome_assembly, 
                                  design = design, 
                                  variantType = c("synonymous"))

# Filter 3 ------------------------------------------------------------
# Remove known cancer mutations and synonymous mutations
vcfs_NoCancer_NoSynonymous <- applyFilters(vcfs = vcfs, 
                                           assembly = genome_assembly,
                                           design = design, 
                                           remove.cancer = T, 
                                           variantType = c("synonymous"))
#> Warning in filterCOSMIC(vcf_NoTruncatingInTS, assembly): data object COSMIC_hg19
#> was not found. Loading a COSMIC demo dataset. Please do not use this for TMB
#> analysis but only for demoing TMBleR

#> Warning in filterCOSMIC(vcf_NoTruncatingInTS, assembly): data object COSMIC_hg19
#> was not found. Loading a COSMIC demo dataset. Please do not use this for TMB
#> analysis but only for demoing TMBleR

# Filter 4 ------------------------------------------------------------
# Remove synonymous mutations and mutations with variant 
# allele frequency < 0.05
vcfs_NoSynonymous_VAFFilter <- applyFilters(vcfs = vcfs, 
                                            assembly = genome_assembly, 
                                            design = design, 
                                            vaf.cutoff = 0.05, 
                                            variantType = c("synonymous"))

# Filter 5 ------------------------------------------------------------
# Remove mutations with variant allele frequency < 0.05
vcfs_VAFFilter <- applyFilters(
  vcfs = vcfs, 
  assembly = genome_assembly, 
  design = design, 
  vaf.cutoff = 0.05,
  tsList = NULL, 
)

# Filter 6 ------------------------------------------------------------
# Remove known cancer mutations (e.g. coding mutations described in COSMIC and 
# truncating mutations in tumor suppressor genes) and mutations with variant 
# allele frequency < 0.05
vcfs_NoCancer_VAFFilter <- applyFilters(
  vcfs = vcfs, 
  assembly = genome_assembly, 
  design = design, 
  vaf.cutoff = 0.05, 
  remove.cancer = T, 
)
#> Warning in filterCOSMIC(vcf_NoTruncatingInTS, assembly): data object COSMIC_hg19
#> was not found. Loading a COSMIC demo dataset. Please do not use this for TMB
#> analysis but only for demoing TMBleR

#> Warning in filterCOSMIC(vcf_NoTruncatingInTS, assembly): data object COSMIC_hg19
#> was not found. Loading a COSMIC demo dataset. Please do not use this for TMB
#> analysis but only for demoing TMBleR

# Filter 7 ------------------------------------------------------------
# Remove known cancer mutations (e.g. coding mutations described in COSMIC and 
# truncating mutations in tumor suppressor genes), synonymous mutations and 
# mutations with variant allele frequency < 0.05
vcfs_NoCancer_VAFFilter_NoSynonymous <- applyFilters(
  vcfs = vcfs, 
  assembly = genome_assembly, 
  design = design, 
  vaf.cutoff = 0.05, 
  remove.cancer = T, 
  variantType = c("synonymous")
)
#> Warning in filterCOSMIC(vcf_NoTruncatingInTS, assembly): data object COSMIC_hg19
#> was not found. Loading a COSMIC demo dataset. Please do not use this for TMB
#> analysis but only for demoing TMBleR

#> Warning in filterCOSMIC(vcf_NoTruncatingInTS, assembly): data object COSMIC_hg19
#> was not found. Loading a COSMIC demo dataset. Please do not use this for TMB
#> analysis but only for demoing TMBleR

# We also want to perform TMB quantification on a non filtered vcf. Therefore 
# we need to format it as required by the applyTMB function. We can 
# do so by simply providing no Filtering options
vcfs_nonfiltered <- applyFilters(
  vcfs = vcfs,
  assembly = genome_assembly, 
  design = design, 
)

TMB quantification

The following command allows to estimate the TMB from the different filtering criteria chosen.

# Concatenate all filtered and unfiltered vcfs
vcfs_all <- c(vcfs_NoCancer, 
              vcfs_NoCancer_NoSynonymous, 
              vcfs_NoSynonymous, 
              vcfs_NoSynonymous_VAFFilter, 
              vcfs_VAFFilter, 
              vcfs_NoCancer_VAFFilter, 
              vcfs_NoCancer_VAFFilter_NoSynonymous, 
              vcfs_nonfiltered )

# Perform TMB quantification
TMB_res=applyTMB(inputForTMB = vcfs_all, assembly = genome_assembly)

# Print results in table format
DT::datatable(TMB_res)

The TMB estimate results can be preview in different ways:

Classical barplot

where the samples are grouped together on the x-axis, while the TMB value per Mb is shown on the y-axis. Each filter is displayed with a different color.

# Plot results using barplots
plotTMB(TMB_results = TMB_res, type="barplot")

When processing a large number of samples, the previous plot might be a little bit cluttered. Here we present an alternative visualization that groups together the different samples showing their TMB value density for each filter.

# Plot results using density plots
plotTMB(TMB_results = TMB_res, type="densityplot")

Finally, a boxplot can also provide additional information on the distribution

# Plot results with boxplot
plotTMB(TMB_results = TMB_res, type="boxplot")

Simulate TMB quantification from a targeted gene panel

In case we are investigating a targeted gene panel design and we are interest in its performance to evaluate TMB but no sequencing data is available, then we use WES dataset to simulate it. This section covers an example on how to just do that; we will subset a WES dataset to only include the genes targeted by the panel in examination. Then we compare the accuracy of the (simulated) panel-based TMB with the reference value of WES-based TMB. This is a common strategy for in-silico validation of accuracy and predictive power of panel-based TMB.

Dataset preparation

There are 4 inputs required to perform the analysis:

Genome assemby of reference: either “hg19” or “hg38”;
Somatic mutations identified by WES;
Design of WES, provided as .bed file or .txt file with list of gene IDs;
Design of the panel to simulate, provided as BED file or TXT file with list of gene IDs.

Genome assembly

Specify human assembly of reference, either “hg19” or “hg38”

# 1 . =========================================================================
genome_assembly <- "hg19"

Somatic mutations identified by WES

The mutational profile is to be provided as a .vcf file. For this example we have already prepared a showcase dataset.

# Load built-in list of 4 vcfs generated by WES
# 2. ==========================================================================
# Read file names
vcf_files <- list(  Pat02="Pat02_vs_NORMAL.vcf"
                  , Pat03="Pat03_vs_NORMAL.vcf"
                  , Pat04="Pat04_vs_NORMAL.vcf"
                  , Pat06="Pat06_vs_NORMAL.vcf")

# Find the absolute path of the vcf files bundled with the TMBleR package
vcf_files <- lapply(
  vcf_files
  , function(x) system.file("extdata", x, package = "TMBleR", mustWork = TRUE)
)

# Read vcf files and put in a list
ExampleWESvcfs <- readVcfFiles(vcfFiles=vcf_files, genome_assembly)
#> Validate VCF input:
#> $Pat02
#> $Pat02$warnings
#> [1] "line 90: The VCF file contains 2 samples. Only the first (pat02) will be used"             
#> [2] "line 91: AF is neither described in FORMAT nor in INFO of this and possibly other variants"
#> 
#> $Pat02$errors
#> NULL
#> 
#> 
#> $Pat03
#> $Pat03$warnings
#> [1] "line 90: The VCF file contains 2 samples. Only the first (pat03) will be used"             
#> [2] "line 91: AF is neither described in FORMAT nor in INFO of this and possibly other variants"
#> 
#> $Pat03$errors
#> NULL
#> 
#> 
#> $Pat04
#> $Pat04$warnings
#> [1] "line 90: The VCF file contains 2 samples. Only the first (pat04) will be used"             
#> [2] "line 91: AF is neither described in FORMAT nor in INFO of this and possibly other variants"
#> 
#> $Pat04$errors
#> NULL
#> 
#> 
#> $Pat06
#> $Pat06$warnings
#> [1] "line 90: The VCF file contains 2 samples. Only the first (pat06) will be used"             
#> [2] "line 91: AF is neither described in FORMAT nor in INFO of this and possibly other variants"
#> 
#> $Pat06$errors
#> NULL
#> Warnings and Errors exported to 'validator_check' variable
#> Pat02
#> Some variants in the .vcf file were discarded because did not have a chromosome ID maching the expected format:
#> Found:
#>  [1] "1"          "2"          "3"          "4"          "5"         
#>  [6] "6"          "7"          "8"          "9"          "10"        
#> [11] "11"         "12"         "13"         "14"         "15"        
#> [16] "16"         "17"         "18"         "19"         "20"        
#> [21] "21"         "22"         "GL000191.1" "GL000192.1" "GL000193.1"
#> [26] "GL000194.1" "GL000195.1" "GL000196.1" "GL000197.1" "GL000198.1"
#> [31] "GL000199.1" "GL000200.1" "GL000201.1" "GL000202.1" "GL000203.1"
#> [36] "GL000204.1" "GL000205.1" "GL000206.1" "GL000207.1" "GL000208.1"
#> [41] "GL000209.1" "GL000210.1" "GL000211.1" "GL000212.1" "GL000213.1"
#> [46] "GL000214.1" "GL000215.1" "GL000216.1" "GL000217.1" "GL000218.1"
#> [51] "GL000219.1" "GL000220.1" "GL000221.1" "GL000222.1" "GL000223.1"
#> [56] "GL000224.1" "GL000225.1" "GL000226.1" "GL000227.1" "GL000228.1"
#> [61] "GL000229.1" "GL000230.1" "GL000231.1" "GL000232.1" "GL000233.1"
#> [66] "GL000234.1" "GL000235.1" "GL000236.1" "GL000237.1" "GL000238.1"
#> [71] "GL000239.1" "GL000240.1" "GL000241.1" "GL000242.1" "GL000243.1"
#> [76] "GL000244.1" "GL000245.1" "GL000246.1" "GL000247.1" "GL000248.1"
#> [81] "GL000249.1" "MT"         "X"          "Y"
#> Expected:
#>  [1] "chr1"  "chr2"  "chr3"  "chr4"  "chr5"  "chr6"  "chr7"  "chr8"  "chr9" 
#> [10] "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18"
#> [19] "chr19" "chr20" "chr21" "chr22" "chrX"  "chrY"
#> Variants discarded: 5 out of 378

WES and gene panel design

Import both the WES genomic coordinates as well as the panel design targeted regions. In this example we will be using 2 gene panel designs :

ExamplePaneldesign.bed
MskImpactPanel_468GenesMutsAndCNA.txt

Finally the WES genomic coordinates can be found here:

SureSelect_Human_All_Exon_V2_Regions.bed

(note: use the following command to preview the file system.file("extdata","SureSelect_Human_All_Exon_V2_Regions.bed",package = "TMBleR",mustWork = TRUE)).

# Read files with WES and gene panel design
# 3. ==========================================================================

# First panel ------------------------------------------------
# Load the design of the first gene panels that you want to simulate
design_file_path <- system.file("extdata",
                          "ExamplePaneldesign.bed",
                          package = "TMBleR",
                          mustWork = TRUE)
ExamplePaneldesign <- readDesign(filename = design_file_path
                                 , assembly = "hg19"
                                 , name = "SimulatedPanel")


# Second Panel ------------------------------------------------
design_file_path <- system.file("extdata",
                         "MskImpactPanel_468GenesMutsAndCNA.txt",
                          package = "TMBleR",
                          mustWork = TRUE)
MSKdesign <- readDesign(filename = design_file_path
                        , assembly = "hg19"
                        , ids = "hgnc_symbol"
                        , name = "MSKPanel")
#> Cache found


# Load the design of the WES ---------------------------------
design_file_path <- system.file("extdata",
                          "SureSelect_Human_All_Exon_V2_Regions.bed",
                          package = "TMBleR",
                          mustWork = TRUE)
ExampleWESdesign <- readDesign(filename = design_file_path
                               , assembly = "hg19"
                               , name = "WES")

Mutation filtering

In the TMB quantification of genomic panel, cancer mutations are usually removed to avoid overestimation of TMB due to panel enrichment in cancer genes. This can be achieved by filtering out known cancer mutations or truncating mutations in tumor suppressors ( e.g. remove.cancer=T). In the following example, we will apply 3 different TMB filters to the WES somatic dataset:

Filter 1: remove cancer mutation (as in F1CDx and TSO500);
Filter 2: remove known cancer mutation AND any variant classified as either “synonymous”,“frameshift”, “nonsense” (as in MSK-IMPACT panel);
Filter 3: as usually done in WES studies to estimate TMB, filter out only synonymous mutation.

# Filters for gene panel simulation ===================================
# Filter 1 ------------------------------------------------------------
# Remove known cancer mutations (e.g. coding mutations described in COSMIC 
# and truncating mutations in tumor suppressor genes)
vcfs_NoCancer <- applyFilters(
  vcfs = ExampleWESvcfs
  , assembly = genome_assembly
  , design = ExampleWESdesign
  , remove.cancer = T
)

# Filter 2 ------------------------------------------------------------
# Remove synonymous, indels and nonsense mutations 
vcfs_NoSynonymousNoFrameshiftNoNonsense <- applyFilters(
  vcfs = ExampleWESvcfs, 
  assembly = genome_assembly, 
  design = ExampleWESdesign, 
  variantType = c("synonymous", "frameshift", "nonsense"))

# Filter 3  ----------------------------------------------------------
# Filter out synonymous mutations, as usually done for WES-based TMB 
# quantification. This filtered WES dataset will be used for WES-based TMB 
# quantification
vcfs_NoSynonymous_WES <- applyFilters(
  vcfs = ExampleWESvcfs, 
  assembly = genome_assembly, 
  design = ExampleWESdesign, 
  variantType = c("synonymous"))

Subset the WES

Now that we have applied different mutational filters on the somatic mutation in our WES dataset, we will try to simulate the panel by removing from the dataset the mutational profile that was “mapped” outside of the genomic panel of interest.

We will performed this process on the two panels we are currently investigating.

# PANEL 1 - generic example panel
# =====================================================================
# Subset WES with Filter 1
SimulatedExamplePanel_NoCancer <- applySimulatePanel(
  WES = vcfs_NoCancer,               # <--- results from Filter 1
  WES.design = ExampleWESdesign,     # <--- This is the WES design 
  panel.design = ExamplePaneldesign, # <--- this is design of the first panels
  assembly = genome_assembly
)
# Subset WES with Filter 2
SimulatedExamplePanel_NoSynonymousNoFrameshiftNoNonsense <- applySimulatePanel(
  WES = vcfs_NoSynonymousNoFrameshiftNoNonsense,  # <--- results from Filter 2
  WES.design = ExampleWESdesign,
  panel.design = ExamplePaneldesign, 
  assembly = genome_assembly
)
# Concatenate the 2 resulting panels with different vcf filters
vcfs_SimulatedExamplePanel <- c(
  SimulatedExamplePanel_NoCancer
  , SimulatedExamplePanel_NoSynonymousNoFrameshiftNoNonsense
)


# Panel 2 - MSK panel
# =====================================================================
# Subset WES with Filter 1
SimulatedMSKPanel_NoCancer <- applySimulatePanel(
  WES = vcfs_NoCancer,            # <--- results from Filter 1
  WES.design = ExampleWESdesign,  # <--- this is the WES design
  panel.design = MSKdesign,       # <--- this is the design of the second panel
  assembly = genome_assembly
)
# Subset WES with Filter 2
SimulatedMSKPanel_NoSynonymousNoFrameshiftNoNonsense <- applySimulatePanel(
  WES = vcfs_NoSynonymousNoFrameshiftNoNonsense,  # <--- results from Filter 2
  WES.design = ExampleWESdesign,
  panel.design = MSKdesign, 
  assembly = genome_assembly
)
vcfs_SimulatedMSKPanel <- c(
  SimulatedMSKPanel_NoCancer,
  SimulatedMSKPanel_NoSynonymousNoFrameshiftNoNonsense
)

TMB quantification

After the previous step, the subset dataset will only contain variants in the regions targeted by the panel you want to simulate. We can now finally estimate the TMB. We will estimate the TMB for the following 3 scenarios:

Simulated Example panel
Simulated MSK Panel
Original WES dataset

We have included also the “original WES” dataset so that we can perform a comparison.

# TMB quantification from the simulated panels 
# ====================================================================

#  Panel 1  ----------------------------------------------------------
TMBs_SimulatedExamplePanel <- applyTMB(
  inputForTMB = vcfs_SimulatedExamplePanel
  , assembly = genome_assembly
)
# generate plot
plotTMB(TMB_results = TMBs_SimulatedExamplePanel, type="barplot")



#  Panel 2  ----------------------------------------------------------
TMBs_SimulatedMSKPanel <- applyTMB(
  inputForTMB = vcfs_SimulatedMSKPanel
  , assembly = genome_assembly
)
# generate plot
plotTMB(TMB_results = TMBs_SimulatedMSKPanel, type="barplot")


# Original WES --------------------------------------------
TMBs_WES <- applyTMB(inputForTMB = vcfs_NoSynonymous_WES
                     , assembly = genome_assembly)
# generate plot
plotTMB(TMB_results = TMBs_WES, type="barplot")

The following code is just to merge together all the results from the 3 previous experiments into a single dataframe. We will have a new colum “Panel” that encodes the experiment

library(purrr)
#> 
#> Attaching package: 'purrr'
#> The following object is masked from 'package:TMBleR':
#> 
#>     set_names
#> The following object is masked from 'package:testthat':
#> 
#>     is_null
# Create a dataframe with the TMB results from different 
# panels  together all together
TMB_res_df <- list(SimulatedExamplePanel=TMBs_SimulatedExamplePanel
                  , SimulatedMSKPanel=TMBs_SimulatedMSKPanel
                  , WES=TMBs_WES)  %>%
  # convert Factors into strings in the process. 
  purrr::map( ~ dplyr::mutate_at(., c("Sample", "Filter", "Design"), as.character)) %>% 
  # make list into a dataframe
  dplyr::bind_rows()

# Preview
DT::datatable(TMB_res_df)

The same bar plot as before can now be printed all together into one single plot calling the very same fuction.

plotTMB(TMB_res_df)

Plot TMB distribution by groups

To simplify interpretation, the results can be grouped in

by filters
by design (e.g. panel or WES)

plotTMB(TMB_res_df, type ="densityplot")

For a more compact plot, we can use boxplots

plotTMB(TMB_res_df, type="boxplot")

Compare simulated panel-based TMB to the reference WES-based TMB values

Here we showcase how you can use the function correlateTMBvalues() to run a quick correlation analysis on two variable: the TMB generated by one of the filters and a comparison TMB from the original WES.

In the following example we will use Spearman correlation coefficient between the ExamplePanel and the original WES dataset both filtered out by removing the pathogenic mutations only (i.e. noCancerMuts, the other filters are not considered in this analysis).

# Spearman correlation -------------------------------------------------------
# Plot Spearman correlation
#par(cex.axis=1.2, cex.lab=1.2)
TMBs_SimulatedExamplePanel %>%
  dplyr::filter(Filter == "vaf=0NoCancerMuts") %>%
  .$TMB_per_Mb %>%
  correlateTMBvalues(panel.TMB = .,
                     WES.TMB = TMBs_WES$TMB_per_Mb,
                     corr.coeff = "spearman",
                     title.plot="Panel-WES correlation")

In alternative, it is also possible to use the Pearson correlation coefficient by changing the corr.coeff argument in the function:

# Pearson correlation --------------------------------------------------------
# Plot Pearson correlation
#par(cex.axis=1.2, cex.lab=1.2)
TMBs_SimulatedExamplePanel %>%
  dplyr::filter(Filter == "vaf=0NoCancerMuts") %>%
  .$TMB_per_Mb %>%
correlateTMBvalues(panel.TMB = ,
                   WES.TMB = as.numeric(as.vector(TMBs_WES$TMB_per_Mb)),
                   corr.coeff = "pearson",
                   title.plot="Panel-WES correlation")

Evaluate TMB performance to predict immunotherapy responders

If both TMB and clinical data are available, the TMBleR package allows to build ROC curves and evaluate TMB performance in predicting immunotherapy responders. Here we use a dataset from Hellman et al., Cancer Cell 2018 (PMID: 29731394), in a study on immunotherapy response on small cell lung cancer patients. The original dataset provided as supplementary table contains tumor mutational burden, as total number of non-synonymous mutations identified by WES, and clinical response, which was transformed in a binary vector of values ‘responder’ and ‘nonresponder’ to be suitable to generateBoxplots and generateROC functions.

Dataset preparation

There are 3 inputs required to perform the analysis:

Genome assembly of reference: either “hg19” or “hg38”
TMB values
response to immunotherapy

Genome assembly

# 1. Specify human assembly of reference, either "hg19" or "hg38"
# ==============================================================================
genome_assembly <- "hg19"

TMB values

Use in input TMB values generated above on simulated panel-based TMB already computed TMB_res_df.

Clinical data

In this section we will import the clinical dataset published by VanAllen EM. et al. in Science 2015. This dataset contains lots of information but we only interest into knowning if the patient was considered as either a responder or nonresponder.

# 2. =========================================================================
# Load dataset with clinical response to immunotherapy 
data("VanAllen_Clinical")

Annotate TMB values with clinical data

TMB_clinical = annotateTMB(TMB_df = TMB_res_df, ClinicalData = VanAllen_Clinical)

Compare TMB distribution between responders and non-responders

generateBoxplot(TMB_clinical)

Perform ROC analysis to evaluate sensitivity and specificity of TMB to classify responders versus non-responders using WES-based TMB values

The generateROC() function separates data in training (random 75% of input data) and test (remaining 25% of input data) using the caret package and provides metrics of classification performance using functions from the OptimalCutpoints package. It provides in output a list of objects containing all analyses results.

First, it identifies the best TMB cutoff, based on responders vs non-responders classification in the training set and provides in output: - best TMB cutoff, AUC, specificity, sensitivity, positive and negative predictive values, in the optimalCutpoints object “train_res” - the ROC curve, the predictive ROC curve and the variation of the optimal criterion value (i.e. Youden) for different cutoffs in the training set, in the optimalCutpoints object “train_res”

Then, it uses the identified TMB cutoff to perform classification of samples from the test set and provides in output: - the confusion matrix of test samples classification based on the best TMB cutoff value, in the confusionMatrix object “confMatrix_test” - the ROC curve for the test set, in the optimalCutpoints object “test_res”

# Set the seed to create reproducible train and test sets in generateROC
set.seed(949)


data(Hellman_SimulatedFM1Panel_WES)

# Fix dataset format
Hellman_Df <- Hellman_SimulatedFM1Panel_WES %>%
  dplyr::rename("TMB_per_Mb" = "WES.NumMuts") %>%
  dplyr::mutate("ClinicalResponse" = as.factor(.data$ClinicalResponse)) %>%
  dplyr::filter(!is.na(.data$TMB_per_Mb)) %>%
  dplyr::filter(!is.na(.data$ClinicalResponse))

    
# Analysis of  WES-based TMB performance in classification -------------------
WES_ROCanalysis <- generateROC(dataset = Hellman_Df,
                               method = "Youden")

# Print AUC, sensitivity and specificity, calculated on training set
print(summary(WES_ROCanalysis$train_res))
#> 
#> Call:
#> optimal.cutpoints.default(X = "TMB_per_Mb", status = "ClinicalResponse", 
#>     tag.healthy = "nonresponder", methods = method, data = training, 
#>     pop.prev = NULL, control = OptimalCutpoints::control.cutpoints(), 
#>     ci.fit = TRUE, conf.level = 0.95, trace = FALSE)
#> 
#> Area under the ROC curve (AUC):  0.691 (0.525, 0.857) 
#> 
#> CRITERION: Youden
#> Number of optimal cutoffs: 1
#> 
#>                      Estimate 95% CI lower limit 95% CI upper limit
#> cutoff            565.0000000                  -                  -
#> Se                  0.6666667          0.4099252          0.8665726
#> Sp                  0.7777778          0.6084824          0.8988495
#> PPV                 0.6000000          0.3997851          0.8296723
#> NPV                 0.8235294          0.6184616          0.9221689
#> DLR.Positive        3.0000000          1.5002929          5.9988285
#> DLR.Negative        0.4285714          0.2179369          0.8427828
#> FP                  8.0000000                  -                  -
#> FN                  6.0000000                  -                  -
#> Optimal criterion   0.4444444                  -                  -

# Print confusion matrix describing the performance of classification on the
# test set using the best TMB cutoff identified on the training set
print(WES_ROCanalysis$confMatrix_test)
#> Confusion Matrix and Statistics
#> 
#>               Reference
#> Prediction     responder nonresponder
#>   responder            3            3
#>   nonresponder         2            9
#>                                           
#>                Accuracy : 0.7059          
#>                  95% CI : (0.4404, 0.8969)
#>     No Information Rate : 0.7059          
#>     P-Value [Acc > NIR] : 0.6177          
#>                                           
#>                   Kappa : 0.3307          
#>                                           
#>  Mcnemar's Test P-Value : 1.0000          
#>                                           
#>             Sensitivity : 0.6000          
#>             Specificity : 0.7500          
#>          Pos Pred Value : 0.5000          
#>          Neg Pred Value : 0.8182          
#>              Prevalence : 0.2941          
#>          Detection Rate : 0.1765          
#>    Detection Prevalence : 0.3529          
#>       Balanced Accuracy : 0.6750          
#>                                           
#>        'Positive' Class : responder       
#> 

# Plot ROC curve on test set
par(cex.axis=1.5, cex.lab=1.5)
plot(WES_ROCanalysis$test_res, which=1)


# Analysis of  panel-based TMB performance in classification -------------------
panel_ROCanalysis <- generateROC(dataset = Hellman_Df,
                                 method = "Youden")

# Print AUC, sensitivity and specificity, calculated on training set
print(summary(panel_ROCanalysis$train_res))
#> 
#> Call:
#> optimal.cutpoints.default(X = "TMB_per_Mb", status = "ClinicalResponse", 
#>     tag.healthy = "nonresponder", methods = method, data = training, 
#>     pop.prev = NULL, control = OptimalCutpoints::control.cutpoints(), 
#>     ci.fit = TRUE, conf.level = 0.95, trace = FALSE)
#> 
#> Area under the ROC curve (AUC):  0.663 (0.481, 0.845) 
#> 
#> CRITERION: Youden
#> Number of optimal cutoffs: 1
#> 
#>                      Estimate 95% CI lower limit 95% CI upper limit
#> cutoff            565.0000000                  -                  -
#> Se                  0.6666667          0.4099252          0.8665726
#> Sp                  0.8055556          0.6397520          0.9180564
#> PPV                 0.6315789          0.4235782          0.8477211
#> NPV                 0.8285714          0.6267068          0.9289306
#> DLR.Positive        3.4285714          1.6345163          7.1917924
#> DLR.Negative        0.4137931          0.2111611          0.8108726
#> FP                  7.0000000                  -                  -
#> FN                  6.0000000                  -                  -
#> Optimal criterion   0.4722222                  -                  -

# Plot ROC curve, PROC curve and plot of optimal criterion values depending on
# TMB cutoff, calculated on the training set
par(mfrow=c(2,2), cex.axis=1.5, cex.lab=1.5)
plot(panel_ROCanalysis$train_res, which=1)
plot(panel_ROCanalysis$train_res, which=2)
plot(panel_ROCanalysis$train_res, which=3, ylim = c(0, 1))

# Print confusion matrix describing the performance of classification on the
# test set using the best TMB cutoff identified on the training set
print(panel_ROCanalysis$confMatrix_test)
#> Confusion Matrix and Statistics
#> 
#>               Reference
#> Prediction     responder nonresponder
#>   responder            3            4
#>   nonresponder         2            8
#>                                           
#>                Accuracy : 0.6471          
#>                  95% CI : (0.3833, 0.8579)
#>     No Information Rate : 0.7059          
#>     P-Value [Acc > NIR] : 0.7914          
#>                                           
#>                   Kappa : 0.2388          
#>                                           
#>  Mcnemar's Test P-Value : 0.6831          
#>                                           
#>             Sensitivity : 0.6000          
#>             Specificity : 0.6667          
#>          Pos Pred Value : 0.4286          
#>          Neg Pred Value : 0.8000          
#>              Prevalence : 0.2941          
#>          Detection Rate : 0.1765          
#>    Detection Prevalence : 0.4118          
#>       Balanced Accuracy : 0.6333          
#>                                           
#>        'Positive' Class : responder       
#> 

# Plot ROC curve on test set
par(cex.axis=1.5, cex.lab=1.5)
plot(panel_ROCanalysis$test_res, which=1)

Compare the ROC curves representing the performance of WES-based and panel-based TMB to predict clinical response.

par(cex.axis=1.5, cex.lab=1.5)
compareROC(dataset = Hellman_SimulatedFM1Panel_WES,
           TMB1 = "Panel.NumMuts",
           TMB2 = "WES.NumMuts",
           method = "d",
           paired = TRUE)
#> Setting levels: control = nonresponder, case = responder
#> Setting direction: controls < cases
#> Setting levels: control = nonresponder, case = responder
#> Setting direction: controls < cases
#> 
#>  DeLong's test for two correlated ROC curves
#> 
#> data:  TMB1 and TMB2 by as.factor(dataset$ClinicalResponse) (nonresponder, responder)
#> Z = 1.05, p-value = 0.2937
#> alternative hypothesis: true difference in AUC is not equal to 0
#> sample estimates:
#> AUC of roc1 AUC of roc2 
#>   0.7368659   0.6644022
#> Setting direction: controls < cases
#> Setting direction: controls < cases

sessionInfo()
#> R version 3.6.3 (2020-02-29)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS  10.16
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] purrr_0.3.4      TMBleR_0.1.1     testthat_2.3.2   devtools_2.3.0  
#> [5] usethis_1.6.1    BiocStyle_2.12.0
#> 
#> loaded via a namespace (and not attached):
#>   [1] readxl_1.3.1                            
#>   [2] backports_1.1.7                         
#>   [3] BiocFileCache_1.10.2                    
#>   [4] systemfonts_1.0.1                       
#>   [5] plyr_1.8.6                              
#>   [6] splines_3.6.3                           
#>   [7] crosstalk_1.1.0.1                       
#>   [8] BiocParallel_1.20.1                     
#>   [9] GenomeInfoDb_1.22.1                     
#>  [10] ggplot2_3.3.1                           
#>  [11] digest_0.6.25                           
#>  [12] foreach_1.5.0                           
#>  [13] htmltools_0.5.0                         
#>  [14] magick_2.3                              
#>  [15] fansi_0.4.1                             
#>  [16] magrittr_1.5                            
#>  [17] memoise_1.1.0                           
#>  [18] BSgenome_1.54.0                         
#>  [19] openxlsx_4.1.5                          
#>  [20] remotes_2.1.1                           
#>  [21] recipes_0.1.12                          
#>  [22] Biostrings_2.54.0                       
#>  [23] gower_0.2.1                             
#>  [24] matrixStats_0.56.0                      
#>  [25] BSgenome.Hsapiens.UCSC.hg38_1.4.1       
#>  [26] askpass_1.1                             
#>  [27] pkgdown_1.6.1                           
#>  [28] prettyunits_1.1.1                       
#>  [29] colorspace_1.4-1                        
#>  [30] blob_1.2.1                              
#>  [31] rappdirs_0.3.1                          
#>  [32] textshaping_0.3.4                       
#>  [33] haven_2.3.1                             
#>  [34] xfun_0.18                               
#>  [35] dplyr_1.0.0                             
#>  [36] jsonlite_1.7.1                          
#>  [37] callr_3.4.3                             
#>  [38] crayon_1.3.4                            
#>  [39] RCurl_1.98-1.2                          
#>  [40] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 
#>  [41] survival_3.1-12                         
#>  [42] VariantAnnotation_1.30.1                
#>  [43] iterators_1.0.12                        
#>  [44] glue_1.4.2                              
#>  [45] gtable_0.3.0                            
#>  [46] ipred_0.9-9                             
#>  [47] zlibbioc_1.32.0                         
#>  [48] XVector_0.26.0                          
#>  [49] DelayedArray_0.12.3                     
#>  [50] TxDb.Hsapiens.UCSC.hg38.knownGene_3.10.0
#>  [51] car_3.0-8                               
#>  [52] pkgbuild_1.0.8                          
#>  [53] OptimalCutpoints_1.1-4                  
#>  [54] BiocGenerics_0.32.0                     
#>  [55] abind_1.4-5                             
#>  [56] scales_1.1.1                            
#>  [57] DBI_1.1.0                               
#>  [58] rstatix_0.5.0                           
#>  [59] Rcpp_1.0.5                              
#>  [60] progress_1.2.2                          
#>  [61] foreign_0.8-75                          
#>  [62] bit_1.1-15.2                            
#>  [63] lava_1.6.7                              
#>  [64] prodlim_2019.11.13                      
#>  [65] stats4_3.6.3                            
#>  [66] DT_0.13                                 
#>  [67] htmlwidgets_1.5.3                       
#>  [68] httr_1.4.2                              
#>  [69] RColorBrewer_1.1-2                      
#>  [70] ellipsis_0.3.1                          
#>  [71] farver_2.0.3                            
#>  [72] pkgconfig_2.0.3                         
#>  [73] XML_3.99-0.3                            
#>  [74] nnet_7.3-14                             
#>  [75] dbplyr_1.4.4                            
#>  [76] caret_6.0-86                            
#>  [77] labeling_0.3                            
#>  [78] reshape2_1.4.4                          
#>  [79] tidyselect_1.1.0                        
#>  [80] rlang_0.4.7                             
#>  [81] AnnotationDbi_1.48.0                    
#>  [82] munsell_0.5.0                           
#>  [83] cellranger_1.1.0                        
#>  [84] tools_3.6.3                             
#>  [85] cli_2.0.2                               
#>  [86] generics_0.0.2                          
#>  [87] RSQLite_2.2.0                           
#>  [88] broom_0.7.2                             
#>  [89] evaluate_0.14                           
#>  [90] stringr_1.4.0                           
#>  [91] yaml_2.2.1                              
#>  [92] ragg_1.1.2                              
#>  [93] ModelMetrics_1.2.2.2                    
#>  [94] processx_3.4.2                          
#>  [95] knitr_1.30                              
#>  [96] bit64_0.9-7                             
#>  [97] fs_1.4.1                                
#>  [98] zip_2.0.4                               
#>  [99] nlme_3.1-148                            
#> [100] biomaRt_2.42.1                          
#> [101] compiler_3.6.3                          
#> [102] rstudioapi_0.11                         
#> [103] curl_4.3                                
#> [104] e1071_1.7-3                             
#> [105] ggsignif_0.6.0                          
#> [106] tibble_3.0.1                            
#> [107] stringi_1.4.6                           
#> [108] ps_1.3.3                                
#> [109] GenomicFeatures_1.38.2                  
#> [110] desc_1.2.0                              
#> [111] forcats_0.5.0                           
#> [112] lattice_0.20-41                         
#> [113] Matrix_1.2-18                           
#> [114] vctrs_0.3.4                             
#> [115] pillar_1.4.4                            
#> [116] lifecycle_0.2.0                         
#> [117] BiocManager_1.30.10                     
#> [118] data.table_1.12.8                       
#> [119] bitops_1.0-6                            
#> [120] rtracklayer_1.46.0                      
#> [121] GenomicRanges_1.38.0                    
#> [122] R6_2.4.1                                
#> [123] rio_0.5.16                              
#> [124] IRanges_2.20.2                          
#> [125] sessioninfo_1.1.1                       
#> [126] codetools_0.2-16                        
#> [127] MASS_7.3-51.6                           
#> [128] assertthat_0.2.1                        
#> [129] pkgload_1.1.0                           
#> [130] cutpointr_1.0.2                         
#> [131] SummarizedExperiment_1.16.1             
#> [132] openssl_1.4.1                           
#> [133] rprojroot_1.3-2                         
#> [134] withr_2.3.0                             
#> [135] GenomicAlignments_1.22.1                
#> [136] Rsamtools_2.2.3                         
#> [137] S4Vectors_0.24.4                        
#> [138] GenomeInfoDbData_1.2.2                  
#> [139] parallel_3.6.3                          
#> [140] hms_0.5.3                               
#> [141] rpart_4.1-15                            
#> [142] grid_3.6.3                              
#> [143] timeDate_3043.102                       
#> [144] tidyr_1.1.0                             
#> [145] class_7.3-17                            
#> [146] rmarkdown_2.6                           
#> [147] BSgenome.Hsapiens.UCSC.hg19_1.4.0       
#> [148] carData_3.0-4                           
#> [149] ggpubr_0.3.0                            
#> [150] pROC_1.16.2                             
#> [151] lubridate_1.7.9                         
#> [152] Biobase_2.46.0

In the example showcased we are using targeted sequencing (specific group of genes), but the application of this package can be extended also to whole exome sequencing (WES).↩︎
(Horizon, provides cell lines with a known mutational provide as well as the variant allele frequency (VAF). The samples in this example, are both characterized by mutations at 5% VAF; one of them was exposed to mild Formalin-fixed and paraffin-embedded (FFPE) treatment to reproduce the effects of DNA degradation typical of FFPE).↩︎

An introduction to TMBleR

2021-07-14

Introduction

TMB quantification using different filters

Dataset preparation

Genome assembly

Somatic mutations identified by the gene panel

Panel Design

External dataset (Optional)

Mutation filtering

TMB quantification

Simulate TMB quantification from a targeted gene panel

Dataset preparation

Genome assembly

Somatic mutations identified by WES

WES and gene panel design

Mutation filtering

Subset the WES

TMB quantification

Plot TMB distribution by groups

Compare simulated panel-based TMB to the reference WES-based TMB values

Evaluate TMB performance to predict immunotherapy responders

Dataset preparation

Genome assembly

TMB values

Clinical data

Annotate TMB values with clinical data

Compare TMB distribution between responders and non-responders

Perform ROC analysis to evaluate sensitivity and specificity of TMB to classify responders versus non-responders using WES-based TMB values

Compare the ROC curves representing the performance of WES-based and panel-based TMB to predict clinical response.