In this tutorial, we are going to introduce splicing analysis using GEDI in the ratio mode, focused on performing exon-skipping event analysis. We will be using the data derived from two studies: Tasic 2016 and Tasic 2018. These two studies profiled neocortex tissue in adult mice using full-length single-cell sequencing protocols (SMARTer and SMART-seq).

01 - Pre-processing

Load libraries

library(HDF5Array)
library(ggplot2)
library(Matrix)
library(RColorBrewer)
library(uwot)
library(GEDI)
set.seed(43)

Loading data

To run GEDI in this mode, we need two types of observations for each exon skipping event:

1. The number of reads that support an exon inclusion event.
1. The number of reads that support an exon exclusion event.

In this case, the data has been processed using the Quantas pipeline, which generated two types of observations for each exon skipping event. They are stored as M1(inclusion) and M2(exclusion) in our SingleCell Experiment Object.

We can load the data.

sce<- loadHDF5SummarizedExperiment(dir="./my_h5_se")
sce

## class: SingleCellExperiment 
## dim: 14267 25352 
## metadata(0):
## assays(5): Y raw_M1 raw_M2 norm_M1 norm_M2
## rownames(14267):
##   CA-100036521-14294-117618-117680-170683[INC][40/1][DNT]
##   CA-100036521-117680-144789-144867-170683[INC][3/42][DNT] ...
##   CA-17380-8143-33842-34004-34913[INC][36/1]
##   CA-17387-3370-5354-5521-7141[INC][1/115][UPT]
## rowData names(0):
## colnames(25352): SRR2138604 SRR2138605 ... SRR8322945 SRR8322946
## colData names(4): Sample SRA_Run major_cell_type cell_cluster
## reducedDimNames(0):
## altExpNames(0):

We access the exon inclusion and exclusion events.

meta<- data.frame(colData(sce)) # Metadata
M1<- assay(sce, "raw_M1") # inclusion counts
M2<- assay(sce, "raw_M2") # exclusion counts
head(M1[,1:10])

## <6 x 10> sparse matrix of class DelayedMatrix and type "double":
##                                                          SRR2138604 ...
## CA-100036521-14294-117618-117680-170683[INC][40/1][DNT]          66   .
## CA-100036521-117680-144789-144867-170683[INC][3/42][DNT]          0   .
## CA-100037278-10390-10961-11073-12015[INC][5/2]                    0   .
## CA-100037283-3239-5230-5291-8262[INC][105/1][UPT]                 0   .
## CA-100039707-3310-7049-7311-33813[INC][1/3][UPT][DNT]             0   .
## CA-100039795-19480-40277-40334-43452[INC][6/2]                    0   .
##                                                          SRR2138614
## CA-100036521-14294-117618-117680-170683[INC][40/1][DNT]           8
## CA-100036521-117680-144789-144867-170683[INC][3/42][DNT]          0
## CA-100037278-10390-10961-11073-12015[INC][5/2]                    0
## CA-100037283-3239-5230-5291-8262[INC][105/1][UPT]                 0
## CA-100039707-3310-7049-7311-33813[INC][1/3][UPT][DNT]             0
## CA-100039795-19480-40277-40334-43452[INC][6/2]                    0

head(M2[,1:10])

## <6 x 10> sparse matrix of class DelayedMatrix and type "double":
##                                                          SRR2138604 ...
## CA-100036521-14294-117618-117680-170683[INC][40/1][DNT]           0   .
## CA-100036521-117680-144789-144867-170683[INC][3/42][DNT]         45   .
## CA-100037278-10390-10961-11073-12015[INC][5/2]                    0   .
## CA-100037283-3239-5230-5291-8262[INC][105/1][UPT]                 0   .
## CA-100039707-3310-7049-7311-33813[INC][1/3][UPT][DNT]             0   .
## CA-100039795-19480-40277-40334-43452[INC][6/2]                    0   .
##                                                          SRR2138614
## CA-100036521-14294-117618-117680-170683[INC][40/1][DNT]           0
## CA-100036521-117680-144789-144867-170683[INC][3/42][DNT]          7
## CA-100037278-10390-10961-11073-12015[INC][5/2]                    0
## CA-100037283-3239-5230-5291-8262[INC][105/1][UPT]                 0
## CA-100039707-3310-7049-7311-33813[INC][1/3][UPT][DNT]             0
## CA-100039795-19480-40277-40334-43452[INC][6/2]                    0

M1 (inclusion) and M2 (exclusion) are matrices that have the exon skipping events in the rows and cells in the columns.

Filter cells and genes

We filter cells that have low number of inclusion and exclusion counts

# Filter cels
filter<- which(log10(colSums(M1)) > 4.5 & log10(colSums(M2)) > 4)
length(filter)

## [1] 25352

M1<- M1[,filter]
M2<- M2[,filter]

Now, we filter low-expressed events. We keep events that have more than 50 exon inclusion counts and more than 50 exon exclusion counts across the entire dataset.

# Filter events
filter <- which( rowSums(M1) > 50 & rowSums(M2) > 50 ) # Filter events 
length(filter)

## [1] 14267

M1<- M1[filter,]
M2<- M2[filter,]

We can plot the distribution of the total number of events per cell after filtering

hist(log10(colSums(M1)), main="Total number of exon inclusion counts", xlab="")

hist(log10(colSums(M2)), main="Total number of exon exclusion counts", xlab="")

02 - Run GEDI

After pre-processing, we are ready to run GEDI. First, we need to indicate a vector of samples. In this case, each sample represents all the cells that come from one study.

meta<- meta[colnames(M1),]
table(meta$Sample)

## 
## Tasic2016 Tasic2018 
##      1413     23939

IMPORTANT: The vector of samples (and potentially the metadata) needs to be in the same order as the gene expression matrices M1 and M2.

We will run GEDI using both inclusion and exclusion count matrices. They need to be passed as sparse Matrix objects.

# Convert to a sparse Matrix format
M1<- as(as.matrix(M1), "dgCMatrix")
M2<- as(as.matrix(M2), "dgCMatrix")

Now, we are ready to run GEDI. These are the main arguments:

Samples: It should be a character vector indicating which sample belongs to each cell.
Expression matrix: Could be one of the following:
- Y: The log-transformed (and possibly normalized) gene expression matrix.
- M: The raw read count matrix, which needs to be in the sparse format. It can also be a list of two matrices, in which case they are considered as paired observations whose log-ratio must be modelled.
K: The number of latent variables.
mode: Two values are allowed:
- Bsphere: L2 norms of B columns are fixed. Interpretation is that we’re projecting the data on a hyperellipsoid of dimension K.
- Bl2: L2 norm of the entire B matrix is fixed. Interpretation is that we’re projecting the data on a lower-dimensional hyperplane with dimension K.
itelim: Number of iterations for optimization.

Optional arguments:

C: The gene-level biological prior. If NULL, it means that there is no prior information for Z.
H: Sample-level prior for sources of variation. If NULL, there will be no prior for Qi.
oi_shrinkage: Shrinkage multiplier for oi (offset vector per sample i).

For this example, we are going to use GEDI with the inclusion and exclusion raw counts and the Bsphere mode.

NOTE: With this complete dataset(~15K genes and ~25K cells), this step might take several hours, so it is recommended to run this using a high-performance computing cluster.

## Set up GEDI model
model <- new("GEDI") # Initialize GEDI object
model$setup(Samples = meta$Sample, # Vector indicating which sample belongs to each cell
            colData=meta, # Metadata (optional)
            M = list(M1,M2), # Expression data as a list of two matrices, in which case they are considered as paired observations whose log-ratio must be modelled.
            K = 20, # Number of latent variables to use
            mode = "Bsphere", # Modes to use: Either Bsphere (hyperellipsoid) or Bl2 (hyperplane)
            oi_shrinkage = 0.001 # Shrinkage multiplier for oi. In here we use 0.001, to better accommodated the mean abundance differences that exist between multiple scRNA-seq technologies.
            ) 
model$initialize.LVs(randomSeed = 1) # initialize LVs
model$optimize(iterations=150) # run model with 150 iterations
saveRDS( model, file="Tasic_gedi_model.rds") # Saving output model

The output GEDI model for this tutorial can also be accessed in here.

We check convergence of the GEDI model.

model$plotTracking()

## $ZAo

## 
## $Bi

## 
## $Qi

## 
## $oi

## 
## $si

## 
## $Rk
## NULL
## 
## $sigma2

These plots show the convergence of the different components of the model (o, Z, Bi, Qi, oi and si). The last plot shows the convergence of the mean squared error of the model.

IMPORTANT: GEDI reorders the cells by sample, so we need to reorder the original metadata.

# Reorder meta
meta<- meta[model$aux$cellIDs,]

03 - Visualize integration resuts

After running the model, we can recover the integrated reference space.

# Generating svd
svd_res <- svd.gedi( model )
embedding_res_svd<- svd_res$v %*% diag(svd_res$d)
colnames(embedding_res_svd)<- paste0("embedding", 1:ncol(embedding_res_svd))

# Generating umap of 2 dimensions
umap_2_res <- umap(embedding_res_svd, min_dist=0.01, metric="euclidean")
colnames(umap_2_res)<- paste0("umap", 1:2)
rownames(umap_2_res)<- model$aux$cellIDs
umap_2_res<- data.frame(umap_2_res)

## Plot embeddings
GEDI::plot_embedding( umap_2_res, meta$Sample) + labs(x="umap1", y="umap2", title="UMAP (integration with GEDI)", color="Sample")

GEDI::plot_embedding( umap_2_res, meta$major_cell_type) + labs(x="umap1", y="umap2", title="UMAP (integration with GEDI)", color="Cell Type")

Also, we can recover the imputed logit(PSI) from the GEDI model.

imputedY<- GEDI::getY.gedi(model)
dim(imputedY)

## [1] 14267 25352

We can visualize the imputed logit(PSI) of some target genes.

# Exon skipping event for Nrxn1
# CA-18189-887480-931921-932011-933369[INC][12/3]
event_name<- "CA-18189-887480-931921-932011-933369[INC][12/3]" 
plot_name<- "Nrxn1 chr17:90162113-90208494"
GEDI::plot_embedding(umap_2_res, imputedY[event_name,]) +
    labs(title=plot_name, color="Imputed logit PSI")

# Exon skipping event for Nptn
# CA-20320-3287-39423-39771-44447[INC][8/120][UPT]
event_name<- "CA-20320-3287-39423-39771-44447[INC][8/120][UPT]"
plot_name<- "Nptn chr9:58582406-58623855"
GEDI::plot_embedding(umap_2_res, imputedY[event_name,]) +
    labs(title=plot_name, color="Imputed logit PSI")

sessionInfo()

## R version 4.0.0 (2020-04-24)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)
## 
## Matrix products: default
## BLAS/LAPACK: /cvmfs/soft.computecanada.ca/easybuild/software/2020/Core/imkl/2020.1.217/compilers_and_libraries_2020.1.217/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so
## 
## locale:
##  [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8    
##  [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8   
##  [7] LC_PAPER=en_CA.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] SingleCellExperiment_1.12.0 SummarizedExperiment_1.20.0
##  [3] Biobase_2.50.0              GenomicRanges_1.42.0       
##  [5] GenomeInfoDb_1.26.7         GEDI_0.0.0.9000            
##  [7] uwot_0.1.10                 RColorBrewer_1.1-2         
##  [9] ggplot2_3.4.2               HDF5Array_1.18.1           
## [11] rhdf5_2.34.0                DelayedArray_0.16.3        
## [13] IRanges_2.24.1              S4Vectors_0.28.1           
## [15] MatrixGenerics_1.2.1        matrixStats_0.58.0         
## [17] BiocGenerics_0.36.0         Matrix_1.3-3               
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.8.3           lattice_0.20-41        digest_0.6.33         
##  [4] utf8_1.2.3             RSpectra_0.16-0        R6_2.5.0              
##  [7] evaluate_0.22          highr_0.8              pillar_1.9.0          
## [10] zlibbioc_1.36.0        rlang_1.1.1            jquerylib_0.1.3       
## [13] rmarkdown_2.7          labeling_0.4.2         stringr_1.5.0         
## [16] RcppEigen_0.3.3.9.1    RCurl_1.98-1.3         munsell_0.5.0         
## [19] compiler_4.0.0         xfun_0.22              pkgconfig_2.0.3       
## [22] htmltools_0.5.6        tidyselect_1.2.0       tibble_3.2.1          
## [25] GenomeInfoDbData_1.2.4 codetools_0.2-16       fansi_1.0.4           
## [28] dplyr_1.1.1            withr_2.5.0            rhdf5filters_1.2.0    
## [31] bitops_1.0-6           grid_4.0.0             jsonlite_1.8.7        
## [34] gtable_0.3.0           lifecycle_1.0.3        magrittr_2.0.3        
## [37] scales_1.2.1           cli_3.6.1              stringi_1.5.3         
## [40] cachem_1.0.4           farver_2.1.0           XVector_0.30.0        
## [43] bslib_0.5.1            generics_0.1.3         vctrs_0.6.1           
## [46] RcppAnnoy_0.0.18       Rhdf5lib_1.12.1        tools_4.0.0           
## [49] glue_1.6.2             fastmap_1.1.1          yaml_2.2.1            
## [52] colorspace_2.0-0       knitr_1.31             sass_0.4.7