In this tutorial, we are going to estimate the activity of various transcription factors (TFs) using the PBMC data from this publication. This is a collection of immune cell types profiled from peripherial blood from two human donors across six single-cell RNA-seq technologies (10x Chromimum v2 and v3, CEL-seq2, Drop-seq, inDrops, Seq-Well and Smart-seq2).

01 - Pre-processing

Load libraries

library(scuttle)
library(scran)
library(scater)
library(uwot)
library(SingleCellExperiment)
library(ggplot2)
library(HDF5Array)
library(pheatmap)
library(GEDI)
set.seed(43)

Downloading data

The raw count matrices and metadata are available through the SeuratData package:

library(SeuratData)
InstallData("pbmcsca") # Install pbmcsca object
data("pbmcsca") # Load the Seurat Object

We will generate a Single Cell Experiment (SCE) Object.

meta<- pbmcsca[[]] # Obtaining metadata
meta$Barcode<- rownames(meta) # Naming Barcode
sce<- SingleCellExperiment(list(counts=Seurat::GetAssayData(object = pbmcsca, slot = "counts")), colData=meta) # create SCE
sce

Quality control steps

We will perform basic QC steps using the scuttle library.

sce<- addPerCellQC(sce, subsets=list(Mito=grep("MT-", rownames(sce) )) )
sce$log10_total_counts<- log10(sce$sum)
sce$log10_total_features_by_counts<- log10(sce$detected)

In our case, we already have a SingleCellExperiment created, which we will upload. Instructions on how to download the data can be found in here.

sce<- loadHDF5SummarizedExperiment(dir="./my_h5_se")
sce$subsets_Mito_percent<- sce$pct_counts_mito # Renaming so it's concordant with addPerCellQC

Visualize the quality of the data

ggplot(data.frame(colData(sce)), aes(log10_total_counts, log10_total_features_by_counts,color=subsets_Mito_percent ) )+
    geom_point(size=0.5) +    
    viridis::scale_color_viridis() +
    facet_wrap(~sce$Method)

Filtering low-quality cells

filter<- sce$log10_total_counts > 2.5 &
    sce$log10_total_features_by_counts > 2 &
    sce$subsets_Mito_percent <= 20
sce<- sce[,filter]

Filtering low-expressed genes

# We kept genes that had more than 5 counts in more than 3 cells across the entire dataset
sum_genes<- rowSums(assay(sce, "counts")>5)
genes_use<- sum_genes>3
table(genes_use)

## genes_use
## FALSE  TRUE 
## 19288 14406

genes_use<- names(which(genes_use))

# Filtering SCE
sce<- sce[genes_use,]
sce

## class: SingleCellExperiment 
## dim: 14406 28035 
## metadata(0):
## assays(2): counts logcounts
## rownames(14406): DPM1 SCYL3 ... RP11-107E5.4 RP11-299P2.2
## rowData names(8): is_feature_control is_feature_control_mito ...
##   total_counts log10_total_counts
## colnames(28035): pbmc1_SM2_Cell_108 pbmc1_SM2_Cell_115 ...
##   pbmc2_inDrops_1_TGAATCCT.GAGCCTTA.CCCAAGCA
##   pbmc2_inDrops_1_TGAATCCT.TTATGCGA.CATCTCCC
## colData names(51): orig.ident nCount_RNA ... sizeFactor
##   subsets_Mito_percent
## reducedDimNames(0):
## altExpNames(0):

Loading prior biological information

For the analysis of transcription factor (TF) networks, we downloaded the human DoRothEA gene regulatory network and restricted the TF-gene interactions to the high-confidence sets A, B, and C. We then refined this TF-gene interaction matrix, using the whole blood data from GTEx and generated a weighted regulatory network. You can download this matrix from here.

# Read C matrix
C<- readRDS("C_dorothea_tf_human.refined.rds")
dim(C)

## [1] 1395  100

head(C[,15:25])

##        CUX1 E2F1 E2F4 E2F6 EBF1       EGR1 ELF1 ELF3 ESR1 ESRRA      ETS1
## ABCA1     0    0    0    0    0 0.00000000    0    0    0     0 0.0000000
## ABCA7     0    0    0    0    0 0.00000000    0    0    0     0 0.0000000
## ABCB1     0    0    0    0    0 0.02751281    0    0    0     0 0.7843831
## ABCB9     0    0    0    0    0 0.00000000    0    0    0     0 0.0000000
## ABCC1     0    0    0    0    0 0.00000000    0    0    0     0 0.3412032
## ABLIM1    0    0    0    0    0 0.00000000    0    0    0     0 0.0000000

C is a weigthed regulatory network with regulated genes in the rows and TFs in the columns.

02 - Run GEDI

Setting up input matrices

First, we need to indicate a vector of samples. We will create a vector that indicates each sample, which in this case is a combination of the donor and single-cell technology used.

sce$donor_sample<- paste0(sce$orig.ident, ".", sce$Method)
meta<- data.frame(colData(sce))

IMPORTANT: The vector of samples (and potentially the metadata) needs to be in the same order as the gene expression matrix.

We will run GEDI using the raw counts, which need to be passed as a sparse Matrix.

# Accesing the raw counts in the SCE object and convert to a sparse Matrix format
raw_counts<- as(as.matrix(assay(sce, "counts")), "dgCMatrix")

IMPORTANT: Now, we need to prepare our matrix of prior information. First, it needs to include the same genes of the gene expression matrix (and be in the same order). Also, we need to remove any columns from the C matrix that result in singularity.

commongenes<- intersect(rownames(C), rownames(raw_counts) )
C<- C[commongenes,]
raw_counts<- raw_counts[commongenes,]
dim(raw_counts)

## [1]  1121 28035

We now remove any columns that result in singularity from the C matrix.

QR <- qr(crossprod(C))
C <- C[ , QR$pivot[seq_len(QR$rank)] ] # remove any columns that results in singularity
dim(C)

## [1] 1121  100

head(C[, 15:25])

##        CUX1 E2F1 E2F4 E2F6 EBF1       EGR1 ELF1 ELF3 ESR1 ESRRA      ETS1
## ABCA1     0    0    0    0    0 0.00000000    0    0    0     0 0.0000000
## ABCA7     0    0    0    0    0 0.00000000    0    0    0     0 0.0000000
## ABCB1     0    0    0    0    0 0.02751281    0    0    0     0 0.7843831
## ABCB9     0    0    0    0    0 0.00000000    0    0    0     0 0.0000000
## ABCC1     0    0    0    0    0 0.00000000    0    0    0     0 0.3412032
## ABLIM1    0    0    0    0    0 0.00000000    0    0    0     0 0.0000000

Set up GEDI object

Now, we are ready to run GEDI. These are the main arguments:

Samples: Batch variable to use. It should be a character vector indicating which sample belongs to each cell.
Expression matrix: Could be one of the following:
- Y: The log-transformed (and possibly normalized) gene expression matrix.
- M: The raw read count matrix, which needs to be in the sparse format. It can also be a list of two matrices, in which case they are considered as paired observations whose log-ratio must be modelled.
K: The number of latent variables.
mode: Two values are allowed:
- Bsphere: L2 norms of B columns are fixed. Interpretation is that we’re projecting the data on a hyperellipsoid of dimension K.
- Bl2: L2 norm of the entire B matrix is fixed. Interpretation is that we’re projecting the data on a lower-dimensional hyperplane with dimension K.
itelim: Number of iterations for optimization.

Optional arguments:

C: The gene-level biological prior. If NULL, it means that there is no prior information for Z.
H: Sample-level prior for sources of variation. If NULL, there will be no prior for Qi.
oi_shrinkage: Shrinkage multiplier for oi (offset vector per sample i).

For this example, we are going to use GEDI with the raw counts and the Bsphere mode.

NOTE: With this dataset(~1K genes and ~28K cells), this step might take ~1 hour.

## Set up GEDI model
model <- new("GEDI") # Initialize GEDI object
model$setup(Samples = sce$donor_sample, # Vector indicating which sample belongs to each cell
            colData=meta, # Metadata (optional)
            M = raw_counts, # Expression data
            C = C, # Gene level prior information
            K = 40, # Number of latent variables to use
            mode = "Bsphere", # Modes to use: Either Bsphere (hyperellipsoid) or Bl2 (hyperplane)
            oi_shrinkage = 0.001 # Shrinkage multiplier for oi. In here we use 0.001, to better accommodated the mean abundance differences that exist between multiple scRNA-seq technologies.
            ) 
model$initialize.LVs(randomSeed = 1) # initialize LVs
model$optimize(iterations=500) # run model with 500 iterations
saveRDS( model, file="pbmc_gedi_model_bothDonors.rds") # Saving output model

The output GEDI model for this tutorial can also be accessed in here.

We check convergence of the GEDI model.

model$plotTracking()

## $ZAo

## 
## $Bi

## 
## $Qi

## 
## $oi

## 
## $si

## Warning: Transformation introduced infinite values in continuous y-axis

## Warning: Transformation introduced infinite values in continuous y-axis

## 
## $Rk
## NULL
## 
## $sigma2

These plots show the convergence of the different components of the model (o, Z, A, Bi, Qi, oi and si). The last plot shows the convergence of the mean squared error of the model.

IMPORTANT: GEDI reorders the cells by sample, so we need to reorder the original metadata.

# Reorder meta
meta<- meta[model$aux$cellIDs,]
sce<- sce[,rownames(meta)]

03 - Visualize integration results

After running the model, we can recover the integrated reference space.

# Generating svd
svd_res <- svd.gedi( model )
embedding_res_svd<- svd_res$v %*% diag(svd_res$d)
colnames(embedding_res_svd)<- paste0("embedding", 1:ncol(embedding_res_svd))

# Generating umap of 2 dimensions
umap_2_res <- umap(embedding_res_svd, min_dist=0.01, metric="euclidean")
colnames(umap_2_res)<- paste0("umap", 1:2)
rownames(umap_2_res)<- model$aux$cellIDs
umap_2_res<- data.frame(umap_2_res)

## Plot embeddings
GEDI::plot_embedding( umap_2_res, meta$donor_sample) + labs(x="umap1", y="umap2", title="UMAP (integration with GEDI)", color="Sample")

GEDI::plot_embedding( umap_2_res, meta$CellType) + labs(x="umap1", y="umap2", title="UMAP (integration with GEDI)", color="Cell Type")

We observe clear separation between the cell types without any obvious separation by the single-cell technology.

04 - Retrieve TF activities

We can retrieve the activities from the GEDI model.

A<- GEDI::getA.gedi(model) # Activity x LVs
dim(A)

## [1] 100  40

head(A[,1:5])

##              LV:1       LV:2       LV:3        LV:4        LV:5
## AHR    -1.2894447  0.9928540 -1.6033879  0.92992032  0.03720394
## AR      1.5179697 -5.6074397  7.3481481  8.45830085 -0.18448963
## ARID2  -0.3574705  0.1816064  0.7649295 -2.03131675  0.64930862
## ARID3A -0.1629303 -0.7359476  0.3739424 -0.03835742 -0.92378268
## ARNT    1.4239740  1.3032202 -0.4937760 -2.32518488 -1.05299262
## BACH1  -1.0265562 -0.6659908  0.7240469 -0.30899539 -1.06143289

ADB<- GEDI::getADB.gedi(model) # Activity x cell
dim(ADB)

## [1]   100 28035

head(ADB[,1:5])

##        pbmc1_SM2_Cell_108 pbmc1_SM2_Cell_115 pbmc1_SM2_Cell_133
## AHR           -0.01168929        -0.01164058       -0.132472856
## AR            -0.45435347         0.47167275        0.156151402
## ARID2          0.07819523        -0.02252334       -0.002561416
## ARID3A        -0.02038519         0.01306837       -0.043909744
## ARNT          -0.10569826         0.06566846        0.110040745
## BACH1         -0.09023755        -0.02493510       -0.026140590
##        pbmc1_SM2_Cell_142 pbmc1_SM2_Cell_143
## AHR           0.090675173        0.030735040
## AR           -0.103701106        0.143024807
## ARID2        -0.030121667       -0.082159380
## ARID3A       -0.006812148        0.007579779
## ARNT          0.113334131       -0.006798111
## BACH1        -0.046906844       -0.086500705

We can project the activities of different TFs that are cell type specific.

GEDI::plot_embedding( umap_2_res, ADB["PAX5",]) + labs(x="umap1", y="umap2", title="Activity of PAX5", color="Activity")

GEDI::plot_embedding( umap_2_res, ADB["TCF7",]) + labs(x="umap1", y="umap2", title="Activity of TCF7", color="Activity")

sessionInfo()

## R version 4.0.0 (2020-04-24)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)
## 
## Matrix products: default
## BLAS/LAPACK: /cvmfs/soft.computecanada.ca/easybuild/software/2020/Core/imkl/2020.1.217/compilers_and_libraries_2020.1.217/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so
## 
## locale:
##  [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8    
##  [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8   
##  [7] LC_PAPER=en_CA.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] GEDI_0.0.0.9000             pheatmap_1.0.12            
##  [3] HDF5Array_1.18.1            rhdf5_2.34.0               
##  [5] DelayedArray_0.16.3         uwot_0.1.10                
##  [7] Matrix_1.3-3                scater_1.18.6              
##  [9] ggplot2_3.4.2               scran_1.18.6               
## [11] scuttle_1.0.4               SingleCellExperiment_1.12.0
## [13] SummarizedExperiment_1.20.0 Biobase_2.50.0             
## [15] GenomicRanges_1.42.0        GenomeInfoDb_1.26.7        
## [17] IRanges_2.24.1              S4Vectors_0.28.1           
## [19] BiocGenerics_0.36.0         MatrixGenerics_1.2.1       
## [21] matrixStats_0.58.0         
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-6              RcppAnnoy_0.0.18         
##  [3] RColorBrewer_1.1-2        tools_4.0.0              
##  [5] bslib_0.5.1               utf8_1.2.3               
##  [7] R6_2.5.0                  irlba_2.3.3              
##  [9] vipor_0.4.5               colorspace_2.0-0         
## [11] rhdf5filters_1.2.0        withr_2.5.0              
## [13] tidyselect_1.2.0          gridExtra_2.3            
## [15] compiler_4.0.0            cli_3.6.1                
## [17] BiocNeighbors_1.8.2       labeling_0.4.2           
## [19] sass_0.4.7                scales_1.2.1             
## [21] stringr_1.5.0             digest_0.6.33            
## [23] rmarkdown_2.7             XVector_0.30.0           
## [25] pkgconfig_2.0.3           htmltools_0.5.6          
## [27] sparseMatrixStats_1.2.1   highr_0.8                
## [29] fastmap_1.1.1             limma_3.46.0             
## [31] rlang_1.1.1               DelayedMatrixStats_1.12.3
## [33] jquerylib_0.1.3           generics_0.1.3           
## [35] farver_2.1.0              jsonlite_1.8.7           
## [37] BiocParallel_1.24.1       dplyr_1.1.1              
## [39] RCurl_1.98-1.3            magrittr_2.0.3           
## [41] BiocSingular_1.6.0        GenomeInfoDbData_1.2.4   
## [43] Rcpp_1.0.8.3              ggbeeswarm_0.6.0         
## [45] munsell_0.5.0             Rhdf5lib_1.12.1          
## [47] fansi_1.0.4               viridis_0.6.4            
## [49] lifecycle_1.0.3           stringi_1.5.3            
## [51] yaml_2.2.1                edgeR_3.32.1             
## [53] zlibbioc_1.36.0           grid_4.0.0               
## [55] dqrng_0.2.1               lattice_0.20-41          
## [57] beachmat_2.6.4            locfit_1.5-9.4           
## [59] knitr_1.31                pillar_1.9.0             
## [61] igraph_1.3.4              codetools_0.2-16         
## [63] glue_1.6.2                evaluate_0.22            
## [65] vctrs_0.6.1               gtable_0.3.0             
## [67] cachem_1.0.4              xfun_0.22                
## [69] rsvd_1.0.5                RcppEigen_0.3.3.9.1      
## [71] RSpectra_0.16-0           viridisLite_0.4.2        
## [73] tibble_3.2.1              beeswarm_0.3.1           
## [75] bluster_1.0.0             statmod_1.4.35

GEDI: TF analysis