In this tutorial, we are going to perform an analysis of sample-to-sample variability using the PBMC data from this publication. This is a collection of immune cell types profiled from peripherial blood from two human donors across six single-cell RNA-seq technologies (10x Chromimum v2 and v3, CEL-seq2, Drop-seq, inDrops, Seq-Well and Smart-seq2).
library(scuttle)
library(scran)
library(scater)
library(uwot)
library(SingleCellExperiment)
library(ggplot2)
library(HDF5Array)
library(GEDI)
set.seed(43)
The raw count matrices and metadata are available through the SeuratData package:
library(SeuratData)
InstallData("pbmcsca") # Install pbmcsca object
data("pbmcsca") # Load the Seurat Object
We will generate a Single Cell Experiment (SCE) Object.
meta<- pbmcsca[[]] # Obtaining metadata
meta$Barcode<- rownames(meta) # Naming Barcode
sce<- SingleCellExperiment(list(counts=Seurat::GetAssayData(object = pbmcsca, slot = "counts")), colData=meta) # create SCE
sce
We will perform basic QC steps using the scuttle library.
sce<- addPerCellQC(sce, subsets=list(Mito=grep("MT-", rownames(sce) )) )
sce$log10_total_counts<- log10(sce$sum)
sce$log10_total_features_by_counts<- log10(sce$detected)
In our case, we already have a SingleCellExperiment created, which we will upload.Instructions on how to download the data can be found in here.
sce<- loadHDF5SummarizedExperiment(dir="./my_h5_se")
sce$subsets_Mito_percent<- sce$pct_counts_mito # Renaming so it's concordant with addPerCellQC
Visualize the quality of the data
ggplot(data.frame(colData(sce)), aes(log10_total_counts, log10_total_features_by_counts,color=subsets_Mito_percent ) )+
geom_point(size=0.5) +
viridis::scale_color_viridis() +
facet_wrap(~sce$Method)
Filtering low-quality cells
filter<- sce$log10_total_counts > 2.5 &
sce$log10_total_features_by_counts > 2 &
sce$subsets_Mito_percent <= 20
sce<- sce[,filter]
Filtering low-expressed genes
# We kept genes that had more than 5 counts in more than 3 cells across the entire dataset
sum_genes<- rowSums(assay(sce, "counts")>5)
genes_use<- sum_genes>3
table(genes_use)
## genes_use
## FALSE TRUE
## 19288 14406
genes_use<- names(which(genes_use))
# Filtering SCE
sce<- sce[genes_use,]
sce
## class: SingleCellExperiment
## dim: 14406 28035
## metadata(0):
## assays(2): counts logcounts
## rownames(14406): DPM1 SCYL3 ... RP11-107E5.4 RP11-299P2.2
## rowData names(8): is_feature_control is_feature_control_mito ...
## total_counts log10_total_counts
## colnames(28035): pbmc1_SM2_Cell_108 pbmc1_SM2_Cell_115 ...
## pbmc2_inDrops_1_TGAATCCT.GAGCCTTA.CCCAAGCA
## pbmc2_inDrops_1_TGAATCCT.TTATGCGA.CATCTCCC
## colData names(51): orig.ident nCount_RNA ... sizeFactor
## subsets_Mito_percent
## reducedDimNames(0):
## altExpNames(0):
After pre-processing, we are ready to run GEDI. First, we need to indicate a vector of samples. We will create a vector that indicates each sample, which in this case is a combination of the donor and single-cell technology used.
sce$donor_sample<- paste0(sce$orig.ident, ".", sce$Method)
meta<- data.frame(colData(sce))
IMPORTANT: The vector of samples (and potentially the metadata) needs to be in the same order as the gene expression matrix.
We will run GEDI using the raw counts, which need to be passed as a sparse Matrix.
# Accesing the raw counts in the SCE object and convert to a sparse Matrix format
raw_counts<- as(as.matrix(assay(sce, "counts")), "dgCMatrix")
Now, we are ready to run GEDI. These are the main arguments:
Optional arguments:
For this example, we are going to use GEDI with the raw counts and the Bsphere mode.
NOTE: With this complete dataset(~15K genes and ~28K cells), this step might take several hours, so it is recommended to run this using a high-performance computing cluster.
## Set up GEDI model
model <- new("GEDI") # Initialize GEDI object
model$setup(Samples = sce$donor_sample, # Vector indicating which sample belongs to each cell
colData=meta, # Metadata (optional)
M = raw_counts, # Expression data
K = 40, # Number of latent variables to use
mode = "Bsphere", # Modes to use: Either Bsphere (hyperellipsoid) or Bl2 (hyperplane)
oi_shrinkage = 0.001 # Shrinkage multiplier for oi. In here we use 0.001, to better accommodated the mean abundance differences that exist between multiple scRNA-seq technologies.
)
model$initialize.LVs(randomSeed = 1) # initialize LVs
model$optimize(iterations=500) # run model with 500 iterations
saveRDS( model, file="pbmc_gedi_model_bothDonors.rds") # Saving output model
The output GEDI model for this tutorial can also be accessed in here.
We check convergence of the GEDI model.
model$plotTracking()
## $ZAo
##
## $Bi
##
## $Qi
##
## $oi
##
## $si
##
## $Rk
## NULL
##
## $sigma2
These plots show the convergence of the different components of the model (o, Z, Bi, Qi, oi and si). The last plot shows the convergence of the mean squared error of the model.
IMPORTANT: GEDI reorders the cells by sample, so we need to reorder the original metadata.
# Reorder meta
meta<- meta[model$aux$cellIDs,]
After running the model, we can recover the integrated reference space.
# Generating svd
svd_res <- svd.gedi( model )
embedding_res_svd<- svd_res$v %*% diag(svd_res$d)
colnames(embedding_res_svd)<- paste0("embedding", 1:ncol(embedding_res_svd))
# Generating umap of 2 dimensions
umap_2_res <- umap(embedding_res_svd, min_dist=0.01, metric="euclidean")
colnames(umap_2_res)<- paste0("umap", 1:2)
rownames(umap_2_res)<- model$aux$cellIDs
umap_2_res<- data.frame(umap_2_res)
## Plot embeddings
GEDI::plot_embedding( umap_2_res, meta$donor_sample) + labs(x="umap1", y="umap2", title="UMAP (integration with GEDI)", color="Sample")
GEDI::plot_embedding( umap_2_res, meta$CellType) + labs(x="umap1", y="umap2", title="UMAP (integration with GEDI)", color="Cell Type")
We observe clear separation between the cell types without any obvious separation by the single-cell technology.
We first create a sample-level metadata matrix.
# Create a sample-level metadata matrix
H <- unique(
cbind(
meta$donor_sample, # Vector of samples
as.character(meta$orig.ident), # Donor
sub("[[:space:]]A$|[[:space:]]B$", "", meta$Method), # Technology (removing A and B 10x versions
meta$Method # Technology
) )
rownames(H) <- H[,1]
head(H)
## [,1] [,2]
## pbmc1.Smart-seq2 "pbmc1.Smart-seq2" "pbmc1"
## pbmc1.CEL-Seq2 "pbmc1.CEL-Seq2" "pbmc1"
## pbmc1.10x Chromium (v2) A "pbmc1.10x Chromium (v2) A" "pbmc1"
## pbmc1.10x Chromium (v2) B "pbmc1.10x Chromium (v2) B" "pbmc1"
## pbmc1.10x Chromium (v3) "pbmc1.10x Chromium (v3)" "pbmc1"
## pbmc1.Drop-seq "pbmc1.Drop-seq" "pbmc1"
## [,3] [,4]
## pbmc1.Smart-seq2 "Smart-seq2" "Smart-seq2"
## pbmc1.CEL-Seq2 "CEL-Seq2" "CEL-Seq2"
## pbmc1.10x Chromium (v2) A "10x Chromium (v2)" "10x Chromium (v2) A"
## pbmc1.10x Chromium (v2) B "10x Chromium (v2)" "10x Chromium (v2) B"
## pbmc1.10x Chromium (v3) "10x Chromium (v3)" "10x Chromium (v3)"
## pbmc1.Drop-seq "Drop-seq" "Drop-seq"
We then gather the set of sample-specific manifold parameters learned by GEDI (excluding sample-specific translation vectors).
Q <- NULL
for( i in 1:model$aux$numSamples ) {
Q <- cbind(Q,c(model$params$Qi[[i]])) # sample-specific manifold parameters are stored in model$params$Qi
}
colnames(Q) <- model$aux$Samples
rownames(Q) <- paste0( rep(model$aux$geneIDs,model$aux$K), ".Q", rep(1:model$aux$K,each=model$aux$J) )
dim(Q)
## [1] 576240 14
head(Q)
## pbmc1.Smart-seq2 pbmc1.CEL-Seq2 pbmc1.10x Chromium (v2) A
## DPM1.Q1 3.0753703 2.692895 0.21061525
## SCYL3.Q1 -4.2046855 -4.627479 0.05325036
## C1orf112.Q1 -0.7212641 3.042990 0.22357886
## FGR.Q1 -14.1739946 1.932431 1.21535994
## CFH.Q1 -0.7778644 2.618229 -0.15009692
## FUCA2.Q1 -3.5036795 -0.467825 0.30754981
## pbmc1.10x Chromium (v2) B pbmc1.10x Chromium (v3) pbmc1.Drop-seq
## DPM1.Q1 -0.5415665 0.19947351 0.104934041
## SCYL3.Q1 -0.2912763 -0.01943017 -0.474411299
## C1orf112.Q1 0.1631571 0.15508654 -0.007578783
## FGR.Q1 -0.9745976 1.95666129 0.532549804
## CFH.Q1 -0.3318041 -0.30000274 -0.038752417
## FUCA2.Q1 -0.6211875 0.72702544 0.007465043
## pbmc1.Seq-Well pbmc1.inDrops pbmc2.Smart-seq2 pbmc2.CEL-Seq2
## DPM1.Q1 0.345641165 0.160595259 1.333081 0.8746625
## SCYL3.Q1 -0.125962741 -0.262768503 13.112629 -1.1901022
## C1orf112.Q1 0.002837026 0.051449644 14.820585 0.4944724
## FGR.Q1 0.662727919 -0.790564370 -35.728184 -10.4607422
## CFH.Q1 -0.103552952 0.007647797 6.307930 0.4886349
## FUCA2.Q1 -0.473190989 -0.593790753 -1.184934 -1.9937301
## pbmc2.10x Chromium (v2) pbmc2.Drop-seq pbmc2.Seq-Well pbmc2.inDrops
## DPM1.Q1 -0.03246163 0.48979497 -0.8657543 0.32980807
## SCYL3.Q1 0.58696486 0.06990959 -0.4370313 -0.08295852
## C1orf112.Q1 -0.25416075 0.05541931 -0.8547718 0.13041050
## FGR.Q1 -5.06902935 -3.92230960 -2.3776228 -2.37889642
## CFH.Q1 -0.05641977 -0.35516364 0.1054431 -0.14496198
## FUCA2.Q1 -0.61754370 -0.48753373 -0.8113449 -0.06690240
The matrix of the sample-specific manifold parameters (Q) has dimensions:
We then regress out the effect of technology from the Q matrix.
tab<- table( H[,3], H[,2])
tab
##
## pbmc1 pbmc2
## 10x Chromium (v2) 2 1
## 10x Chromium (v3) 1 0
## CEL-Seq2 1 1
## Drop-seq 1 1
## inDrops 1 1
## Seq-Well 1 1
## Smart-seq2 1 1
techs_check<- rownames(tab)[rowSums(tab)> 1] # We only consider technologies with more than one sample
lis_H<- list()
lis_Q<- list()
# Regress out effect of technology
for( i in techs_check ){
print(i)
H_sample<- H[ H[,3]== i,]
Q_sample<- Q[ , H[,3]==i ]
Q_sample <- Q_sample - apply(Q_sample,1,mean) # remove mean from the Q matrix
lis_H[[i]]<- H_sample
lis_Q[[i]]<- Q_sample
}
## [1] "10x Chromium (v2)"
## [1] "CEL-Seq2"
## [1] "Drop-seq"
## [1] "inDrops"
## [1] "Seq-Well"
## [1] "Smart-seq2"
H_subset<- do.call(rbind, lis_H)
Q_subset<- do.call(cbind, lis_Q)
We select the top 20 most variable parameters learned by GEDI.
### top parameters
top_parameters <- order( apply(Q_subset,1,sd), decreasing=T )[1:20]
We then perform PCA and UMAP of the sample-specific manifold distortions.
# PCA
svd_Q <- svd(Q_subset[top_parameters,])
Q_rot <- svd_Q$v %*% diag(svd_Q$d)
rownames(Q_rot) <- unlist(lapply(lis_Q, colnames))
# umap
umap_res<- umap(Q_rot, min_dist=0.01, metric="euclidean", n_neighbors=3)
rownames(umap_res)<- colnames(Q_subset)
colnames(umap_res)<- paste0("umap", 1:2)
GEDI::plot_embedding( umap_res, H_subset[,2], size_point=3) + labs(x="umap1", y="umap2", title="Regress out effect of technology", color="Donor")
GEDI::plot_embedding( umap_res, H_subset[,3], size_point=3) + labs(x="umap1", y="umap2", title="Regress out effect of technology", color="Technology")
After regressing out the effect of technology on the sample-specific manifold parameters, we can observe that samples cluster by Donor.
We can also regress out the effect of Donor from the Q matrix.
tab<- table( H[,3], H[,2])
tab
##
## pbmc1 pbmc2
## 10x Chromium (v2) 2 1
## 10x Chromium (v3) 1 0
## CEL-Seq2 1 1
## Drop-seq 1 1
## inDrops 1 1
## Seq-Well 1 1
## Smart-seq2 1 1
techs_check<- rownames(tab)[rowSums(tab)> 1] # We only consider technologies with more than one sample
# Filter out H and Q matrices for technologies present in more than 1 sample
vec_filter<- H[,3] %in% techs_check
H<- H[vec_filter, ]
Q<- Q[, vec_filter ]
lis_H<- list()
lis_Q<- list()
# Regress out effect of Donor
for( i in colnames(tab) ){
print(i)
H_sample<- H[ H[,2]== i,]
Q_sample<- Q[ , H[,2]==i ]
Q_sample <- Q_sample - apply(Q_sample,1,mean) # remove mean from the Q matrix
lis_H[[i]]<- H_sample
lis_Q[[i]]<- Q_sample
}
## [1] "pbmc1"
## [1] "pbmc2"
H_subset<- do.call(rbind, lis_H)
Q_subset<- do.call(cbind, lis_Q)
We select the top 20 most variable parameters learned by GEDI.
### top parameters
top_parameters <- order( apply(Q_subset,1,sd), decreasing=T )[1:20]
We then perform PCA and UMAP of the sample-specific manifold distortions.
# PCA
svd_Q <- svd(Q_subset[top_parameters,])
Q_rot <- svd_Q$v %*% diag(svd_Q$d)
rownames(Q_rot) <- unlist(lapply(lis_Q, colnames))
# umap
umap_res<- umap(Q_rot, min_dist=0.01, metric="euclidean", n_neighbors=3)
rownames(umap_res)<- colnames(Q_subset)
colnames(umap_res)<- paste0("umap", 1:2)
GEDI::plot_embedding( umap_res, H_subset[,2], size_point=3) + labs(x="umap1", y="umap2", title="Regress out effect of Donor", color="Donor")
GEDI::plot_embedding( umap_res, H_subset[,3], size_point=3) + labs(x="umap1", y="umap2", title="Regress out effect of Donor", color="Technology")
After regressing out the effect of Donor on the sample-specific manifold parameters, we can observe that samples cluster by the Technology.
sessionInfo()
## R version 4.0.0 (2020-04-24)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)
##
## Matrix products: default
## BLAS/LAPACK: /cvmfs/soft.computecanada.ca/easybuild/software/2020/Core/imkl/2020.1.217/compilers_and_libraries_2020.1.217/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so
##
## locale:
## [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8
## [5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8
## [7] LC_PAPER=en_CA.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] GEDI_0.0.0.9000 HDF5Array_1.18.1
## [3] rhdf5_2.34.0 DelayedArray_0.16.3
## [5] uwot_0.1.10 Matrix_1.3-3
## [7] scater_1.18.6 ggplot2_3.4.2
## [9] scran_1.18.6 scuttle_1.0.4
## [11] SingleCellExperiment_1.12.0 SummarizedExperiment_1.20.0
## [13] Biobase_2.50.0 GenomicRanges_1.42.0
## [15] GenomeInfoDb_1.26.7 IRanges_2.24.1
## [17] S4Vectors_0.28.1 BiocGenerics_0.36.0
## [19] MatrixGenerics_1.2.1 matrixStats_0.58.0
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-6 RcppAnnoy_0.0.18
## [3] tools_4.0.0 bslib_0.5.1
## [5] utf8_1.2.3 R6_2.5.0
## [7] irlba_2.3.3 vipor_0.4.5
## [9] colorspace_2.0-0 rhdf5filters_1.2.0
## [11] withr_2.5.0 tidyselect_1.2.0
## [13] gridExtra_2.3 compiler_4.0.0
## [15] cli_3.6.1 BiocNeighbors_1.8.2
## [17] labeling_0.4.2 sass_0.4.7
## [19] scales_1.2.1 stringr_1.5.0
## [21] digest_0.6.33 rmarkdown_2.7
## [23] XVector_0.30.0 pkgconfig_2.0.3
## [25] htmltools_0.5.6 sparseMatrixStats_1.2.1
## [27] fastmap_1.1.1 limma_3.46.0
## [29] highr_0.8 rlang_1.1.1
## [31] FNN_1.1.3 DelayedMatrixStats_1.12.3
## [33] jquerylib_0.1.3 generics_0.1.3
## [35] farver_2.1.0 jsonlite_1.8.7
## [37] BiocParallel_1.24.1 dplyr_1.1.1
## [39] RCurl_1.98-1.3 magrittr_2.0.3
## [41] BiocSingular_1.6.0 GenomeInfoDbData_1.2.4
## [43] Rcpp_1.0.8.3 ggbeeswarm_0.6.0
## [45] munsell_0.5.0 Rhdf5lib_1.12.1
## [47] fansi_1.0.4 viridis_0.6.4
## [49] lifecycle_1.0.3 stringi_1.5.3
## [51] yaml_2.2.1 edgeR_3.32.1
## [53] zlibbioc_1.36.0 grid_4.0.0
## [55] dqrng_0.2.1 lattice_0.20-41
## [57] beachmat_2.6.4 locfit_1.5-9.4
## [59] knitr_1.31 pillar_1.9.0
## [61] igraph_1.3.4 codetools_0.2-16
## [63] glue_1.6.2 evaluate_0.22
## [65] vctrs_0.6.1 gtable_0.3.0
## [67] cachem_1.0.4 xfun_0.22
## [69] rsvd_1.0.5 RcppEigen_0.3.3.9.1
## [71] RSpectra_0.16-0 viridisLite_0.4.2
## [73] tibble_3.2.1 beeswarm_0.3.1
## [75] bluster_1.0.0 statmod_1.4.35