library(PhantomPurgeR)
Registered S3 method overwritten by 'dplyr':
method from
print.rowwise_df
Registered S3 method overwritten by 'data.table':
method from
print.data.table
The data directory input_dir is the filepath of the folder that contains all the molecule_info.h5 files of all the samples that were multiplexed on the same lane. The files can be renamed by sample names (e.g. sample_name.h5) before being loaded or after as shown below.
input_dir <- "./data/novaseq_l1/input"
samples_filepaths <- get_h5_filenames(input_dir)
samples_filepaths
P7_0 P7_1
"/project/6007998/rfarouni/index_hopping/data/novaseq_l1/input/P7_0.h5" "/project/6007998/rfarouni/index_hopping/data/novaseq_l1/input/P7_1.h5"
P7_10 P7_11
"/project/6007998/rfarouni/index_hopping/data/novaseq_l1/input/P7_10.h5" "/project/6007998/rfarouni/index_hopping/data/novaseq_l1/input/P7_11.h5"
P7_12 P7_13
"/project/6007998/rfarouni/index_hopping/data/novaseq_l1/input/P7_12.h5" "/project/6007998/rfarouni/index_hopping/data/novaseq_l1/input/P7_13.h5"
P7_14 P7_15
"/project/6007998/rfarouni/index_hopping/data/novaseq_l1/input/P7_14.h5" "/project/6007998/rfarouni/index_hopping/data/novaseq_l1/input/P7_15.h5"
P7_2 P7_3
"/project/6007998/rfarouni/index_hopping/data/novaseq_l1/input/P7_2.h5" "/project/6007998/rfarouni/index_hopping/data/novaseq_l1/input/P7_3.h5"
P7_4 P7_5
"/project/6007998/rfarouni/index_hopping/data/novaseq_l1/input/P7_4.h5" "/project/6007998/rfarouni/index_hopping/data/novaseq_l1/input/P7_5.h5"
P7_6 P7_7
"/project/6007998/rfarouni/index_hopping/data/novaseq_l1/input/P7_6.h5" "/project/6007998/rfarouni/index_hopping/data/novaseq_l1/input/P7_7.h5"
P7_8 P7_9
"/project/6007998/rfarouni/index_hopping/data/novaseq_l1/input/P7_8.h5" "/project/6007998/rfarouni/index_hopping/data/novaseq_l1/input/P7_9.h5"
Change the names of the samples if the filenames are too long.
#names(samples_filepaths) <- paste0("s", 1:length(samples_filepaths))
The workflow can be run in four stages or alternatively all steps can all be run at once using the phantom_purger function.
#out <- phantom_purger(samples, torc=3, return_readcounts=FALSE, return_discarded=FALSE)
out <- read10xMolInfoSamples(samples_filepaths)
Loading sample: /project/6007998/rfarouni/index_hopping/data/novaseq_l1/input/P7_0.h5
Loading sample: /project/6007998/rfarouni/index_hopping/data/novaseq_l1/input/P7_1.h5
Loading sample: /project/6007998/rfarouni/index_hopping/data/novaseq_l1/input/P7_10.h5
Loading sample: /project/6007998/rfarouni/index_hopping/data/novaseq_l1/input/P7_11.h5
Loading sample: /project/6007998/rfarouni/index_hopping/data/novaseq_l1/input/P7_12.h5
Loading sample: /project/6007998/rfarouni/index_hopping/data/novaseq_l1/input/P7_13.h5
Loading sample: /project/6007998/rfarouni/index_hopping/data/novaseq_l1/input/P7_14.h5
Loading sample: /project/6007998/rfarouni/index_hopping/data/novaseq_l1/input/P7_15.h5
Loading sample: /project/6007998/rfarouni/index_hopping/data/novaseq_l1/input/P7_2.h5
Loading sample: /project/6007998/rfarouni/index_hopping/data/novaseq_l1/input/P7_3.h5
Loading sample: /project/6007998/rfarouni/index_hopping/data/novaseq_l1/input/P7_4.h5
Loading sample: /project/6007998/rfarouni/index_hopping/data/novaseq_l1/input/P7_5.h5
Loading sample: /project/6007998/rfarouni/index_hopping/data/novaseq_l1/input/P7_6.h5
Loading sample: /project/6007998/rfarouni/index_hopping/data/novaseq_l1/input/P7_7.h5
Loading sample: /project/6007998/rfarouni/index_hopping/data/novaseq_l1/input/P7_8.h5
Loading sample: /project/6007998/rfarouni/index_hopping/data/novaseq_l1/input/P7_9.h5
Loading samples data from molecule_info.h5 files produced by 10X Genomics CellRanger software: 78.001 sec elapsed
out <- join_data(out)
Joining data. Step 1: creating read counts datatables from lists: 0.034 sec elapsed
NAs produced by integer overflow
Joining data. Step 2: joining and merging datatables for all samples keyed by cell, umi, and gene: 659.432 sec elapsed
Joining data. Step 3: creating an outcome variable for each row: 2979.376 sec elapsed
out$read_counts
out <- estimate_hopping_rate(out)
Estimating SIHR. Step 1: creating outcome counts datatable: 58.106 sec elapsed
Estimating SIHR. Step 2: creating chimera counts datatable: 0.098 sec elapsed
Estimating SIHR. Step 3: fitting GLM: 0.091 sec elapsed
Estimating SIHR. Step 4: computing summary statistics: 0.471 sec elapsed
out$glm_estimates
out$summary_stats$summary_estimates
p_chimeras is the proportion CUGs that are chimeric. g is the estimated proportion of fugue molecules and u is the molecule inflation factor such that n_cugs x u would give the number of non-fugue phantom molecules. The estimated total number of phantom molecules present in the dataset is given by n_pm=n_cugs x (u+g).
out$summary_stats$marginal
Call garbage collection first
gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 6713953 358.6 11060670 590.8 11060670 590.8
Vcells 5146972186 39268.3 16100491482 122837.1 16100491482 122837.1
Set the trade-off ratio cost cutoff (torc). The parameter torc represents the number of real molecules one is willing to incorrectly discard in order to correctly purge one phantom molecule. Since discarding a large proportion of the data is undesirable, reasonable values of torc are expected to be within the range of 1-5.
out <- purge_phantoms(out,
torc=3,
return_readcounts=FALSE,
return_discarded=TRUE)
Purging phantom molecules. Step 1: reassigning hopped reads: 9.441 sec elapsed
Purging phantom molecules. Step 2: getting observed tor threshold below user-provided cutoff: 0.06 sec elapsed
Purging phantom molecules. Step 3: marking retained observations in read counts table: 52.064 sec elapsed
Purging phantom molecules. Step 4: creating sparse count matrices of cleaned and discarded data: 255.639 sec elapsed
out$summary_stats$cutoff_dt
The row discard_torc shows the outcome whose qr value is the maximum allowed. Reads corresponding to outcomes with greater qr values are discarded. no_discarding corresponds to retaining all reassigned reads and no_purging corresponds to keeping the data as it is.
out$summary_stats$purge_summary
str(out$umi_counts$retained$P7_0)
Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
..@ i : int [1:7536936] 25223 529 835 1382 1529 1658 1747 2217 2394 2659 ...
..@ p : int [1:309166] 0 1 86 87 186 193 194 195 197 198 ...
..@ Dim : int [1:2] 33332 309165
..@ Dimnames:List of 2
.. ..$ : chr [1:33332(1d)] "ENSMUSG00000051951" "ENSMUSG00000089699" "ENSMUSG00000102343" "ENSMUSG00000025900" ...
.. ..$ : chr [1:309165] "AAACCTGAGAAACCTA" "AAACCTGAGAAACGAG" "AAACCTGAGAAACGCC" "AAACCTGAGAACAATC" ...
..@ x : num [1:7536936] 1 1 2 1 5 1 2 1 2 1 ...
..@ factors : list()
The datatable is ordered in descending order of qr, the posterior probability of incorrectly assigning s as the inferred sample of origin. n is the number of CUGs with the corresponding outcome and p_outcome is the observed marginal probability of that outcome.
out$outcome_counts
saveRDS(out, "./data/novaseq_l1/output/out_novaseq_l1.rds")
dataset_name <- "Novaseq_l1"
plots <- make_plots(out, dataset_name, x_lim = 120, legend_rel_width=0.2)
plots
$p_read
$p_fit
$p_tor
sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS/LAPACK: /cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/imkl/2018.3.222/compilers_and_libraries_2018.3.222/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8
[6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] purrr_0.3.3 PhantomPurgeR_0.99.0
loaded via a namespace (and not attached):
[1] Biobase_2.45.1 edgeR_3.27.14 tidyr_1.0.0 jsonlite_1.6 viridisLite_0.3.0
[6] R.utils_2.9.2 assertthat_0.2.1 stats4_3.6.0 dqrng_0.2.1 GenomeInfoDbData_1.2.1
[11] globals_0.12.5 pillar_1.4.3 backports_1.1.5 lattice_0.20-38 glue_1.3.1
[16] limma_3.41.18 digest_0.6.23 GenomicRanges_1.37.17 XVector_0.25.0 colorspace_1.4-1
[21] cowplot_1.0.0 htmltools_0.4.0 Matrix_1.2-17 R.oo_1.23.0 pkgconfig_2.0.3
[26] broom_0.5.3 listenv_0.8.0 zlibbioc_1.31.0 scales_1.1.0 HDF5Array_1.13.11
[31] BiocParallel_1.19.5 tibble_2.1.3 farver_2.0.2 generics_0.0.2 IRanges_2.19.17
[36] tictoc_1.0 ggplot2_3.2.1 ellipsis_0.3.0 SummarizedExperiment_1.15.9 furrr_0.1.0
[41] BiocGenerics_0.31.6 lazyeval_0.2.2 magrittr_1.5 crayon_1.3.4 evaluate_0.14
[46] R.methodsS3_1.7.1 future_1.15.1 nlme_3.1-139 MASS_7.3-51.4 tools_3.6.0
[51] data.table_1.12.8 lifecycle_0.1.0 matrixStats_0.55.0 stringr_1.4.0 Rhdf5lib_1.7.6
[56] DropletUtils_1.6.1 S4Vectors_0.23.25 munsell_0.5.0 locfit_1.5-9.1 DelayedArray_0.11.8
[61] compiler_3.6.0 GenomeInfoDb_1.21.2 rlang_0.4.2 rhdf5_2.29.6 grid_3.6.0
[66] RCurl_1.95-4.12 rstudioapi_0.10 SingleCellExperiment_1.7.11 bitops_1.0-6 base64enc_0.1-3
[71] labeling_0.3 rmarkdown_2.0 gtable_0.3.0 codetools_0.2-16 R6_2.4.1
[76] knitr_1.25 dplyr_0.8.3 zeallot_0.1.0 stringi_1.4.4 parallel_3.6.0
[81] Rcpp_1.0.3 vctrs_0.2.1 tidyselect_0.2.5 xfun_0.10