第 4/8 章

五、monocle3

1、简介 视频教程:请点击monocle3拟时序分析视频链接。测试文件:请在单细胞测序数据进阶分析—《拟时序分析》4.初识monocle3推送中获取。2、monocle3与monocle2的主要区别 2.1 monocle3的主要功能 Clustering, classifying, and counting cells:主要任务是帮助大家找到一些具有特殊功能的亚群。

Constructing single-cell trajectories:这个功能大家需要monocle的最根本原因,通过拟时序分析帮助大家解析生物体发育、疾病等过程中细胞发生的变化。Differential expression analysis:差异计算monocle2也有,但我们实测的感觉是不如Seurat的,这里作者表明monocle3在这一块进行了优化升级,我们后面具体看看用起来效果如何。

2.2 先谈缺点 目前monocle3已经更新到β版本了,作者在官网也承认了缺点,monocle3 α已经是不推荐使用的,可能会存在bug,但是monocle3 β仍然处于搭建中的状态,也就是说monocle3仍然是可能存在bug的,并且我们之前讲绪论的时候说到monocle1、2都发表在了Nature系列之上,但是monocle3迟迟没有发表,并且目前发表的文章还是使用monocle2的比较多,monocle3的不稳定性可能是重要原因。

不过我相信新发表的论文很快就会由monocle2转向monocle3。官方表述如下 2.3 再看看优点 咱们抛开可能存在的bug不谈,monocle3相较于monocle2具有以下几点优势:(1)最大的优点就是计算量变大了,可以处理百万级别的单细胞数据集,也就是说整个器官、甚至整个胚胎的矩阵交给monocle3处理完全没压力。

(2)代码结构性优化:这点我要吐槽一下,monocle系列的语法我一直觉得很奇怪,默认参数也很不人性化 (3)支持UMAP算法的降维,这个也非常Nice,速度比tSNE快的不是一星半点。

(4)支持多谱系的拓扑结构:换句话说拟时序的轨迹可以做的很复杂 (5)相较于原来的RGE算法,新的approximate graph abstraction能够计算不连续的、平行的拓扑结构 (6)新的基因表达量计算及差异分析方法被引入,也就是说原来的differentialGeneTest和BEAM可以被替代。

(7)可以像Seurat的多样本整合一样对拟时序对象进行整合:这个功能可以说是刚需了,换句话说,如果你有合适的、已构建好拟时序的参考数据集,可以直接把自己的数据跟参考数据集进行投影、比对。(8)数据整合时也可同时加上注释:这有点类似于Seurat中的TransferData,可以利用已经注释好的参考数据给现行数据添加注释。

(9)对monocle对象的读取、加载、转换做了一定的优化,我们后面可以看看效果如何 (10)优化了负二项分布模型:也就是说对处理count的优化 (11)可视化提供了3D展示功能 3、安装Monocle3 保证自己的环境符合这些版本要求,不然报错很麻烦 R >= 4.1.0 , Bioconductor = 3.14,这样装上的monocle3 >= 1.2.7 3.1 R语言中的安装 # if (!requireNamespace("BiocManager", quietly = TRUE)) # install.packages("BiocManager") # package.version('BiocManager') # BiocManager::install(version = "3.14") # if(!require(monocle3))BiocManager::install(c('BiocGenerics', 'DelayedArray', 'DelayedMatrixStats', # 'limma', 'lme4', 'S4Vectors', 'SingleCellExperiment', # 'SummarizedExperiment', 'batchelor', 'Matrix.utils', # 'HDF5Array', 'terra', 'ggrastr')) # if(!require(devtools))install.packages("devtools") # if(!require(monocle3))devtools::install_github('cole-trapnell-lab/monocle3') # if(!require(monocle3))devtools::install_github('cole-trapnell-lab/monocle3', ref="develop") 3.2 通过conda安装 即使在R4.1.0中还是很容易有很多包装不上 error: gdal-config not found or not executable. apt-get -y update && apt-get install -y libudunits2-dev libgdal-dev libgeos-dev libproj-dev activate monocle3 insyall r-base==4.1.0 install gcc install seurat install r-seuratobject install r-terra==1.5.12 install r-units install r-raster # conda install r-spdata # conda install r-sf # conda install r-spdep # conda install r-ragg # conda install r-ggrastr # conda install r-devtools # if(!require(monocle3))devtools::install_github('cole-trapnell-lab/monocle3') search monocle3 install r-monocle3 library (monocle3) package.version ( "monocle3" ) # [1] "1.3.7" library (Seurat) library (SeuratObject) library (tidyselect) library (dplyr) library (BiocGenerics) 4、初探monocle3 4.1 monocle3对象的构建 Sys.time # [1] "2025-04-28 13:09:36 CST" # 设置工作目录 setwd ( "/data/06_scRNA-seq_psedutime/" ) # data <- as(as.matrix(data4mono@assays$RNA@counts), 'sparseMatrix') # pd <- new('AnnotatedDataFrame', data = [email protected]) # fData <- data.frame(gene_short_name = row.names(data4mono), row.names = row.names(data4mono)) # fd <- new('AnnotatedDataFrame', data = fData) # mycds <- newCellDataSet(data, # phenoData = pd, # featureData = fd, # expressionFamily = negbinomial.size) # expression_matrix <- readRDS(url("https://depts.washington.edu:/trapnell-lab/software/monocle3/celegans/data/cao_l2_expression.rds")) # cell_metadata <- readRDS(url("https://depts.washington.edu:/trapnell-lab/software/monocle3/celegans/data/cao_l2_colData.rds")) # gene_annotation <- readRDS(url("https://depts.washington.edu:/trapnell-lab/software/monocle3/celegans/data/cao_l2_rowData.rds")) expression_matrix <- readRDS ( 'data/expression_matrix.rds' ) cell_metadata <- readRDS ( 'data/cell_metadata.rds' ) gene_annotation <- readRDS ( 'data/gene_annotation.rds' ) class (cell_metadata) # [1] "data.frame" cds <- new_cell_data_set (expression_matrix, cell_metadata = cell_metadata, gene_metadata = gene_annotation) function not found 了还不知道哪里出错了 mysubset <- c for (runplate in unique (cds @ colData $ plate)){ mytemp <- cds @ colData %>% as.data.frame %>% filter (plate == runplate) %>% rownames %>% sample ( 50 ) mysubset <- cbind (mysubset,mytemp) } cds <- cds[,mysubset] table (cds @ colData $ plate) # # 001 002 003 004 005 006 007 008 009 010 # 50 saveRDS (cds, 'data/simple.rds' ) pbmc <- Read10X ( 'data/filtered_gene_bc_matrices/hg19/' ) pbmc <- CreateSeuratObject ( counts = pbmc) count <- GetAssayData (pbmc, slot= "counts" ) seurat2cds <- new_cell_data_set (count, cell_metadata = as.data.frame (pbmc @ meta.data), gene_metadata = data.frame ( gene_short_name = row.names (pbmc), row.names = row.names (pbmc)) ) class (seurat2cds) # [1] "cell_data_set" # attr(,"package") # [1] "monocle3" 4.2 预处理、质控 cds <- preprocess_cds (cds, num_dim = 100 ) cds <- align_cds (cds, alignment_group = "plate" ) names (cds @ colData) # 降维,默认是"Aligned"方式 cds <- reduce_dimension (cds, reduction_method= "UMAP" , umap.fast_sgd = FALSE , cores = 1 ) = FALSE' and 'cores = 1'可以保证结果相同 <- reduce_dimension(cds,reduction_method = 'tSNE',cores=5) # 聚类分群 cds <- cluster_cells (cds) # 拟时序 cds <- learn_graph (cds) graph embedding(RGE)算法 graph abstraction:https://pubmed.ncbi.nlm.nih.gov/30890159/ 这一步整活了,可以交互式地选择起点和终点,看了一下这个功能应该是依赖shiny写的,那就意味着在终端操作这一步就不可能了 cds <- order_cells (cds) 选择过程如下:plot_cells (cds, label_principal_points = T) cds <- order_cells (cds, root_pr_nodes= c ( "Y_5" )) plot_cells (cds, color_cells_by = 'pseudotime' ) names (cds @ colData) # [1] "plate" "cao_cluster" "cao_cell_type" "cao_tissue" # [5] "Size_Factor" gene_fits <- fit_models (cds, model_formula_str = "~plate" , cores= 1 ) 上面这函数看起来有点bug,如果不指定核心数默认会调用CPU所有的核心来计算,从而让简单的计算变得很慢,原理大家可以参考这个核心/线程数调用的越多就算得越快嘛? fit_coefs <- coefficient_table (gene_fits) emb_time_terms <- fit_coefs %>% filter (term == "plate" ) emb_time_terms <- emb_time_terms %>% mutate ( q_value = p.adjust (p_value)) sig_genes <- emb_time_terms %>% filter (q_value < 0.05 ) %>% pull (gene_short_name) suppressMessages (pr_test_res <- graph_test (cds, neighbor_graph= "principal_graph" , cores= 1 , verbose = F)) pr_deg_ids <- row.names ( subset (pr_test_res, q_value < 0.05 )) head (pr_deg_ids) # [1] "WBGene00000001" "WBGene00000004" "WBGene00000006" "WBGene00000010" # [5] "WBGene00000016" "WBGene00000018" 最后,取子集与合并 cds1 <- cds[ 1 : 100 ,] cds2 <- cds[ 101 : 200 ,] big_cds <- combine_cds ( list (cds1, cds2)) dim (cds) # [1] 20271 500 cds1 <- cds[, 1 : 300 ] class (cds1) # [1] "cell_data_set" # attr(,"package") # [1] "monocle3" dim (cds1) # [1] 20271 300 cds2 <- cds[ 1 : 300 ,] class (cds2) # [1] "cell_data_set" # attr(,"package") # [1] "monocle3" dim (cds2) # [1] 300 500 try ({big_cds <- combine_cds ( list (cds1, cds2), keep_all_genes = F, cell_names_unique = TRUE )}) # Error in combine_cds(list(cds1, cds2), keep_all_genes = F, cell_names_unique = TRUE) : # Cell names are not unique across CDSs - cell_names_unique must be FALSE. table ( colnames (cds1) %in% colnames (cds2)) # # TRUE # 300 4.3 细胞降维、分群、鉴定 视频教程:请点击monocle3中的聚类与分群视频链接。

测试文件:请在单细胞测序数据进阶分析—《拟时序分析》5.monocle3的降维、分群、聚类推送中获取。

4.3.1 降维 rm ( list = ls ); gc # used (Mb) gc trigger (Mb) max used (Mb) # Ncells 12126339 647.7 31347466 1674.2 39184332 2092.7 # Vcells 43548463 332.3 282321537 2154.0 427809837 3264.0 library (monocle3) library (dplyr) cds <- readRDS ( 'data/simple.rds' ) cds <- preprocess_cds (cds, num_dim = 100 ) plot_pc_variance_explained (cds) cds <- reduce_dimension (cds) # 这里我们没有指定reduction_method,但是默认是"UMAP",有"UMAP", "tSNE", "PCA", "LSI", "Aligned"可供选择 umap.fast_sgd=TRUE # umap.fast_sgd参数或者设置cores可以加快运算速度,但是可以能会让降维图最终显得有些不同,所以要慎重选择 red .1 <- plot_cells (cds, color_cells_by= "cao_cell_type" , cell_size = 2 , graph_label_size = 5 , group_label_size = 3 ) cds <- reduce_dimension (cds, umap.fast_sgd= T) red .2 <- plot_cells (cds, color_cells_by= "cao_cell_type" , cell_size = 2 , graph_label_size = 5 , group_label_size = 3 ) cds <- reduce_dimension (cds, cores = 5 ) # No preprocess_method specified, using preprocess_method = 'PCA' # Note: reduce_dimension will produce slightly different output each time you run it unless you set 'umap.fast_sgd = FALSE' and 'cores = 1' red .3 <- plot_cells (cds, color_cells_by= "cao_cell_type" , cell_size = 2 , graph_label_size = 5 , group_label_size = 3 ) suppressMessages ( library (patchwork)) red .1 | red .2 | red .3 # 试试tSNE cds <- reduce_dimension (cds, reduction_method= "tSNE" ) plot_cells (cds, color_cells_by= "cao_cell_type" , reduction_method= "tSNE" ) red .1 | plot_cells (cds, reduction_method= "tSNE" , color_cells_by= "cao_cell_type" , cell_size = 2 , group_label_size = 3 ) 4.3.2 分群 cds <- cluster_cells (cds) detection,类似于Seurat::FindClusters,但原理有所不同 cds <- cluster_cells (cds, resolution= 0.1 ) unique ( partitions (cds)) # [1] 1 2 # Levels: 1 2 plot_cells (cds) plot_cells (cds, genes= c ( "cpna-2" , "egl-21" , "ram-2" , "inos-1" )) 4.3.3 去批次 names (cds @ colData) # [1] "plate" "cao_cluster" "cao_cell_type" "cao_tissue" # [5] "Size_Factor" bfalig <- plot_cells (cds, color_cells_by= "plate" , label_cell_groups= FALSE ) cds <- align_cds (cds, alignment_group = "plate" ) aftalig <- plot_cells (cds, color_cells_by= "plate" , label_cell_groups= FALSE ) bfalig | aftalig 4.3.4 计算并展示marker基因 suppressMessages (marker_test_res <- top_markers (cds, group_cells_by= "cao_cell_type" , reference_cells= 1000 , cores= 2 )) marker_test_res[ 1 : 4 , 1 : 4 ] # gene_id gene_short_name cell_group marker_score # 1 WBGene00000031 abu-8 Pharyngeal epithelia 0.6092094 # 2 WBGene00000033 abu-10 Pharyngeal epithelia 0.4163906 # 3 WBGene00000069 acy-2 Canal associated neurons 0.8874499 # 4 WBGene00000100 ajm-1 Pharyngeal epithelia 0.4129481 top_specific_markers <- marker_test_res %>% filter (fraction_expressing >= 0.10 ) %>% group_by (cell_group) %>% top_n ( 1 , pseudo_R2) top_specific_markers[ 1 : 4 , 1 : 4 ] # # A tibble: 4 × 4 # # Groups: cell_group [3] # gene_id gene_short_name cell_group marker_score # <chr> <fct> <chr> <dbl> # 1 WBGene00000145 apc-11 Sex myoblasts 0.923 # 2 WBGene00000444 ceh-21 Sex myoblasts 0.946 # 3 WBGene00000501 cho-1 Cholinergic neurons 0.389 # 4 WBGene00000753 col-180 Non-seam hypodermis 0.566 top_specific_marker_ids <- unique (top_specific_markers %>% pull (gene_id)) plot_genes_by_group (cds, top_specific_marker_ids[ 1 : 10 ], group_cells_by= "cao_cell_type" , max.size= 3 ) plot_genes_by_group (cds, top_specific_marker_ids[ 1 : 10 ], max.size= 3 , group_cells_by= "partition" ) 4.3.5 注释细胞 colData (cds) $ assigned_cell_type <- as.character ( partitions (cds)) colData (cds) $ assigned_cell_type <- dplyr :: recode ( colData (cds) $ assigned_cell_type, "1" = "Bio_1" , "2" = "Bio_2" , "3" = "Bio_3" ) plot_cells (cds, group_cells_by= "partition" , color_cells_by= "assigned_cell_type" , cell_size = 2 , graph_label_size = 5 , group_label_size = 3 ) 4.4 取子集以及亚群分析 try (cds_subset <- choose_cells (cds)) # Error : choose_cells only works interactive mode. cds_subset <- cds class (cds_subset) # [1] "cell_data_set" # attr(,"package") # [1] "monocle3" suppressMessages (pr_graph_test_res <- graph_test (cds_subset, neighbor_graph= "knn" , cores= 2 )) 通过marker基因集可视化帮助判断 gene pr_deg_ids <- row.names ( subset (pr_graph_test_res, morans_I > 0.01 & q_value < 0.05 )) gene_module_df <- find_gene_modules (cds_subset[pr_deg_ids,], resolution= 1e-3 ) gene_module_df[ 1 : 5 , 1 : 5 ] # # A tibble: 5 × 5 # id module supermodule dim_1 dim_2 # <chr> <fct> <fct> <dbl> <dbl> # 1 WBGene00000004 6 2 3.02 -4.80 # 2 WBGene00000006 1 1 8.08 0.789 # 3 WBGene00000010 8 2 -5.91 1.45 # 4 WBGene00000016 21 9 -1.51 -4.65 # 5 WBGene00000018 9 2 1.20 -2.65 plot_cells (cds_subset, genes= gene_module_df, show_trajectory_graph= FALSE , label_cell_groups= FALSE , cell_size = 0.5 , graph_label_size = 5 , group_label_size = 3 ) 5、monocle3实战 视频教程:请点击monocle3拟时序分析视频链接。

测试文件:请在单细胞测序数据进阶分析—《拟时序分析》5.monocle3的降维、分群、聚类推送中获取。

monocle3与monocle2最大的特点有以下两点:1、可以交互式地选择拟时序的起点 2、可以采取3D的形式展示轨迹图 5.1 预处理数据 rm ( list = ls ); gc # used (Mb) gc trigger (Mb) max used (Mb) # Ncells 12027102 642.4 31347466 1674.2 39184332 2092.7 # Vcells 44101024 336.5 144548628 1102.9 427809837 3264.0 # 读入并创建cds对象 expression_matrix <- readRDS ( 'data/expression_matrix.rds' ) cell_metadata <- readRDS ( 'data/cell_metadata.rds' ) gene_annotation <- readRDS ( 'data/gene_annotation.rds' ) cds <- new_cell_data_set (expression_matrix, cell_metadata = cell_metadata, gene_metadata = gene_annotation) cds <- preprocess_cds (cds, num_dim = 50 ) cds <- align_cds (cds, alignment_group = "plate" , residual_model_formula_str = '~Size_Factor' ) residual_model_formula_str可以指定连续变量进行去批次,例如某一基因/基因集的表达值,否则这些变量以分类变量的形式参与去批次会将每一个值视为一个批次从而产生极大的冗余计算量 降维、聚类 cds <- reduce_dimension (cds) plot_cells (cds, label_groups_by_cluster= FALSE , color_cells_by = "cao_cell_type" ) ciliated_genes <- c ( "che-1" , "hlh-17" , "nhr-6" , "dmd-6" , "ceh-36" , "ham-1" ) plot_cells (cds, genes= ciliated_genes, label_cell_groups= FALSE , show_trajectory_graph= FALSE ) 我们monocle2那里谈过,虽然从计算学的角度来说细胞可以连续地从一种状态过渡到下一种状态,它们之间没有离散的边界,但从生物学上来说并不是所有细胞类型之间都能发生转化,所以Monocle3并不假设数据集中的所有细胞都来自一个共同的转录“祖先”,大家实际拿到的数据中可能有多个不同的轨迹。

例如,在应对感染的组织中,组织常驻免疫细胞和基质细胞会有非常不同的初始转录组,对感染的反应也会非常不同,所以它们应该分属于不用的轨迹。因此,monocle3在进行拟时序分析时并不采取单起点的方式。

cds <- cluster_cells (cds) plot_cells (cds, color_cells_by = "partition" ) # 这个图里黑色的线就是拟时序走向的背景,带有数字的灰色圈圈代表的是拟时序分支的leaf,他们被黑色的branch_points所隔开 5.2 拟时序分析 cds <- learn_graph (cds) plot_cells (cds, color_cells_by = "cao_cell_type" , label_groups_by_cluster= FALSE , label_leaves= FALSE , label_branch_points= FALSE ) plot_cells (cds, color_cells_by = "cao_cell_type" , label_cell_groups= FALSE , label_leaves= TRUE , label_branch_points= TRUE , graph_label_size= 1.5 ) cds <- order_cells (cds) plot_cells (cds, color_cells_by = "pseudotime" , label_cell_groups= FALSE , label_leaves= FALSE , label_branch_points= FALSE , graph_label_size= 1.5 ) 我们可以看到这个图中有许多灰色的点,这是因为这些单独的轨迹没有被选择root,所以产生了无效的拟时间值,也就是说,如果手动选择,那么每个拟时序的轨迹都需要选择一次root 说实话这个自动弹出交互式窗口的功能让我在写Rmarkdown的时候很抓狂,这个带有交互功能函数目前没法写在html文件中,所以,在我研读了一下源码之后,通过以下这种方式可以编程性选择拟时序的root,下面这个函数会选择最接近你选择细胞的节点作为拟时序的root names ( colData (cds)) # [1] "plate" "cao_cluster" "cao_cell_type" "cao_tissue" # [5] "Size_Factor" colData (cds)[, 'cao_cell_type' ] %>% unique # [1] "Unclassified neurons" "Germline" # [3] "Intestinal/rectal muscle" "Vulval precursors" # [5] "Coelomocytes" NA # [7] "Ciliated sensory neurons" "Failed QC" # [9] "Seam cells" "Non-seam hypodermis" # [11] "Pharyngeal epithelia" "Touch receptor neurons" # [13] "Body wall muscle" "Cholinergic neurons" # [15] "Distal tip cells" "Other interneurons" # [17] "GABAergic neurons" "Am/PH sheath cells" # [19] "Pharyngeal muscle" "Pharyngeal neurons" # [21] "Oxygen sensory neurons" "Somatic gonad precursors" # [23] "flp-1(+) interneurons" "Canal associated neurons" # [25] "Unclassified glia" "Pharyngeal gland" # [27] "Sex myoblasts" "Excretory cells" # [29] "Dopaminergic neurons" "Socket cells" # [31] "Rectum" colData (cds)[, 'cao_cell_type' ] %>% table # . # Am/PH sheath cells Body wall muscle Canal associated neurons # 421 10508 239 # Cholinergic neurons Ciliated sensory neurons Coelomocytes # 1015 842 1358 # Distal tip cells Dopaminergic neurons Excretory cells # 129 70 155 # Failed QC flp-1(+) interneurons GABAergic neurons # 3483 224 400 # Germline Intestinal/rectal muscle Non-seam hypodermis # 5144 338 1268 # Other interneurons Oxygen sensory neurons Pharyngeal epithelia # 443 305 747 # Pharyngeal gland Pharyngeal muscle Pharyngeal neurons # 271 332 314 # Rectum Seam cells Sex myoblasts # 121 3523 302 # Socket cells Somatic gonad precursors Touch receptor neurons # 358 355 334 # Unclassified glia Unclassified neurons Vulval precursors # 208 2639 488 get_earliest_principal_node <- function (cds, my_select= "Am/PH sheath cells" ){ cell_ids <- which ( colData (cds)[, "cao_cell_type" ] == my_select) closest_vertex <- cds @ principal_graph_aux[[ "UMAP" ]] $ pr_graph_cell_proj_closest_vertex closest_vertex <- as.matrix (closest_vertex[ colnames (cds), ]) root_pr_nodes <- igraph :: V ( principal_graph (cds)[[ "UMAP" ]]) $ name[ as.numeric (names ( which.max ( table (closest_vertex[cell_ids,]))))] root_pr_nodes } cds <- order_cells (cds, root_pr_nodes= get_earliest_principal_node (cds)) myselect <- function (cds,select.classify,my_select){ cell_ids <- which ( colData (cds)[,select.classify] == my_select) closest_vertex <- cds @ principal_graph_aux[[ "UMAP" ]] $ pr_graph_cell_proj_closest_vertex closest_vertex <- as.matrix (closest_vertex[ colnames (cds), ]) root_pr_nodes <- igraph :: V ( principal_graph (cds)[[ "UMAP" ]]) $ name[ as.numeric (names ( which.max ( table (closest_vertex[cell_ids,]))))] root_pr_nodes } cds <- order_cells (cds, root_pr_nodes= myselect (cds, select.classify = 'cao_cell_type' , my_select = "Body wall muscle" ) ) 下图可以看的出来,我们制定了“Body wall muscle”为起点后,这群细胞便是紫色的”零值” plot_cells (cds, color_cells_by = "pseudotime" , label_cell_groups= FALSE , label_leaves= FALSE , label_branch_points= FALSE , graph_label_size= 1.5 ) | plot_cells (cds, color_cells_by = "cao_cell_type" , label_cell_groups= FALSE , label_leaves= FALSE , label_branch_points= FALSE , graph_label_size= 1.5 ) 拟时序中的基因展示 sce.obj <- CreateSeuratObject ( counts = expression_matrix, meta.data = cell_metadata, assay = "RNA" ) sce.obj <- NormalizeData (sce.obj) Idents (sce.obj) <- "cao_cell_type" sce.obj.diff <- FindMarkers (sce.obj, ident.1= "Body wall muscle" , ident.2= "Seam cells" , min.pct = 0.1 , logfc.threshold = 0.25 ) cells <- rownames (sce.obj @ meta.data[ which (sce.obj @ meta.data $ cao_cell_type %in% c ( "Body wall muscle" , "Seam cells" )),]) plot_genes_in_pseudotime (cds[ rownames (sce.obj.diff)[ 1 : 5 ],cells], color_cells_by= "cao_cell_type" , min_expr= 0.5 ) 还有3D版的拟时序,来试试吧 cds_3d <- reduce_dimension (cds, max_components = 3 ) cds_3d <- cluster_cells (cds_3d) cds_3d <- learn_graph (cds_3d) cds <- order_cells (cds, root_pr_nodes= myselect (cds, select.classify = 'cao_cell_type' , my_select = "Body wall muscle" ) ) cds_3d_plot_obj <- plot_cells_3d (cds_3d, color_cells_by= "cao_cell_type" ) cds_3d_plot_obj 6、下游探索 6.1 DimPlot和FeaturePlot # 创建 Seurat 对象 sce.obj <- CreateSeuratObject ( counts = expression_matrix, meta.data = cell_metadata, assay = "RNA" ) # 添加降维坐标 # 创建一个 DimReduc 对象 umap_reduction <- CreateDimReducObject ( embeddings = reducedDims (cds) $ UMAP, key = "UMAP_" , assay = "RNA" ) sce.obj[[ "umap" ]] <- umap_reduction Idents (sce.obj) <- "cao_cell_type" sce.obj <- NormalizeData (sce.obj) sce.obj <- ScaleData (sce.obj, features= rownames (sce.obj)) # 提取monocle3拟时间 mono.info <- as.data.frame ( pseudotime (cds, reduction_method = "UMAP" )) head (mono.info) # pseudotime(cds, reduction_method = "UMAP") # cele-001-001.CATGACTCAA Inf # cele-001-001.AAGACGGCCA Inf # cele-001-001.GCCAACGCCA Inf # cele-001-001.ATAGGAGTAC 17.6783 # cele-001-001.CTCGTCTAGG Inf # cele-001-001.AAGTTGCCAT Inf # Inf替换0.01避免影响后续作图 mono.info[, 1 ][ is.infinite (mono.info[, 1 ])] <- 0.01 # 可以看出这是一个行名为细胞名,列名为各个变量信息的数据框,可以通过以下方式将这个数据框作为注释列表添加回Seurat colnames (mono.info) # [1] "pseudotime(cds, reduction_method = \"UMAP\")" # 为了避免与Seurat原有的注释重合,我们给这些变量名加一个后缀 colnames (mono.info) <- "monocle3_pseudotime" colnames (mono.info) # [1] "monocle3_pseudotime" sce.obj <- AddMetaData (sce.obj, metadata = mono.info) # 可以利用Seurat原有的降维图观察细胞类型、拟时序中state、拟时间分布:library (ggsci) DimPlot (sce.obj, group.by = 'cao_cell_type' ) | FeaturePlot (sce.obj, features = 'monocle3_pseudotime' , cols = c ( "lightgrey" , 'darkred' )) # 注意这里颜色不可以为白色 6.2 VlnPlot library (ggplot2) # VlnPlot可以用于绘制Pseudotime这样的数值型变量 color <- c ( pal_npg ( 10 ), pal_jco ( 10 ), pal_lancet ( 9 ), pal_aaas ( 10 )) VlnPlot (sce.obj, features = 'monocle3_pseudotime' , group.by = 'cao_cell_type' , pt.size = 0 , idents = c ( "Failed QC" , "Body wall muscle" , "Non-seam hypodermis" , "Somatic gonad precursors" , "Unclassified glia" ), cols= color) + theme ( axis.text.x = element_text ( angle = 60 , hjust = 1 )) | VlnPlot (sce.obj, features = 'monocle3_pseudotime' , group.by = 'cao_tissue' , pt.size = 0 , cols= color) + theme ( axis.text.x = element_text ( angle = 60 , hjust = 1 )) 6.3 ggplot2绘图 cellname <- rownames (mono.info) data4plot <- cbind (sce.obj @ meta.data, as.data.frame (sce.obj @ reductions $ umap @ cell.embeddings)[cellname,]) head (data4plot) # orig.ident nCount_RNA nFeature_RNA plate cao_cluster # cele-001-001.CATGACTCAA SeuratProject 144 89 001 20 # cele-001-001.AAGACGGCCA SeuratProject 790 419 001 6 # cele-001-001.GCCAACGCCA SeuratProject 832 338 001 13 # cele-001-001.ATAGGAGTAC SeuratProject 197 138 001 27 # cele-001-001.CTCGTCTAGG SeuratProject 475 208 001 2 # cele-001-001.AAGTTGCCAT SeuratProject 1399 679 001 6 # cao_cell_type cao_tissue # cele-001-001.CATGACTCAA Unclassified neurons Neurons # cele-001-001.AAGACGGCCA Germline Gonad # cele-001-001.GCCAACGCCA Intestinal/rectal muscle Intestinal/rectal muscle # cele-001-001.ATAGGAGTAC Vulval precursors <NA> # cele-001-001.CTCGTCTAGG Coelomocytes # cele-001-001.AAGTTGCCAT Germline Gonad # monocle3_pseudotime UMAP_1 UMAP_2 # cele-001-001.CATGACTCAA 0.0100 8.314502 -8.9886265 # cele-001-001.AAGACGGCCA 0.0100 -9.276903 3.2137904 # cele-001-001.GCCAACGCCA 0.0100 8.918166 5.5197773 # cele-001-001.ATAGGAGTAC 17.6783 -2.600855 -0.8929377 # cele-001-001.CTCGTCTAGG 0.0100 1.998563 -14.4926939 # cele-001-001.AAGTTGCCAT 0.0100 -10.160144 2.9547246 library ( "ggplot2" ) ggplot (data4plot, aes ( x= UMAP_1, y= UMAP_2)) + geom_point ( aes ( x= UMAP_1, y= UMAP_2, size= monocle3_pseudotime, colour = cao_cell_type), data = data4plot) + scale_color_manual ( values = color) + scale_size_continuous ( range = c ( 0.3 , 1.5 )) + theme_classic + theme ( axis.line = element_line ( size= 0.6 , colour = "black" , arrow = ggplot2 :: arrow ( length = ggplot2 :: unit ( 0.1 , "inches" ), ends = "last" , type = 'open' ))) 6.4 细胞比例探索 这部分内容其实我们以前出过专门的教程,详情见沉浸式统计细胞比例。

在这里也一样,我们可以去统计拟时序的state和celltype之间的关系。

6.4.1 饼图 # 统计细胞类型 mono.info.sorted <- mono.info[ order (mono.info $ monocle3_pseudotime), , drop = FALSE ] # 获取行名(细胞名) cell_names <- rownames (mono.info.sorted) # 按长度划分成三等份 n <- length (cell_names) group1 <- cell_names[ 1 : floor (n / 3 )] group2 <- cell_names[( floor (n / 3 ) + 1 ) : ( floor ( 2 * n / 3 ))] group3 <- cell_names[( floor ( 2 * n / 3 ) + 1 ) : n] mono.info.sorted[group1, "new" ] <- "group1" mono.info.sorted[group2, "new" ] <- "group2" mono.info.sorted[group3, "new" ] <- "group3" mono.info.sorted[, c ( "cell_type" , "group" )] <- sce.obj @ meta.data[ rownames (mono.info.sorted), c ( "cao_cell_type" , "cao_tissue" )] # 统计细胞类型 # 选取排名前五的细胞类型 mynames <- sort ( table (mono.info.sorted $ cell_type), decreasing = T)[ 1 : 5 ] %>% names mynames # [1] "Body wall muscle" "Germline" "Seam cells" # [4] "Failed QC" "Unclassified neurons" # 统计细胞数目 myratio <- sort ( table (mono.info.sorted $ cell_type), decreasing = T)[ 1 : 5 ] %>% as.numeric myratio # [1] 10508 5144 3523 3483 2639 # 整理细胞标签 pielabel <- paste0 (mynames, " (" , round (myratio / sum (myratio) * 100 , 2 ), "%)" ) pielabel # [1] "Body wall muscle (41.54%)" "Germline (20.33%)" # [3] "Seam cells (13.93%)" "Failed QC (13.77%)" # [5] "Unclassified neurons (10.43%)" cols <- pal_npg ( "nrc" )( 10 ) cols # [1] " " " " " " # [7] " " " " pie (myratio, labels= pielabel, radius = 1.0 , clockwise= F, main = "Celltype" , col = cols) mynames <- table (mono.info.sorted $ new) %>% names myratio <- table (mono.info.sorted $ new) %>% as.numeric pielabel <- paste0 (mynames, " (" , round (myratio / sum (myratio) * 100 , 2 ), "%)" ) pielabel # [1] "group1 (33.33%)" "group2 (33.33%)" "group3 (33.33%)" pie (myratio, labels= pielabel, radius = 1.0 , clockwise= F, main = "State" , col = cols) 6.4.2 堆积柱形图 cellnum <- table (mono.info.sorted $ cell_type,mono.info.sorted $ new)[ 1 : 10 ,] cellnum <- cellnum[ ! apply (cellnum, 1 , sum) == 0 ,] # 按列统计细胞比例 cell.prop <- as.data.frame ( prop.table (cellnum, margin= 2 )) colnames (cell.prop) <- c ( "Celltype" , "group" , "Proportion" ) head (cell.prop) # Celltype group Proportion # 1 Am/PH sheath cells group1 0.08415716 # 2 Body wall muscle group1 0.10190114 # 3 Canal associated neurons group1 0.04816223 # 4 Cholinergic neurons group1 0.18073511 # 5 Ciliated sensory neurons group1 0.16780735 # 6 Coelomocytes group1 0.26539924 p.bar <- ggplot (cell.prop, aes (group,Proportion, fill= Celltype)) + geom_bar ( stat= "identity" , position= "fill" ) + scale_fill_manual ( values= cols) + ggtitle ( "cell proportion" ) + theme_bw + theme ( axis.ticks.length= unit ( 0.5 , 'cm' )) + guides ( fill= guide_legend ( title= NULL )) p.bar 6.5 基因表达量的探索 6.5.1 在轨迹图上绘制表达量 plot_cells (cds, genes= "dmd-6" , label_cell_groups= FALSE , show_trajectory_graph= FALSE ) library (circlize) mycolor <- colorRampPalette ( colors = c ( ' , ' ))( 100 ) plot_cells (cds, genes= "dmd-6" , label_cell_groups= FALSE , show_trajectory_graph= FALSE ) + scale_colour_gradientn ( colours = mycolor, guide = "colorbar" ) 6.5.2 热图 其实热图的教程我们做过很多:热图绘制天花板:ComplexHeatmap|

一、初探 热图绘制天花板| 二、两万字长文探索注释功能 ComplexHeatmap|

三、用富集分析结果做成词云注释热图 在理解了monocle对象的数据结构和绘制热图的语法后,大家完全可以自行拼凑出下面的图,流程有些长,有点耐心哦:首先整理一下注释信息和表达矩阵 library (ComplexHeatmap) # 注意这里绘制热图如果出现报错 "Image must have at least 1 frame to write a bitmap",因为完整的矩阵是有42035个细胞,当绘制矩阵过大时候就会出现这样的情况 sce.obj_sub <- subset (sce.obj, idents= c ( "Unclassified glia" , "Somatic gonad precursors" , "Non-seam hypodermis" , "Vulval precursors" )) table ( Idents (sce.obj_sub)) # # Vulval precursors Non-seam hypodermis Somatic gonad precursors # 488 1268 355 # Unclassified glia # 208 expr4plot <- GetAssayData (sce.obj_sub, slot = "scale.data" ) mono.info <- mono.info[ rownames (mono.info) %in% colnames (expr4plot),,drop = F] mono.info <- mono.info[ order (mono.info $ monocle3_pseudotime),,drop = F] expr4plot <- expr4plot[, rownames (mono.info)] dim (mono.info) # [1] 2319 1 dim (expr4plot) # [1] 20271 2319 # 设置顶部的psedotime、celltype的注释 celltype_colors <- c ( "Non-seam hypodermis" = " , "Somatic gonad precursors" = " , "Unclassified glia" = " , "Vulval precursors" = " ) pseudotime_colors <- colorRamp2 ( range (mono.info $ monocle3_pseudotime, na.rm = TRUE ), c ( " , "green" ) # 从浅黄到红 ) top_anno <- HeatmapAnnotation ( Pseudotime= mono.info $ monocle3_pseudotime, Celltype= sce.obj_sub @ meta.data[ rownames (mono.info), "cao_cell_type" ], col = list ( Pseudotime = pseudotime_colors, Celltype = celltype_colors)) # 可以先看一下注释的效果 draw (top_anno) # 找一些感兴趣的基因集,比如各种细胞的marker,这里以每种细胞的Top20为例,实际过程中大家可以选择更多的基因,便于富集到有意义的通路:# 先展示基因来源 # sub.marker <- FindMarkers(sce.obj_sub, ident.1 = "Body wall muscle", ident.2 = "Seam cells", only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25) sub.marker <- FindAllMarkers (sce.obj_sub, only.pos = TRUE , min.pct = 0.25 , logfc.threshold = 0.25 ) sub.marker.new <- sub.marker %>% group_by (cluster) %>% top_n ( n = 20 , wt = avg_log2FC) top20gene <- sub.marker.new[ which (sub.marker.new $ cluster %in% c ( "Unclassified glia" , "Somatic gonad precursors" , "Non-seam hypodermis" , "Vulval precursors" )),] left_ann <- rowAnnotation ( cellmarker= top20gene $ cluster) draw (left_ann) 来做个富集分析用于词云注释,富集分析的教程可以看这里:手把手教你做单细胞测序数据分析

(七)——基因集富集分析 library (clusterProfiler) library (org.Ce.eg.db) pathway4wc <- data.frame for (runcell in unique (top20gene $ cluster)) { try ({pathway4wc}) gene4er <- filter (top20gene,cluster == runcell) %>% pull (gene) # 转换基因格式 gene.df <- gene_annotation[gene4er, "gene_short_name" ] gores <- enrichGO (gene.df, org.Ce.eg.db, keyType = "SYMBOL" , ont = "BP" , pvalueCutoff = 0.05 , pAdjustMethod = "BH" , qvalueCutoff = 0.2 , minGSSize = 10 , maxGSSize = 500 , readable = FALSE , pool = FALSE ) go.all.res <- as.data.frame (gores @ result) pathway.dt <- top_n (go.all.res, - 10 , p.adjust) pathway.dt <- cbind (pathway.dt, data.frame ( cluster= rep (runcell, nrow (pathway.dt)))) pathway4wc <- rbind (pathway4wc,pathway.dt) } head (pathway4wc) # ID Description GeneRatio BgRatio # GO:0048513 GO:0048513 animal organ development 8/13 414/9802 # GO:0040025 GO:0040025 vulval development 6/13 175/9802 # GO:0040028 GO:0040028 regulation of vulval development 4/13 110/9802 # GO:0007399 GO:0007399 nervous system development 6/13 457/9802 # GO:0002119 GO:0002119 nematode larval development 6/13 466/9802 # GO:0002164 GO:0002164 larval development 6/13 475/9802 # pvalue p.adjust qvalue # GO:0048513 1.012260e-08 2.348443e-06 1.267989e-06 # GO:0040025 4.598766e-08 5.334569e-06 2.880280e-06 # GO:0040028 9.931212e-06 6.222359e-04 3.359622e-04 # GO:0007399 1.289353e-05 6.222359e-04 3.359622e-04 # GO:0002119 1.442073e-05 6.222359e-04 3.359622e-04 # GO:0002164 1.609231e-05 6.222359e-04 3.359622e-04 # geneID Count # GO:0048513 osm-11/cab-1/let-23/lin-12/lin-31/dex-1/sel-2/lin-39 8 # GO:0040025 osm-11/let-23/lin-12/lin-31/sel-2/lin-39 6 # GO:0040028 let-23/lin-12/sel-2/lin-39 4 # GO:0007399 cab-1/let-23/zig-4/dex-1/dlg-1/sax-3 6 # GO:0002119 osm-11/let-23/lin-12/lin-31/sel-2/lin-39 6 # GO:0002164 osm-11/let-23/lin-12/lin-31/sel-2/lin-39 6 # cluster # GO:0048513 Vulval precursors # GO:0040025 Vulval precursors # GO:0040028 Vulval precursors # GO:0007399 Vulval precursors # GO:0002119 Vulval precursors # GO:0002164 Vulval precursors # 构建词云的text # 遍历每个唯一的聚类 mytext <- list for (i in 1 : length ( unique (pathway4wc $ cluster))) { # 选择p.adjust最小的前5个通路 temp.text <- filter (pathway4wc,cluster == unique (pathway4wc $ cluster)[i]) %>% top_n ( - 1 ,p.adjust) %>% pull (Description) # 将当前聚类的通路描述信息添加到列表中 mytext[[i]] <- temp.text } # 为列表元素命名,使用聚类的唯一值 names (mytext) <- unique (pathway4wc $ cluster) # 减少部分通路 mytext[[ 4 ]] <- mytext[[ 4 ]][ 1 : 2 ] # 依据显著性设置词云大小,越显著的通路字体越大 # 创建一个空向量用于存储每个聚类的p.adjust值 myfontsize <- c # 遍历每个唯一的聚类 for (i in 1 : length ( unique (pathway4wc $ cluster))) { # 筛选出当前聚类且p.adjust小于0.05的数据 temfs <- filter (pathway4wc,cluster == unique (pathway4wc $ cluster)[i]) %>% filter (p.adjust < 0.05 ) %>% # 选择p.adjust最小的前5个通路 top_n ( - 1 ,p.adjust) %>% pull (p.adjust) # 将当前聚类的p.adjust值添加到向量中 myfontsize <- c (myfontsize,temfs) } # 加载scales包,用于数据缩放 library (scales) # 对p.adjust取负对数,然后进行缩放,使字体大小范围在0到10之间 myfontsize <- - log2 (myfontsize) %>% scales :: rescale (., to= c ( 0 : 1 )) * 10 # 创建行注释对象 # 使用anno_textbox函数绘制文本框注释 max_line_length <- 30 # 定义换行函数,保留 list 结构 wrap_list_text <- function (lst, width = max_line_length) { lapply (lst, function (vec) { sapply (vec, function (txt) { paste ( strwrap (txt, width = width), collapse = " \n " ) }) }) } # 应用函数 mytext_wrapped <- wrap_list_text (mytext) right_ann <- rowAnnotation ( textbox = anno_textbox (top20gene $ cluster, mytext_wrapped, # 允许文本换行 add_new_line = T, by = "anno_block" , # 设置文本框的样式 gp = gpar ( col = c ( pal_npg ( 10 ), pal_nejm ( 8 ), pal_aaas ( 4 )) , fontsize= myfontsize), # 设置文本框的圆角 round_corners = TRUE , # 设置文本框的最大宽度 max_width= unit ( 3 , 'mm' ))) col_fun <- colorRamp2 ( quantile (expr4plot) %>% as.numeric , c ( ' , ' , 'white' , ' , ' )) finalhmp <- Heatmap (expr4plot[top20gene $ gene,], show_column_names = F, show_row_names = F, col = col_fun, top_annotation = top_anno, cluster_columns = F, row_split = top20gene $ cluster, row_title = NULL , right_annotation = right_ann, border = T, left_annotation = left_ann ) draw (finalhmp) save.image ( "result_m3/m3.Rdata" ) Sys.time # [1] "2025-04-28 14:12:56 CST"

← 上一章 下一章 →