本教程基于Linux中的Rstudio环境演示,计算资源不足的同学可参考:生信分析为什么要使用服务器?足够支持你完成硕博生涯的生信环境 配置一个心仪的工作站(硬件+环境配置) 独享服务器,生信分析不求人 为实验室准备一份生物信息学不动产
二、可视化集锦 三、绪论 1、定义 拟时序分析(Pseudotime Analysis) 是一种在单细胞转录组学中常用的方法,它通过对细胞在特定生物过程中的转录组数据进行排序,推断其发展或变化的时间顺序。尽管单细胞数据通常是静态的(在一个时间点采集),拟时序分析通过基因表达的变化模式来重建细胞的发育轨迹或动态过程,从而模拟细胞随时间推移的变化,揭示细胞在不同状态下的演变路径和关键转折点。
2、拟时序分析作用 推断细胞的发育阶段:通过分析基因表达变化,将处于不同分化阶段的细胞排序,模拟出一个拟时间轴。识别关键的基因调控事件:通过比较不同拟时间阶段的细胞,识别在特定转折点上调或下调的基因,找出与细胞分化或功能转换相关的关键分子。推断细胞状态转换:分析细胞从一种状态向另一种状态过渡的动态变化,尤其在干细胞分化或疾病模型中应用广泛。
3、不同拟时序软件的比较 轨迹分析的流程如下图所示,一般输入文件为原始的基因表达矩阵或者是Normalized之后的矩阵。进行一系列的分析之后得到的分析结果将会用于下游的图形可视化。输出的轨迹类型可能会包括了环形轨迹、线形轨迹、分支型轨迹和多分支型轨迹等多种类型。其中较为常见的为分支型的轨迹分布。主要研究成果如下:3.1 总体评价结果 下图展示了每一种拟时序方法可以实现的拓扑结构。
图中彩色为对应方法可以实现的拓扑结构,灰色的为不能实现的拓扑结构。从中可以发现测试的大部分软件都可以实现线性拟时序的拓扑结构的预测和分析。而分支型、多分支型和树状的拓扑结构从上往下对应的拟时序软件可以实现的拓扑结构依次减少。另外对于部分相连的拓扑结构和不相连的拓扑只有以PAGA为代表的三种软件可以实现。整体来看得分最高的为slingshot、PAGA。
另外我们今天要给大家介绍的以monocle为代表的拟时序分析软件也可以较好地满足大家绝大多数场景下拟时序分析的需求。热图展示了各个拟时序软件应用于不同的数据集、不同类型的拓扑轨迹时的准确性、多次运行时结果的稳定性、数据量增加时所需计算时间的变化的综合评分。不同的指标可能会彼此不一致,Monocle和PAGA在拓扑分数上得分更高,检测到的拓扑倾向于更复杂的拓扑,在具有树状或更复杂轨迹的数据集上得分更高。
而其他方法例如Slingshot,通常在包含更简单拓扑的数据集上表现更好。3.2 准确性 测试数据由110个真实数据集和229个合成数据集两部分构成。110个真实的数据集来自各种单细胞技术,各种生物体和动态过程,并包含几种类型的拓扑轨迹。
作者把做测试用的真实数据集做了两个分类:Gold standard(参考轨迹是通过细胞分选或细胞混合而来,不是从表达数据本身中提取)和Silver standard(gold standard之外的数据集)。通过评估已知轨迹和预测轨迹之间的相似性得到软件的总体评分,准确性排在前三的为slingshot、PAGA和SCORPIUS。另外作者也发现不同方法性能在各个数据集之间的表现变化很大,这表明没有一种万金油的方法适用于每个数据集。
3.3 可扩展性 随着高通量单细胞技术的普及,我们在进行拟时序分析时经常需要处理几万,也许在未来有处理几十万细胞的需求,所以作者评估了目前的拟时序方法在处理细胞数、特征数(gene)性能的扩展。随着细胞数目的增加,大多数方法无法在一小时内在具有10k个细胞和几千个特征(gene)的数据集上完成。运行时间进一步增加,只有少数几个方法(如PAGA、Monocle)可以在1天内处理完数万细胞的分析。
3.4 稳定性 拟时序方法不仅要能够在合理的时间范围内推断出准确的模型,而且要在给定非常相似的输入数据时生成相似的模型。为了测试每种方法的稳定性,作者对10个数据集的子集(95%细胞,95%特征)测试了每种方法,并评估每对模型之间的平均相似性和轨迹的准确性。Slingshot产生的模型稳定性最高,Monocle等方法的稳定性次之,排名最后的四个方法稳定性较差。
3.5 软件、文档和手稿方面的可用性 最后,作者对每种方法的软件包装、文档、自动代码测试以及发布的期刊做了评估。作者发现大多数方法都满足基本标准,例如教程的可用性和基本代码质量标准。同时,新方法的质量得分比旧方法也会更好一些,更加推荐大家使用。
四、monocle2 1、简介 视频教程:请点击monocle2轨迹分析视频链接。测试文件:请在《拟时序分析》3.monocle2实操:完整版推送中获取。
2、安装monocle2 # 安装指定版本的R和monocle2 # conda search monocle # conda create -n monocle.2.18.0 # conda activate monocle.2.18.0 # conda install r-base==4.0.5 # conda install # conda install bioconductor-monocle==2.18.0 library (monocle) library (Seurat) if ( ! require (dplyr)) install.packages ( 'dplyr' ) 3、monocle2对象的格式 celldataset的创建、转换 Sys.time # [1] "2025-04-28 12:25:56 CST" # 设置工作目录 setwd ( "/data/06_scRNA-seq_psedutime/" ) # 官方给出的cds对象构建过程 # 根据自己数据实际情况修改表达矩阵、细胞文件、基因名称文件 # HSMM_expr_matrix <- read.table("fpkm_matrix.txt") # HSMM_sample_sheet <- read.delim("cell_sample_sheet.txt") # HSMM_gene_annotation <- read.delim("gene_annotations.txt") # pd <- new("AnnotatedDataFrame", data = HSMM_sample_sheet) # fd <- new("AnnotatedDataFrame", data = HSMM_gene_annotation) # HSMM <- newCellDataSet(as.matrix(HSMM_expr_matrix), # phenoData = pd, featureData = fd) # seurat对象转换为cds对象方法 sce.obj <- readRDS ( 'data/sce.obj.rds' ) class (sce.obj) # [1] "Seurat" # attr(,"package") # [1] "SeuratObject" # 准备细胞*基因表达矩阵 data <- GetAssayData (sce.obj, slot = "counts" ) data[ 1 : 4 , 1 : 4 ] # 4 x 4 sparse Matrix of class "dgCMatrix" # AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 # AL627309.1 . . . # AP006222.2 . . . # RP11-206L10.2 . . . # RP11-206L10.9 . . . # AAACCGTGCTTCCG-1 # AL627309.1 . # AP006222.2 . # RP11-206L10.2 . # RP11-206L10.9 . # 准备phenoData,描述细胞的各种元属性,相当于[email protected],行-细胞,列-属性 pd <- new ( 'AnnotatedDataFrame' , data = sce.obj @ meta.data) # 准备featureData,基因名,行-基因名 fData <- data.frame ( gene_short_name = row.names (data), row.names = row.names (data)) fd <- new ( 'AnnotatedDataFrame' , data = fData) # 创建monocle2的celldataset(缩写为cds)对象 mycds <- newCellDataSet (data, phenoData = pd, featureData = fd, expressionFamily = negbinomial.size ) 需要根据输入数据做出相应的选择 count的输入,而前者计算速度更快,后者运行的更加准确,适用于截断正态分布(truncated normal distributions)的数据,如FPKM、TPM,monocle处理这类数据会先取log class (mycds) # [1] "CellDataSet" # attr(,"package") # [1] "monocle" cds => seurat.obj as.data.frame (mycds @ phenoData @ data)[ 1 : 4 , 1 : 4 ] # orig.ident nCount_RNA nFeature_RNA percent.mt # AAACATACAACCAC-1 pbmc3k 2419 779 3.0177759 # AAACATTGAGCTAC-1 pbmc3k 4903 1352 3.7935958 # AAACATTGATCAGC-1 pbmc3k 3147 1129 0.8897363 # AAACCGTGCTTCCG-1 pbmc3k 2639 960 1.7430845 names (mycds @ phenoData @ data) # [1] "orig.ident" "nCount_RNA" "nFeature_RNA" "percent.mt" # [5] "RNA_snn_res.0.5" "seurat_clusters" "celltype" "Size_Factor" class (mycds @ assayData $ exprs) # [1] "dgCMatrix" # attr(,"package") # [1] "Matrix" m2seurat <- CreateSeuratObject ( counts = mycds @ assayData $ exprs, meta.data = mycds @ phenoData @ data) class (m2seurat) # [1] "Seurat" # attr(,"package") # [1] "SeuratObject" 4、初探monocle2 4.1 预处理、质控 mycds <- estimateSizeFactors (mycds) mycds <- estimateDispersions (mycds) # 质控(可选) 去除死细胞、空液滴、碎细胞液滴、双液滴,提高计算准确度、降低运算时间 mycds <- detectGenes (mycds, min_expr = 0.1 ) head (Biobase :: fData (mycds)) # gene_short_name num_cells_expressed # AL627309.1 AL627309.1 9 # AP006222.2 AP006222.2 3 # RP11-206L10.2 RP11-206L10.2 5 # RP11-206L10.9 RP11-206L10.9 3 # LINC00115 18 # NOC2L 254 # head(fData(mycds))如果提示Error: unable to find an inherited method for function ‘fData’ for signature ‘x = "CellDataSet"’,可能是monocle2和monocle3的冲突,请调整命令为Biobase::fData(mycds),参考https://github.com/cole-trapnell-lab/garnett/issues/20 mycds_expressed_genes <- row.names ( subset (Biobase :: fData (mycds), num_cells_expressed >= 10 )) mycds_expressed_genes[ 1 : 5 ] # [1] "LINC00115" "NOC2L" "HES4" "ISG15" "C1orf159" mycds <- mycds[mycds_expressed_genes,] length ( row.names (sce.obj)) # [1] 13714 length (mycds_expressed_genes) # [1] 11095 4.2 细胞降维、分群、鉴定 # 我们说过monocle是可以用来鉴定细胞类型的 # 如果你拥有一些先验知识,你可以通过设定一定的阈值来鉴定细胞 # 比如我们之前命名好的Seurat.obj # 计算差异基因 seu.marker <- FindAllMarkers (sce.obj) # 提取每个亚群top10差异基因 top10gene <- seu.marker %>% group_by (cluster) %>% top_n ( n = 10 , wt = avg_log2FC) # 提取指定亚群差异基因 top10gene %>% filter (cluster == 'B' ) # # A tibble: 10 × 7 # # Groups: cluster [1] # p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene # <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr> # 1 8.79e-62 7.77 0.125 0.001 1.21e-57 B AC079767.4 # 2 2.03e-42 7.75 0.087 0.001 2.78e-38 B TNFRSF13C # 3 4.03e-29 7.93 0.061 0.001 5.52e-25 B KCNG1 # 4 8.47e-28 8.11 0.055 0 1.16e-23 B MACROD2 # 5 3.82e-25 9.12 0.047 0 5.25e-21 B FAM177B # 6 7.88e-25 7.75 0.049 0 1.08e-20 B AL928768.3 # 7 1.15e-23 8.70 0.044 0 1.58e-19 B RP11-16E12.2 # 8 7.94e-12 8.15 0.02 0 1.09e- 7 B MARC2 # 9 7.94e-12 7.97 0.02 0 1.09e- 7 B PKHD1L1 # 10 2.44e-10 7.83 0.017 0 3.34e- 6 B HRK top10gene %>% filter (cluster == 'NK' ) # # A tibble: 10 × 7 # # Groups: cluster [1] # p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene # <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr> # 1 2.63e-33 9.47 0.058 0 3.60e-29 NK LIM2 # 2 1.69e-24 7.01 0.058 0.001 2.32e-20 NK RHOBTB3 # 3 2.46e-19 7.76 0.039 0 3.37e-15 NK YES1 # 4 6.57e-16 7.41 0.032 0 9.01e-12 NK NME8 # 5 1.16e-15 8.48 0.026 0 1.59e-11 NK MIR181A2HG # 6 1.16e-15 8.05 0.026 0 1.59e-11 NK GRIK4 # 7 1.72e-12 7.20 0.026 0 2.36e- 8 NK CCNJL # 8 4.09e-12 8.03 0.019 0 5.61e- 8 NK BOK # 9 4.09e-12 7.97 0.019 0 5.61e- 8 NK BNC2 # 10 4.09e-12 8.12 0.019 0 5.61e- 8 NK KRT86 # 从 mycds 对象里找出基因名称为 "CD79A" 和 "GZMB" 的基因在 fData 数据框中的行名 CD79A_id <- row.names ( subset (Biobase :: fData (mycds), gene_short_name == "CD79A" )) GZMB_id <- row.names ( subset (Biobase :: fData (mycds), gene_short_name == "GZMB" )) # newCellTypeHierarchy 是一个函数,用于创建一个新的细胞类型层次结构对象。
cth 就是这个新创建的对象,后续会在这个基础上添加不同的细胞类型。# classify_func 是一个自定义的分类函数,它接收一个矩阵 x 作为输入。这里的矩阵 x 通常代表细胞的基因表达矩阵,x[CD79A_id,] 表示提取基因 CD79A 在所有细胞中的表达值。
该函数的判断条件是当基因 CD79A 的表达值大于等于 1 时,将细胞分类为 B 细胞 # 当基因 CD79A 的表达值小于 1 且基因 GZMB 的表达值大于 1 时,将细胞分类为 NK 细胞 cth <- newCellTypeHierarchy cth <- addCellType (cth, "B" , classify_func = function (x) { x[CD79A_id,] >= 1 }) cth <- addCellType (cth, "NK" , classify_func = function (x) { x[CD79A_id,] < 1 & x[GZMB_id,] > 1 }) class (cth) # [1] "CellTypeHierarchy" # attr(,"package") # [1] "monocle" mycds <- classifyCells (mycds, cth, 0.1 ) table (Biobase :: pData (mycds) $ CellType) # # B NK Unknown # 417 198 2023 # 如果出现以下报错提示 # `nei` was deprecated in igraph 2.1.0 and is now defunct;
# 定位错误发生位置:rlang::last_trace # 解决方法:修改源码 # trace("cth_classifier_cell",edit = T, where = asNamespace("monocle")) 后 修改nei为.nei # pData(mycds)报错提示Error: unable to find an inherited method for function ‘pData’ for signature ‘x = "CellDataSet"’ # 将命令修改为 Biobase::pData(mycds) pm <- ggplot (Biobase :: pData (mycds), aes ( x = factor ( 1 ), fill = factor (CellType))) + geom_bar ( width = 1 ) + coord_polar ( theta = "y" ) + theme ( axis.title.x = element_blank , axis.title.y = element_blank ) ps <- ggplot (Biobase :: pData (mycds), aes ( x = factor ( 1 ), fill = factor (celltype))) + geom_bar ( width = 1 ) + coord_polar ( theta = "y" ) + theme ( axis.title.x = element_blank , axis.title.y = element_blank ) pm | ps # 当然,monocle也可以用非监督式的方法进行分群:# 找出一些表达量高得高变基因来,这些基因可能是会参与细胞命运得决定得 # 这步其实有点像 Seurat::FindVariableFeatures disp_table <- dispersionTable (mycds) unsup_clustering_genes <- subset (disp_table, mean_expression >= 0.1 ) # 大概两千多个“离散基因” length (unsup_clustering_genes $ gene_id) # [1] 2252 # setOrderingFilter该函数的作用是为 Monocle 的 cell ordering 设置一个基因的子集,作为排序的依据 mycds <- setOrderingFilter (mycds, unsup_clustering_genes $ gene_id) plot_ordering_genes (mycds) # 平均表达值与离散度(可变性较高的基因被标为黑点,而灰色的点会被下游分析弃用) # 这其实就是Seurat里的肘图 monocle :: plot_pc_variance_explained (mycds, return_all = F) # 如果提示plot_pc_variance_explained(mycds, return_all = F):Error in plot_pc_variance_explained(mycds, return_all = F),将命令调整为monocle::plot_pc_variance_explained(mycds, return_all = F) # 降维 mycds <- reduceDimension (mycds, max_components = 2 , num_dim = 6 , reduction_method = 'tSNE' , verbose = T) # monocle分群可以指定分群的数量 cluster.num <- length ( unique (sce.obj $ celltype)) cluster.num # [1] 9 mycds <- clusterCells (mycds, num_clusters = cluster.num) mono.redu <- plot_cell_clusters (mycds, 1 , 2 , color = "celltype" , show_cell_names = T, cell_name_size = 5 ) mono.redu library (patchwork) plot_cell_clusters (mycds, 1 , 2 , color = "celltype" , markers = c ( "CD79A" ), show_cell_names = T, cell_name_size = 5 ) | mono.redu plot_cell_clusters (mycds, 1 , 2 , color = "celltype" , markers = c ( "GZMB" ), show_cell_names = T, cell_name_size = 5 ) | mono.redu # 看一下与Seurat的差别,基本形状还是差不多的,但是Seurat的label是可以展示的 seu.tsne <- DimPlot (sce.obj, label = T, reduction = 'tsne' ) plot_cell_clusters (mycds, 1 , 2 , color = "celltype" , show_cell_names = T, cell_name_size = 10 ) | seu.tsne # 这里相当于Seurat regression,排除一些不想要的因素 names (pd @ data) # [1] "orig.ident" "nCount_RNA" "nFeature_RNA" "percent.mt" # [5] "RNA_snn_res.0.5" "seurat_clusters" "celltype" # 排除单个因素影响 mycds <- reduceDimension (mycds, max_components = 2 , num_dim = 2 , reduction_method = 'tSNE' , residualModelFormulaStr = "~percent.mt" , verbose = T) mycds <- clusterCells (mycds, num_clusters = 9 ) reg.tsne <- plot_cell_clusters (mycds, 1 , 2 , color = "celltype" ) mono.redu | reg.tsne # 两个变量的影响一起排除 mycds <- reduceDimension (mycds, max_components = 2 , num_dim = 2 , reduction_method = 'tSNE' , residualModelFormulaStr = "~nFeature_RNA + percent.mt" , verbose = T) mycds <- clusterCells (mycds, num_clusters = 9 ) reg2.tsne <- plot_cell_clusters (mycds, 1 , 2 , color = "celltype" ) mono.redu | reg.tsne | reg2.tsne plot_cell_clusters (mycds, 1 , 2 , color = "celltype" ) + facet_wrap ( ~ celltype) # 用monocle计算一下marker,看看降维效果如何 diff_test_res <- differentialGeneTest (mycds[mycds_expressed_genes,], fullModelFormulaStr = "~celltype" ) # 这一步就相当于Seurat里的findallmarker,约3.5min,不能多线程,狠麻烦,会算很久 diff_test_res[ 1 : 4 , 1 : 4 ] # status family pval qval # LINC00115 OK negbinomial.size 9.547089e-01 9.716102e-01 # NOC2L OK negbinomial.size 4.047651e-01 5.783723e-01 # HES4 OK negbinomial.size 1.118682e-45 6.430970e-44 # ISG15 OK negbinomial.size 2.504851e-22 6.148523e-21 ordering_genes <- row.names ( subset (diff_test_res, qval < 0.01 )) mycds <- setOrderingFilter (mycds, ordering_genes) plot_ordering_genes (mycds) mycds <- reduceDimension (mycds, max_components = 2 , num_dim = 3 , norm_method = 'log' , reduction_method = 'tSNE' , # residualModelFormulaStr = "~Media + num_genes_expressed", verbose = T) mycds <- clusterCells (mycds, num_clusters = 9 ) mono.gene.tsne <- plot_cell_clusters (mycds, 1 , 2 , color = "celltype" ) mono.gene.tsne | mono.redu | seu.tsne seu.var.gene <- Seurat :: VariableFeatures (sce.obj) length (seu.var.gene) # [1] 2000 mycds <- setOrderingFilter (mycds, seu.var.gene) mycds <- reduceDimension (mycds, max_components = 2 , num_dim = 3 , norm_method = 'log' , reduction_method = 'tSNE' , # residualModelFormulaStr = "~Media + num_genes_expressed", verbose = T) mycds <- clusterCells (mycds, num_clusters = 9 ) seu.var.plot <- plot_ordering_genes (mycds) mono.gene.tsne | seu.var.plot 5、 monocle2实战 有选择先验(半监督式)/高变基因(非监督式)两种机器学习供大家选择 5.1 半监督式拟时序 # 正式开始拟时序 # 第一步:,# 这一步主要作用是去噪、提取真正对拟时序有用的基因矩阵,这会影响拟时序的形状 # 半监督式:寻找自己感兴趣的,这里我们先用各celltype之间的marker gene尝试一下 diff_test_res <- differentialGeneTest (mycds[mycds_expressed_genes,], fullModelFormulaStr = "~celltype" ) diff_test_res[ 1 : 4 , 1 : 4 ] # status family pval qval # LINC00115 OK negbinomial.size 9.547089e-01 9.716102e-01 # NOC2L OK negbinomial.size 4.047651e-01 5.783723e-01 # HES4 OK negbinomial.size 1.118682e-45 6.430970e-44 # ISG15 OK negbinomial.size 2.504851e-22 6.148523e-21 ordering_genes <- row.names ( subset (diff_test_res, qval < 0.01 )) ordering_genes[ 1 : 5 ] # [1] "HES4" "ISG15" "TNFRSF18" "TNFRSF4" "SDF4" mycds <- setOrderingFilter (mycds, ordering_genes) plot_ordering_genes (mycds) Graph Embedding降维 mycds <- reduceDimension (mycds, max_components = 2 , method = 'DDRTree' ) mycds <- orderCells (mycds) # 如果报错出现的提示:# Error in if (class(projection) != "matrix") projection <- as.matrix(projection):the condition has length > 1 # 解决方法:修改源码 # trace("project2MST",edit = T, where = asNamespace("monocle")) 删除 “ if (class(projection) != "matrix") ”这部分的代码 plot_cell_trajectory (mycds, color_by = "celltype" ) | plot_cell_trajectory (mycds, color_by = "State" ) | plot_cell_trajectory (mycds, color_by = "Pseudotime" ) # 注意:拟时间可能会与实际的生物学方向相反,如果你具有一定的先验知识,那么你可以反转这一进程 ori.pse <- plot_cell_trajectory (mycds, color_by = "Pseudotime" ) mycds <- orderCells (mycds, reverse = T) revserse.pse <- plot_cell_trajectory (mycds, color_by = "Pseudotime" ) ori.pse | revserse.pse # 如果你觉得混在一起看不清楚 plot_cell_trajectory (mycds, color_by = "State" ) + facet_wrap ( ~ State, nrow = 1 ) plot_cell_trajectory (mycds, color_by = "State" ) + facet_wrap ( ~ celltype, nrow = 1 ) blast_genes <- diff_test_res $ gene_short_name[ 1 : 10 ] blast_genes # [1] "LINC00115" "NOC2L" "HES4" "ISG15" "C1orf159" "TNFRSF18" # [7] "TNFRSF4" "SDF4" "B3GALT6" "UBE2J2" plot_genes_in_pseudotime (mycds[ c ( "HES4" , "ISG15" , "SDF4" ),], color_by = "State" ) mycds_expressed_genes <- row.names ( subset (Biobase :: fData (mycds), num_cells_expressed >= 10 )) mycds_filtered <- mycds[mycds_expressed_genes,] top10gene # # A tibble: 90 × 7 # # Groups: cluster [9] # p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene # <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr> # 1 4.03e-12 4.79 0.034 0.003 0.0000000552 Naive CD4 T NOG # 2 8.87e-10 5.85 0.022 0.001 0.0000122 Naive CD4 T REG4 # 3 2.94e- 8 7.12 0.016 0 0.000403 Naive CD4 T GTSCR1 # 4 1.54e- 7 3.81 0.02 0.002 0.00212 Naive CD4 T C2orf40 # 5 2.83e- 7 4.33 0.016 0.001 0.00388 Naive CD4 T MMP28 # 6 6.11e- 7 3.58 0.019 0.002 0.00837 Naive CD4 T FAM153A # 7 1.87e- 6 3.51 0.016 0.001 0.0256 Naive CD4 T MYB # 8 8.67e- 5 4.42 0.01 0.001 1 Naive CD4 T AXIN2 # 9 8.71e- 5 4.43 0.01 0.001 1 Naive CD4 T ST6GALNAC1 # 10 1.17e- 4 3.86 0.011 0.001 1 Naive CD4 T RP11-119D9.1 # # ℹ 80 more rows cds_subset <- mycds_filtered[ c ( 'LDHB' , 'CCR7' , 'CD3D' ),] plot_genes_in_pseudotime (cds_subset, color_by = "celltype" ) 5.2 非监督式的 # 刚才是半监督式的,现在来个非监督式的 # 重来 # 1、按照cluster基因做拟时序 mycds <- newCellDataSet (data, phenoData = pd, featureData = fd, expressionFamily = negbinomial.size ) mycds <- estimateSizeFactors (mycds) mycds <- estimateDispersions (mycds, cores= 10 , relative_expr = TRUE ) mycds <- detectGenes (mycds, min_expr = 0.1 ) Biobase :: fData (mycds) $ use_for_ordering <- Biobase :: fData (mycds) $ num_cells_expressed > 0.05 * ncol (mycds) expr.genes <- Biobase :: fData (mycds) $ gene_short_name[Biobase :: fData (mycds) $ use_for_ordering] mycds <- setOrderingFilter (mycds, ordering_genes = expr.genes ) mycds <- reduceDimension (mycds, max_components = 2 , method = 'DDRTree' ) plot_ordering_genes (mycds) plot_pc_variance_explained (mycds, return_all = F) mycds <- reduceDimension (mycds, max_components = 2 , norm_method = 'log' , num_dim = 3 , reduction_method = 'tSNE' , verbose = T) mycds <- clusterCells (mycds, rho_threshold = 2 , delta_threshold = 4 , skip_rho_sigma = T, verbose = F) plot_cell_clusters (mycds, color_by = 'as.factor(Cluster)' ) | plot_cell_clusters (mycds, color_by = 'as.factor(celltype)' ) clustering_DEG_genes <- differentialGeneTest (mycds[mycds_expressed_genes,], fullModelFormulaStr = '~Cluster' , cores = 10 ) mycds_ordering_genes <- row.names (clustering_DEG_genes)[ order (clustering_DEG_genes $ qval)][ 1 : 1000 ] mycds <- setOrderingFilter (mycds, ordering_genes = mycds_ordering_genes) mycds <- reduceDimension (mycds, method = 'DDRTree' ) suppressWarnings (mycds <- orderCells (mycds)) plot_cell_trajectory (mycds, color_by = "celltype" ) | plot_cell_trajectory (mycds, color_by = "Pseudotime" ) suppressWarnings (mycds <- orderCells (mycds, reverse = T)) cluster.pse <- plot_cell_trajectory (mycds, color_by = "Pseudotime" ) ori.pse | cluster.pse data <- data <- as ( as.matrix (sce.obj @ assays $ RNA $ counts), 'sparseMatrix' ) mycds <- newCellDataSet (data, phenoData = pd, featureData = fd, expressionFamily = negbinomial.size ) mycds <- estimateSizeFactors (mycds) mycds <- estimateDispersions (mycds, cores= 10 , relative_expr = TRUE ) disp_table <- dispersionTable (mycds) disp.genes <- subset (disp_table, mean_expression >= 0.1 & dispersion_empirical >= 1 * dispersion_fit) $ gene_id mycds <- setOrderingFilter (mycds, disp.genes) plot_ordering_genes (mycds) mycds <- reduceDimension (mycds, max_components = 2 , method = 'DDRTree' ) mycds <- orderCells (mycds) disper.pse <- plot_cell_trajectory (mycds, color_by = "Pseudotime" ) ori.pse | cluster.pse | disper.pse cds_subset <- mycds[ row.names (clustering_DEG_genes)[ order (clustering_DEG_genes $ qval)][ 1 : 1000 ],] diff_test_res <- differentialGeneTest (cds_subset, fullModelFormulaStr = "~sm.ns(Pseudotime)" ) library (dplyr) diff_test_res[, c ( "gene_short_name" , "pval" , "qval" )] %>% head # gene_short_name pval qval # S100A9 1.316208e-213 1.316208e-210 # S100A8 7.284744e-209 3.642372e-206 # NKG7 1.552122e-173 3.104243e-171 # GNLY 4.006125e-191 1.335375e-188 # GZMB 2.736250e-186 6.840626e-184 # PPBP 1.324811e-01 1.503758e-01 sig.gene <- row.names ( subset (diff_test_res, qval < 0.05 )) # plot_genes_in_pseudotime(cds_subset[sig.gene], # color_by = 'celltype',ncol = 5) plot_pseudotime_heatmap (cds_subset[sig.gene[ 1 : 30 ],], num_clusters = 3 , cores = 1 , show_rownames = T) 6、下游探索 6.1 DimPlot和FeaturePlot mono.info <- pData (mycds) head (mono.info) # orig.ident nCount_RNA nFeature_RNA percent.mt RNA_snn_res.0.5 # AAACATACAACCAC-1 pbmc3k 2419 779 3.0177759 1 # AAACATTGAGCTAC-1 pbmc3k 4903 1352 3.7935958 3 # AAACATTGATCAGC-1 pbmc3k 3147 1129 0.8897363 1 # AAACCGTGCTTCCG-1 pbmc3k 2639 960 1.7430845 2 # AAACCGTGTATGCG-1 pbmc3k 980 521 1.2244898 6 # AAACGCACTGGTAC-1 pbmc3k 2163 781 1.6643551 1 # seurat_clusters celltype Size_Factor Pseudotime State # AAACATACAACCAC-1 1 CD14+ Mono 1.1076350 4.1751808 1 # AAACATTGAGCTAC-1 3 B 2.2450329 5.4834034 3 # AAACATTGATCAGC-1 1 CD14+ Mono 1.4409787 2.5338982 1 # AAACCGTGCTTCCG-1 2 Memory CD4 T 1.2083708 8.6265722 3 # AAACCGTGTATGCG-1 6 NK 0.4487318 0.5655187 1 # AAACGCACTGGTAC-1 1 CD14+ Mono 0.9904153 4.8913800 1 # 可以看出这是一个行名为细胞名,列名为各个变量信息的数据框,可以通过以下方式将这个数据框作为注释列表添加回Seurat colnames (mono.info) # [1] "orig.ident" "nCount_RNA" "nFeature_RNA" "percent.mt" # [5] "RNA_snn_res.0.5" "seurat_clusters" "celltype" "Size_Factor" # [9] "Pseudotime" "State" # 为了避免与Seurat原有的注释重合,我们给这些变量名加一个后缀 new.var <- paste0 ( colnames (mono.info), '_mono.info' ) colnames (mono.info) <- new.var colnames (mono.info) # [1] "orig.ident_mono.info" "nCount_RNA_mono.info" # [3] "nFeature_RNA_mono.info" "percent.mt_mono.info" # [5] "RNA_snn_res.0.5_mono.info" "seurat_clusters_mono.info" # [7] "celltype_mono.info" "Size_Factor_mono.info" # [9] "Pseudotime_mono.info" "State_mono.info" sce.obj <- AddMetaData (sce.obj, metadata = mono.info) # 可以利用Seurat原有的降维图观察细胞类型、拟时序中state、拟时间分布:library (ggsci) DimPlot (sce.obj, group.by = 'celltype' , cols = pal_npg ( 10 )) | DimPlot (sce.obj, group.by = 'State_mono.info' , cols = pal_nejm ( 10 )) | FeaturePlot (sce.obj, features = 'Pseudotime_mono.info' , cols = c ( 'white' , 'darkred' )) 6.2 VlnPlot # VlnPlot可以用于绘制Pseudotime这样的数值型变量 VlnPlot (sce.obj, features = 'Pseudotime_mono.info' , group.by = 'celltype' , cols = pal_npg ( 10 )) | VlnPlot (sce.obj, features = 'Pseudotime_mono.info' , group.by = 'State_mono.info' , cols = pal_nejm ( 10 )) 6.3 ggplot2绘图 cellname <- rownames (mono.info) data4plot <- cbind (sce.obj @ meta.data, as.data.frame (sce.obj @ reductions $ umap @ cell.embeddings)[cellname,]) head (data4plot) # orig.ident nCount_RNA nFeature_RNA percent.mt RNA_snn_res.0.5 # AAACATACAACCAC-1 pbmc3k 2419 779 3.0177759 1 # AAACATTGAGCTAC-1 pbmc3k 4903 1352 3.7935958 3 # AAACATTGATCAGC-1 pbmc3k 3147 1129 0.8897363 1 # AAACCGTGCTTCCG-1 pbmc3k 2639 960 1.7430845 2 # AAACCGTGTATGCG-1 pbmc3k 980 521 1.2244898 6 # AAACGCACTGGTAC-1 pbmc3k 2163 781 1.6643551 1 # seurat_clusters celltype orig.ident_mono.info # AAACATACAACCAC-1 1 CD14+ Mono pbmc3k # AAACATTGAGCTAC-1 3 B pbmc3k # AAACATTGATCAGC-1 1 CD14+ Mono pbmc3k # AAACCGTGCTTCCG-1 2 Memory CD4 T pbmc3k # AAACCGTGTATGCG-1 6 NK pbmc3k # AAACGCACTGGTAC-1 1 CD14+ Mono pbmc3k # nCount_RNA_mono.info nFeature_RNA_mono.info # AAACATACAACCAC-1 2419 779 # AAACATTGAGCTAC-1 4903 1352 # AAACATTGATCAGC-1 3147 1129 # AAACCGTGCTTCCG-1 2639 960 # AAACCGTGTATGCG-1 980 521 # AAACGCACTGGTAC-1 2163 781 # percent.mt_mono.info RNA_snn_res.0.5_mono.info # AAACATACAACCAC-1 3.0177759 1 # AAACATTGAGCTAC-1 3.7935958 3 # AAACATTGATCAGC-1 0.8897363 1 # AAACCGTGCTTCCG-1 1.7430845 2 # AAACCGTGTATGCG-1 1.2244898 6 # AAACGCACTGGTAC-1 1.6643551 1 # seurat_clusters_mono.info celltype_mono.info # AAACATACAACCAC-1 1 CD14+ Mono # AAACATTGAGCTAC-1 3 B # AAACATTGATCAGC-1 1 CD14+ Mono # AAACCGTGCTTCCG-1 2 Memory CD4 T # AAACCGTGTATGCG-1 6 NK # AAACGCACTGGTAC-1 1 CD14+ Mono # Size_Factor_mono.info Pseudotime_mono.info State_mono.info # AAACATACAACCAC-1 1.1076350 4.1751808 1 # AAACATTGAGCTAC-1 2.2450329 5.4834034 3 # AAACATTGATCAGC-1 1.4409787 2.5338982 1 # AAACCGTGCTTCCG-1 1.2083708 8.6265722 3 # AAACCGTGTATGCG-1 0.4487318 0.5655187 1 # AAACGCACTGGTAC-1 0.9904153 4.8913800 1 # UMAP_1 UMAP_2 # AAACATACAACCAC-1 3.001694 3.751682 # AAACATTGAGCTAC-1 5.673855 -12.254552 # AAACATTGATCAGC-1 5.428529 5.915791 # AAACCGTGCTTCCG-1 -8.592008 -3.628471 # AAACCGTGTATGCG-1 2.763551 -2.193586 # AAACGCACTGGTAC-1 2.877154 5.927556 library ( "ggplot2" ) ggplot (data4plot, aes ( x= UMAP_1, y= UMAP_2)) + geom_point ( aes ( x= UMAP_1, y= UMAP_2, shape= State_mono.info, size= Pseudotime_mono.info, colour = celltype), data = data4plot) + scale_color_manual ( values = pal_nejm ( 10 )) + scale_size_continuous ( range = c ( 0.3 , 1.5 )) + theme_classic + theme ( axis.line = element_line ( size= 0.6 , colour = "black" , arrow = ggplot2 :: arrow ( length = ggplot2 :: unit ( 0.1 , "inches" ), ends = "last" , type = 'open' ))) 6.4 细胞比例探索 这部分内容其实我们以前出过专门的教程,详情见沉浸式统计细胞比例。
在这里也一样,我们可以去统计拟时序的state和celltype之间的关系。
6.4.1 饼图 # 统计细胞类型 unique (mono.info $ celltype_mono.info) # [1] CD14+ Mono B Memory CD4 T NK CD8 T # [6] Naive CD4 T FCGR3A+ Mono DC Platelet # 9 Levels: Naive CD4 T CD14+ Mono Memory CD4 T B CD8 T FCGR3A+ Mono NK ... Platelet mynames <- table (mono.info $ celltype_mono.info) %>% names mynames # [1] "Naive CD4 T" "CD14+ Mono" "Memory CD4 T" "B" "CD8 T" # [6] "FCGR3A+ Mono" "NK" "DC" "Platelet" # 统计细胞数目 myratio <- table (mono.info $ celltype_mono.info) %>% as.numeric myratio # [1] 697 483 480 344 271 162 155 32 14 # 整理细胞标签 pielabel <- paste0 (mynames, " (" , round (myratio / sum (myratio) * 100 , 2 ), "%)" ) pielabel # [1] "Naive CD4 T (26.42%)" "CD14+ Mono (18.31%)" "Memory CD4 T (18.2%)" # [4] "B (13.04%)" "CD8 T (10.27%)" "FCGR3A+ Mono (6.14%)" # [7] "NK (5.88%)" "DC (1.21%)" "Platelet (0.53%)" cols <- pal_npg ( "nrc" )( 10 ) cols # [1] " " " " " " # [7] " " " " pie (myratio, labels= pielabel, radius = 1.0 , clockwise= F, main = "Celltype" , col = cols) mynames <- table (mono.info $ State_mono.info) %>% names myratio <- table (mono.info $ State_mono.info) %>% as.numeric pielabel <- paste0 (mynames, " (" , round (myratio / sum (myratio) * 100 , 2 ), "%)" ) pielabel # [1] "1 (44.28%)" "2 (27.1%)" "3 (28.62%)" pie (myratio, labels= pielabel, radius = 1.0 , clockwise= F, main = "State" , col = cols) 6.4.2 堆积柱形图 cellnum <- table (mono.info $ celltype_mono.info,mono.info $ State_mono.info) cellnum <- cellnum[ ! apply (cellnum, 1 , sum) == 0 ,] # 按列统计细胞比例 cell.prop <- as.data.frame ( prop.table (cellnum, margin= 2 )) colnames (cell.prop) <- c ( "Celltype" , "State" , "Proportion" ) head (cell.prop) # Celltype State Proportion # 1 Naive CD4 T 1 0.25256849 # 2 CD14+ Mono 1 0.36044521 # 3 Memory CD4 T 1 0.00000000 # 4 B 1 0.01626712 # 5 CD8 T 1 0.23116438 # 6 FCGR3A+ Mono 1 0.00000000 p.bar <- ggplot (cell.prop, aes (State,Proportion, fill= Celltype)) + geom_bar ( stat= "identity" , position= "fill" ) + scale_fill_manual ( values= cols) + ggtitle ( "cell proportion" ) + theme_bw + theme ( axis.ticks.length= unit ( 0.5 , 'cm' )) + guides ( fill= guide_legend ( title= NULL )) p.bar 6.4.3 累计密度分布图 densyplot <- ggplot (mono.info, aes ( x= Pseudotime_mono.info, group= celltype_mono.info, fill= celltype_mono.info)) + geom_density ( alpha= 0.3 , adjust= 1.5 ) + scale_fill_manual ( values = c ( pal_npg ( "nrc" )( 10 )) ) + theme_bw densyplot 6.5 基因表达量的探索 6.5.1 在轨迹图上绘制表达量 相信大家一定也很感兴趣如何把Seurat中的基因拿到monocle来绘制。
比如我想通过plot_cell_trajectory函数,像绘制psedotime一样把基因的表达量在拟时序轨迹图中绘制出来,如果直接在color_by中输入基因名是会报错的。
disp.genes[ 1 ] # [1] "CPSF3L" # try(plot_cell_trajectory(mycds, color_by = "CPSF3L")) # 这是因为plot_cell_trajectory函数会在pData(cds)中寻找color_by的变量,所以这里要做的就是把Seurat中基因的表达量加到pData中去,这个操作一行就可以搞定 pData (mycds) $ CPSF3L <- sce.obj @ assays $ RNA @ data[ "CPSF3L" ,][ rownames ( pData (mycds))] try ( plot_cell_trajectory (mycds, color_by = "CPSF3L" )) # 可能你还是觉得上面这张图有点丑,想要改一下颜色,plot_cell_trajectory中无法指定color的值,我阅读了一下源码,发现这个图片本质上还是用ggplot2生成的,那事情就变得简单了 library (circlize) mycolor <- colorRampPalette ( colors = c ( ' , 'white' , ' ))( 100 ) plot_cell_trajectory (mycds, color_by = "CPSF3L" , cell_size = 0.5 ) + scale_colour_gradientn ( colours = mycolor, guide = "colorbar" ) 6.5.2 热图 其实热图的教程我们做过很多:热图绘制天花板:ComplexHeatmap|
一、初探 热图绘制天花板| 二、两万字长文探索注释功能 ComplexHeatmap|
三、用富集分析结果做成词云注释热图 在理解了monocle对象的数据结构和绘制热图的语法后,大家完全可以自行拼凑出下面的图,流程有些长,有点耐心哦:首先整理一下注释信息和表达矩阵 library (ComplexHeatmap) expr4plot <- sce.obj @ assays $ RNA @ scale.data mono.info <- mono.info[ rownames (mono.info) %in% colnames (expr4plot),] mono.info <- mono.info[ order (mono.info $ Pseudotime_mono.info),] expr4plot <- expr4plot[, rownames (mono.info)] dim (mono.info) # [1] 2638 10 dim (expr4plot) # [1] 13714 2638 # 设置顶部的psedotime、celltype、State的注释 top_anno <- HeatmapAnnotation ( Pseudotime= mono.info $ Pseudotime, Celltype= mono.info $ celltype_mono.info, State= mono.info $ State_mono.info) # 可以先看一下注释的效果 draw (top_anno) # 找一些感兴趣的基因集,比如各种细胞的marker,这里以每种细胞的Top20为例,实际过程中大家可以选择更多的基因,便于富集到有意义的通路:# 先展示基因来源 t.sub.marker <- FindAllMarkers (sce.obj, only.pos = TRUE , min.pct = 0.25 , logfc.threshold = 0.25 ) top20gene <- t.sub.marker %>% group_by (cluster) %>% top_n ( n = 20 , wt = avg_log2FC) left_ann <- rowAnnotation ( cellmarker= top20gene $ cluster) draw (left_ann) 来做个富集分析用于词云注释,富集分析的教程可以看这里:手把手教你做单细胞测序数据分析
(七)——基因集富集分析 library (clusterProfiler) library (org.Hs.eg.db) pathway4wc <- data.frame for (runcell in unique (top20gene $ cluster)) { try ({pathway4wc}) gene4er <- filter (top20gene,cluster == runcell) %>% pull (gene) gene.df <- bitr (gene4er, fromType= "SYMBOL" , toType= c ( "ENTREZID" , "ENSEMBL" ), OrgDb = org.Hs.eg.db) gores <- enrichGO (gene.df $ SYMBOL, org.Hs.eg.db, keyType = "SYMBOL" , ont = "BP" , pvalueCutoff = 0.05 , pAdjustMethod = "BH" , qvalueCutoff = 0.2 , minGSSize = 10 , maxGSSize = 500 , readable = FALSE , pool = FALSE ) go.all.res <- as.data.frame (gores @ result) pathway.dt <- top_n (go.all.res, - 10 , p.adjust) pathway.dt <- cbind (pathway.dt, data.frame ( cluster= rep (runcell, nrow (pathway.dt)))) pathway4wc <- rbind (pathway4wc,pathway.dt) } head (pathway4wc) # ID Description GeneRatio BgRatio # GO:0030098 GO:0030098 lymphocyte differentiation 8/19 419/18903 # GO:1903131 GO:1903131 mononuclear cell differentiation 8/19 473/18903 # GO:0030217 GO:0030217 T cell differentiation 7/19 296/18903 # GO:0050863 GO:0050863 regulation of T cell activation 7/19 376/18903 # GO:0022407 GO:0022407 regulation of cell-cell adhesion 7/19 490/18903 # GO:0045061 GO:0045061 thymic T cell selection 3/19 22/18903 # pvalue p.adjust qvalue # GO:0030098 3.327775e-09 2.370731e-06 1.453285e-06 # GO:1903131 8.596498e-09 2.370731e-06 1.453285e-06 # GO:0030217 9.224635e-09 2.370731e-06 1.453285e-06 # GO:0050863 4.778696e-08 9.210937e-06 5.646407e-06 # GO:0022407 2.898002e-07 4.468719e-05 2.739375e-05 # GO:0045061 1.309881e-06 1.450935e-04 8.894393e-05 # geneID Count cluster # GO:0030098 CCR7/CD3D/CD3E/LEF1/TCF7/CD27/NDFIP1/RHOH 8 Naive CD4 T # GO:1903131 CCR7/CD3D/CD3E/LEF1/TCF7/CD27/NDFIP1/RHOH 8 Naive CD4 T # GO:0030217 CCR7/CD3D/CD3E/LEF1/TCF7/CD27/RHOH 7 Naive CD4 T # GO:0050863 CCR7/CD3E/LEF1/TCF7/CD27/NDFIP1/RHOH 7 Naive CD4 T # GO:0022407 CCR7/CD3E/LEF1/CD27/RGCC/NDFIP1/RHOH 7 Naive CD4 T # GO:0045061 CCR7/CD3D/CD3E 3 Naive CD4 T # 构建词云的text # 遍历每个唯一的聚类 mytext <- list for (i in 1 : length ( unique (pathway4wc $ cluster))) { # 选择p.adjust最小的前5个通路 temp.text <- filter (pathway4wc,cluster == unique (pathway4wc $ cluster)[i]) %>% top_n ( - 1 ,p.adjust) %>% pull (Description) # 将当前聚类的通路描述信息添加到列表中 mytext[[i]] <- temp.text } # 为列表元素命名,使用聚类的唯一值 names (mytext) <- unique (pathway4wc $ cluster) # 依据显著性设置词云大小,越显著的通路字体越大 # 创建一个空向量用于存储每个聚类的p.adjust值 myfontsize <- c # 遍历每个唯一的聚类 for (i in 1 : length ( unique (pathway4wc $ cluster))) { # 筛选出当前聚类且p.adjust小于0.05的数据 temfs <- filter (pathway4wc,cluster == unique (pathway4wc $ cluster)[i]) %>% filter (p.adjust < 0.05 ) %>% # 选择p.adjust最小的前5个通路 top_n ( - 1 ,p.adjust) %>% pull (p.adjust) # 将当前聚类的p.adjust值添加到向量中 myfontsize <- c (myfontsize,temfs) } # 加载scales包,用于数据缩放 library (scales) # 对p.adjust取负对数,然后进行缩放,使字体大小范围在0到10之间 myfontsize <- - log2 (myfontsize) %>% scales :: rescale (., to= c ( 0 : 1 )) * 10 # 创建行注释对象 # 使用anno_textbox函数绘制文本框注释 max_line_length <- 30 # 定义换行函数,保留 list 结构 wrap_list_text <- function (lst, width = max_line_length) { lapply (lst, function (vec) { sapply (vec, function (txt) { paste ( strwrap (txt, width = width), collapse = " \n " ) }) }) } # 应用函数 mytext_wrapped <- wrap_list_text (mytext) right_ann <- rowAnnotation ( textbox = anno_textbox (top20gene $ cluster, mytext_wrapped, # 允许文本换行 add_new_line = T, by = "anno_block" , # 设置文本框的样式 gp = gpar ( col = c ( pal_npg ( 10 ), pal_nejm ( 8 ), pal_aaas ( 4 )) , fontsize= myfontsize), # 设置文本框的圆角 round_corners = TRUE , # 设置文本框的最大宽度 max_width= unit ( 3 , 'mm' ))) col_fun <- colorRamp2 ( quantile (expr4plot) %>% as.numeric , c ( ' , ' , 'white' , ' , ' )) finalhmp <- Heatmap (expr4plot[top20gene $ gene,], show_column_names = F, show_row_names = F, col = col_fun, top_annotation = top_anno, cluster_columns = F, row_split = top20gene $ cluster, row_title = NULL , right_annotation = right_ann, border = T, left_annotation = left_ann ) draw (finalhmp) save.image ( "result_m2/m2.Rdata" ) Sys.time # [1] "2025-04-28 13:09:31 CST"
五、monocle3 1、简介 视频教程:请点击monocle3拟时序分析视频链接。测试文件:请在单细胞测序数据进阶分析—《拟时序分析》4.初识monocle3推送中获取。2、monocle3与monocle2的主要区别 2.1 monocle3的主要功能 Clustering, classifying, and counting cells:主要任务是帮助大家找到一些具有特殊功能的亚群。
Constructing single-cell trajectories:这个功能大家需要monocle的最根本原因,通过拟时序分析帮助大家解析生物体发育、疾病等过程中细胞发生的变化。Differential expression analysis:差异计算monocle2也有,但我们实测的感觉是不如Seurat的,这里作者表明monocle3在这一块进行了优化升级,我们后面具体看看用起来效果如何。
2.2 先谈缺点 目前monocle3已经更新到β版本了,作者在官网也承认了缺点,monocle3 α已经是不推荐使用的,可能会存在bug,但是monocle3 β仍然处于搭建中的状态,也就是说monocle3仍然是可能存在bug的,并且我们之前讲绪论的时候说到monocle1、2都发表在了Nature系列之上,但是monocle3迟迟没有发表,并且目前发表的文章还是使用monocle2的比较多,monocle3的不稳定性可能是重要原因。
不过我相信新发表的论文很快就会由monocle2转向monocle3。官方表述如下 2.3 再看看优点 咱们抛开可能存在的bug不谈,monocle3相较于monocle2具有以下几点优势:(1)最大的优点就是计算量变大了,可以处理百万级别的单细胞数据集,也就是说整个器官、甚至整个胚胎的矩阵交给monocle3处理完全没压力。
(2)代码结构性优化:这点我要吐槽一下,monocle系列的语法我一直觉得很奇怪,默认参数也很不人性化 (3)支持UMAP算法的降维,这个也非常Nice,速度比tSNE快的不是一星半点。
(4)支持多谱系的拓扑结构:换句话说拟时序的轨迹可以做的很复杂 (5)相较于原来的RGE算法,新的approximate graph abstraction能够计算不连续的、平行的拓扑结构 (6)新的基因表达量计算及差异分析方法被引入,也就是说原来的differentialGeneTest和BEAM可以被替代。
(7)可以像Seurat的多样本整合一样对拟时序对象进行整合:这个功能可以说是刚需了,换句话说,如果你有合适的、已构建好拟时序的参考数据集,可以直接把自己的数据跟参考数据集进行投影、比对。(8)数据整合时也可同时加上注释:这有点类似于Seurat中的TransferData,可以利用已经注释好的参考数据给现行数据添加注释。
(9)对monocle对象的读取、加载、转换做了一定的优化,我们后面可以看看效果如何 (10)优化了负二项分布模型:也就是说对处理count的优化 (11)可视化提供了3D展示功能 3、安装Monocle3 保证自己的环境符合这些版本要求,不然报错很麻烦 R >= 4.1.0 , Bioconductor = 3.14,这样装上的monocle3 >= 1.2.7 3.1 R语言中的安装 # if (!requireNamespace("BiocManager", quietly = TRUE)) # install.packages("BiocManager") # package.version('BiocManager') # BiocManager::install(version = "3.14") # if(!require(monocle3))BiocManager::install(c('BiocGenerics', 'DelayedArray', 'DelayedMatrixStats', # 'limma', 'lme4', 'S4Vectors', 'SingleCellExperiment', # 'SummarizedExperiment', 'batchelor', 'Matrix.utils', # 'HDF5Array', 'terra', 'ggrastr')) # if(!require(devtools))install.packages("devtools") # if(!require(monocle3))devtools::install_github('cole-trapnell-lab/monocle3') # if(!require(monocle3))devtools::install_github('cole-trapnell-lab/monocle3', ref="develop") 3.2 通过conda安装 即使在R4.1.0中还是很容易有很多包装不上 error: gdal-config not found or not executable. apt-get -y update && apt-get install -y libudunits2-dev libgdal-dev libgeos-dev libproj-dev activate monocle3 insyall r-base==4.1.0 install gcc install seurat install r-seuratobject install r-terra==1.5.12 install r-units install r-raster # conda install r-spdata # conda install r-sf # conda install r-spdep # conda install r-ragg # conda install r-ggrastr # conda install r-devtools # if(!require(monocle3))devtools::install_github('cole-trapnell-lab/monocle3') search monocle3 install r-monocle3 library (monocle3) package.version ( "monocle3" ) # [1] "1.3.7" library (Seurat) library (SeuratObject) library (tidyselect) library (dplyr) library (BiocGenerics) 4、初探monocle3 4.1 monocle3对象的构建 Sys.time # [1] "2025-04-28 13:09:36 CST" # 设置工作目录 setwd ( "/data/06_scRNA-seq_psedutime/" ) # data <- as(as.matrix(data4mono@assays$RNA@counts), 'sparseMatrix') # pd <- new('AnnotatedDataFrame', data = [email protected]) # fData <- data.frame(gene_short_name = row.names(data4mono), row.names = row.names(data4mono)) # fd <- new('AnnotatedDataFrame', data = fData) # mycds <- newCellDataSet(data, # phenoData = pd, # featureData = fd, # expressionFamily = negbinomial.size) # expression_matrix <- readRDS(url("https://depts.washington.edu:/trapnell-lab/software/monocle3/celegans/data/cao_l2_expression.rds")) # cell_metadata <- readRDS(url("https://depts.washington.edu:/trapnell-lab/software/monocle3/celegans/data/cao_l2_colData.rds")) # gene_annotation <- readRDS(url("https://depts.washington.edu:/trapnell-lab/software/monocle3/celegans/data/cao_l2_rowData.rds")) expression_matrix <- readRDS ( 'data/expression_matrix.rds' ) cell_metadata <- readRDS ( 'data/cell_metadata.rds' ) gene_annotation <- readRDS ( 'data/gene_annotation.rds' ) class (cell_metadata) # [1] "data.frame" cds <- new_cell_data_set (expression_matrix, cell_metadata = cell_metadata, gene_metadata = gene_annotation) function not found 了还不知道哪里出错了 mysubset <- c for (runplate in unique (cds @ colData $ plate)){ mytemp <- cds @ colData %>% as.data.frame %>% filter (plate == runplate) %>% rownames %>% sample ( 50 ) mysubset <- cbind (mysubset,mytemp) } cds <- cds[,mysubset] table (cds @ colData $ plate) # # 001 002 003 004 005 006 007 008 009 010 # 50 saveRDS (cds, 'data/simple.rds' ) pbmc <- Read10X ( 'data/filtered_gene_bc_matrices/hg19/' ) pbmc <- CreateSeuratObject ( counts = pbmc) count <- GetAssayData (pbmc, slot= "counts" ) seurat2cds <- new_cell_data_set (count, cell_metadata = as.data.frame (pbmc @ meta.data), gene_metadata = data.frame ( gene_short_name = row.names (pbmc), row.names = row.names (pbmc)) ) class (seurat2cds) # [1] "cell_data_set" # attr(,"package") # [1] "monocle3" 4.2 预处理、质控 cds <- preprocess_cds (cds, num_dim = 100 ) cds <- align_cds (cds, alignment_group = "plate" ) names (cds @ colData) # 降维,默认是"Aligned"方式 cds <- reduce_dimension (cds, reduction_method= "UMAP" , umap.fast_sgd = FALSE , cores = 1 ) = FALSE' and 'cores = 1'可以保证结果相同 <- reduce_dimension(cds,reduction_method = 'tSNE',cores=5) # 聚类分群 cds <- cluster_cells (cds) # 拟时序 cds <- learn_graph (cds) graph embedding(RGE)算法 graph abstraction:https://pubmed.ncbi.nlm.nih.gov/30890159/ 这一步整活了,可以交互式地选择起点和终点,看了一下这个功能应该是依赖shiny写的,那就意味着在终端操作这一步就不可能了 cds <- order_cells (cds) 选择过程如下:plot_cells (cds, label_principal_points = T) cds <- order_cells (cds, root_pr_nodes= c ( "Y_5" )) plot_cells (cds, color_cells_by = 'pseudotime' ) names (cds @ colData) # [1] "plate" "cao_cluster" "cao_cell_type" "cao_tissue" # [5] "Size_Factor" gene_fits <- fit_models (cds, model_formula_str = "~plate" , cores= 1 ) 上面这函数看起来有点bug,如果不指定核心数默认会调用CPU所有的核心来计算,从而让简单的计算变得很慢,原理大家可以参考这个核心/线程数调用的越多就算得越快嘛? fit_coefs <- coefficient_table (gene_fits) emb_time_terms <- fit_coefs %>% filter (term == "plate" ) emb_time_terms <- emb_time_terms %>% mutate ( q_value = p.adjust (p_value)) sig_genes <- emb_time_terms %>% filter (q_value < 0.05 ) %>% pull (gene_short_name) suppressMessages (pr_test_res <- graph_test (cds, neighbor_graph= "principal_graph" , cores= 1 , verbose = F)) pr_deg_ids <- row.names ( subset (pr_test_res, q_value < 0.05 )) head (pr_deg_ids) # [1] "WBGene00000001" "WBGene00000004" "WBGene00000006" "WBGene00000010" # [5] "WBGene00000016" "WBGene00000018" 最后,取子集与合并 cds1 <- cds[ 1 : 100 ,] cds2 <- cds[ 101 : 200 ,] big_cds <- combine_cds ( list (cds1, cds2)) dim (cds) # [1] 20271 500 cds1 <- cds[, 1 : 300 ] class (cds1) # [1] "cell_data_set" # attr(,"package") # [1] "monocle3" dim (cds1) # [1] 20271 300 cds2 <- cds[ 1 : 300 ,] class (cds2) # [1] "cell_data_set" # attr(,"package") # [1] "monocle3" dim (cds2) # [1] 300 500 try ({big_cds <- combine_cds ( list (cds1, cds2), keep_all_genes = F, cell_names_unique = TRUE )}) # Error in combine_cds(list(cds1, cds2), keep_all_genes = F, cell_names_unique = TRUE) : # Cell names are not unique across CDSs - cell_names_unique must be FALSE. table ( colnames (cds1) %in% colnames (cds2)) # # TRUE # 300 4.3 细胞降维、分群、鉴定 视频教程:请点击monocle3中的聚类与分群视频链接。
测试文件:请在单细胞测序数据进阶分析—《拟时序分析》5.monocle3的降维、分群、聚类推送中获取。
4.3.1 降维 rm ( list = ls ); gc # used (Mb) gc trigger (Mb) max used (Mb) # Ncells 12126339 647.7 31347466 1674.2 39184332 2092.7 # Vcells 43548463 332.3 282321537 2154.0 427809837 3264.0 library (monocle3) library (dplyr) cds <- readRDS ( 'data/simple.rds' ) cds <- preprocess_cds (cds, num_dim = 100 ) plot_pc_variance_explained (cds) cds <- reduce_dimension (cds) # 这里我们没有指定reduction_method,但是默认是"UMAP",有"UMAP", "tSNE", "PCA", "LSI", "Aligned"可供选择 umap.fast_sgd=TRUE # umap.fast_sgd参数或者设置cores可以加快运算速度,但是可以能会让降维图最终显得有些不同,所以要慎重选择 red .1 <- plot_cells (cds, color_cells_by= "cao_cell_type" , cell_size = 2 , graph_label_size = 5 , group_label_size = 3 ) cds <- reduce_dimension (cds, umap.fast_sgd= T) red .2 <- plot_cells (cds, color_cells_by= "cao_cell_type" , cell_size = 2 , graph_label_size = 5 , group_label_size = 3 ) cds <- reduce_dimension (cds, cores = 5 ) # No preprocess_method specified, using preprocess_method = 'PCA' # Note: reduce_dimension will produce slightly different output each time you run it unless you set 'umap.fast_sgd = FALSE' and 'cores = 1' red .3 <- plot_cells (cds, color_cells_by= "cao_cell_type" , cell_size = 2 , graph_label_size = 5 , group_label_size = 3 ) suppressMessages ( library (patchwork)) red .1 | red .2 | red .3 # 试试tSNE cds <- reduce_dimension (cds, reduction_method= "tSNE" ) plot_cells (cds, color_cells_by= "cao_cell_type" , reduction_method= "tSNE" ) red .1 | plot_cells (cds, reduction_method= "tSNE" , color_cells_by= "cao_cell_type" , cell_size = 2 , group_label_size = 3 ) 4.3.2 分群 cds <- cluster_cells (cds) detection,类似于Seurat::FindClusters,但原理有所不同 cds <- cluster_cells (cds, resolution= 0.1 ) unique ( partitions (cds)) # [1] 1 2 # Levels: 1 2 plot_cells (cds) plot_cells (cds, genes= c ( "cpna-2" , "egl-21" , "ram-2" , "inos-1" )) 4.3.3 去批次 names (cds @ colData) # [1] "plate" "cao_cluster" "cao_cell_type" "cao_tissue" # [5] "Size_Factor" bfalig <- plot_cells (cds, color_cells_by= "plate" , label_cell_groups= FALSE ) cds <- align_cds (cds, alignment_group = "plate" ) aftalig <- plot_cells (cds, color_cells_by= "plate" , label_cell_groups= FALSE ) bfalig | aftalig 4.3.4 计算并展示marker基因 suppressMessages (marker_test_res <- top_markers (cds, group_cells_by= "cao_cell_type" , reference_cells= 1000 , cores= 2 )) marker_test_res[ 1 : 4 , 1 : 4 ] # gene_id gene_short_name cell_group marker_score # 1 WBGene00000031 abu-8 Pharyngeal epithelia 0.6092094 # 2 WBGene00000033 abu-10 Pharyngeal epithelia 0.4163906 # 3 WBGene00000069 acy-2 Canal associated neurons 0.8874499 # 4 WBGene00000100 ajm-1 Pharyngeal epithelia 0.4129481 top_specific_markers <- marker_test_res %>% filter (fraction_expressing >= 0.10 ) %>% group_by (cell_group) %>% top_n ( 1 , pseudo_R2) top_specific_markers[ 1 : 4 , 1 : 4 ] # # A tibble: 4 × 4 # # Groups: cell_group [3] # gene_id gene_short_name cell_group marker_score # <chr> <fct> <chr> <dbl> # 1 WBGene00000145 apc-11 Sex myoblasts 0.923 # 2 WBGene00000444 ceh-21 Sex myoblasts 0.946 # 3 WBGene00000501 cho-1 Cholinergic neurons 0.389 # 4 WBGene00000753 col-180 Non-seam hypodermis 0.566 top_specific_marker_ids <- unique (top_specific_markers %>% pull (gene_id)) plot_genes_by_group (cds, top_specific_marker_ids[ 1 : 10 ], group_cells_by= "cao_cell_type" , max.size= 3 ) plot_genes_by_group (cds, top_specific_marker_ids[ 1 : 10 ], max.size= 3 , group_cells_by= "partition" ) 4.3.5 注释细胞 colData (cds) $ assigned_cell_type <- as.character ( partitions (cds)) colData (cds) $ assigned_cell_type <- dplyr :: recode ( colData (cds) $ assigned_cell_type, "1" = "Bio_1" , "2" = "Bio_2" , "3" = "Bio_3" ) plot_cells (cds, group_cells_by= "partition" , color_cells_by= "assigned_cell_type" , cell_size = 2 , graph_label_size = 5 , group_label_size = 3 ) 4.4 取子集以及亚群分析 try (cds_subset <- choose_cells (cds)) # Error : choose_cells only works interactive mode. cds_subset <- cds class (cds_subset) # [1] "cell_data_set" # attr(,"package") # [1] "monocle3" suppressMessages (pr_graph_test_res <- graph_test (cds_subset, neighbor_graph= "knn" , cores= 2 )) 通过marker基因集可视化帮助判断 gene pr_deg_ids <- row.names ( subset (pr_graph_test_res, morans_I > 0.01 & q_value < 0.05 )) gene_module_df <- find_gene_modules (cds_subset[pr_deg_ids,], resolution= 1e-3 ) gene_module_df[ 1 : 5 , 1 : 5 ] # # A tibble: 5 × 5 # id module supermodule dim_1 dim_2 # <chr> <fct> <fct> <dbl> <dbl> # 1 WBGene00000004 6 2 3.02 -4.80 # 2 WBGene00000006 1 1 8.08 0.789 # 3 WBGene00000010 8 2 -5.91 1.45 # 4 WBGene00000016 21 9 -1.51 -4.65 # 5 WBGene00000018 9 2 1.20 -2.65 plot_cells (cds_subset, genes= gene_module_df, show_trajectory_graph= FALSE , label_cell_groups= FALSE , cell_size = 0.5 , graph_label_size = 5 , group_label_size = 3 ) 5、monocle3实战 视频教程:请点击monocle3拟时序分析视频链接。
测试文件:请在单细胞测序数据进阶分析—《拟时序分析》5.monocle3的降维、分群、聚类推送中获取。
monocle3与monocle2最大的特点有以下两点:1、可以交互式地选择拟时序的起点 2、可以采取3D的形式展示轨迹图 5.1 预处理数据 rm ( list = ls ); gc # used (Mb) gc trigger (Mb) max used (Mb) # Ncells 12027102 642.4 31347466 1674.2 39184332 2092.7 # Vcells 44101024 336.5 144548628 1102.9 427809837 3264.0 # 读入并创建cds对象 expression_matrix <- readRDS ( 'data/expression_matrix.rds' ) cell_metadata <- readRDS ( 'data/cell_metadata.rds' ) gene_annotation <- readRDS ( 'data/gene_annotation.rds' ) cds <- new_cell_data_set (expression_matrix, cell_metadata = cell_metadata, gene_metadata = gene_annotation) cds <- preprocess_cds (cds, num_dim = 50 ) cds <- align_cds (cds, alignment_group = "plate" , residual_model_formula_str = '~Size_Factor' ) residual_model_formula_str可以指定连续变量进行去批次,例如某一基因/基因集的表达值,否则这些变量以分类变量的形式参与去批次会将每一个值视为一个批次从而产生极大的冗余计算量 降维、聚类 cds <- reduce_dimension (cds) plot_cells (cds, label_groups_by_cluster= FALSE , color_cells_by = "cao_cell_type" ) ciliated_genes <- c ( "che-1" , "hlh-17" , "nhr-6" , "dmd-6" , "ceh-36" , "ham-1" ) plot_cells (cds, genes= ciliated_genes, label_cell_groups= FALSE , show_trajectory_graph= FALSE ) 我们monocle2那里谈过,虽然从计算学的角度来说细胞可以连续地从一种状态过渡到下一种状态,它们之间没有离散的边界,但从生物学上来说并不是所有细胞类型之间都能发生转化,所以Monocle3并不假设数据集中的所有细胞都来自一个共同的转录“祖先”,大家实际拿到的数据中可能有多个不同的轨迹。
例如,在应对感染的组织中,组织常驻免疫细胞和基质细胞会有非常不同的初始转录组,对感染的反应也会非常不同,所以它们应该分属于不用的轨迹。因此,monocle3在进行拟时序分析时并不采取单起点的方式。
cds <- cluster_cells (cds) plot_cells (cds, color_cells_by = "partition" ) # 这个图里黑色的线就是拟时序走向的背景,带有数字的灰色圈圈代表的是拟时序分支的leaf,他们被黑色的branch_points所隔开 5.2 拟时序分析 cds <- learn_graph (cds) plot_cells (cds, color_cells_by = "cao_cell_type" , label_groups_by_cluster= FALSE , label_leaves= FALSE , label_branch_points= FALSE ) plot_cells (cds, color_cells_by = "cao_cell_type" , label_cell_groups= FALSE , label_leaves= TRUE , label_branch_points= TRUE , graph_label_size= 1.5 ) cds <- order_cells (cds) plot_cells (cds, color_cells_by = "pseudotime" , label_cell_groups= FALSE , label_leaves= FALSE , label_branch_points= FALSE , graph_label_size= 1.5 ) 我们可以看到这个图中有许多灰色的点,这是因为这些单独的轨迹没有被选择root,所以产生了无效的拟时间值,也就是说,如果手动选择,那么每个拟时序的轨迹都需要选择一次root 说实话这个自动弹出交互式窗口的功能让我在写Rmarkdown的时候很抓狂,这个带有交互功能函数目前没法写在html文件中,所以,在我研读了一下源码之后,通过以下这种方式可以编程性选择拟时序的root,下面这个函数会选择最接近你选择细胞的节点作为拟时序的root names ( colData (cds)) # [1] "plate" "cao_cluster" "cao_cell_type" "cao_tissue" # [5] "Size_Factor" colData (cds)[, 'cao_cell_type' ] %>% unique # [1] "Unclassified neurons" "Germline" # [3] "Intestinal/rectal muscle" "Vulval precursors" # [5] "Coelomocytes" NA # [7] "Ciliated sensory neurons" "Failed QC" # [9] "Seam cells" "Non-seam hypodermis" # [11] "Pharyngeal epithelia" "Touch receptor neurons" # [13] "Body wall muscle" "Cholinergic neurons" # [15] "Distal tip cells" "Other interneurons" # [17] "GABAergic neurons" "Am/PH sheath cells" # [19] "Pharyngeal muscle" "Pharyngeal neurons" # [21] "Oxygen sensory neurons" "Somatic gonad precursors" # [23] "flp-1(+) interneurons" "Canal associated neurons" # [25] "Unclassified glia" "Pharyngeal gland" # [27] "Sex myoblasts" "Excretory cells" # [29] "Dopaminergic neurons" "Socket cells" # [31] "Rectum" colData (cds)[, 'cao_cell_type' ] %>% table # . # Am/PH sheath cells Body wall muscle Canal associated neurons # 421 10508 239 # Cholinergic neurons Ciliated sensory neurons Coelomocytes # 1015 842 1358 # Distal tip cells Dopaminergic neurons Excretory cells # 129 70 155 # Failed QC flp-1(+) interneurons GABAergic neurons # 3483 224 400 # Germline Intestinal/rectal muscle Non-seam hypodermis # 5144 338 1268 # Other interneurons Oxygen sensory neurons Pharyngeal epithelia # 443 305 747 # Pharyngeal gland Pharyngeal muscle Pharyngeal neurons # 271 332 314 # Rectum Seam cells Sex myoblasts # 121 3523 302 # Socket cells Somatic gonad precursors Touch receptor neurons # 358 355 334 # Unclassified glia Unclassified neurons Vulval precursors # 208 2639 488 get_earliest_principal_node <- function (cds, my_select= "Am/PH sheath cells" ){ cell_ids <- which ( colData (cds)[, "cao_cell_type" ] == my_select) closest_vertex <- cds @ principal_graph_aux[[ "UMAP" ]] $ pr_graph_cell_proj_closest_vertex closest_vertex <- as.matrix (closest_vertex[ colnames (cds), ]) root_pr_nodes <- igraph :: V ( principal_graph (cds)[[ "UMAP" ]]) $ name[ as.numeric (names ( which.max ( table (closest_vertex[cell_ids,]))))] root_pr_nodes } cds <- order_cells (cds, root_pr_nodes= get_earliest_principal_node (cds)) myselect <- function (cds,select.classify,my_select){ cell_ids <- which ( colData (cds)[,select.classify] == my_select) closest_vertex <- cds @ principal_graph_aux[[ "UMAP" ]] $ pr_graph_cell_proj_closest_vertex closest_vertex <- as.matrix (closest_vertex[ colnames (cds), ]) root_pr_nodes <- igraph :: V ( principal_graph (cds)[[ "UMAP" ]]) $ name[ as.numeric (names ( which.max ( table (closest_vertex[cell_ids,]))))] root_pr_nodes } cds <- order_cells (cds, root_pr_nodes= myselect (cds, select.classify = 'cao_cell_type' , my_select = "Body wall muscle" ) ) 下图可以看的出来,我们制定了“Body wall muscle”为起点后,这群细胞便是紫色的”零值” plot_cells (cds, color_cells_by = "pseudotime" , label_cell_groups= FALSE , label_leaves= FALSE , label_branch_points= FALSE , graph_label_size= 1.5 ) | plot_cells (cds, color_cells_by = "cao_cell_type" , label_cell_groups= FALSE , label_leaves= FALSE , label_branch_points= FALSE , graph_label_size= 1.5 ) 拟时序中的基因展示 sce.obj <- CreateSeuratObject ( counts = expression_matrix, meta.data = cell_metadata, assay = "RNA" ) sce.obj <- NormalizeData (sce.obj) Idents (sce.obj) <- "cao_cell_type" sce.obj.diff <- FindMarkers (sce.obj, ident.1= "Body wall muscle" , ident.2= "Seam cells" , min.pct = 0.1 , logfc.threshold = 0.25 ) cells <- rownames (sce.obj @ meta.data[ which (sce.obj @ meta.data $ cao_cell_type %in% c ( "Body wall muscle" , "Seam cells" )),]) plot_genes_in_pseudotime (cds[ rownames (sce.obj.diff)[ 1 : 5 ],cells], color_cells_by= "cao_cell_type" , min_expr= 0.5 ) 还有3D版的拟时序,来试试吧 cds_3d <- reduce_dimension (cds, max_components = 3 ) cds_3d <- cluster_cells (cds_3d) cds_3d <- learn_graph (cds_3d) cds <- order_cells (cds, root_pr_nodes= myselect (cds, select.classify = 'cao_cell_type' , my_select = "Body wall muscle" ) ) cds_3d_plot_obj <- plot_cells_3d (cds_3d, color_cells_by= "cao_cell_type" ) cds_3d_plot_obj 6、下游探索 6.1 DimPlot和FeaturePlot # 创建 Seurat 对象 sce.obj <- CreateSeuratObject ( counts = expression_matrix, meta.data = cell_metadata, assay = "RNA" ) # 添加降维坐标 # 创建一个 DimReduc 对象 umap_reduction <- CreateDimReducObject ( embeddings = reducedDims (cds) $ UMAP, key = "UMAP_" , assay = "RNA" ) sce.obj[[ "umap" ]] <- umap_reduction Idents (sce.obj) <- "cao_cell_type" sce.obj <- NormalizeData (sce.obj) sce.obj <- ScaleData (sce.obj, features= rownames (sce.obj)) # 提取monocle3拟时间 mono.info <- as.data.frame ( pseudotime (cds, reduction_method = "UMAP" )) head (mono.info) # pseudotime(cds, reduction_method = "UMAP") # cele-001-001.CATGACTCAA Inf # cele-001-001.AAGACGGCCA Inf # cele-001-001.GCCAACGCCA Inf # cele-001-001.ATAGGAGTAC 17.6783 # cele-001-001.CTCGTCTAGG Inf # cele-001-001.AAGTTGCCAT Inf # Inf替换0.01避免影响后续作图 mono.info[, 1 ][ is.infinite (mono.info[, 1 ])] <- 0.01 # 可以看出这是一个行名为细胞名,列名为各个变量信息的数据框,可以通过以下方式将这个数据框作为注释列表添加回Seurat colnames (mono.info) # [1] "pseudotime(cds, reduction_method = \"UMAP\")" # 为了避免与Seurat原有的注释重合,我们给这些变量名加一个后缀 colnames (mono.info) <- "monocle3_pseudotime" colnames (mono.info) # [1] "monocle3_pseudotime" sce.obj <- AddMetaData (sce.obj, metadata = mono.info) # 可以利用Seurat原有的降维图观察细胞类型、拟时序中state、拟时间分布:library (ggsci) DimPlot (sce.obj, group.by = 'cao_cell_type' ) | FeaturePlot (sce.obj, features = 'monocle3_pseudotime' , cols = c ( "lightgrey" , 'darkred' )) # 注意这里颜色不可以为白色 6.2 VlnPlot library (ggplot2) # VlnPlot可以用于绘制Pseudotime这样的数值型变量 color <- c ( pal_npg ( 10 ), pal_jco ( 10 ), pal_lancet ( 9 ), pal_aaas ( 10 )) VlnPlot (sce.obj, features = 'monocle3_pseudotime' , group.by = 'cao_cell_type' , pt.size = 0 , idents = c ( "Failed QC" , "Body wall muscle" , "Non-seam hypodermis" , "Somatic gonad precursors" , "Unclassified glia" ), cols= color) + theme ( axis.text.x = element_text ( angle = 60 , hjust = 1 )) | VlnPlot (sce.obj, features = 'monocle3_pseudotime' , group.by = 'cao_tissue' , pt.size = 0 , cols= color) + theme ( axis.text.x = element_text ( angle = 60 , hjust = 1 )) 6.3 ggplot2绘图 cellname <- rownames (mono.info) data4plot <- cbind (sce.obj @ meta.data, as.data.frame (sce.obj @ reductions $ umap @ cell.embeddings)[cellname,]) head (data4plot) # orig.ident nCount_RNA nFeature_RNA plate cao_cluster # cele-001-001.CATGACTCAA SeuratProject 144 89 001 20 # cele-001-001.AAGACGGCCA SeuratProject 790 419 001 6 # cele-001-001.GCCAACGCCA SeuratProject 832 338 001 13 # cele-001-001.ATAGGAGTAC SeuratProject 197 138 001 27 # cele-001-001.CTCGTCTAGG SeuratProject 475 208 001 2 # cele-001-001.AAGTTGCCAT SeuratProject 1399 679 001 6 # cao_cell_type cao_tissue # cele-001-001.CATGACTCAA Unclassified neurons Neurons # cele-001-001.AAGACGGCCA Germline Gonad # cele-001-001.GCCAACGCCA Intestinal/rectal muscle Intestinal/rectal muscle # cele-001-001.ATAGGAGTAC Vulval precursors <NA> # cele-001-001.CTCGTCTAGG Coelomocytes # cele-001-001.AAGTTGCCAT Germline Gonad # monocle3_pseudotime UMAP_1 UMAP_2 # cele-001-001.CATGACTCAA 0.0100 8.314502 -8.9886265 # cele-001-001.AAGACGGCCA 0.0100 -9.276903 3.2137904 # cele-001-001.GCCAACGCCA 0.0100 8.918166 5.5197773 # cele-001-001.ATAGGAGTAC 17.6783 -2.600855 -0.8929377 # cele-001-001.CTCGTCTAGG 0.0100 1.998563 -14.4926939 # cele-001-001.AAGTTGCCAT 0.0100 -10.160144 2.9547246 library ( "ggplot2" ) ggplot (data4plot, aes ( x= UMAP_1, y= UMAP_2)) + geom_point ( aes ( x= UMAP_1, y= UMAP_2, size= monocle3_pseudotime, colour = cao_cell_type), data = data4plot) + scale_color_manual ( values = color) + scale_size_continuous ( range = c ( 0.3 , 1.5 )) + theme_classic + theme ( axis.line = element_line ( size= 0.6 , colour = "black" , arrow = ggplot2 :: arrow ( length = ggplot2 :: unit ( 0.1 , "inches" ), ends = "last" , type = 'open' ))) 6.4 细胞比例探索 这部分内容其实我们以前出过专门的教程,详情见沉浸式统计细胞比例。
在这里也一样,我们可以去统计拟时序的state和celltype之间的关系。
6.4.1 饼图 # 统计细胞类型 mono.info.sorted <- mono.info[ order (mono.info $ monocle3_pseudotime), , drop = FALSE ] # 获取行名(细胞名) cell_names <- rownames (mono.info.sorted) # 按长度划分成三等份 n <- length (cell_names) group1 <- cell_names[ 1 : floor (n / 3 )] group2 <- cell_names[( floor (n / 3 ) + 1 ) : ( floor ( 2 * n / 3 ))] group3 <- cell_names[( floor ( 2 * n / 3 ) + 1 ) : n] mono.info.sorted[group1, "new" ] <- "group1" mono.info.sorted[group2, "new" ] <- "group2" mono.info.sorted[group3, "new" ] <- "group3" mono.info.sorted[, c ( "cell_type" , "group" )] <- sce.obj @ meta.data[ rownames (mono.info.sorted), c ( "cao_cell_type" , "cao_tissue" )] # 统计细胞类型 # 选取排名前五的细胞类型 mynames <- sort ( table (mono.info.sorted $ cell_type), decreasing = T)[ 1 : 5 ] %>% names mynames # [1] "Body wall muscle" "Germline" "Seam cells" # [4] "Failed QC" "Unclassified neurons" # 统计细胞数目 myratio <- sort ( table (mono.info.sorted $ cell_type), decreasing = T)[ 1 : 5 ] %>% as.numeric myratio # [1] 10508 5144 3523 3483 2639 # 整理细胞标签 pielabel <- paste0 (mynames, " (" , round (myratio / sum (myratio) * 100 , 2 ), "%)" ) pielabel # [1] "Body wall muscle (41.54%)" "Germline (20.33%)" # [3] "Seam cells (13.93%)" "Failed QC (13.77%)" # [5] "Unclassified neurons (10.43%)" cols <- pal_npg ( "nrc" )( 10 ) cols # [1] " " " " " " # [7] " " " " pie (myratio, labels= pielabel, radius = 1.0 , clockwise= F, main = "Celltype" , col = cols) mynames <- table (mono.info.sorted $ new) %>% names myratio <- table (mono.info.sorted $ new) %>% as.numeric pielabel <- paste0 (mynames, " (" , round (myratio / sum (myratio) * 100 , 2 ), "%)" ) pielabel # [1] "group1 (33.33%)" "group2 (33.33%)" "group3 (33.33%)" pie (myratio, labels= pielabel, radius = 1.0 , clockwise= F, main = "State" , col = cols) 6.4.2 堆积柱形图 cellnum <- table (mono.info.sorted $ cell_type,mono.info.sorted $ new)[ 1 : 10 ,] cellnum <- cellnum[ ! apply (cellnum, 1 , sum) == 0 ,] # 按列统计细胞比例 cell.prop <- as.data.frame ( prop.table (cellnum, margin= 2 )) colnames (cell.prop) <- c ( "Celltype" , "group" , "Proportion" ) head (cell.prop) # Celltype group Proportion # 1 Am/PH sheath cells group1 0.08415716 # 2 Body wall muscle group1 0.10190114 # 3 Canal associated neurons group1 0.04816223 # 4 Cholinergic neurons group1 0.18073511 # 5 Ciliated sensory neurons group1 0.16780735 # 6 Coelomocytes group1 0.26539924 p.bar <- ggplot (cell.prop, aes (group,Proportion, fill= Celltype)) + geom_bar ( stat= "identity" , position= "fill" ) + scale_fill_manual ( values= cols) + ggtitle ( "cell proportion" ) + theme_bw + theme ( axis.ticks.length= unit ( 0.5 , 'cm' )) + guides ( fill= guide_legend ( title= NULL )) p.bar 6.5 基因表达量的探索 6.5.1 在轨迹图上绘制表达量 plot_cells (cds, genes= "dmd-6" , label_cell_groups= FALSE , show_trajectory_graph= FALSE ) library (circlize) mycolor <- colorRampPalette ( colors = c ( ' , ' ))( 100 ) plot_cells (cds, genes= "dmd-6" , label_cell_groups= FALSE , show_trajectory_graph= FALSE ) + scale_colour_gradientn ( colours = mycolor, guide = "colorbar" ) 6.5.2 热图 其实热图的教程我们做过很多:热图绘制天花板:ComplexHeatmap|
一、初探 热图绘制天花板| 二、两万字长文探索注释功能 ComplexHeatmap|
三、用富集分析结果做成词云注释热图 在理解了monocle对象的数据结构和绘制热图的语法后,大家完全可以自行拼凑出下面的图,流程有些长,有点耐心哦:首先整理一下注释信息和表达矩阵 library (ComplexHeatmap) # 注意这里绘制热图如果出现报错 "Image must have at least 1 frame to write a bitmap",因为完整的矩阵是有42035个细胞,当绘制矩阵过大时候就会出现这样的情况 sce.obj_sub <- subset (sce.obj, idents= c ( "Unclassified glia" , "Somatic gonad precursors" , "Non-seam hypodermis" , "Vulval precursors" )) table ( Idents (sce.obj_sub)) # # Vulval precursors Non-seam hypodermis Somatic gonad precursors # 488 1268 355 # Unclassified glia # 208 expr4plot <- GetAssayData (sce.obj_sub, slot = "scale.data" ) mono.info <- mono.info[ rownames (mono.info) %in% colnames (expr4plot),,drop = F] mono.info <- mono.info[ order (mono.info $ monocle3_pseudotime),,drop = F] expr4plot <- expr4plot[, rownames (mono.info)] dim (mono.info) # [1] 2319 1 dim (expr4plot) # [1] 20271 2319 # 设置顶部的psedotime、celltype的注释 celltype_colors <- c ( "Non-seam hypodermis" = " , "Somatic gonad precursors" = " , "Unclassified glia" = " , "Vulval precursors" = " ) pseudotime_colors <- colorRamp2 ( range (mono.info $ monocle3_pseudotime, na.rm = TRUE ), c ( " , "green" ) # 从浅黄到红 ) top_anno <- HeatmapAnnotation ( Pseudotime= mono.info $ monocle3_pseudotime, Celltype= sce.obj_sub @ meta.data[ rownames (mono.info), "cao_cell_type" ], col = list ( Pseudotime = pseudotime_colors, Celltype = celltype_colors)) # 可以先看一下注释的效果 draw (top_anno) # 找一些感兴趣的基因集,比如各种细胞的marker,这里以每种细胞的Top20为例,实际过程中大家可以选择更多的基因,便于富集到有意义的通路:# 先展示基因来源 # sub.marker <- FindMarkers(sce.obj_sub, ident.1 = "Body wall muscle", ident.2 = "Seam cells", only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25) sub.marker <- FindAllMarkers (sce.obj_sub, only.pos = TRUE , min.pct = 0.25 , logfc.threshold = 0.25 ) sub.marker.new <- sub.marker %>% group_by (cluster) %>% top_n ( n = 20 , wt = avg_log2FC) top20gene <- sub.marker.new[ which (sub.marker.new $ cluster %in% c ( "Unclassified glia" , "Somatic gonad precursors" , "Non-seam hypodermis" , "Vulval precursors" )),] left_ann <- rowAnnotation ( cellmarker= top20gene $ cluster) draw (left_ann) 来做个富集分析用于词云注释,富集分析的教程可以看这里:手把手教你做单细胞测序数据分析
(七)——基因集富集分析 library (clusterProfiler) library (org.Ce.eg.db) pathway4wc <- data.frame for (runcell in unique (top20gene $ cluster)) { try ({pathway4wc}) gene4er <- filter (top20gene,cluster == runcell) %>% pull (gene) # 转换基因格式 gene.df <- gene_annotation[gene4er, "gene_short_name" ] gores <- enrichGO (gene.df, org.Ce.eg.db, keyType = "SYMBOL" , ont = "BP" , pvalueCutoff = 0.05 , pAdjustMethod = "BH" , qvalueCutoff = 0.2 , minGSSize = 10 , maxGSSize = 500 , readable = FALSE , pool = FALSE ) go.all.res <- as.data.frame (gores @ result) pathway.dt <- top_n (go.all.res, - 10 , p.adjust) pathway.dt <- cbind (pathway.dt, data.frame ( cluster= rep (runcell, nrow (pathway.dt)))) pathway4wc <- rbind (pathway4wc,pathway.dt) } head (pathway4wc) # ID Description GeneRatio BgRatio # GO:0048513 GO:0048513 animal organ development 8/13 414/9802 # GO:0040025 GO:0040025 vulval development 6/13 175/9802 # GO:0040028 GO:0040028 regulation of vulval development 4/13 110/9802 # GO:0007399 GO:0007399 nervous system development 6/13 457/9802 # GO:0002119 GO:0002119 nematode larval development 6/13 466/9802 # GO:0002164 GO:0002164 larval development 6/13 475/9802 # pvalue p.adjust qvalue # GO:0048513 1.012260e-08 2.348443e-06 1.267989e-06 # GO:0040025 4.598766e-08 5.334569e-06 2.880280e-06 # GO:0040028 9.931212e-06 6.222359e-04 3.359622e-04 # GO:0007399 1.289353e-05 6.222359e-04 3.359622e-04 # GO:0002119 1.442073e-05 6.222359e-04 3.359622e-04 # GO:0002164 1.609231e-05 6.222359e-04 3.359622e-04 # geneID Count # GO:0048513 osm-11/cab-1/let-23/lin-12/lin-31/dex-1/sel-2/lin-39 8 # GO:0040025 osm-11/let-23/lin-12/lin-31/sel-2/lin-39 6 # GO:0040028 let-23/lin-12/sel-2/lin-39 4 # GO:0007399 cab-1/let-23/zig-4/dex-1/dlg-1/sax-3 6 # GO:0002119 osm-11/let-23/lin-12/lin-31/sel-2/lin-39 6 # GO:0002164 osm-11/let-23/lin-12/lin-31/sel-2/lin-39 6 # cluster # GO:0048513 Vulval precursors # GO:0040025 Vulval precursors # GO:0040028 Vulval precursors # GO:0007399 Vulval precursors # GO:0002119 Vulval precursors # GO:0002164 Vulval precursors # 构建词云的text # 遍历每个唯一的聚类 mytext <- list for (i in 1 : length ( unique (pathway4wc $ cluster))) { # 选择p.adjust最小的前5个通路 temp.text <- filter (pathway4wc,cluster == unique (pathway4wc $ cluster)[i]) %>% top_n ( - 1 ,p.adjust) %>% pull (Description) # 将当前聚类的通路描述信息添加到列表中 mytext[[i]] <- temp.text } # 为列表元素命名,使用聚类的唯一值 names (mytext) <- unique (pathway4wc $ cluster) # 减少部分通路 mytext[[ 4 ]] <- mytext[[ 4 ]][ 1 : 2 ] # 依据显著性设置词云大小,越显著的通路字体越大 # 创建一个空向量用于存储每个聚类的p.adjust值 myfontsize <- c # 遍历每个唯一的聚类 for (i in 1 : length ( unique (pathway4wc $ cluster))) { # 筛选出当前聚类且p.adjust小于0.05的数据 temfs <- filter (pathway4wc,cluster == unique (pathway4wc $ cluster)[i]) %>% filter (p.adjust < 0.05 ) %>% # 选择p.adjust最小的前5个通路 top_n ( - 1 ,p.adjust) %>% pull (p.adjust) # 将当前聚类的p.adjust值添加到向量中 myfontsize <- c (myfontsize,temfs) } # 加载scales包,用于数据缩放 library (scales) # 对p.adjust取负对数,然后进行缩放,使字体大小范围在0到10之间 myfontsize <- - log2 (myfontsize) %>% scales :: rescale (., to= c ( 0 : 1 )) * 10 # 创建行注释对象 # 使用anno_textbox函数绘制文本框注释 max_line_length <- 30 # 定义换行函数,保留 list 结构 wrap_list_text <- function (lst, width = max_line_length) { lapply (lst, function (vec) { sapply (vec, function (txt) { paste ( strwrap (txt, width = width), collapse = " \n " ) }) }) } # 应用函数 mytext_wrapped <- wrap_list_text (mytext) right_ann <- rowAnnotation ( textbox = anno_textbox (top20gene $ cluster, mytext_wrapped, # 允许文本换行 add_new_line = T, by = "anno_block" , # 设置文本框的样式 gp = gpar ( col = c ( pal_npg ( 10 ), pal_nejm ( 8 ), pal_aaas ( 4 )) , fontsize= myfontsize), # 设置文本框的圆角 round_corners = TRUE , # 设置文本框的最大宽度 max_width= unit ( 3 , 'mm' ))) col_fun <- colorRamp2 ( quantile (expr4plot) %>% as.numeric , c ( ' , ' , 'white' , ' , ' )) finalhmp <- Heatmap (expr4plot[top20gene $ gene,], show_column_names = F, show_row_names = F, col = col_fun, top_annotation = top_anno, cluster_columns = F, row_split = top20gene $ cluster, row_title = NULL , right_annotation = right_ann, border = T, left_annotation = left_ann ) draw (finalhmp) save.image ( "result_m3/m3.Rdata" ) Sys.time # [1] "2025-04-28 14:12:56 CST"
六、slingshot 1、简介 测试文件:请在Slingshot拟时序分析学习手册推送中获取。
2、官方教学解读 URL:https://www.bioconductor.org/packages/release/bioc/vignettes/slingshot/inst/doc/vignette.html Slingshot:单细胞数据的轨迹推断 2.1 介绍:该部分将展示完整的单细胞谱系分析工作流程,特别强调谱系重建和拟时序推断的过程。利用(Street al. 2017 )slingshot中提出的软件包。
slingshot 的目标是利用细胞簇来揭示全局结构,并将这种结构转换为由一维变量(称为 “ 伪时间 ”)表示。该部分内容提供了一些工具,用于以无监督或半监督方式学习细胞簇关系,并构建代表每条脉络的平滑曲线,以及每一步的可视化方法。2.1.1 概括 Slingshot 的输入是一个表示降维空间中单元格的矩阵和一个群组标签向量。
有了这两个输入,就可以:使用 getLineages 函数在集群上构建最小生成树(MST),从而确定全局线序结构。使用 getCurves 函数拟合同步主曲线,构建平滑的线状结构并推断伪时间变量。使用内置可视化工具评估每个步骤的输出结果。2.1.2 数据载入 接下来,我们将教学中的两个模拟数据跑一遍流程:data1:第一个数据集(称为“单轨迹”数据集)在下面生成,旨在表示单个谱系,其中三分之一的基因与转变有关。
该数据集将包含在一个SingleCellExperiment 对象(Lun and Risso 2017)中,并将用于演示完整的工作流程。
Sys.time # [1] "2025-04-28 14:12:56 CST" library (Seurat) library (SingleCellExperiment) library (tidyverse) library (RColorBrewer) # 设置工作目录 setwd ( "/data/06_scRNA-seq_psedutime/" ) # generate synthetic count data representing a single lineage means <- rbind ( # non-DE genes matrix ( rep ( rep ( c ( 0.1 , 0.5 , 1 , 2 , 3 ), each = 300 ), 100 ), ncol = 300 , byrow = TRUE ), # early deactivation matrix ( rep ( exp ( atan ( (( 300 : 1 ) - 200 ) / 50 )), 50 ), ncol = 300 , byrow = TRUE ), # late deactivation matrix ( rep ( exp ( atan ( (( 300 : 1 ) - 100 ) / 50 )), 50 ), ncol = 300 , byrow = TRUE ), # early activation matrix ( rep ( exp ( atan ( (( 1 : 300 ) - 100 ) / 50 )), 50 ), ncol = 300 , byrow = TRUE ), # late activation matrix ( rep ( exp ( atan ( (( 1 : 300 ) - 200 ) / 50 )), 50 ), ncol = 300 , byrow = TRUE ), # transient matrix ( rep ( exp ( atan ( c (( 1 : 100 ) / 33 , rep ( 3 , 100 ), ( 100 : 1 ) / 33 ) )), 50 ), ncol = 300 , byrow = TRUE ) ) counts <- apply (means, 2 , function (cell_means){ total <- rnbinom ( 1 , mu = 7500 , size = 4 ) rmultinom ( 1 , total, cell_means) }) rownames (counts) <- paste0 ( 'G' , 1 : 750 ) colnames (counts) <- paste0 ( 'c' , 1 : 300 ) sce <- SingleCellExperiment ( assays = List ( counts = counts)) head (counts)[ 1 : 5 , 1 : 10 ] # c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 # G1 0 1 1 0 1 0 1 0 0 1 # G2 3 2 0 2 6 3 1 0 4 6 # G3 3 8 1 4 10 3 1 7 6 2 # G4 7 15 1 16 14 5 3 5 17 10 # G5 12 34 5 12 21 11 10 20 25 sce # class: SingleCellExperiment # dim: 750 300 # metadata(0): # assays(1): counts # rownames(750): G1 G2 ... G749 G750 # rowData names(0): # colnames(300): c1 c2 ... c299 c300 # colData names(0): # reducedDimNames(0): # mainExpName: NULL # altExpNames(0): data2:第二个数据集(“分叉”数据集)由一个坐标矩阵(通过 PCA、ICA、扩散图等获得)以及由以下函数生成的聚类标签组成。
该数据集代表了一条分叉轨迹,它将使我们能够展示所提供的一些附加功能。
library (slingshot, quietly = FALSE ) data ( "slingshotExample" ) rd <- slingshotExample $ rd cl <- slingshotExample $ cl dim (rd) # data representing cells in a reduced dimensional space # [1] 140 2 length (cl) # [1] 140 2.2 上游分析:2.2.1 基因过滤 在单细胞系谱数据集分析时,我们首先需要降低数据的维度,而筛选掉不提供有用信息的基因是常见的第一步。
这将显著提升后续分析的速度,同时将信息损失降至最低。在基因筛选阶段,我们保留了在足够多的细胞中稳健表达的基因,这些基因至少足以构成一个细胞簇,使它们成为潜在的有趣的细胞类型标记基因。我们将最小簇大小设定为10个细胞,并定义一个基因至少有3个模拟读数为“稳健表达”。
geneFilter <- apply ( assays (sce) $ counts, 1 , function (x){ sum (x >= 3 ) >= 10 }) table (geneFilter) # geneFilter # FALSE TRUE # 21 729 sce <- sce[geneFilter, ] 2.2.2 标准化 在实践中,数据标准化可以帮助我们从数据中去除不需要的技术或生物学误差,例如批次、测序深度、细胞周期效应等。
此处,推荐使用软件包scone( Cole and Risso 2018 ),ZINB-WaVE (Risso et al. 2018 ) 在考虑技术变量的同时进行降维,而 MNN (Haghverdi et al. 2018 )在降维后校正批次效应。本教学使用的是模拟数据,因此我们无需担心批次效应或其他潜在混杂因素。因此,将继续进行全分位数标准化,可强制每个细胞具有相同的表达值分布。
FQnorm <- function (counts){ rk <- apply (counts, 2 ,rank, ties.method= 'min' ) counts.sort <- apply (counts, 2 ,sort) refdist <- apply (counts.sort, 1 ,median) norm <- apply (rk, 2 , function (r){ refdist[r] }) rownames (norm) <- rownames (counts) return (norm) } assays (sce) $ norm <- FQnorm ( assays (sce) $ counts) 2.2.3 降维 Slingshot的核心假设:转录相似的细胞在降维空间中位置接近。
为了构建谱系和测量伪时间,数据的低维表示非常关键。在实际操作中,通常采用两种降维方法:主成分分析(PCA)和均匀流形逼近与投影(UMAP)。在进行PCA时,选择不按基因的方差进行缩放,因为认为并非所有基因的信息量都是相同的,更关注那些表达强烈且变异度高的基因,而不是通过强制基因间方差相等来削弱这些信号。在绘制图表时,会设置合适的长宽比,以避免感知上的距离扭曲。
UMAP是一种较新的降维技术,它在保持全局数据结构的同时,还能揭示局部结构特征。通过uwot软件包,我们可以应用UMAP来进行降维处理。
pca <- prcomp ( t ( log1p ( assays (sce) $ norm)), scale. = FALSE ) rd1 <- pca $ x[, 1 : 2 ] plot (rd1, col = rgb ( 0 , 0 , 0 ,. 5 ), pch= 16 , asp = 1 ) library (uwot) rd2 <- uwot :: umap ( t ( log1p ( assays (sce) $ norm))) colnames (rd2) <- c ( 'UMAP1' , 'UMAP2' ) plot (rd2, col = rgb ( 0 , 0 , 0 ,. 5 ), pch= 16 , asp = 1 ) 接下来,将为对象添加两种降维操作SingleCellExperiment,但继续重点分析 PCA 结果。
reducedDims (sce) <- SimpleList ( PCA = rd1, UMAP = rd2) 2.2.4 细胞聚类 Slingshot分析的最终输入是细胞的聚类标签向量。如果未提供此信息,Slingshot将把数据视为单一聚类,并拟合一个标准的主曲线。然而,即使在预期只有单一谱系的数据集中,程序也推荐对细胞进行聚类,有助于发现新的分支事件。
这一步骤中识别的聚类将用于确定潜在谱系的全局结构(即它们的数量、它们彼此分支的时间点,以及这些分支事件的大致位置)。这与单细胞数据聚类的典型目标不同,后者旨在识别数据集中所有生物学上相关的细胞类型。在本分析中,将采用了两种聚类方法,它们都假设低维空间中的欧几里得距离反映了细胞之间的生物学差异:高斯混合模型和k均值聚类。
前者在mclust包中实现(Scrucca et al. 2016),并具有基于贝叶斯信息准则(BIC)自动确定聚类数量的方法。
library (mclust, quietly = TRUE ) cl1 <- Mclust (rd1) $ classification colData (sce) $ GMM <- cl1 library (RColorBrewer) plot (rd1, col = brewer.pal ( 9 , "Set1" )[cl1], pch= 16 , asp = 1 ) cl2 <- kmeans (rd1, centers = 4 ) $ cluster colData (sce) $ kmeans <- cl2 plot (rd1, col = brewer.pal ( 9 , "Set1" )[cl2], pch= 16 , asp = 1 ) 至此,所有的准备数据已经完成,接下来对模拟数据集运行Slingshot分析。
这个过程包括两个步骤:首先是使用基于聚类的最小生成树(MST)来识别全局谱系结构,其次是拟合同时主曲线来描述每个谱系。这两个步骤可以分别使用getLineages和getCurves函数运行,也可以通过包装函数slingshot一起运行(推荐),接下来先使用包装函数展示轨迹推断,随后展示在分叉数据集(data2)上展示各个独立函数的使用。Slingshot包装函数在一次调用中执行轨迹推断的两个步骤。
所需的输入包括一个降维后的坐标矩阵和一组聚类标签。这些可以是独立的对象,或者在单细胞测序数据的情况下,是包含在SingleCellExperiment对象中的元素。
要使用PCA降维和高斯混合模型识别的聚类标签运行Slingshot,我们将执行以下操作:sce <- slingshot (sce, clusterLabels = 'GMM' , reducedDim = 'PCA' ) 如上所述,如果没有提供聚类结果,将假定所有细胞都属于同一个聚类,并将构建一个单一曲线。如果没有提供降维结果,Slingshot将使用reducedDims返回列表的第一个元素。
输出是一个包含Slingshot结果的SingleCellExperiment对象。所有结果都存储在一个PseudotimeOrdering对象中,该对象被添加到原始对象的colData中,可以通过colData(sce)$slingshot访问。此外,所有推断出的伪时间变量(每个谱系一个)也被单独添加到colData中。
要将所有Slingshot结果提取为一个对象,我们可以使用as.PseudotimeOrdering或as.SlingshotDataSet函数,具体取决于我们想要的形式。PseudotimeOrdering对象是SummarizedExperiment对象的扩展,这是一个灵活的存储对象。SlingshotDataSet对象主要用于可视化。如下,我们用伪时间着色的点来可视化单细胞数据的推断谱系。
summary (sce $ slingPseudotime_1) # Min. 1st Qu. Median Mean 3rd Qu. Max. # 0.000 9.083 21.616 21.766 34.805 44.369 library (grDevices) colors <- colorRampPalette ( brewer.pal ( 11 , 'Spectral' )[ - 6 ])( 100 ) plotcol <- colors[ cut (sce $ slingPseudotime_1, breaks= 100 )] plot ( reducedDims (sce) $ PCA, col = plotcol, pch= 16 , asp = 1 ) lines ( SlingshotDataSet (sce), lwd= 2 , col= 'black' ) 还可以看到如何使用type参数通过基于聚类的最小生成树初步估计谱系结构。
plot ( reducedDims (sce) $ PCA, col = brewer.pal ( 9 , 'Set1' )[sce $ GMM], pch= 16 , asp = 1 ) lines ( SlingshotDataSet (sce), lwd= 2 , type = 'lineages' , col = 'black' ) 2.3 下游分析 2.3.1 识别时间动态基因 运行后slingshot,通常使用tradeSeq包对在发育过程中改变其表达的基因进行探索。
对于每个基因,使用负二项噪声分布拟合一般加性模型 (GAM),以模拟基因表达和伪时间之间的(潜在非线性)关系。然后,计算基因表达和伪时间之间的显著关联 associationTest。进一步可以根据 p 值选出最显著的基因,并用热图可视化它们在发育过程中的表达情况。这里我们使用了表达最活跃的前 250 个基因。
# BiocManager::install("tradeSeq") library (tradeSeq) sce <- fitGAM (sce) ATres <- associationTest (sce) topgenes <- rownames (ATres[ order (ATres $ pvalue), ])[ 1 : 250 ] pst.ord <- order (sce $ slingPseudotime_1, na.last = NA ) heatdata <- assays (sce) $ counts[topgenes, pst.ord] heatclus <- sce $ GMM[pst.ord] heatmap ( log1p (heatdata), Colv = NA , ColSideColors = brewer.pal ( 9 , "Set1" )[heatclus], col = colorRampPalette ( c ( " , " , " , " , " ))( 100 )) 2.4 功能详细解说 以下是对Slingshot包的详细功能说明和一些附加功能的强调:Slingshot将使用包含的slingshotExample数据集作为示例。
该数据集旨在表示降维的细胞基因表达数据,并附带了由k均值聚类生成的一组聚类标签。2.4.1 确定全局谱系结构 getLineages函数接受一个n×p的矩阵和一个长度为n的聚类结果向量作为输入。它使用最小生成树(MST)映射相邻聚类之间的连接,并识别通过这些连接的路径,这些路径代表谱系。该函数的输出是一个PseudotimeOrdering对象,其中包含输入数据以及推断出的MST(由igraph对象表示)和谱系(聚类名称的有序向量)。
这种分析通过完全无监督地进行,也可以通过指定已知的起始和终止点聚类以半监督的方式进行。如果不指定起始点,Slingshot会基于简洁性选择一个,最大化谱系分裂前共享的聚类数量。如果没有分裂或多个聚类产生相同的简洁性得分,起始聚类将被任意选择。在模拟数据的情况下,Slingshot选择聚类1作为起始聚类。然而,通常建议根据先验知识(无论是样本收集时间还是已建立的基因标记)指定一个初始聚类。
lin1 <- getLineages (rd, cl, start.clus = '1' ) lin1 # class: PseudotimeOrdering # dim: 140 2 # metadata(3): lineages mst slingParams # pathStats(2): pseudotime weights # cellnames(140): cell-1 cell-2 ... cell-139 cell-140 # cellData names(2): reducedDim clusterLabels # pathnames(2): Lineage1 Lineage2 # pathData names(0): plot (rd, col = brewer.pal ( 9 , "Set1" )[cl], asp = 1 , pch = 16 ) lines ( SlingshotDataSet (lin1), lwd = 3 , col = 'black' ) 在此步骤中,slingshot还允许指定已知端点。
在构建 MST 时,指定为终端单元状态的簇将被限制为只有一个连接。如下一个示例所示,我们将簇 3 指定为端点。
lin2 <- getLineages (rd, cl, start.clus= '1' , end.clus = '3' ) plot (rd, col = brewer.pal ( 9 , "Set1" )[cl], asp = 1 , pch = 16 ) lines ( SlingshotDataSet (lin2), lwd = 3 , col = 'black' , show.constraints = TRUE ) 这种监督方式有助于确保结果与生物学知识一致。
我们可以设置getLineages函数中的参数,实现更好的分析展示:dist.method是一个字符参数,用于指定如何计算聚类之间的距离。默认值是“slingshot”,它使用聚类中心之间的距离,并由它们的完整联合协方差矩阵进行归一化。在小聚类(细胞数量少于数据维度)存在的情况下,这将自动切换到使用对角联合协方差矩阵。
其他选项包括simple(欧几里得)距离、scaled.full(完整协方差矩阵的缩放版本)、scaled.diag(对角协方差矩阵的缩放版本)和mnn(基于相互最近邻的距离)。omega是一个粒度参数,允许用户设置连接距离的上限。它代表每个真实聚类与一个人工的.OMEGA聚类之间的距离,该聚类在拟合最小生成树(MST)后会被移除。这对于识别不属于任何谱系的离群聚类或分离不同的轨迹非常有用。这个参数可以是一个数值参数或逻辑值。
值为TRUE表示应使用启发式方法,即1.5倍的无监督MST中位数边长,这意味着允许的最大距离将是该距离的3倍。在构建MST之后,getLineages会识别通过树的路径,将其指定为谱系。在这个阶段,一个谱系将由一个有序的聚类名称集合组成,从根聚类开始,以叶聚类结束。getLineages的输出是一个PseudotimeOrdering对象,它保存了所有输入以及一系列谱系和一些关于它们如何构建的额外信息。
然后,这个对象被用作getCurves函数的输入。2.4.2 构建平滑曲线并排序细胞 为了更好地模拟细胞沿着不同谱系的发展路径,我们将使用一个名为getCurves的函数来绘制平滑的曲线 getCurves函数通过一个迭代过程来构建平滑的谱系曲线,这个过程与Hastie和Stuetzle在1989年提出的主曲线方法类似。
如果只有一个谱系,那么构建的曲线将穿过数据的中心点,类似于主曲线,但有所改进:初始曲线是基于聚类中心之间的直线连接来构建的,而不是基于数据的第一主成分。这种改进使得算法更稳定,并且通常可以加快算法的收敛速度。当存在多个谱系时,算法会增加一个步骤:在共同细胞附近对曲线进行平均处理。由于这些共同细胞尚未分化,不同的谱系在这些细胞上应该有很好的一致性。
因此,在每次迭代中,我们会对这些细胞附近的曲线进行平均,这样可以提高算法的稳定性,并生成平滑的分支谱系曲线。
crv1 <- getCurves (lin1) crv1 # class: PseudotimeOrdering # dim: 140 2 # metadata(4): lineages mst slingParams curves # pathStats(2): pseudotime weights # cellnames(140): cell-1 cell-2 ... cell-139 cell-140 # cellData names(2): reducedDim clusterLabels # pathnames(2): Lineage1 Lineage2 # pathData names(0): plot (rd, col = brewer.pal ( 9 , "Set1" )[cl], asp = 1 , pch = 16 ) lines ( SlingshotDataSet (crv1), lwd = 3 , col = 'black' ) getCurves函数的输出是一个更新后的PseudotimeOrdering对象,它包含了同时拟合的主曲线以及关于它们是如何拟合的额外信息。
slingPseudotime函数提取了一个细胞-谱系矩阵,其中包含了每个细胞沿着每个谱系的伪时间值,对于未被分配到特定谱系的细胞则标记为NA值。slingCurveWeights函数提取了一个类似的矩阵,它为每个细胞分配到一个或多个谱系的权重。我们可以使用slingCurves函数来得到曲线对象,该函数将返回一个主曲线对象列表。这些对象包含以下几个部分:s:构成曲线的点的矩阵。这些点对应于数据点的正交投影。
ord:可以用来根据细胞在曲线上的投影顺序排列细胞的索引。lambda:从曲线的起点到每个细胞投影的弧长。slingPseudotime函数返回这些值的矩阵。dist_ind:数据点与其在曲线上的投影之间的平方距离。dist:平方投影距离的总和。w:沿着这条谱系的权重向量。被明确分配到这条谱系的细胞将有一个权重为1,而被分配到其他谱系的细胞将有一个权重为0。如果细胞位于分支事件之前,它们可能对多个谱系有权重为1(或非常接近1)。
slingCurveWeights函数返回这些值的矩阵。2.4.3 在大型数据集上运行 Slingshot 对于大型的数据集,我们推荐在使用slingshot或getCurves时加入approx_points参数,以便控制曲线的精细程度,即曲线上有多少个独特的点。尽管最小生成树是基于聚类构建的,但是随着数据集变大,将所有数据点投影到曲线上的过程可能会变得计算成本很高。
因此,我们将approx_points的默认值设为150或者数据集中细胞数量的较小值,这样可以在对结果影响不大的情况下,显著降低计算成本。如果你想要得到密集的曲线,可以将approx_points设置为FALSE,这样曲线上的点数将与数据集中的细胞数相同。但请注意,这样做会使得每次迭代拟合过程中的计算复杂度增加到与细胞数量的平方成正比。此外,如果数据中存在分支谱系,这种密集的曲线还会增加曲线平均和收缩过程的复杂度。
我们建议将approx_points的值设定在100到200之间,这也是我们选择150作为默认值的原因。限制曲线上独特点的数量并不会影响到伪时间值的独特数量。即使将approx_points的值设得很低,比如5,我们仍然可以从伪时间值中观察到完整的颜色变化。
sce5 <- slingshot (sce, clusterLabels = 'GMM' , reducedDim = 'PCA' , approx_points = 5 ) colors <- colorRampPalette ( brewer.pal ( 11 , 'Spectral' )[ - 6 ])( 100 ) plotcol <- colors[ cut (sce5 $ slingPseudotime_1, breaks= 100 )] plot ( reducedDims (sce5) $ PCA, col = plotcol, pch= 16 , asp = 1 ) lines ( SlingshotDataSet (sce5), lwd= 2 , col= 'black' ) 2.4.4 多条轨迹 在某些情况下,可能需要识别多个感兴趣的轨迹。
Slingshot通过在最初的最小生成树(MST)构建中引入一个人造聚类,称为omega——来处理这个问题。这个人造聚类与每个真实聚类之间相隔一个固定的长度,这意味着任何两个真实聚类之间的最大距离是这个长度的两倍。实际上,这为MST中允许的最大边长设定了一个限制。设置omega = TRUE将采用一种经验规则,即允许的最大边长等于没有人造聚类时构建的MST的中位数边长的3倍(即:omega_scale的默认值为1.5)。
rd2 <- rbind (rd, cbind (rd[, 2 ] - 12 , rd[, 1 ] - 6 )) cl2 <- c (cl, cl + 10 ) pto2 <- slingshot (rd2, cl2, omega = TRUE , start.clus = c ( 1 , 11 )) plot (rd2, pch= 16 , asp = 1 , col = c ( brewer.pal ( 9 , "Set1" ), brewer.pal ( 8 , "Set2" ))[cl2]) lines ( SlingshotDataSet (pto2), type = 'l' , lwd= 2 , col= 'black' ) 在拟合MST后,slingshot对每条轨迹分别处理。
plot (rd2, pch= 16 , asp = 1 , col = c ( brewer.pal ( 9 , "Set1" ), brewer.pal ( 8 , "Set2" ))[cl2]) lines ( SlingshotDataSet (pto2), lwd= 2 , col= 'black' ) 2.4.5 将细胞投射到现有轨迹上 有时,可能只想使用细胞的子集来确定轨迹,或者可能会获得想要投射到现有轨迹上的新数据。
无论哪种情况,都需要一种方法来确定新细胞沿先前构建的轨迹的位置。为此,我们可以使用函数predict。
pto <- sce $ slingshot # simulate new cells in PCA space newPCA <- reducedDim (sce, 'PCA' ) + rnorm ( 2 * ncol (sce), sd = 2 ) # project onto trajectory newPTO <- slingshot :: predict (pto, newPCA) 这将产生一个新的组合对象,其中包含来自原始数据的轨迹(曲线),但包含新细胞的伪时间值和权重。
作为参考,原始细胞在下方显示为灰色。
newplotcol <- colors[ cut ( slingPseudotime (newPTO)[, 1 ], breaks= 100 )] plot ( reducedDims (sce) $ PCA, col = 'grey' , bg = 'grey' , pch= 21 , asp = 1 , xlim = range (newPCA[, 1 ]), ylim = range (newPCA[, 2 ])) lines ( SlingshotDataSet (sce), lwd= 2 , col = 'black' ) points ( slingReducedDim (newPTO), col = newplotcol, pch = 16 ) 3、外部数据实操 接下来,我们从GEO数据库下载真实单细胞转录组测序数据,并使用R语言Slingshot包进行单细胞转录组测序 拟时序分析;
全流程包括:原始数据下载———测序数据预处理———降维聚类———细胞注释———利用关键细胞进行拟时序分析。3.1 单细胞数据下载:GSE163973 该数据集主要探讨了皮肤纤维化疾病中的成纤维细胞异质性,特别是以瘢痕疙瘩(KD)为研究模型。该研究使用真皮进行scRNA-seq分析,因为瘢痕疙瘩代表皮肤真皮纤维化疾病。
经过严格的质量控制,我们获得了 40655 个细胞的转录组(瘢痕疙瘩(nsample=3):21488;正常疤痕(nsample=3):19167)。我们下载该数据集中每个样本(其中包含3个KD样本,3个NC样本)的原始文件,即barcodes.tsv、genes.tsv和matrix.mtx文件。
3.2 测序数据预处理:接下来,我们使用R语言加载数据并构建疾病组与正常组样本的Seurat对象:3.2.1 载入相关的包 if ( ! require (Seurat)) install.packages ( "Seurat" ) if ( ! require (dplyr)) install.packages ( "dplyr" ) if ( ! require (ggplot2)) install.packages ( "ggplot2" ) if ( ! require (scran)) install.packages ( "scran" ) if ( ! require (patchwork)) install.packages ( "patchwork" ) if ( ! require (tidyr)) install.packages ( "tidyr" ) if ( ! require (scater)) install.packages ( "scater" ) if ( ! require (stringr)) install.packages ( "stringr" ) if ( ! require (tibble)) install.packages ( "tibble" ) if ( ! require (SingleCellExperiment)) install.packages ( "SingleCellExperiment" ) library (hdf5r) library (Matrix) library (openxlsx) library (clustree) library (paletteer) 3.2.2 处理各组样本 # 正常样本 rm ( list= ls ) normal_scar <- Read10X ( data.dir = "data/NS1/" ) class (normal_scar) normal_scar_obj <- CreateSeuratObject ( counts = normal_scar, project = "NS1" ) normal_scar_obj[[ "percent.mt" ]] <- PercentageFeatureSet (normal_scar_obj, pattern = "^MT-" ) normal_scar_obj <- subset (normal_scar_obj, subset = nFeature_RNA > 200 & nFeature_RNA < 5000 & percent.mt < 10 ) VlnPlot1 <- VlnPlot (normal_scar_obj, features = c ( "nFeature_RNA" , "nCount_RNA" , "percent.mt" ), ncol = 3 ) for (sample in c ( "NS2" , "NS3" )) { normal_scar1 <- Read10X ( data.dir = paste0 ( "data/" ,sample, "/" )) class (normal_scar1) normal_scar_obj1 <- CreateSeuratObject ( counts = normal_scar1, project = sample) normal_scar_obj1[[ "percent.mt" ]] <- PercentageFeatureSet (normal_scar_obj1, pattern = "^MT-" ) normal_scar_obj1 <- subset (normal_scar_obj1, subset = nFeature_RNA > 200 & nFeature_RNA < 5000 & percent.mt < 10 ) normal_scar_obj <- merge (normal_scar_obj,normal_scar_obj1) } # saveRDS(normal_scar_obj,file = "result_slingshot/normal_Object.rds") # 疾病样本 rm ( list= ls ) keloid_data <- Read10X ( data.dir = "data/KL1/" ) keloid_obj <- CreateSeuratObject ( counts = keloid_data, project = "KL1" ) keloid_obj[[ "percent.mt" ]] <- PercentageFeatureSet (keloid_obj, pattern = "^MT-" ) keloid_obj <- subset (keloid_obj, subset = nFeature_RNA > 200 & nFeature_RNA < 5000 & percent.mt < 10 ) VlnPlot1 <- VlnPlot (keloid_obj, features = c ( "nFeature_RNA" , "nCount_RNA" , "percent.mt" ), ncol = 3 ) for (sample in c ( "KL2" , "KL3" )) { keloid_data1 <- Read10X ( data.dir = paste0 ( "data/" ,sample, "/" )) class (keloid_data1) keloid_data_obj1 <- CreateSeuratObject ( counts = keloid_data1, project = sample) keloid_data_obj1[[ "percent.mt" ]] <- PercentageFeatureSet (keloid_data_obj1, pattern = "^MT-" ) keloid_data_obj1 <- subset (keloid_data_obj1, subset = nFeature_RNA > 200 & nFeature_RNA < 5000 & percent.mt < 10 ) keloid_obj <- merge (keloid_obj,keloid_data_obj1) } # saveRDS(keloid_obj,file = "result_slingshot/keloid_Object.rds") 3.2.3 合并两组数据,构建seurat_object rm ( list= ls ) normal_object <- readRDS ( "result_slingshot/normal_Object.rds" ) keloid_object <- readRDS ( "result_slingshot/keloid_Object.rds" ) seurat_object <- merge (keloid_object, y = normal_object) seurat_object[[ "percent.mt" ]] <- PercentageFeatureSet (seurat_object, pattern = "^MT-" ) VlnPlot1 <- VlnPlot (seurat_object, features = c ( "nFeature_RNA" , "nCount_RNA" , "percent.mt" ), ncol = 3 ) 3.2.4 数据标准化 # 数据标准化 seurat_object <- NormalizeData (seurat_object, normalization.method = "LogNormalize" , scale.factor = 10000 ) seurat_object <- FindVariableFeatures (seurat_object, selection.method = "vst" , nfeatures = 2000 ) top10 <- head ( VariableFeatures (seurat_object), 10 ) plot1 <- VariableFeaturePlot (seurat_object) plot2 <- LabelPoints ( plot = plot1, points = top10, repel = TRUE ) plot1 + plot2 # 数据缩放,对高变基因进行标准化 seurat_object <- ScaleData (seurat_object, features = VariableFeatures (seurat_object)) # 降维 seurat_object <- RunPCA (seurat_object, features = VariableFeatures ( object = seurat_object)) dims = 1:15, cells = 1000, balanced = TRUE) DimPlot (seurat_object, reduction = 'pca' , group.by = 'orig.ident' ) seurat_object <- JackStraw (seurat_object, num.replicate = 100 ) seurat_object <- ScoreJackStraw (seurat_object, dims = 1 : 20 ) PCA_Score <- JackStrawPlot (seurat_object, dims = 1 : 20 ) 3.3 细胞聚类 seurat_object <- FindNeighbors (seurat_object, reduction = "pca" , dims = 1 : 20 ) seurat_object <- FindClusters (seurat_object, resolution = 0.3 ) # Modularity Optimizer version 1.3.0 by Ludo Waltmand Nees Jan van Eck # # Number of nodes: 41167 # Number of edges: 1367889 # # Running Louvain algorithm... # Maximum modularity in 10 random starts: 0.9481 # Number of communities: 18 # Elapsed time: 25 seconds seurat_object <- RunUMAP (seurat_object, reduction = "pca" , dims = 1 : 20 ) head ( Idents (seurat_object), 5 ) # AAACCCAAGAACAGGA-1_1_1 AAACCCAAGATCCTAC-1_1_1 AAACCCAAGCCTGACC-1_1_1 # 0 4 0 # AAACCCAAGCGCAATG-1_1_1 AAACCCACAAACTCGT-1_1_1 # 0 0 # Levels: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 DimPlot (seurat_object, reduction = "umap" , label = TRUE , group.by = "RNA_snn_res.0.3" ) 3.3.1 细胞簇注释 seurat_object <- JoinLayers (seurat_object) seurat_object.markers <- FindAllMarkers (seurat_object, group.by = "RNA_snn_res.0.3" , min.pct = 0.5 , logfc.threshold = 0.5 , only.pos = T) top10 <- seurat_object.markers %>% group_by (cluster) %>% dplyr :: filter (avg_log2FC > 1 ) %>% slice_head ( n = 5 ) %>% ungroup mesenchymal_stem_cell <- c ( "THY1" , "TAGLN" , "COL1A2" , "DCN" , "PECAM1" , "CDH5" , "VWF" , "KRT5" , "KRT14" , "KRT1" , "CD68" , "CSF1R" , "CYBB" , "CPA3" , "HPGDS" , "TPSB2" , "MLANA" , "CD3D" , "CD3E" , "KRT19" , "KRT7" , "KRT8" ) 3.3.2 定义细胞类型 new.cluster.ids <- c ( "Endothelial cells" , "Fibroblast cells" , "Fibroblast cells" , "Epidermal stem cells" , "Fibroblast cells" , "Fibroblast cells" , "Endothelial cells" , "Epidermal stem cells" , "Macrophages" , "Fibroblast cells" , "Endothelial cells" , "Mast cells" , "T cells" , "Melanocytes" , "Endothelial cells" , "Epidermal stem cells" , "Endothelial cells" , "Keratinocyte" ) names (new.cluster.ids) <- levels (seurat_object) new.cluster.ids # 0 1 2 # "Endothelial cells" "Fibroblast cells" "Fibroblast cells" # 3 4 5 # "Epidermal stem cells" "Fibroblast cells" "Fibroblast cells" # 6 7 8 # "Endothelial cells" "Epidermal stem cells" "Macrophages" # 9 10 11 # "Fibroblast cells" "Endothelial cells" "Mast cells" # 12 13 14 # "T cells" "Melanocytes" "Endothelial cells" # 15 16 17 # "Epidermal stem cells" "Endothelial cells" "Keratinocyte" seurat_object <- RenameIdents (seurat_object,new.cluster.ids) seurat_obj_Annotated <- seurat_object # saveRDS(seurat_obj_Annotated,file = "result_slingshot/seurat_obj_Annotated.rds") 3.3.3 可视化注释结果 rm ( list= ls ) seurat_object <- readRDS ( "result_slingshot/seurat_obj_Annotated.rds" ) seurat_object <- JoinLayers (seurat_object) seurat_object.markers <- FindAllMarkers (seurat_object, min.pct = 0.5 , logfc.threshold = 1 , only.pos = T) DimPlot (seurat_object, reduction = "umap" , label = TRUE ) 3.3.4 显著表达基因可视化 mesenchymal_stem_cell <- c ( "SELE" , "DCN" , "LGALS7" , "CD86" , "TPSAB1" , "PTPRC" , "SOX10" , "DCD" ) VlnPlot (seurat_object, features = mesenchymal_stem_cell, ncol = 2 ) 3.4 关键细胞 在下游分析中,我们可以选择一些关键细胞群进行拟时序分析,关键细胞可以通过生物标志物鉴定,也可以根据现有研究来定义,在这里我们选择”Fibroblast cells”,“Endothelial cells”,“Epidermal stem cells”做进一步的研究分析:seurat_object @ meta.data $ cell_type <- seurat_object @ active.ident seurat_object $ group <- ifelse ( grepl ( "^KL" , seurat_object $ orig.ident), "KD" , ifelse ( grepl ( "^NS" , seurat_object $ orig.ident), "NC" , "unknown" )) metadata <- data.frame (seurat_object @ meta.data) unique (seurat_object $ cell_type) subset_clusters <- subset (seurat_object, idents = c ( "Fibroblast cells" , "Endothelial cells" , "Epidermal stem cells" )) # saveRDS(subset_clusters,file = "result_slingshot/subset_clusters.rds") 3.5 拟时序分析 在这里,我们基于上面筛选的关键细胞簇进行后续拟时序分析:3.5.1 载入相关包 rm ( list= ls ) library (Seurat) library (SingleCellExperiment) library (tidyverse) library (RColorBrewer) 3.5.2 可视化细胞簇 subset_clusters <- readRDS ( "result_slingshot/subset_clusters.rds" ) DimPlot (subset_clusters, reduction = "umap" , label = TRUE ) mesenchymal_stem_cell <- c ( "PECAM1" , "CDH5" , "VWF" , "THY1" , "TAGLN" , "COL1A2" , "DCN" , "KRT5" , "KRT14" , "KRT1" ) DotPlot (subset_clusters, features = mesenchymal_stem_cell) + RotatedAxis + scale_color_gradientn ( colours = c ( ' , ' , ' , ' )) seurat_object <- subset_clusters 3.5.3 安装slingshot # if (!require("BiocManager", quietly = TRUE)) # install.packages("BiocManager") # BiocManager::install("slingshot") 接下来,我们从现有的Seurat对象中,根据cluster_list指定的簇标识符,提取对应的细胞,并将这部分数据转换成SingleCellExperiment对象,以便进行进一步的分析或操作。
library (slingshot) cluster_list <- unique (seurat_object $ cell_type) select_cluster = as.SingleCellExperiment ( subset (seurat_object, idents = cluster_list), assay = 'RNA' ) select_cluster # class: SingleCellExperiment # dim: 33538382 # metadata(0): # assays(2): counts logcounts # rownames(33538): MIR1302-2HG FAM138A ... AC213203.1 FAM231C # rowData names(0): # colnames(38382): AAACCCAAGAACAGGA-1_1_1 AAACCCAAGATCCTAC-1_1_1 ... # TTTGTTGTCGAACCAT-1_2 TTTGTTGTCTTCTAAC-1_2 # colData names(9): orig.ident nCount_RNA ... group ident # reducedDimNames(2): PCA UMAP # mainExpName: RNA # altExpNames(0): colnames ( colData (select_cluster)) # [1] "orig.ident" "nCount_RNA" "nFeature_RNA" "percent.mt" # [5] "RNA_snn_res.0.3" "seurat_clusters" "cell_type" "group" # [9] "ident" 3.5.4 轨迹推断 sim <- slingshot (select_cluster, clusterLabels = 'cell_type' , reducedDim = 'UMAP' , start.clus= "Epidermal stem cells" , # 选择起点 end.clus = NULL # 这里我们不指定终点 ) 这里我们定义”Epidermal stem cells”为细胞发育起点,并且不规定终点,让算法进行模拟计算。
slingshot函数,用于单细胞数据分析中的谱系推断,帮助从高维数据中推测细胞发展轨迹。下面是各个参数的含义和作用:1. data 类型:数据对象,包含用于谱系推断的坐标矩阵。说明:这可以是 matrix、SingleCellExperiment、SlingshotDataSet 或 PseudotimeOrdering 类型的数据对象,通常是高维数据的降维结果(如PCA、t-SNE、UMAP等)。
2. clusterLabels 类型:向量或矩阵 说明:每个细胞的簇分配标签。如果是向量,则每个细胞对应一个标签;如果是矩阵,则为每个细胞与簇的加权关系,其中“-1”表示未分配的细胞。3. reducedDim 类型:字符或矩阵(可选) 说明:指定要使用的降维数据。如果提供了该参数,则可以指定哪一个降维结果(例如PCA、UMAP或t-SNE)用于谱系推断。
4. start.clus 类型:字符(可选) 说明:指定谱系推断的起始簇,通常是谱系树的根簇。5. end.clus 类型:字符(可选) 说明:指定强制作为叶节点的簇,通常是谱系树的终点簇。6. dist.method 类型:字符(可选) 说明:指定计算簇之间距离的方法,默认值为“slingshot”。可以使用不同的距离度量方法。
3.5.5 可视化计算结果 colnames ( colData (sim)) # [1] "orig.ident" "nCount_RNA" "nFeature_RNA" # [4] "percent.mt" "RNA_snn_res.0.3" "seurat_clusters" # [7] "cell_type" "group" "ident" # [10] "slingshot" "slingPseudotime_1" SlingshotDataSet (sim) # class: SlingshotDataSet # # Samples Dimensions # 38382 2 # # lineages: 1 # Lineage1: Epidermal stem cells Endothelial cells Fibroblast cells # # curves: 1 # Curve1: Length: 47.865 Samples: 38382 plot ( reducedDims (sim) $ UMAP, pch= 16 , asp = 1 ) lines ( SlingshotDataSet (sim), lwd= 2 , col= brewer.pal ( 9 , "Set1" )) legend ( "right" , legend = paste0 ( "lineage" , 1 ), col = unique ( brewer.pal ( 6 , "Set1" )), inset= 0.8 , pch = 16 ) 美化图形,添加渐变色。
colors <- colorRampPalette ( brewer.pal ( 11 , 'Spectral' )[ - 6 ])( 100 ) plotcol <- colors[ cut (sim $ slingPseudotime_1, breaks= 100 )] plotcol[ is.na (plotcol)] <- "lightgrey" plot ( reducedDims (sim) $ UMAP, col = plotcol, pch= 16 , asp = 1 ) lines ( SlingshotDataSet (sim), lwd= 2 , col= brewer.pal ( 9 , "Set1" )) legend ( "right" , legend = paste0 ( "lineage" , 1 ), col = unique ( brewer.pal ( 6 , "Set1" )), inset= 0.8 , pch = 16 ) 从模拟计算结果我们可以看到,3种关键细胞,由”Epidermal stem cells”向”Endothelial cells”,再向”Fibroblast cells”分化。
3.6 探究分化过程中的基因表达 3.6.1 tradeSeq 提供了一种灵活的回归模型拟合方法,可用于查找沿轨迹中的一个或多个谱系差异表达的基因。基于拟合模型,发现表达与伪时间相关的基因,或沿轨迹差异表达(在特定区域)的基因。它为每个基因拟合一个负二项式广义加性模型 (GAM),并对 GAM 的参数进行推理。
# if (!require("BiocManager", quietly = TRUE)) # install.packages("BiocManager") # BiocManager::install("tradeSeq") 3.6.2 数据准备 counts <- sim @ assays @ data $ counts crv <- SlingshotDataSet (sim) counts:从模拟数据对象sim中提取基因表达矩阵(计数数据)。
sim@assays @data$counts提取计数矩阵,每行代表一个基因,每列代表一个细胞。crv:将sim转换为SlingshotDataSet对象。Slingshot是一种分析单细胞发育轨迹的方法,SlingshotDataSet对象描述了细胞间的发育轨迹及其分支结构。
3.6.3 确定最佳模型的参数 k set.seed ( 1234 ) icMat <- evaluateK ( counts = counts, sds = crv, k = 3 : 5 , nGenes = 20 , # 这里只做演示,实际计算请设置200-500 verbose = T) evaluateK:评估不同模型中参数k的最佳值,k表示基因表达模型中的平滑参数数量。参数说明:counts:输入的基因表达矩阵,用于拟合模型。
sds:发育轨迹数据(SlingshotDataSet对象),定义了细胞的轨迹和分支信息。k = 3:10:尝试从3到10个不同的平滑参数数量。nGenes = 200:随机选择200个基因进行评估,加快分析速度。verbose = T:显示详细的日志信息,方便跟踪运行状态。输出:icMat:存储不同k值下模型的评估结果,通常基于信息准则(如AIC、BIC)选择最优k值。
这一计算时间较长,结果展示如下:下图中,我们需要找到快速下降曲线的转折点;如果这个点小于6,则选择对应的数字;如果这个点大于等于6,则选择6。
这里我们选择 nknots = 6,进行下一步计算:内容解读可参考:https://statomics.github.io/tradeSeq/articles/tradeSeq.html 3.6.4 计算伪时间 pseudotime <- slingPseudotime (crv, na = FALSE ) slingPseudotime(crv, na = FALSE):计算每个细胞的伪时间值。
crv 是一个 SlingshotDataSet 对象,包含了细胞的发育轨迹和分支结构。伪时间(Pseudotime) 是一种用于量化细胞在发育过程中所处“时间”点的概念。尽管我们没有真实的时间轴,伪时间可以帮助揭示细胞的发育或分化状态。该函数计算每个细胞沿轨迹的相对位置(伪时间),并将这些值存储在 pseudotime 中。na = FALSE:表示如果在计算过程中有缺失值,是否应跳过这些细胞。
设为FALSE时,缺失值将被处理(默认行为)。3.6.5 计算细胞权重 cellWeights <- slingCurveWeights (crv) slingCurveWeights(crv):计算每个细胞在各个轨迹曲线上的权重。在发育轨迹分析中,细胞的表达模式可以影响它们在特定轨迹上的位置。该函数计算每个细胞在各个发育轨迹(或曲线)上的“权重”,即它们在不同轨迹分支的概率分配。
3.6.6 拟合广义加性模型(GAM) top_genes <- head ( order ( rowVars (counts), decreasing = TRUE ), 200 ) counts_subset <- counts[top_genes, ] # 约7min system.time ({ sce <- fitGAM ( counts = counts_subset, pseudotime = pseudotime, cellWeights = cellWeights, nknots = 6 , verbose = FALSE ) }) # user system elapsed # 393.259 66.290 473.043 fitGAM:拟合一个广义加性模型(GAM)来描述基因在伪时间中的表达变化。
counts:基因表达矩阵,其中行是基因,列是细胞。pseudotime:之前计算的每个细胞的伪时间值。它提供了每个细胞在发育轨迹中的相对时间点,作为GAM拟合的一个自变量。cellWeights:每个细胞在不同轨迹曲线上的权重,作为GAM拟合中的额外信息,用于提升模型的准确性。nknots = 6:指定在拟合GAM时使用的“结点”数量。结点数量影响模型的灵活性,通常越多结点,模型越灵活,但也可能导致过拟合。这里指定了6个结点。
3.6.7 基因表达与拟时序的相关性 assoRes <- associationTest (sce) head (assoRes) # waldStat df pvalue meanLogFC # COL1A1 99151.88 5 0 2.1863965 # MALAT1 10867.17 5 0 0.8055341 # COL1A2 107704.27 5 0 2.1889041 # COL3A1 84147.77 5 0 2.1775600 # RPLP1 50220.89 5 0 0.7993864 # RPS18 27798.16 5 0 0.6785392 3.6.8 起止点相关性最高的基因 startRes <- startVsEndTest (sce) head (startRes) # waldStat df pvalue logFClineage1 # COL1A1 29795.940 1 0 8.5624909 # MALAT1 4889.731 1 0 1.5013682 # COL1A2 30976.256 1 0 8.1200847 # COL3A1 24290.809 1 0 8.0383031 # RPLP1 5866.138 1 0 -0.6587992 # RPS18 8062.045 1 0 -0.7984236 startVsEndTest 是一个用于评估单细胞转录组数据中发育轨迹(lineage)起始点(start)和终止点(end)之间差异表达的函数。
该函数的主要目的是通过比较发育轨迹起始点与终止点的基因表达差异,来识别在发育过程中可能有重要作用的基因。startVsEndTest 的结果是一个包含差异表达基因(DEGs)分析的输出,通常包括统计量(如Wald统计量)、自由度、p值、对数变化值(logFC)等。你提供的输出是一个数据框,其中包含以下列:waldStat:Wald统计量用于检验基因在起始点和终止点之间是否存在显著的表达变化。
它是一个用于假设检验的统计量,通常用于广义线性模型中。数值越大,表示该基因在轨迹的起始和终止点之间的差异越显著。df:这是自由度(degrees of freedom),通常用于检验统计显著性的分布。由于每个基因的表达模型是基于伪时间拟合的,因此它的自由度是1,表示每个基因有一个独立的估计参数(即线性模型中的斜率)。pvalue:p值用于检验基因表达的差异是否显著。p值越小,表示该基因在起始点和终止点之间的表达差异越显著。
logFClineage1:logFClineage1 是基因在发育轨迹的第1条线路(lineage 1)中起始和终止点之间的对数变化(log2 fold change)。这是基因在起始点和终止点之间表达变化的对数变动。
3.6.9 按照相关性排序 oStart <- order (startRes $ waldStat, decreasing = TRUE ) # 挑选相关性最强的基因,并可视化 sigGeneStart <- names (sce)[oStart[ 1 ]] # > sigGeneStart # [1] "DCN" plotSmoothers (sce, counts, gene = sigGeneStart) 在UMAP图中展示基因的差异表达:plotGeneCount (crv, counts, gene = sigGeneStart) 3.6.10 使用ggplots可视化基因表达趋势 coldata <- data.frame ( celltype = sim @ colData $ cell_type, plotcol = sim @ colData $ group) rownames (coldata) = colnames (sim) # sce中对应信息取出 filter_coldata <- coldata[ colnames (sce),] # 添加拟时序信息 filter_coldata $ Pseudotime = sce $ crv $ pseudotime.Lineage1 # top6 genes top6 <- names (sce)[oStart[ 1 : 6 ]] top6_exp = sce @ assays @ data $ counts[top6,] top6_exp = log2 (top6_exp + 1 ) %>% t # 获得最终的清洁数据 plt_data = cbind (filter_coldata, top6_exp) colnames (plt_data) # [1] "celltype" "plotcol" "Pseudotime" "CD63" "SPARC" # [6] "S100A6" "VIM" "COL1A2" "LGALS1" 3.6.11 绘图 library (ggplot2) library (ggsci) library (ggpubr) library (RColorBrewer) getPalette = colorRampPalette ( brewer.pal ( 12 , "Paired" )) mycolors = getPalette ( length ( unique (plt_data $ celltype))) plt_list = list for (gene in top6) { print (gene) p = ggscatter ( data = plt_data, x = 'Pseudotime' , y = gene, color = 'celltype' , size = 0.6 ) + geom_smooth ( se = F, color = 'orange' ) + theme_bw + scale_color_manual ( values = mycolors) + theme ( legend.position = 'none' ) plt_list[[gene]] = p } # [1] "CD63" # [1] "SPARC" # [1] "S100A6" # [1] "VIM" # [1] "COL1A2" # [1] "LGALS1" library (patchwork) wrap_plots (plt_list) ggsave ( filename = 'result_slingshot/top6_genes.pdf' , width = 9 , height = 5 ) 4、下游探索 4.1 DimPlot和FeaturePlot # 提取拟时间 sl.info <- as.data.frame ( slingPseudotime (sim)) head (sl.info) # Lineage1 # AAACCCAAGAACAGGA-1_1_1 17.469736 # AAACCCAAGATCCTAC-1_1_1 43.886954 # AAACCCAAGCCTGACC-1_1_1 17.944789 # AAACCCAAGCGCAATG-1_1_1 21.178838 # AAACCCACAAACTCGT-1_1_1 20.885255 # AAACCCACAAGACGGT-1_1_1 1.142355 # 为了避免与Seurat原有的注释重合,我们给这些变量名加一个后缀 colnames (sl.info) <- "slingshot_pseudotime" colnames (sl.info) # [1] "slingshot_pseudotime" seurat_object <- AddMetaData (seurat_object, metadata = sl.info) # 可以利用Seurat原有的降维图观察细胞类型、拟时序中state、拟时间分布:library (ggsci) color <- c ( pal_npg ( 10 ), pal_jco ( 10 ), pal_lancet ( 9 ), pal_aaas ( 10 )) DimPlot (seurat_object, group.by = 'cell_type' , cols = color) | FeaturePlot (seurat_object, features = 'slingshot_pseudotime' , cols = c ( "lightgrey" , 'darkred' )) 4.2 VlnPlot library (ggplot2) # VlnPlot可以用于绘制Pseudotime这样的数值型变量 VlnPlot (seurat_object, features = 'slingshot_pseudotime' , group.by = 'cell_type' , pt.size = 0 ,, cols= color) + theme ( axis.text.x = element_text ( angle = 40 , hjust = 1 )) | VlnPlot (seurat_object, features = 'slingshot_pseudotime' , group.by = 'group' , pt.size = 0 , cols= color) + theme ( axis.text.x = element_text ( angle = 40 , hjust = 1 )) 4.3 ggplot2绘图 cellname <- rownames (sl.info) data4plot <- cbind (seurat_object @ meta.data, as.data.frame (seurat_object @ reductions $ umap @ cell.embeddings)[cellname,]) head (data4plot) # orig.ident nCount_RNA nFeature_RNA percent.mt # AAACCCAAGAACAGGA-1_1_1 KL1 8791 2335 3.810716 # AAACCCAAGATCCTAC-1_1_1 KL1 15880 3589 3.425693 # AAACCCAAGCCTGACC-1_1_1 KL1 12648 2989 3.913662 # AAACCCAAGCGCAATG-1_1_1 KL1 7076 1932 7.405314 # AAACCCACAAACTCGT-1_1_1 KL1 5235 1994 5.386819 # AAACCCACAAGACGGT-1_1_1 KL1 15944 2853 3.712995 # RNA_snn_res.0.3 seurat_clusters cell_type # AAACCCAAGAACAGGA-1_1_1 0 0 Endothelial cells # AAACCCAAGATCCTAC-1_1_1 4 4 Fibroblast cells # AAACCCAAGCCTGACC-1_1_1 0 0 Endothelial cells # AAACCCAAGCGCAATG-1_1_1 0 0 Endothelial cells # AAACCCACAAACTCGT-1_1_1 0 0 Endothelial cells # AAACCCACAAGACGGT-1_1_1 3 3 Epidermal stem cells # group slingshot_pseudotime umap_1 umap_2 # AAACCCAAGAACAGGA-1_1_1 KD 17.469736 2.496052 -9.570583 # AAACCCAAGATCCTAC-1_1_1 KD 43.886954 -2.775920 9.123234 # AAACCCAAGCCTGACC-1_1_1 KD 17.944789 2.970328 -8.379124 # AAACCCAAGCGCAATG-1_1_1 KD 21.178838 7.411248 -8.299144 # AAACCCACAAACTCGT-1_1_1 KD 20.885255 6.304723 -7.642037 # AAACCCACAAGACGGT-1_1_1 KD 1.142355 -13.288835 -3.727686 library ( "ggplot2" ) ggplot (data4plot, aes ( x= umap_1, y= umap_2)) + geom_point ( aes ( x= umap_1, y= umap_2, size= slingshot_pseudotime, colour = cell_type), data = data4plot) + scale_color_manual ( values = color) + scale_size_continuous ( range = c ( 0.3 , 1.5 )) + theme_classic + theme ( axis.line = element_line ( size= 0.6 , colour = "black" , arrow = ggplot2 :: arrow ( length = ggplot2 :: unit ( 0.1 , "inches" ), ends = "last" , type = 'open' ))) 4.4 细胞比例探索 这部分内容其实我们以前出过专门的教程,详情见沉浸式统计细胞比例。
在这里也一样,我们可以去统计拟时序的state和celltype之间的关系。
4.4.1 饼图 # 统计细胞类型 sl.info.sorted <- sl.info[ order (sl.info $ slingshot_pseudotime), , drop = FALSE ] # 获取行名(细胞名) cell_names <- rownames (sl.info.sorted) # 按长度划分成三等份 n <- length (cell_names) group1 <- cell_names[ 1 : floor (n / 3 )] group2 <- cell_names[( floor (n / 3 ) + 1 ) : ( floor ( 2 * n / 3 ))] group3 <- cell_names[( floor ( 2 * n / 3 ) + 1 ) : n] sl.info.sorted[group1, "new" ] <- "group1" sl.info.sorted[group2, "new" ] <- "group2" sl.info.sorted[group3, "new" ] <- "group3" sl.info.sorted[, c ( "cell_type" , "group" )] <- seurat_object @ meta.data[ rownames (sl.info.sorted), c ( "cell_type" , "group" )] sl.info.sorted $ cell_type <- factor (sl.info.sorted $ cell_type, levels = c ( "Epidermal stem cells" , "Endothelial cells" , "Fibroblast cells" )) mynames <- table (sl.info.sorted $ cell_type) %>% names mynames # [1] "Epidermal stem cells" "Endothelial cells" "Fibroblast cells" # 统计细胞数目 myratio <- table (sl.info.sorted $ cell_type) %>% as.numeric myratio # [1] 6017 13975 18390 # 整理细胞标签 pielabel <- paste0 (mynames, " (" , round (myratio / sum (myratio) * 100 , 2 ), "%)" ) pielabel # [1] "Epidermal stem cells (15.68%)" "Endothelial cells (36.41%)" # [3] "Fibroblast cells (47.91%)" cols <- pal_npg ( "nrc" )( 10 ) cols # [1] " " " " " " # [7] " " " " pie (myratio, labels= pielabel, radius = 1.0 , clockwise= F, main = "Celltype" , col = cols) mynames <- table (sl.info.sorted $ new) %>% names myratio <- table (sl.info.sorted $ new) %>% as.numeric pielabel <- paste0 (mynames, " (" , round (myratio / sum (myratio) * 100 , 2 ), "%)" ) pielabel # [1] "group1 (33.33%)" "group2 (33.33%)" "group3 (33.33%)" pie (myratio, labels= pielabel, radius = 1.0 , clockwise= F, main = "State" , col = cols) 4.4.2 堆积柱形图 cellnum <- table (sl.info.sorted $ cell_type,sl.info.sorted $ new) cellnum <- cellnum[ ! apply (cellnum, 1 , sum) == 0 ,] # 按列统计细胞比例 cell.prop <- as.data.frame ( prop.table (cellnum, margin= 2 )) colnames (cell.prop) <- c ( "Celltype" , "group" , "Proportion" ) head (cell.prop) # Celltype group Proportion # 1 Epidermal stem cells group1 0.4702204158 # 2 Endothelial cells group1 0.5292324527 # 3 Fibroblast cells group1 0.0005471315 # 4 Epidermal stem cells group2 0.0000000000 # 5 Endothelial cells group2 0.5399405972 # 6 Fibroblast cells group2 0.4600594028 p.bar <- ggplot (cell.prop, aes (group,Proportion, fill= Celltype)) + geom_bar ( stat= "identity" , position= "fill" ) + scale_fill_manual ( values= cols) + ggtitle ( "cell proportion" ) + theme_bw + theme ( axis.ticks.length= unit ( 0.5 , 'cm' )) + guides ( fill= guide_legend ( title= NULL )) p.bar 4.4.3 累计密度分布图 densyplot <- ggplot (sl.info.sorted, aes ( x= slingshot_pseudotime, group= cell_type, fill= cell_type)) + geom_density ( alpha= 0.3 , adjust= 1.5 ) + scale_fill_manual ( values = c ( pal_npg ( "nrc" )( 10 )) ) + theme_bw densyplot 4.5 基因表达量的探索 4.5.1 在轨迹图上绘制表达量 # 用所有的基因Scale,便于后续提取矩阵基因都在 seurat_object <- ScaleData (seurat_object, features = rownames (seurat_object)) plotGeneCount (sim, counts, gene = "CD63" ) library (circlize) mycolor <- colorRampPalette ( colors = c ( ' , 'white' , ' ))( 100 ) plotGeneCount (sim, counts, gene = "CD63" ) + scale_colour_gradientn ( colours = mycolor, guide = "colorbar" ) 4.5.2 热图 其实热图的教程我们做过很多:热图绘制天花板:ComplexHeatmap|
一、初探 热图绘制天花板| 二、两万字长文探索注释功能 ComplexHeatmap|
三、用富集分析结果做成词云注释热图 在理解了monocle对象的数据结构和绘制热图的语法后,大家完全可以自行拼凑出下面的图,流程有些长,有点耐心哦:首先整理一下注释信息和表达矩阵 library (ComplexHeatmap) # 注意这里绘制热图如果出现报错 "Image must have at least 1 frame to write a bitmap",因为完整的矩阵是有3w多个细胞,当绘制矩阵过大时候就会出现这样的情况 sce.obj_sub <- subset (seurat_object, downsample= 500 ) expr4plot <- GetAssayData (sce.obj_sub, slot = "scale.data" ) sl.info <- sl.info[ rownames (sl.info) %in% colnames (expr4plot),,drop = F] sl.info <- sl.info[ order (sl.info $ slingshot_pseudotime),,drop = F] expr4plot <- expr4plot[, rownames (sl.info)] dim (sl.info) # [1] 1500 1 dim (expr4plot) # [1] 33538 1500 # 设置顶部的psedotime、celltype的注释 celltype_colors <- c ( "Endothelial cells" = " , "Fibroblast cells" = " , "Epidermal stem cells" = " ) pseudotime_colors <- colorRamp2 ( range (sl.info $ slingshot_pseudotime, na.rm = TRUE ), c ( " , "green" ) # 从浅黄到红 ) top_anno <- HeatmapAnnotation ( Pseudotime= sl.info $ slingshot_pseudotime, Celltype= sce.obj_sub @ meta.data[ rownames (sl.info), "cell_type" ], col = list ( Pseudotime = pseudotime_colors, Celltype = celltype_colors)) # 可以先看一下注释的效果 draw (top_anno) # 找一些感兴趣的基因集,比如各种细胞的marker,这里以每种细胞的Top20为例,实际过程中大家可以选择更多的基因,便于富集到有意义的通路:# 先展示基因来源 sub.marker <- FindAllMarkers (sce.obj_sub, only.pos = TRUE , min.pct = 0.25 , logfc.threshold = 0.25 ) top20gene <- sub.marker %>% group_by (cluster) %>% top_n ( n = 20 , wt = avg_log2FC) left_ann <- rowAnnotation ( cellmarker= top20gene $ cluster) draw (left_ann) 来做个富集分析用于词云注释,富集分析的教程可以看这里:手把手教你做单细胞测序数据分析
(七)——基因集富集分析 library (clusterProfiler) library (org.Hs.eg.db) pathway4wc <- data.frame for (runcell in unique (top20gene $ cluster)) { try ({pathway4wc}) gene.df <- filter (top20gene,cluster == runcell) %>% pull (gene) gores <- enrichGO (gene.df, org.Hs.eg.db, keyType = "SYMBOL" , ont = "BP" , pvalueCutoff = 0.05 , pAdjustMethod = "BH" , qvalueCutoff = 0.2 , minGSSize = 10 , maxGSSize = 500 , readable = FALSE , pool = FALSE ) go.all.res <- as.data.frame (gores @ result) pathway.dt <- top_n (go.all.res, - 10 , p.adjust) pathway.dt <- cbind (pathway.dt, data.frame ( cluster= rep (runcell, nrow (pathway.dt)))) pathway4wc <- rbind (pathway4wc,pathway.dt) } head (pathway4wc) # ID Description GeneRatio BgRatio # GO:0045446 GO:0045446 endothelial cell differentiation 5/15 121/18903 # GO:0003158 GO:0003158 endothelium development 5/15 139/18903 # GO:0061028 GO:0061028 establishment of endothelial barrier 3/15 48/18903 # GO:0043114 GO:0043114 regulation of vascular permeability 3/15 49/18903 # GO:0001885 GO:0001885 endothelial cell development 3/15 65/18903 # GO:0002064 GO:0002064 epithelial cell development 4/15 213/18903 # pvalue p.adjust qvalue geneID # GO:0045446 2.821312e-08 1.704073e-05 9.503368e-06 PECAM1/CLDN5/SOX17/KDR/ROBO4 # GO:0003158 5.661023e-08 1.709629e-05 9.534354e-06 PECAM1/CLDN5/SOX17/KDR/ROBO4 # GO:0061028 6.843439e-06 1.100228e-03 6.135812e-04 PECAM1/CLDN5/ROBO4 # GO:0043114 7.286277e-06 1.100228e-03 6.135812e-04 CLDN5/PLVAP/TACR1 # GO:0001885 1.714324e-05 1.953748e-03 1.089577e-03 PECAM1/CLDN5/ROBO4 # GO:0002064 1.940809e-05 1.953748e-03 1.089577e-03 PECAM1/CLDN5/KDR/ROBO4 # Count cluster # GO:0045446 5 Endothelial cells # GO:0003158 5 Endothelial cells # GO:0061028 3 Endothelial cells # GO:0043114 3 Endothelial cells # GO:0001885 3 Endothelial cells # GO:0002064 4 Endothelial cells # 构建词云的text # 遍历每个唯一的聚类 mytext <- list for (i in 1 : length ( unique (pathway4wc $ cluster))) { # 选择p.adjust最小的前5个通路 temp.text <- filter (pathway4wc,cluster == unique (pathway4wc $ cluster)[i]) %>% top_n ( - 2 ,p.adjust) %>% pull (Description) # 将当前聚类的通路描述信息添加到列表中 mytext[[i]] <- temp.text } # 为列表元素命名,使用聚类的唯一值 names (mytext) <- unique (pathway4wc $ cluster) # 依据显著性设置词云大小,越显著的通路字体越大 # 创建一个空向量用于存储每个聚类的p.adjust值 myfontsize <- c # 遍历每个唯一的聚类 for (i in 1 : length ( unique (pathway4wc $ cluster))) { # 筛选出当前聚类且p.adjust小于0.05的数据 temfs <- filter (pathway4wc,cluster == unique (pathway4wc $ cluster)[i]) %>% filter (p.adjust < 0.05 ) %>% # 选择p.adjust最小的前5个通路 top_n ( - 2 ,p.adjust) %>% pull (p.adjust) # 将当前聚类的p.adjust值添加到向量中 myfontsize <- c (myfontsize,temfs) } # 加载scales包,用于数据缩放 library (scales) # 对p.adjust取负对数,然后进行缩放,使字体大小范围在0到10之间 myfontsize <- - log2 (myfontsize) %>% scales :: rescale (., to= c ( 0 : 1 )) * 10 # 创建行注释对象 # 使用anno_textbox函数绘制文本框注释 max_line_length <- 30 # 定义换行函数,保留 list 结构 wrap_list_text <- function (lst, width = max_line_length) { lapply (lst, function (vec) { sapply (vec, function (txt) { paste ( strwrap (txt, width = width), collapse = " \n " ) }) }) } # 应用函数 mytext_wrapped <- wrap_list_text (mytext) right_ann <- rowAnnotation ( textbox = anno_textbox (top20gene $ cluster, mytext_wrapped, # 允许文本换行 add_new_line = T, by = "anno_block" , # 设置文本框的样式 gp = gpar ( col = c ( pal_npg ( 10 ), pal_nejm ( 8 ), pal_aaas ( 4 )) , fontsize= myfontsize), # 设置文本框的圆角 round_corners = TRUE , # 设置文本框的最大宽度 max_width= unit ( 3 , 'mm' ))) col_fun <- colorRamp2 ( quantile (expr4plot) %>% as.numeric , c ( ' , ' , 'white' , ' , ' )) finalhmp <- Heatmap (expr4plot[top20gene $ gene,], show_column_names = F, show_row_names = F, col = col_fun, top_annotation = top_anno, cluster_columns = F, row_split = top20gene $ cluster, row_title = NULL , right_annotation = right_ann, border = T, left_annotation = left_ann ) draw (finalhmp) save.image ( "result_slingshot/slingshot.Rdata" ) Sys.time # [1] "2025-04-28 15:15:31 CST"
七、velocyto 1、RNAvelocity原文解读 本期文献来源于2018发表在Nature上的《RNA velocity of single cells》。1.1 摘要 RNA丰度是单个细胞状态的一个强有力的指标。目前,单细胞RNA测序能够以高定量精度、灵敏度和通量揭示RNA丰度。然而,目前这种方法只能捕获到某一时间点的静态快照,这给诸如胚胎发生或组织再生等时间动态变化研究带来了挑战。
因此,作者提出了RNA速率的概念(基因表达状态的时间导数),可以通过区分常见单细胞RNA测序中的未剪接和已剪接mRNA直接估算。RNA速率是一个高维向量,能够在数小时的时间尺度上预测单个细胞的未来状态。作者在神经嵴谱系中验证了其准确性,在多个已发表的数据集和技术平台上展示了其用途。
1.2 结果 1.2.1 提出转录动力学模型 1.2.1.1 主要内容 在发育过程中,分化的时间跨度为数小时至数天,这与信使核糖核酸(mRNA)的典型半衰期相当。新生成(未剪接)和成熟(已剪接)mRNA的相对丰度可用于估算基因剪接和降解的速率。同时,在单细胞RNA测序数据中也可能检测到类似的信号,并且这些信号能够揭示动态过程中整个转录组的变化速率和方向。
所有常见的单细胞RNA测序方案都依赖于寡聚脱氧胸苷酸(oligo-dT)引物来富集多聚腺苷酸化的mRNA分子。然而,对基于SMART-seq2、STRT/C1、inDrop和10x Chromium的单细胞RNA测序数据集进行分析,作者发现15%-25%的读段包含未剪接的内含子序列(图1a),这与之前在bulk(14.6%)和单细胞(约20%)RNA测序中的观察结果一致。大多数此类读段源自内含子区域内的次级引物结合位点。
大量内含子分子及其与外显子计数的相关性表明,这些分子代表未剪接的前体mRNA。这一点通过新转录RNA的代谢标记,随后使用寡聚脱氧胸苷酸引物的STRT技术进行RNA测序得到了证实(附录图2);83%的所有基因显示的表达时间进程与简单的一级动力学一致,正如未剪接读段代表新生mRNA时所预期的那样。
为了量化前体mRNA与成熟mRNA丰度之间随时间变化的关系,作者假设了一个简单的转录动力学模型,其中剪接mRNA丰度的一阶时间导数(RNA速率)由未剪接mRNA转化为剪接mRNA的生成与mRNA降解之间的平衡所决定(图1b)。在这样的模型下,当转录速率α恒定时,系统会渐近地趋向于稳态。此时剪接(s)和未剪接(u)分子的稳态丰度由α决定,并受固定斜率关系的约束,即 u=γs。
平衡斜率γ结合了降解和剪接速率,反映了基因特异性的调控特性、内含子和外显子长度的比例以及内部启动位点的数量。对最近发表的一份小鼠组织进行研究,发现大多数基因在广泛细胞类型中的稳态行为与单一固定斜率γ一致(附录图3a-c)。然而,11%的基因在不同组织亚组中表现出不同的斜率(附录图3d-e),这表明存在组织特异性的可变剪接(附录图3f)或降解速率。
在动态过程中,转录速率α的增加会导致未剪接的mRNA迅速增加,随后剪接的mRNA也会随之增加(图1c),直至达到新的稳态。相反,转录速率的下降首先会导致未剪接的mRNA迅速减少,随后剪接的mRNA也会减少。在基因表达诱导期间,未剪接的mRNA量会超出基于平衡速率γ的预期,而在基因表达抑制期间则情况相反(图1d)。因此,未剪接和剪接的mRNA丰度之间的平衡是成熟mRNA丰度未来状态的指标,从而也是细胞未来状态的指标。
为了说明这样一个简单的模型能够用于预测未来成熟mRNA的丰度,作者研究了小鼠肝脏昼夜节律周期的总RNA测序的时间进程数据。在每个时间点,未剪接的mRNA水平始终与下一个时间点的剪接mRNA更为相似(图1e),并且许多与昼夜节律相关的基因在上调期间未剪接mRNA相对于斜率γ的含量高于预期,在下调期间则低于预期(图1f-g)。
通过求解每个基因的所提出的微分方程,作者能够将每个测量值外推到整个昼夜节律周期,准确地捕捉到预期的昼夜节律周期的进展方向(图1h)。图1: 附录图2: 附录图3: 1.2.2 小鼠嗜铬细胞验证 1.2.2.1 主要内容 为了展示在单细胞测序中预测转录动态的能力,作者分析了基于SMART-seq2技术测序得到的小鼠嗜铬细胞(chromaffin)单细胞mRNA测序数据。
在发育过程中,肾上腺髓质的嗜铬细胞(一种神经内分泌细胞)中,存在相当一部分是由Schwann细胞前体衍生而来。这便提供了一个便捷的测试案例,可以通过谱系追踪来验证分化的方向。基因的相位图展示了与预测稳态关系的预期偏差(图2b-c)。单个细胞的RNA速率估计值准确地概括了该数据集中的转录动力学,包括分化细胞向嗜铬细胞的整体移动,以及向中间分化状态的移动和远离中间分化状态的变化(图2d)。
为了适应那些处于稳态之外很远的基因,作者还开发了一种基于基因结构的替代拟合方法(附录图4)。有多种技术可用于在低维空间中可视化速率估计值。观察到的和外推的细胞状态可以共同嵌入到一个共同的低维空间中(例如图2d中的主成分分析)。或者,速率也可以根据外推状态与局部邻域中其他细胞的相似性投影到现有的低维嵌入中如tSNE(图2h)。在大型数据集中,使用局部平均向量场更容易可视化细胞速率的普遍模式(图2i)。
由于细胞可以同时在多个独立的成分上具有RNA速率,例如分化、成熟和增殖,所以在解读低维表示时必须谨慎,因为在一个特定嵌入中缺乏明显速率的细胞,在未可视化的一些子空间中仍可能具有相当大的速率。细胞特异性RNA速率估计为定量建模细胞命运提供了自然基础。代谢标记显示,对于大多数基因而言,在10-100分钟后,剪接/未剪接比率的变化即可被检测到(附录图2)。然而,外推的有效时间尺度取决于所分析的生物学过程。
基于对嗜铬细胞祖细胞进行的EdU脉冲标记作者估计能够对未来2.5-3.8小时的情况进行外推(图2f,g),这也与能够分辨细胞周期事件的能力相一致。不过,鉴于外推的线性性质,这种预测时间尺度将取决于基因表达轨迹的形状(即表达流形的曲率)。为了展示作者方法的通用性,作者分析了使用其他单细胞RNA测序流程生成的数据。
作者观察了小鼠骨髓中中性粒细胞成熟的转录动态,以及使用inDrop技术测序的小鼠皮质中光诱导的神经元激活(附录图5),以及使用10x Genomics Chromium测量的小鼠肠道上皮(附录图6)、少突胶质细胞分化(附录图7)和海马体发育(见下文)。RNA速率的估计值对模型参数的变化、基因和细胞的抽样都很稳健,其中最敏感的参数是在预定义嵌入中可视化速率时所用邻域的大小。
大多数基因的速率估计值与实验观察到的表达导数呈正相关(附录图8),这证实了速率向量是有信息量的。在特定情况下出现的失败是由于几个明显的原因,包括仅在远离平衡态时观察到的基因、非编码转录本的贡献不均匀以及可变剪接导致在所测群体中存在多个γ率。1.2.2.2 知识点 EdU(5-ethynyl-2′-deoxyuridine)脉冲标记是一种用于检测细胞增殖和DNA合成的技术,常用于细胞周期研究。
它是一种胸苷类似物,在细胞复制DNA时可以被掺入新合成的DNA中。图2: 附录图4: 附录图5: 附录图6: 附录图7: 附录图8: 1.2.3 小鼠海马体验证 1.2.3.1 主要内容 接下来,作者将RNA速率分析应用于发育中的小鼠海马体分支谱系。去除血管细胞、免疫细胞、GABAergic和CajalRetzius神经元(这些细胞起源于海马体之外)后,t-SNE降维图显示了一个具有多个分支的复杂流形(图3a)。
作者利用已知的标记物将分支末端分别识别为星形胶质细胞(astrocytes)、少突胶质前体细胞(OPCs)、齿状回颗粒神经元细胞(dentate gyrus granule neurons)以及海马五个区域(subiculum、CA1、CA2、CA3 和hilus)的锥体神经元细胞(附录图9)。单个基因的相位图显示了沿流形基因表达的特定诱导和抑制(图3b,附录图10)。
例如,Pdgfra(OPCs的标记物)在pre-OPCs细胞中被诱导,并在OPCs中得以维持;其在pre-OPCs细胞状态中表现出正速率,但在OPCs中则为零速率。同样,Igfbpl1特异性表达于神经前体细胞(neuroblasts),从放射状胶质细胞(radial glia)到神经前体细胞表现出正速率,但从神经前体细胞到两个主要神经元分支则表现出负速率。RNA速率显示出朝向每个主要分支的强烈定向流动(图3c,附录图10)。
作者根据包括Notch信号通路的靶基因Hes1和同源框转录因子Hopx在内的标志物的表达,将这些细胞鉴定为放射状胶质细胞(附录图9)。事实上,先前的命运图谱研究已表明放射状胶质细胞是海马体谱系树的真正起源。通过在速率场中使用马尔可夫随机游走模型,可以自动识别终端状态和根状态(图3c),这证明了RNA速率具有在无需先验发育过程知识的情况下对谱系树进行进化方向推断的能力。
一方面,速率指向星形胶质细胞(表达Aqp4)而无需中间细胞分裂,或者通过狭窄通道从pre-OPC 指向OPC细胞状态。作者推测这个狭窄通道代表了向少突胶质细胞谱系的决定时刻。在这一微态水平上,细胞命运的选择可能是一个非确定性过程,涉及基因表达向一种或另一种命运倾斜,随后一旦转录因子反馈回路建立,最终命运便被锁定。
将起始于前少突胶质前体细胞(pre-OPCs)的细胞与起始于通向少突胶质前体细胞(OPCs)狭窄通道的细胞未来状态的概率分布进行比较,发现存在明显差异,后者最终成为完全形成的少突胶质前体细胞的可能性极大,而前者则极有可能保持在前少突胶质前体细胞状态(图3d)。部分祖细胞(附录图9b)表达神经源性转录因子(如Neurod2、Neurod4、Eomes),这些细胞表现出向未成熟神经母细胞状态发展的趋势,从而指向流形上部的三个主要神经元分支。
齿状回颗粒神经元首先从海马体中分离出来,随后海马细胞再次分裂为subiculum/CA1和CA2-4,分别对应于海马体的主要功能和解剖分区(附录图9、10)。这种对分支谱系的单细胞详细观察使作者能够探究细胞命运的选择问题。在CA命运和颗粒细胞命运分支点入口处的两个相邻神经干细胞中(图3e),尽管它们当前的状态在基因表达空间中是相邻的,但它们的未来命运已倾向于不同方向,这从Prox1的激活情况中可以区分出来(图3c)。
与这些发现一致的是,已有研究表明Prox1对颗粒细胞的形成是必需的,当Prox1被删除时,神经干细胞则会转而采取锥体神经元的命运。1.2.3.2 知识点 Notch信号通路维持Radial Glia细胞干性,通过激活Hes1(Notch信号的主要下游效应因子)抑制其分化;Hes1高表达时,抑制Hopx,使Radial Glia保持未分化状态;
当Notch信号减弱时,Hes1下降,Hopx上调,Radial Glia逐步向神经元或星形胶质细胞分化; Hopx在Radial Glia退出干性、转变为 NIPC(神经前体细胞)或星形胶质细胞中起重要作用。
图3: 附录图9: 附录图10: 1.2.4 人类前脑验证 1.2.4.1 主要内容 为了证明RNA速率在人类胚胎中是可检测的,作者在受孕后十周对发育中的人类前脑进行了基于液滴的单细胞mRNA测序,重点关注谷氨酸能神经元谱系(图4a)。
作者发现了一个强烈的RNA速率模式,其起源于增殖的祖细胞状态(放射状胶质细胞),并依次经过一系列中间神经母细胞阶段,最终形成表达SL17A7(前脑兴奋性神经元中使用的囊泡谷氨酸转运蛋白)的更成熟的谷氨酸能神经元。
作者通过多重原位杂交验证了已知和新发现的皮质神经元发育标志物的表达(图4b-c),证实了CLU和FBXO32在室管膜下区(SOX2标记的放射状胶质细胞)、UNC5D在中间区(EOMES标记的神经母细胞)以及SEZ6和RBFOX1在皮质板(SLC17A7标记的神经元,也称为VGLUT1)中的预测表达。这些基因在组织中的分层表达(图4c)与单细胞RNA测序数据中它们表达的伪时间分布(图4b)高度一致。
作者采用主曲线分析法,依据分化假时间对细胞进行排序,并研究了人类原代细胞中转录的时序进程。作者证实,在上调和下调过程中,未剪接的mRNA始终先于剪接的mRNA出现(图4d)。作者观察到了快速和缓慢的动力学过程。例如,RNASEH2B显示出快速的动力学特征,未剪接RNA和剪接RNA之间几乎没有差异。
相反,DCX、ELAVL4和STMN2等基因则表现出初始快速转录爆发,随后转录水平持续降低(从未剪接RNA曲线的形状可以看出,图4d),而剪接转录本则明显延迟。这种具有超调的动态诱导已被认为有助于快速诱导降解动力学缓慢的基因,但在人类胚胎中尚无法进行研究。RNA速率基于真实的转录动力学,这一事实有望为以后理解细胞在基因表达空间中的动态过程(尤其是在分化过程中)提供更加坚实的定量基础。
RNA速率已经使我们能够在整个生物体中详细研究动态过程,并且将极大促进人类胚胎中的谱系分析。图4: 1.3 小结 综合以上研究内容,我们不难发现基于RNA速率推断可以在没有先验发育过程知识的基础上,实现细胞发育轨迹的推断。同时在不同的物种和组织都得到了有效的认证。其实基于目前的研究,已经开发可用于轨迹推断的软件不止一个。例如我们先前就已经推送过的拟时序分析| 3.monocle2实操:完整版和Slingshot拟时序分析学习手册。
两款分析软件都可以用于轨迹推断。但是RNA速率推断以其本身的算法特点,目前仍是单细胞数据轨迹推断的重要软件之一。因此,今天我们在本教程中,基于R进行RNA速率推断演示。2、数据下载 RNA速率分析输入文件之一就是loom文件(包含细胞与基因表达矩阵信息),而loom文件的生成首先第一步需要用到BAM文件。BAM文件中包含了单细胞转录组数据和参考基因组的比对结果,主要用于映射reads到参考基因组。
今天我们的实例数据的bam文件可以从10x genomics官网直接下载得到。
对于BAM文件前期处理感兴趣的可以参考我们之前推送的一文了解并掌握samtools 2.1 bam文件下载地址 : https : // www .10 xgenomics.com / resources / datasets / 3 - k - pbm - cs - from - a - healthy - donor -1 - standard -1-1-0 2.2 repeatmask文件下载地址:https : // genome.ucsc.edu / cgi - bin / hgTables?hgsid = 611454127 _NtvlaW6xBSIRYJEBI0iRDEWisITa & clade = mammal & org = & db = hg19 & hgta_group = allTracks & hgta_track = rmsk & hgta_table = rmsk & hgta_regionType = genome & position = & hgta_outputType = gff & hgta_outFileName = hg19_rmsk.gtf 2.3 gtf文件下载地址:https : // www.gencodegenes.org / human / release_47lift37.html 3、软件安装 # 创建新的conda环境 conda create - n velocyto # 激活环境 conda activate velocyto # 在该环境安装所需的两个包及其附属包 sudo apt install libboost - filesystem - dev sudo apt install libboost - system - dev conda install numpy scipy cython numba matplotlib scikit - learn h5py click pip install velocyto pip install scvelo # 检查是否安装成功 velocyto -- help 4、生成loom文件生成的过程时间较长,耗时约8.5h # 解压gtf文件 gzip - d gencode.v47.annotation.gtf.gz # 查看velocyto run 使用方法 velocyto run -- help # 生成loom velocyto run \ pbmc3k_possorted_genome_bam.bam \ gencode.v47lift37.annotation.gtf \ - o . / result \ - m hg19_rmsk.gtf 5、RNA速率分析实战 5.1 安装R包 library (devtools) install_github ( "velocyto-team/velocyto.R" ) 5.2 加载R包 # 加载R包 suppressMessages ({ library (Seurat) library (velocyto.R) library (tidyverse) }) 5.3 处理测试数据 本次使用的测试数据集为pbmc3k数据集,预处理过程可以参考单细胞测序数据分析
(三)——单样本分析 Sys.time # [1] "2025-04-28 15:15:32 CST" # 设置工作目录 setwd ( "/data/06_scRNA-seq_psedutime/" ) # 导入10x三个标准输入文件 sce = Read10X ( "data/filtered_gene_bc_matrices/hg19" ) # 创建seurat对象 sce_final <- sce %>% CreateSeuratObject ( min.cells = 3 , min.features = 200 , project= "pbmc3k" ) %>% PercentageFeatureSet ( pattern = '^MT' , col.name = 'percent.mt' ) # 质控,过滤低质量的细胞 select_c <- WhichCells (sce_final, expression = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5 ) sce_final = subset (sce_final, cells = select_c) dim (sce_final) # [1] 13714 2626 # 降维分群 sce_final <- sce_final %>% NormalizeData %>% FindVariableFeatures %>% ScaleData %>% RunPCA %>% FindNeighbors ( dims = 1 : 10 ) %>% FindClusters ( resolution = 0.5 ) %>% RunTSNE ( dims = 1 : 10 ) %>% RunUMAP ( dims = 1 : 10 ) # Modularity Optimizer version 1.3.0 by Ludo Waltmand Nees Jan van Eck # # Number of nodes: 2626 # Number of edges: 95233 # # Running Louvain algorithm... # Maximum modularity in 10 random starts: 0.8722 # Number of communities: 9 # Elapsed time: 0 seconds # 参考seurat官网添加细胞标签 new.cluster.ids <- c ( "TCells" , "Myeloid" , "TCells" , "BCells" , "TCells" , "Myeloid" , "TCells" , "Myeloid" , "Platelet" ) names (new.cluster.ids) <- levels (sce_final) sce_final <- RenameIdents (sce_final, new.cluster.ids) sce_final @ meta.data $ celltype <- Idents (sce_final) # 筛选出T细胞亚群用于后续RNA速率分析 TCells <- subset (sce_final, celltype == "TCells" ) # 保存RDS文件 # saveRDS(TCells,"result_velocity/pbmc3k_TCells.rds") 5.4 统计spliced和unspliced区域 # 读入前面保存的rds文件 TCells <- readRDS ( "./result_velocity/pbmc3k_TCells.rds" ) # 展示T细胞的降维分群 DimPlot (TCells, group.by= "seurat_clusters" , label = T) # 读入loom ldat <- read.loom.matrices ( "data/rnavelocity/pbmc3k_possorted_genome_bam_3SE0I.loom" ) # reading loom file via hdf5r... setdiff ( colnames (ldat $ spliced), colnames (ldat $ unspliced)) # character(0) setdiff ( colnames (ldat $ spliced), colnames (ldat $ ambiguous)) # character(0) # 修改loom细胞barcode,和seurat对象的保持一致 col <- as.data.frame ( colnames (ldat $ spliced)) col <- col %>% separate ( "colnames(ldat$spliced)" , into = c ( "col1" , "col2" ), sep = ":" ) col $ col3 <- paste0 (col $ col2, "-1" ) # 检查修改后的列名称 head (col) # col1 col2 col3 # 1 pbmc3k_possorted_genome_bam_3SE0I AAACATACAACCAC AAACATACAACCAC-1 # 2 pbmc3k_possorted_genome_bam_3SE0I AAACATTGAGCTAC AAACATTGAGCTAC-1 # 3 pbmc3k_possorted_genome_bam_3SE0I AAACATTGATCAGC AAACATTGATCAGC-1 # 4 pbmc3k_possorted_genome_bam_3SE0I AAACCGTGTATGCG AAACCGTGTATGCG-1 # 5 pbmc3k_possorted_genome_bam_3SE0I AAACCGTGCTTCCG AAACCGTGCTTCCG-1 # 6 pbmc3k_possorted_genome_bam_3SE0I AAACGCACTGGTAC AAACGCACTGGTAC-1 dim (col) # [1] 2959 3 # 将原先的barcode名称替换成修改后的 colnames (ldat $ spliced) <- col $ col3 colnames (ldat $ unspliced) <- col $ col3 colnames (ldat $ ambiguous) <- col $ col3 # loom和seurat对象细胞数保持一致 cells <- Cells (TCells) TCells @ assays $ spliced <- ldat $ spliced[,cells] TCells @ assays $ unspliced <- ldat $ unspliced[,cells] TCells @ assays $ ambiguous <- ldat $ ambiguous[,cells] 5.5 RNA速率推断 # 速率推断,耗时约0.5h rvel <- gene.relative.velocity.estimates (ldat $ spliced[,cells], ldat $ unspliced[,cells]) # 提取细胞信息 meta <- TCells @ meta.data # 在meta文件中添加新的一列,将每个分群定义一个新的颜色,用户后续作图展示 meta <- meta %>% mutate ( color= case_when ( seurat_clusters == "0" ~ " , seurat_clusters == "2" ~ " , seurat_clusters == "4" ~ " , seurat_clusters == "6" ~ " )) cell.colors <- meta $ color names (cell.colors) <- rownames (meta) head (cell.colors) # 展示T细胞亚群RNA速率图,分化轨迹按照箭头指向发展 show.velocity.on.embedding.cor ( emb = TCells @ reductions $ umap @ cell.embeddings, rvel, cell.colors = cell.colors) Sys.time # [1] "2025-04-28 15:18:01 CST" 环境版本信息 sessionInfo # R version 4.4.2 (2024-10-31) # Platform: x86_64-pc-linux-gnu # Running under: Ubuntu 20.04.6 LTS # # Matrix products: default # BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 # LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0 # # locale: # [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C # [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 # [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 # [7] LC_PAPER=en_US.UTF-8 LC_NAME=C # [9] LC_ADDRESS=C LC_TELEPHONE=C # [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C # # time zone: Asia/Shanghai # tzcode source: system (glibc) # # attached base packages: # [1] grid splines stats4 stats graphics grDevices utils # [8] datasets methods base # # other attached packages: # [1] velocyto.R_0.6 ggpubr_0.5.0 # [3] paletteer_1.6.0 clustree_0.5.1 # [5] ggraph_2.1.0 openxlsx_4.2.7.1 # [7] hdf5r_1.3.11 scater_1.26.1 # [9] scran_1.26.0 scuttle_1.8.0 # [11] tradeSeq_1.20.0 mclust_6.1.1 # [13] uwot_0.1.14 slingshot_2.6.0 # [15] TrajectoryUtils_1.6.0 princurve_2.1.6 # [17] RColorBrewer_1.1-3 forcats_1.0.0 # [19] stringr_1.5.1 purrr_1.0.2 # [21] readr_2.1.5 tidyr_1.3.1 # [23] tibble_3.2.1 tidyverse_1.3.2 # [25] org.Ce.eg.db_3.20.0 tidyselect_1.2.1 # [27] monocle3_1.3.7 SingleCellExperiment_1.20.0 # [29] SummarizedExperiment_1.28.0 GenomicRanges_1.50.1 # [31] GenomeInfoDb_1.34.3 MatrixGenerics_1.18.1 # [33] matrixStats_1.5.0 scales_1.3.0 # [35] org.Hs.eg.db_3.16.0 AnnotationDbi_1.68.0 # [37] IRanges_2.32.0 S4Vectors_0.36.0 # [39] clusterProfiler_4.6.0 ComplexHeatmap_2.14.0 # [41] circlize_0.4.16 ggsci_3.2.0 # [43] patchwork_1.3.0 dplyr_1.1.4 # [45] Seurat_5.0.3 SeuratObject_5.0.1 # [47] sp_2.1-4 monocle_2.34.0 # [49] DDRTree_0.1.5 irlba_2.3.5.1 # [51] VGAM_1.1-12 ggplot2_3.5.1 # [53] Biobase_2.66.0 BiocGenerics_0.52.0 # [55] Matrix_1.7-1 # # loaded via a namespace (and not attached): # [1] igraph_2.0.1.1 ica_1.0-3 # [3] plotly_4.10.1 rematch2_2.1.2 # [5] zlibbioc_1.52.0 rvest_1.0.3 # [7] bit_4.5.0.1 doParallel_1.0.17 # [9] clue_0.3-66 lattice_0.22-6 # [11] rjson_0.2.23 blob_1.2.4 # [13] parallel_4.4.2 png_0.1-8 # [15] ResidualMatrix_1.8.0 cli_3.6.3 # [17] ggplotify_0.1.0 usdata_0.3.1 # [19] goftest_1.2-3 gargle_1.2.1 # [21] textshaping_0.3.6 pbmcapply_1.5.1 # [23] bluster_1.8.0 grr_0.9.5 # [25] BiocNeighbors_2.0.1 shadowtext_0.1.4 # [27] mime_0.12 evaluate_1.0.3 # [29] tidytree_0.4.1 leiden_0.4.3 # [31] openintro_2.4.0 stringi_1.8.4 # [33] backports_1.5.0 lubridate_1.9.0 # [35] httpuv_1.6.6 magrittr_2.0.3 # [37] wk_0.9.4 sctransform_0.4.1 # [39] ggbeeswarm_0.7.2 DBI_1.2.3 # [41] jquerylib_0.1.4 withr_3.0.2 # [43] systemfonts_1.1.0 class_7.3-20 # [45] enrichplot_1.18.1 lmtest_0.9-40 # [47] tidygraph_1.2.2 htmlwidgets_1.5.4 # [49] fs_1.6.5 ggrepel_0.9.6 # [51] cherryblossom_0.1.0 labeling_0.4.3 # [53] cellranger_1.1.0 reticulate_1.26 # [55] zoo_1.8-12 XVector_0.38.0 # [57] knitr_1.49 airports_0.1.0 # [59] RhpcBLASctl_0.23-42 timechange_0.3.0 # [61] speedglm_0.3-4 foreach_1.5.2 # [63] data.table_1.16.4 ggtree_3.6.2 # [65] R.oo_1.27.0 RSpectra_0.16-2 # [67] ggrastr_1.0.1 fastDummies_1.7.4 # [69] gridGraphics_0.5-1 lazyeval_0.2.2 # [71] yaml_2.3.10 survival_3.4-0 # [73] scattermore_1.2 crayon_1.5.3 # [75] RcppAnnoy_0.0.22 progressr_0.15.1 # [77] tweenr_2.0.3 later_1.4.1 # [79] ggridges_0.5.6 codetools_0.2-20 # [81] GlobalOptions_0.1.2 HSMMSingleCell_1.26.0 # [83] KEGGREST_1.38.0 Rtsne_0.17 # [85] shape_1.4.6.1 fastICA_1.2-7 # [87] limma_3.62.2 pkgconfig_2.0.3 # [89] xml2_1.3.6 aplot_0.1.8 # [91] spatstat.sparse_3.1-0 ape_5.8-1 # [93] viridisLite_0.4.2 xtable_1.8-4 # [95] car_3.1-1 plyr_1.8.9 # [97] httr_1.4.4 tools_4.4.2 # [99] globals_0.16.3 beeswarm_0.4.0 # [101] broom_1.0.7 nlme_3.1-160 # [103] HDO.db_0.99.1 dbplyr_2.5.0 # [105] crosstalk_1.2.1 assertthat_0.2.1 # [107] lme4_1.1-31 digest_0.6.37 # [109] farver_2.1.2 tzdb_0.4.0 # [111] reshape2_1.4.4 yulab.utils_0.1.9 # [113] viridis_0.6.5 glue_1.8.0 # [115] cachem_1.1.0 polyclip_1.10-7 # [117] generics_0.1.3 Biostrings_2.66.0 # [119] classInt_0.4-11 googledrive_2.0.0 # [121] presto_1.0.0 spdep_1.3-6 # [123] parallelly_1.41.0 statmod_1.5.0 # [125] ragg_1.2.4 RcppHNSW_0.6.0 # [127] ScaledMatrix_1.6.0 carData_3.0-5 # [129] minqa_1.2.8 pbapply_1.7-2 # [131] spam_2.11-0 gson_0.1.0 # [133] dqrng_0.4.1 utf8_1.2.4 # [135] graphlayouts_0.8.3 readxl_1.4.3 # [137] ggsignif_0.6.4 gridExtra_2.3 # [139] shiny_1.10.0 GenomeInfoDbData_1.2.13 # [141] R.utils_2.12.3 RCurl_1.98-1.16 # [143] memoise_2.0.1 rmarkdown_2.29 # [145] pheatmap_1.0.12 downloader_0.4 # [147] R.methodsS3_1.8.2 googlesheets4_1.0.1 # [149] future_1.34.0 RANN_2.6.2 # [151] spatstat.data_3.1-4 rstudioapi_0.17.1 # [153] cluster_2.1.8 spatstat.utils_3.1-2 # [155] hms_1.1.3 fitdistrplus_1.2-2 # [157] munsell_0.5.1 cowplot_1.1.3 # [159] colorspace_2.1-1 FNN_1.1.4.1 # [161] rlang_1.1.4 s2_1.1.7 # [163] DelayedMatrixStats_1.20.0 sparseMatrixStats_1.10.0 # [165] dotCall64_1.2 ggforce_0.4.2 # [167] mgcv_1.8-41 xfun_0.50 # [169] e1071_1.7-16 modelr_0.1.10 # [171] iterators_1.0.14 abind_1.4-8 # [173] GOSemSim_2.24.0 treeio_1.22.0 # [175] bitops_1.0-9 promises_1.3.2 # [177] scatterpie_0.1.8 RSQLite_2.3.9 # [179] leidenbase_0.1.31 qvalue_2.38.0 # [181] reprex_2.0.2 fgsea_1.24.0 # [183] DelayedArray_0.24.0 proxy_0.4-27 # [185] GO.db_3.16.0 compiler_4.4.2 # [187] boot_1.3-31 beachmat_2.14.0 # [189] listenv_0.9.1 Rcpp_1.0.14 # [191] edgeR_4.4.1 BiocSingular_1.14.0 # [193] tensor_1.5 units_0.8-5 # [195] MASS_7.3-64 BiocParallel_1.40.0 # [197] spatstat.random_3.0-1 R6_2.5.1 # [199] rstatix_0.7.1 fastmap_1.2.0 # [201] fastmatch_1.1-6 vipor_0.4.7 # [203] ROCR_1.0-11 rsvd_1.0.5 # [205] gtable_0.3.6 KernSmooth_2.23-26 # [207] miniUI_0.1.1.1 deldir_2.0-4 # [209] htmltools_0.5.8.1 bit64_4.5.2 # [211] spatstat.explore_3.0-5 lifecycle_1.0.4 # [213] sf_1.0-18 zip_2.3.1 # [215] spData_2.3.4 nloptr_2.1.1 # [217] sass_0.4.9 vctrs_0.6.5 # [219] slam_0.1-55 spatstat.geom_3.0-3 # [221] DOSE_3.24.1 haven_2.5.4 # [223] ggfun_0.0.8 future.apply_1.11.3 # [225] bslib_0.9.0 pillar_1.10.1 # [227] batchelor_1.14.0 pcaMethods_1.90.0 # [229] metapod_1.14.0 locfit_1.5-9.10 # [231] combinat_0.0-8 jsonlite_1.8.9 # [233] GetoptLong_1.0.5 参考:https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-018-4772-0 https://cole-trapnell-lab.github.io/monocle-release/ https://www.nature.com/articles/s41586-018-0414-6