本教程基于Linux中的Rstudio环境演示,计算资源不足的同学可参考:生信分析为什么要使用服务器?
足够支持你完成硕博生涯的生信环境 配置一个心仪的工作站(硬件+环境配置) 独享服务器,生信分析不求人 为实验室准备一份生物信息学不动产 可视化集锦 第一讲:单细胞绪论 1. 课程安排 本系列课程会以 视频 及 代码 的形式,陆续的上线,欢迎大家进行转发,那么做这个系列的课程的目的主要有两点,一则是帮助大家解决单细胞数据挖掘过程当中遇到的问题,那另一方面呢,受限于本人的知识,在制作期间难免有疏漏,欠妥的地方也希望大家及时给我指出来,大家互相学习,共同进步。
课程安排:我们初步将单细胞测序的数据挖掘拆分为 数个模块,课程结束之后从大家可以从最早的数据下机一直到论文当中的图表的绘制,所有问题都会迎刃而解。课程的安排主要考虑数据分析的先后顺序:第一讲:单细胞测序的简介。第二讲:介绍一些单细胞数据常见输入文件的读取方式。第三讲:本系列课程的精髓,对单样本分析进行初步的质控,初步的可视化,这个过程中大家会掌握许多 R语言及可视化的技巧。第四讲:我们主讲实际分析过程中如何对 多样本 的数据进行 整合。
第五讲,在cluster初步确定之后,利用各类的数据库,比如说cell marker等对细胞的类型进行注释,或者利用 singleR 进行一些自动的利用富集原理的注释。第六讲:教会大家掌握对Seurat分类变量处理技巧。第七讲:这一讲大家在这个研究项目当中的核心目的:寻找差异基因差异 进行可视化。第八讲:通过富集分析了解差异基因背后蕴藏的 生物学意义。第九讲:沉浸式统计细胞比例。第十讲:改造单细胞图形-DimPlot。
第十一讲:改造单细胞图形-VlnPlot。第十二讲:改造单细胞图形-FeaturePlot。第十三讲:改造单细胞图形-DotPlot。第十四讲:改造单细胞图形-DoHeatmap。R语言基础学习手册 你至少需要准备一个 4 线程、 32 GB运行内存的电脑来运行以下程序。实战的数据分析可能需要上百GB的内存,如果你的设备不能满足,可以考虑下列服务器的租赁。
最便宜的版本即可帮助你们完成90%以上的单细胞分析需求(土豪随意):足够支持你完成硕博生涯的生信环境 如何拥有一台服务器,且不用自己搭建维护 为实验室准备一份生物信息学不动产 2. 视频及测试文件 2.1 视频教程 2.2 测试文件 第一讲为绪论,无测试文件 第二讲:各类型数据读取 1. 简介:本次课程为该系列的第二次课程,在大家进行了一段时间的 R语言 与 Linux 学习后即可开始跟随课程正式开始系列课程的学习。
本次课程我们首先会教大家如何 安装 、加载分析所需要的 全套R包,然后简单的介绍一下可用于Seurat对象创建的 各类数据结构及读取方式 ( Cellranger 输出文件、表达矩阵、工程文件等)。2. 视频及测试文件 2.1 视频教程 2.2 测试文件 请在推送中获取:手把手教你做单细胞测序数据分析
(二):各种输入文件的读取方式 3. 环境准备 考虑到大家遇到最多的Bug可能是版本冲突,因此需要保证您的 R 与 Packages 的版本都与我一致 版本环境点击跳转 4. 代码: 注意,本教程以 Seurat 的 V5 版本进行演示,Seurat V4和V5的不同之处可参考快速上手Seurat V5(订阅用户请在龙年订阅资料下载连接中找,无需再次付费)。
# 可以参考下面流程安装Seurat:# if(!require(Signac))install.packages('Signac') # remotes::install_github("satijalab/seurat-data", quiet = TRUE) # remotes::install_github("satijalab/azimuth", quiet = TRUE) # remotes::install_github("satijalab/seurat-wrappers", quiet = TRUE) # # if(!require(Seuratdata))devtools::install_github('satijalab/seurat-data') # # setRepositories(ind = 1:3, addURLs = c('https://satijalab.r-universe.dev', 'https://bnprks.r-universe.dev/')) # install.packages(c("BPCells", "presto", "glmGamPoi")) # # 安装Seurat V5,本教程编译的2024年 # install.packages('Seurat') # 安装conda # wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh # bash Miniconda3-latest-Linux-x86_64.sh # source ~/.bashrc 4.1 加载R包 # 服务器账号已经安装好了Seurat,可以直接加载 library (Seurat) library (magrittr) library (mindr) # 确认一下自己的安装版本,是V5后再进行下面的教程 packageVersion ( 'Seurat' ) # [1] '5.1.0' # 设置工作目录 setwd ( "/path/to/project" ) # 删除当前可能存在的所有环境变量 rm ( list = ls ) 4.2 读取数据并创建Seurat对象 # 判断依赖包是否存在:if ( ! require (multtest))BiocManager :: install ( "multtest" ) if ( ! require (Seurat)) install.packages ( "Seurat" ) if ( ! require (dplyr)) install.packages ( "dplyr" ) if ( ! require (tidyverse)) install.packages ( "tidyverse" ) # 读取单样本数据 # 方法一:10x标准三个输出文件 obj1 <- Read10X ( data.dir = "./filtered_gene_bc_matrices/hg19/" ) # 创建Seurat对象 obj1 <- CreateSeuratObject ( counts = obj1, project = "教程作者" ) obj1 # An object of class Seurat # 32738 features across 2700 samples within 1 assay # Active assay: RNA (32738 features, 0 variable features) # 1 layer present: counts # 方法二:仅有一个稀疏矩阵 # 读取时间受到文件大小影响,文件较大,读取时间相应较长 obj2 <- read.table ( "./single_cell_datamatrix.txt" , sep= " \t " , header= T, row.names= 1 ) # 查看数据维度,即细胞数和基因数 dim (obj2) # [1] 13714 2700 # 创建Seurat对象 obj2 <- CreateSeuratObject ( counts = obj2) obj2 # An object of class Seurat # 13714 features across 2700 samples within 1 assay # Active assay: RNA (13714 features, 0 variable features) # 1 layer present: counts # 方法三:RDS文件 obj3 <- readRDS ( "./panc8.rds" ) obj3 # An object of class Seurat # 13714 features across 2638 samples within 1 assay # Active assay: RNA (13714 features, 2000 variable features) # 3 layers present: counts, data, scale.data # 3 dimensional reductions calculated: pca, umap, tsne 4.3 Seurat对象网络图查看 # 前期准备 # 安装pdftools需要的库libpoppler-cpp-dev # sudo apt install libpoppler-cpp-dev # install.packages("pdftools") # 安装完pdftools再安装mindr # devtools::install_github("pzhaonet/mindr") # 这一部分由于版本原因我们做了较大的修改 out2 <- str (obj2) %>% capture.output (.) %>% gsub ( pattern = " \\ . \\ . " , replacement = " , x = . ) %>% gsub ( pattern = " \\ . \\ .@" , replacement = "# " , x = . ) %>% gsub ( pattern= "^ \\ s+ , replace= " ) mindr :: mm ( from = out2, input_type = "markdown" , output_type = "widget" , root = "Seurat" ) # $widget 网络图的绘制报错可参考:答读者问
(十三)查看Seurat对象时的ERROR:type=‘text’ 粉丝来稿|2. ERROR:type=’text’的另一种解法 第三讲:单样本分析 这一部分课程 正式开始学习 单细胞转录组测序的数据分析,大家无需担心自己的 R语言 方面的编程基础。本次课程将从分析工具包的 安装 、 加载 开始,对脚本的参数进行 全方位 的讲解。
具体分析内容方面,此次课程将详细的给大家介绍 Seurat对象 的创建、过滤、数据结构特征及理解、存储、线粒体含量计算、分群、降维、Marker计算、各类可视化方法。并同时提醒大家不同物种处理时需要进行一些改动。此次课程为整个《 单细胞转录组测序数据分析 》课程的 核心内容,其中代码的运行方式、数据结构的理解与处理均十分重要,后续的进阶分析都需以此课程作为基础。关于此讲内容,Seurat V4及V5之间并无明显差别。
代码框中内容可以直接复制至 Rstudio 中运行 1. 视频及测试文件 1.1 视频教程 1.2 测试文件 代码中包含测试文件下载方式,如果网络不佳可在推送中获取。下载本手册的同时本地也包含配套测试文件~ 请在推送中获取:手把手教你做单细胞测序数据分析
(三):单样本分析 2. 代码 2.1 R包安装与加载 if ( ! require (multtest))BiocManager :: install ( "multtest" ) if ( ! require (Seurat)) install.packages ( "Seurat" ) if ( ! require (dplyr)) install.packages ( "dplyr" ) if ( ! require (patchwork)) install.packages ( "patchwork" ) if ( ! require (R.utils)) install.packages ( "R.utils" ) # gc 函数用于触发垃圾回收(Garbage Collection, GC),回收不再使用的内存,并以矩阵形式返回当前的内存使用情况,其中包含两种类型的内存信息:# Ncells(表示 R 语言中的对象,如列表、函数等) # Vcells(表示向量数据,如数值向量、字符向量等) gc # 设置并行计算,提升下面流程的计算速度 makecore <- function (workcore,memory){ if ( ! require (Seurat)) install.packages ( 'Seurat' ) if ( ! require (future)) install.packages ( 'future' ) plan ( "multicore" , workers = workcore) options ( future.globals.maxSize= memory * 1024 * 1024 ** 2 ) } 核心,允许future任务占用 最多2GB内存为例 makecore ( 4 , 2 ) 2.2 数据读入并创建 Seurat对象 2.2.1 下载并解压数据 # 下载速度受到网络稳定性影响,如果下载失败可以多尝试几次 download.file ( 'https://cf.10xgenomics.com/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz' , 'my.pbmc.tar.gz' ) # 解压".tar.gz"结尾的文件完成之后在当前目录即data目录下面会出现一个filtered_gene_bc_matrices文件夹 untar ( "my.pbmc.tar.gz" ) 2.2.2 读入数据并创建Seurat分析对象 pbmc.data <- Read10X ( data.dir = "./filtered_gene_bc_matrices/hg19/" ) 如果你无法正常读入,请尝试理解数据的格式,以下几个教程可供参考:为什么Read10X也会报错? 如何一次性读取一个目录下的cellranger输出文件?
没有barcode文件的单细胞数据要怎么读取 四个单细胞样本只给了一套文件怎么读 pbmc <- CreateSeuratObject ( counts = pbmc.data, project = "pbmc3k" , min.cells = 3 , min.features = 200 ) # 初步查看Seurat对象:pbmc # An object of class Seurat # 13714 features across 2700 samples within 1 assay # Active assay: RNA (13714 features, 0 variable features) # 1 layer present: counts # 查看对象中包含多少个细胞:ncol (pbmc) # [1] 2700 ncol (pbmc.data) # [1] 2700 这里需要注意,Seurat V5中引入了 layer 的概念:所以下面这样无法取出 counts 的稀疏矩阵:# 以下代码会报错:try ({ lalala <- as.data.frame (pbmc[[ "RNA" ]] @ counts) }) # Error in h(simpleError(msg, call)) : # error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': no slot of name "counts" for this object of class "Assay5" 现在的做法应该是:# 这是一种方法:lalala <- pbmc @ assays[[ "RNA" ]] @ layers[[ "counts" ]] # 或者试用GetAssayData函数也是可以的:lalala <- GetAssayData (pbmc, assay= "RNA" , slot= 'counts' ) # 查看一下矩阵的部分数据:lalala[ 1 : 4 , 1 : 4 ] # 4 x 4 sparse Matrix of class "dgCMatrix" # AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 # AL627309.1 . . . # AP006222.2 . . . # RP11-206L10.2 . . . # RP11-206L10.9 . . . # AAACCGTGCTTCCG-1 # AL627309.1 . # AP006222.2 . # RP11-206L10.2 . # RP11-206L10.9 . write.table (lalala, './mycount.txt' , sep = ' \t ' ) # 删除对象,减少内存占用 rm (lalala) 2.3 质控数据及可视化 # if(!require(patchwork))install.packages("patchwork") # 线粒体基因表达比例统计 # 人源的数据为MT,鼠源的需要换成mt pbmc[[ "percent.mt" ]] <- PercentageFeatureSet (pbmc, pattern = "^MT-" ) # 新增percent.mt列 head (pbmc @ meta.data, 5 ) # orig.ident nCount_RNA nFeature_RNA percent.mt # AAACATACAACCAC-1 pbmc3k 2419 779 3.0177759 # AAACATTGAGCTAC-1 pbmc3k 4903 1352 3.7935958 # AAACATTGATCAGC-1 pbmc3k 3147 1129 0.8897363 # AAACCGTGCTTCCG-1 pbmc3k 2639 960 1.7430845 # AAACCGTGTATGCG-1 pbmc3k 980 521 1.2244898 VlnPlot (pbmc, features = c ( "nFeature_RNA" , "nCount_RNA" , "percent.mt" ), ncol = 3 ) # FeatureScatter 是 Seurat 包中的一个函数,它用于绘制散点图(Scatter Plot),可视化两个指标之间的关系 plot1 <- FeatureScatter (pbmc, feature1 = "nCount_RNA" , feature2 = "percent.mt" ) plot2 <- FeatureScatter (pbmc, feature1 = "nCount_RNA" , feature2 = "nFeature_RNA" ) # CombinePlots这步需要你的绘图窗口足够大 CombinePlots ( plots = list (plot1, plot2)) # percent.mt 随 nCount_RNA 增加而增加,可能有一些高 percent.mt 的异常点(低质量细胞) # nCount_RNA 和 nFeature_RNA 成正比,如果 nCount_RNA 很高但 nFeature_RNA 低,可能是低质量细胞或技术噪声 # 注意:percent.mt不是越低越好,部分组织本身代谢高,例如心脏percent.mt相对其他组织数值更高,属于正常现象 # 过滤低质量细胞 pbmc # An object of class Seurat # 13714 features across 2700 samples within 1 assay # Active assay: RNA (13714 features, 0 variable features) # 1 layer present: counts pbmc <- subset (pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5 ) # 看看现在的细胞数:pbmc # An object of class Seurat # 13714 features across 2638 samples within 1 assay # Active assay: RNA (13714 features, 0 variable features) # 1 layer present: counts 2.4 细胞分群 pbmc <- NormalizeData (pbmc, normalization.method = "LogNormalize" , scale.factor = 10000 ) # 寻找高变基因,默认nfeatures=2000 pbmc <- FindVariableFeatures (pbmc, selection.method = "vst" , nfeatures = 2000 ) # 查看十个高变基因 top10 <- head ( VariableFeatures (pbmc), 10 ) top10 # [1] "PPBP" "LYZ" "S100A9" "IGLL5" "GNLY" "FTL" "PF4" "FTH1" # [9] "GNG11" "S100A8" # 可视化单细胞 RNA-seq 数据中的高变基因,x轴基因表达均值;
y轴基因标准变异度。大部分基因点为灰色(背景基因)。高变基因(HVGs)被标记红色。这些高变基因在后续降维(PCA、UMAP)和聚类分析中至关重要。
plot1 <- VariableFeaturePlot (pbmc) # 在 VariableFeaturePlot 上 标注前 10 个高变基因,repel = TRUE避免标签重复,plot2 <- LabelPoints ( plot = plot1, points = top10, repel = TRUE ) plot1 + plot2 pbmc <- ScaleData (pbmc, features = rownames (pbmc)) dim (pbmc @ assays $ RNA @ layers $ scale.data) # [1] 13714 2638 <- ScaleData(pbmc) 默认用高变基因来scale a <- ScaleData (pbmc) dim (a @ assays $ RNA @ layers $ scale.data) # [1] 2000 2638 rm (a) # 提取高变基因 hvg <- VariableFeatures ( object = pbmc) length (hvg) # [1] 2000 # PCA(Principal Component Analysis)主成分分析,主成分分析是一种无监督的降维技术,它能将原始的高维数据转化为一组新的、互不相关的综合变量,即主成分。
这些主成分按照方差大小排序,方差越大,包含的原始数据信息就越多。通过保留前面几个方差较大的主成分,就能在尽量保留原始数据信息的前提下,显著降低数据的维度。单细胞转录组数据通常具有高维度的特点,每个细胞可能有数千个基因的表达值。高维度数据不仅会增加计算的复杂度,还可能导致过拟合等问题。PCA 可以将高维的基因表达数据投影到低维的主成分空间,从而减少数据的维度,降低计算成本。
pbmc <- RunPCA (pbmc, features = hvg) # 输出每个前五个主成分top5基因 print (pbmc[[ "pca" ]], dims = 1 : 5 , nfeatures = 5 ) # PC_ 1 # Positive: CST3, TYROBP, LST1, AIF1, FTL # Negative: MALAT1, LTB, IL32, IL7R, CD2 # PC_ 2 # Positive: CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1 # Negative: NKG7, PRF1, CST7, GZMB, GZMA # PC_ 3 # Positive: HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1 # Negative: PPBP, PF4, SDPR, SPARC, GNG11 # PC_ 4 # Positive: HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1 # Negative: VIM, IL7R, S100A6, IL32, S100A8 # PC_ 5 # Positive: GZMB, NKG7, S100A8, FGFBP2, GNLY # Negative: LTB, IL7R, CKB, VIM, MS4A7 # 展示 PCA 降维过程中,每个基因对指定主成分(PC1和PC2)的贡献程度,即基因的载荷值。
基因载荷反映了基因表达值与主成分之间的相关性,通过可视化载荷值,可以帮助我们理解每个主成分所代表的生物学意义。x轴:表示基因在对应主成分上的载荷值。载荷值的绝对值越大,说明该基因对该主成分的贡献越大;
正载荷值表示基因表达与主成分呈正相关,负载荷值表示基因表达与主成分呈负相关 VizDimLoadings (pbmc, dims = 1 : 2 , reduction = "pca" ) # 基于PCA计算结果展示细胞降维分布 DimPlot (pbmc, reduction = "pca" ) # 热图展示指定主成分(PC)与细胞之间的关系。
热图可以帮助我们直观地了解每个主成分在不同细胞中的表达模式,进而识别出对特定主成分贡献较大的细胞 # dims指定要展示的主成分范围,这里表示展示第 1 到第 15 个主成分 # 指定要在热图中展示的细胞数量,这里随机选取 500 个细胞进行展示 # 表示会平衡选取不同簇的细胞,以确保每个簇的细胞在热图中都有适当的代表 DimHeatmap (pbmc, dims = 1 : 15 , cells = 500 , balanced = TRUE ) # JackStraw分析评估主成分显著性,通过对数据进行多次重采样"这里是100次",模拟出随机数据的主成分分布,然后将真实数据的主成分与随机数据的主成分进行比较,从而判断每个主成分是否具有统计学意义 # 这一步计算时间较长,约4min pbmc <- JackStraw (pbmc, num.replicate = 100 ) # ScoreJackStraw 函数用于对JackStraw 分析的结果进行打分。
它会计算每个主成分的p值,用于评估该主成分的显著性 # dims指定要评估的主成分范围,这里表示评估第 1 到第 20 个主成分 pbmc <- ScoreJackStraw (pbmc, dims = 1 : 20 ) # 该函数用于绘制JackStraw分析的结果图。图形会展示每个主成分的p值分布,帮助我们直观地判断哪些主成分是显著的,哪些是随机噪声。x轴经验值,y轴理论值黑色斜线表示零假设下的均匀p值分布。
其作用是作为判断主成分(PC)显著性的基准线,具体含义如下:零假设的参考JackStraw分析通过随机置换数据生成“零分布”(即假设主成分中基因的表达模式是随机的),黑色虚线即代表这种均匀分布的p值预期。如果主成分的实际p值分布(实线)显著偏离虚线(尤其是位于虚线之上),说明该主成分包含的基因表达模式并非随机,具有统计学意义。实线高于虚线:表示该主成分中低p值的基因富集,说明其包含真实的生物学信号,应保留用于后续分析(如聚类、可视化)。
实线接近或低于虚线:说明主成分的 p 值分布与随机情况无显著差异,可能主要反映噪声,可忽略。JackStrawPlot (pbmc, dims = 1 : 20 ) # ElbowPlot 函数用于绘制肘部图。肘部图展示了每个主成分的方差解释率随主成分数量的变化情况。
通常,我们会在肘部图中找到一个“肘部”位置,即方差解释率下降趋势变缓的点,该点对应的主成分数量可以作为后续分析中保留的主成分数量 # 例如这里可以选择前10个主成分 ElbowPlot (pbmc) # FindNeighbors 函数用于在低维空间(这里是 PCA 空间)中计算细胞之间的邻接关系。
它会根据指定的主成分(这里是第 1 到第 10 个主成分)计算细胞之间的距离,并构建一个 K 近邻图(K-nearest neighbor graph),用于后续的聚类分析。pbmc <- FindNeighbors (pbmc, dims = 1 : 10 ) # FindClusters 函数用于对细胞进行聚类分析。
它基于 FindNeighbors 函数构建的 K 近邻图,使用图聚类算法(如 Louvain 算法)将相似的细胞聚为不同的簇。# resolution = 0.5:指定聚类的分辨率。分辨率越高,聚类的粒度越细,会得到更多的小簇;分辨率越低,聚类的粒度越粗,会得到较少的大簇。
pbmc <- FindClusters (pbmc, resolution = 0.5 ) # Modularity Optimizer version 1.3.0 by Ludo Waltmand Nees Jan van Eck # # Number of nodes: 2638 # Number of edges: 95965 # # Running Louvain algorithm... # Maximum modularity in 10 random starts: 0.8723 # Number of communities: 9 # Elapsed time: 0 seconds 如果你在上面的FindNeighbors中发生了 object 'CsparseMatrix_validate' not found 报错,可以参考下面的方法解决:答读者问| 22.object ‘CsparseMatrix_validate’ not found。
如果你用的是 Seurat V5 那么你大概率不会遇到上面的问题。2.5 分群后的可视化 2.5.1 tsne+umap # 原理:# UMAP:基于流形学习的思想,假设数据分布在一个低维流形上。它通过构建数据点之间的局部连接图来近似流形结构,然后优化低维空间中的点的位置,使得低维空间中的点之间的连接关系尽可能与高维空间中的一致。其核心是使用模糊拓扑来保留数据的全局和局部结构。# t-SNE:通过概率分布来表示数据点之间的相似性。
在高维空间中,使用高斯分布来计算数据点之间的相似度;在低维空间中,使用t分布来计算相似度。t-SNE的目标是最小化高维空间和低维空间中数据点对之间的概率分布差异,通常使用KL散度来衡量这种差异。# 计算效率:# UMAP:通常比t-SNE 更快,尤其是在处理大规模数据集时。UMAP的优化过程相对简单,并且可以使用近似算法来加速计算,能够在较短的时间内完成大规模数据的降维任务。
# t-SNE:计算复杂度较高,尤其是对于大规模数据集,计算时间会显著增加。它的优化过程是迭代的,并且需要计算所有数据点对之间的相似度,因此在处理大数据集时效率较低。
pbmc <- RunUMAP (pbmc, dims = 1 : 10 ) DimPlot (pbmc, reduction = "umap" ) pbmc <- RunTSNE (pbmc, dims = 1 : 10 ) DimPlot (pbmc, reduction = "tsne" ) 2.5.2 寻找marker基因并对cluster进行重命名 # 计算cluster5与cluster(0+3)群之间的差异基因 cluster5.markers <- FindMarkers (pbmc, ident.1 = 5 , ident.2 = c ( 0 , 3 ), min.pct = 0.25 ) head (cluster5.markers) # p_val avg_log2FC pct.1 pct.2 p_val_adj # FCGR3A 2.150929e-209 6.832372 0.975 0.039 2.949784e-205 # IFITM3 6.103366e-199 6.181000 0.975 0.048 8.370156e-195 # CFD 8.891428e-198 6.052575 0.938 0.037 1.219370e-193 # CD68 2.374425e-194 5.493138 0.926 0.035 3.256286e-190 # RP11-290F20.3 9.308287e-191 6.335402 0.840 0.016 1.276538e-186 # IFI30 1.575100e-181 6.382751 0.796 0.014 2.160093e-177 # 依次计算每一个亚群与剩余所有亚群的差异基因 pbmc.markers <- FindAllMarkers (pbmc, only.pos = TRUE , min.pct = 0.25 , logfc.threshold = 0.25 ) if ( ! require (dplyr)) install.packages ( "dplyr" ) # 每个亚群按照avg_log2FC排序得到的top2差异基因 # wt 参数代表权重变量(weight variable),它指定了用于排序的变量 pbmc.markers %>% group_by (cluster) %>% top_n ( n = 2 , wt = avg_log2FC) # # A tibble: 18 × 7 # # Groups: cluster [9] # p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene # <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr> # 1 1.17e- 83 2.37 0.435 0.108 1.60e- 79 0 CCR7 # 2 3.28e- 49 2.10 0.333 0.103 4.50e- 45 0 LEF1 # 3 3.10e-139 7.28 0.3 0.004 4.25e-135 1 FOLR3 # 4 1.63e-121 6.74 0.277 0.006 2.23e-117 1 S100A12 # 5 2.61e- 59 2.11 0.424 0.111 3.58e- 55 2 AQP3 # 6 1.94e- 35 1.90 0.267 0.069 2.66e- 31 2 CD40LG # 7 2.40e-272 7.38 0.564 0.009 3.29e-268 3 LINC00926 # 8 2.75e-237 7.14 0.488 0.007 3.76e-233 3 VPREB3 # 9 4.93e-169 4.37 0.595 0.056 6.76e-165 4 GZMK # 10 3.51e- 93 3.59 0.437 0.06 4.82e- 89 4 GZMH # 11 1.69e-212 5.43 0.506 0.01 2.32e-208 5 CDKN1C # 12 8.23e-168 5.88 0.37 0.005 1.13e-163 5 CKB # 13 1.05e-265 6.01 0.986 0.071 1.44e-261 6 GZMB # 14 7.57e-183 6.21 0.493 0.014 1.04e-178 6 AKR1C3 # 15 1.48e-220 7.63 0.812 0.011 2.03e-216 7 FCER1A # 16 1.46e-207 8.03 0.5 0.002 2.00e-203 7 SERPINF1 # 17 0 14.3 0.571 0 0 8 LY6G6F # 18 4.36e-206 13.8 0.357 0 5.98e-202 8 RP11-879F14.2 top10 <- pbmc.markers %>% group_by (cluster) %>% top_n ( n = 10 , wt = avg_log2FC) DoHeatmap (pbmc, features = top10 $ gene) + NoLegend # 细胞类型marker基因FeaturePlot p1 <- FeaturePlot (pbmc, features = c ( "CCR7" , "CD14" , "LYZ" , Mono "MS4A1" , "CD8A" , "FCGR3A" , Mono "NKG7" , "FCER1A" , "PF4" )) p2 <- DimPlot (pbmc) library (gridExtra) p1 | p2 可以在这个 CellMarker 中查看细胞类型对应的marker # 定义细胞名称 new.cluster.ids <- c ( "Naive CD4 T" , "CD14+ Mono" , "Memory CD4 T" , "B" , "CD8 T" , "FCGR3A+ Mono" , "NK" , "DC" , "Platelet" ) new.cluster.ids # [1] "Naive CD4 T" "CD14+ Mono" "Memory CD4 T" "B" "CD8 T" # [6] "FCGR3A+ Mono" "NK" "DC" "Platelet" levels (pbmc) # [1] "0" "1" "2" "3" "4" "5" "6" "7" "8" # 将细胞名和分群信息一一对应 names (new.cluster.ids) <- levels (pbmc) new.cluster.ids # 0 1 2 3 4 # "Naive CD4 T" "CD14+ Mono" "Memory CD4 T" "B" "CD8 T" # 5 6 7 8 # "FCGR3A+ Mono" "NK" "DC" "Platelet" # 将new.cluster.ids作为新的细胞名称 pbmc <- RenameIdents (pbmc, new.cluster.ids) pbmc $ celltype <- Idents (pbmc) DimPlot (pbmc, reduction = "umap" , label = TRUE , pt.size = 0.5 ) + NoLegend # 保存为rds文件,便于后续使用 # saveRDS(pbmc,'../result_seuratV5/pbmc.rds') pbmc <- readRDS ( '../result_seuratV5/pbmc.rds' ) DimPlot (pbmc, reduction = "umap" , label = TRUE , pt.size = 0.5 ) + NoLegend 第四讲:多样本整合 在我们上一个视频中进行了单细胞转录组测序数据分析中 最重要 的单样本分析后,本次课程将进入 多样本数据整合 模块的学习。
在各位 实际 数据分析的过程中,实验设计往往是多样本,多组别的,因此,因此多样本数据的 科学整合与管理 是十分有必要。这里我们会针对不同的 数据特征,考虑到 批次效应 、 计算机资源利用 特征教大家如何进行多样本 Seurat 对象或矩阵的整合与拆分。不同实验批次、不同样本、不同实验条件、不同技术平台的单细胞数据整合是单细胞数据分析长期以来的重难点。
让不同数据间相同的细胞类型进行高质量、具有置信度的匹配能够让后续的基因差异分析更加的科学。因此,Seurat v5 致力于移除数据间的条件差异而非生物学差异,较V4而言更是优化了 内存 使用与 计算速度 等方面。具体流程与原来的多样本整合中演示步骤具有一定的差别,详细差别可参考:快速上手Seurat V5(订阅用户请在龙年订阅资料下载连接中找,无需再次付费)。
1. 视频及测试文件 1.1 视频教程 1.2 测试文件 本手册下载时包含测试数据,若无测试数据,可在推送中获取。
推送链接 2. 代码 # 加载R包:library (Seurat) library (multtest) library (dplyr) library (ggplot2) library (patchwork) if ( ! require (SeuratData))devtools :: install_github ( 'satijalab/seurat-data' ) 2.1. 单纯的merge 2.1.1 准备用于拆分的数据集 你如果对拆分的过程或 ident 的管理不理解,例如你不理解为什么下文中有 group 变量,可参考以下两讲内容:Seurat中分类变量处理技巧 答读者问
(六):Seurat中如何让细胞听你指挥 单细胞的多样本整合十分耗费计算资源,当各位实战时遇到内存不足的问题,可以考虑入一个具有性价比的服务器:足够支持你完成硕博生涯的生信环境: # 如果你的内存不是很充裕,可以进行抽样,降低计算量 # pbmc <- subset(pbmc, downsample = 50) ifnb <- readRDS ( './pbmcrenamed.rds' ) # 这其实是一个已经整合好的Seurat对象 class (ifnb) # [1] "Seurat" # attr(,"package") # [1] "Seurat" # 这是一个很老的数据,以Seurat V3版本生成的,V4、V5可以向下兼容 SeuratObject :: Version (ifnb) # [1] '3.2.2' # 其中包含了三个分组:unique (ifnb $ group) # [1] "C57" "AS1" "P3" # 那我们就按照分组将这个Seurat对象拆开,split的方式在Seurat V5中有所变化:# ifnb.list <- SplitObject(ifnb, split.by = "group") # 原来的方式 ifnb[[ "RNA" ]] <- split (ifnb[[ "RNA" ]], f = ifnb $ group) class (ifnb) # [1] "Seurat" # attr(,"package") # [1] "Seurat" # 分别取出这三个对象 C57 <- ifnb[,ifnb $ group == 'C57' ] C57 # An object of class Seurat # 22254 features across 309 samples within 2 assays # Active assay: RNA (20254 features, 0 variable features) # 2 layers present: counts.C57, data.C57 # 1 other assay present: integrated # 3 dimensional reductions calculated: pca, umap, tsne AS1 <- ifnb[,ifnb $ group == 'AS1' ] AS1 # An object of class Seurat # 22254 features across 287 samples within 2 assays # Active assay: RNA (20254 features, 0 variable features) # 2 layers present: counts.AS1, data.AS1 # 1 other assay present: integrated # 3 dimensional reductions calculated: pca, umap, tsne P3 <- ifnb[,ifnb $ group == 'P3' ] P3 # An object of class Seurat # 22254 features across 304 samples within 2 assays # Active assay: RNA (20254 features, 0 variable features) # 2 layers present: counts.P3, data.P3 # 1 other assay present: integrated # 3 dimensional reductions calculated: pca, umap, tsne # 注意,这里就等同于我们之前读入单个文件。
每一个对象相当于读取了一次一个样本的单细胞数据后生成的单样本矩阵。单样本矩阵读取后,参考我们前面介绍的单样本读取流程,接下来创建每个单样本seurta对象,即可继续接下来的merge合并对象流程。
2.1.2 merge # 简单利用merge整合这两个数据 pbmc <- merge (C57, y = c (AS1), add.cell.ids = c ( "C57" , "AS1" ), project = "ALL" ) # 这时我们就获得了一个新的Seurat对象 pbmc # An object of class Seurat # 22254 features across 596 samples within 2 assays # Active assay: RNA (20254 features, 0 variable features) # 4 layers present: counts.C57.1, data.C57.1, counts.AS1.2, data.AS1.2 # 1 other assay present: integrated # 可以发现这是barcode的名称发生了变化:head ( colnames (pbmc)) # [1] "C57_AAAGGGCTCGCAGAGA-1_1" "C57_AACACACAGCATCCCG-1_1" # [3] "C57_AACAGGGTCAACACCA-1_1" "C57_AACCCAAAGGCCTTGC-1_1" # [5] "C57_AACCCAAAGTGGACGT-1_1" "C57_AACCCAAGTCCGAAGA-1_1" unique ( sapply ( X = strsplit ( colnames (pbmc), split = "_" ), FUN = "[" , 1 )) # [1] "C57" "AS1" table (pbmc $ orig.ident) # # AS1 C57 # 287 309 # 以此类推,合并多个对象:merged_obj <- merge ( x = C57, y = c (AS1,P3), add.cell.ids = c ( "C57" , "AS1" , "P3" ), project = "MergedProject" ) merged_obj # An object of class Seurat # 22254 features across 900 samples within 2 assays # Active assay: RNA (20254 features, 0 variable features) # 6 layers present: counts.C57.1, data.C57.1, counts.AS1.2, data.AS1.2, counts.P3.3, data.P3.3 # 1 other assay present: integrated rm (merged_obj) # 接下来的分析流程可参考上一讲单样本分析 Seurat V5 中与 Seurat V4 不同的是,现在 list 可以直接参与 Seurat 中的计算函数,无需先合并各样本,也无需写循环采用遍历的形式。
# 标准分析流程:ifnb <- NormalizeData (ifnb) # 标准化 ifnb <- FindVariableFeatures (ifnb) # 计算高变基因 ifnb <- ScaleData (ifnb) # 缩放高变基因 ifnb <- RunPCA (ifnb) # PCA分析 ifnb <- FindNeighbors (ifnb, dims = 1 : 30 , reduction = "pca" ) # 计算最近临近图 ifnb <- FindClusters (ifnb, resolution = 0.4 , cluster.name = "unintegrated_clusters" ) # 分群 # Modularity Optimizer version 1.3.0 by Ludo Waltmand Nees Jan van Eck # # Number of nodes: 900 # Number of edges: 23290 # # Running Louvain algorithm... # Maximum modularity in 10 random starts: 0.9462 # Number of communities: 10 # Elapsed time: 0 seconds ifnb <- RunUMAP (ifnb, dims = 1 : 30 , reduction = "pca" , reduction.name = "umap.unintegrated" ) # 利用不去批次的数据进行降维 DimPlot (ifnb, reduction = "umap.unintegrated" , group.by = c ( "group" , "seurat_clusters" )) # 降维图展示批次效应 table ( Idents (ifnb),ifnb @ meta.data $ group) # # AS1 C57 P3 # 0 13 71 44 # 1 56 48 5 # 2 47 40 15 # 3 46 33 23 # 4 28 0 69 # 5 49 17 28 # 6 18 52 22 # 7 7 46 26 # 8 19 2 42 # 9 4 0 30 2.1.3 接下来我们尝试使用一个新的数据做多样本merge 数据下载地址:https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE139107 library (Matrix) # read.delim直接读取压缩包,而且相较于read.table更为简洁,默认分隔符就是"\t" # 两个文件细胞数较多,读取时间约3min test1 <- read.delim ( "./GSE139107/GSE139107_MouseIRI_4hours.dge.txt.gz" ) test2 <- read.delim ( "./GSE139107/GSE139107_MouseIRI_12hours.dge.txt.gz" ) # 创建对象 obj4 <- CreateSeuratObject ( counts = test1) obj5 <- CreateSeuratObject ( counts = test2) obj <- merge (obj4, y = obj5, project = "ALL" ) head (obj @ meta.data) table (obj @ meta.data $ orig.ident) head ( colnames (obj4)) head ( colnames (obj5)) # 标准化 obj <- NormalizeData (obj) # 计算高变基因 obj <- FindVariableFeatures (obj) # 缩放高变基因 obj <- ScaleData (obj) # PCA分析 obj <- RunPCA (obj) # 计算最近临近图 obj <- FindNeighbors (obj, dims = 1 : 30 , reduction = "pca" ) # 分群 obj <- FindClusters (obj, resolution = 0.4 , cluster.name = "unintegrated_clusters" ) # 利用不去批次的数据进行降维 obj <- RunUMAP (obj, dims = 1 : 30 , reduction = "pca" , reduction.name = "umap.unintegrated" ) # saveRDS(obj,"iri.intergrate.havebatch.rds") # saveRDS(obj4,"iri.obj4.rds") # saveRDS(obj5,"iri.obj5.rds") obj4 <- readRDS ( "iri.obj4.rds" ) obj5 <- readRDS ( "iri.obj5.rds" ) obj <- readRDS ( "iri.intergrate.havebatch.rds" ) cols <- c ( " , " , " , " , " , " , " , " , " , " , " , " , " , " , " , " , " , " , " , " , " , " , " , " , " , " , " , " , " , " ) # 保存色板,后续会使用到这个色板 # save(cols,file = "color.Rdata") p1 <- DimPlot (obj, reduction = "umap.unintegrated" , group.by = "orig.ident" , cols = cols) # 降维图展示批次效应 # 指定颜色 p2 <- DimPlot (obj, reduction = "umap.unintegrated" , group.by = "seurat_clusters" , label = T, cols = cols ) p1 | p2 table ( Idents (obj),obj @ meta.data $ orig.ident) # # IRI12h1b1 IRI12h1b2 IRI12h2 IRI12h3 IRI4h1 IRI4h2 IRI4h3 # 0 16 7 42 2 1775 29 1758 # 1 576 45 469 442 873 201 711 # 2 634 586 425 492 205 65 431 # 3 573 560 503 484 18 211 50 # 4 1203 82 52 996 11 18 15 # 5 1147 29 32 1049 6 7 6 # 6 7 35 962 0 8 5 982 # 7 339 327 197 282 207 50 211 # 8 2 1560 7 1 0 0 4 # 9 0 1405 13 0 0 1 0 # 10 7 1 2 5 1239 9 54 # 11 248 287 239 177 163 62 101 # 12 209 184 175 178 230 63 236 # 13 7 5 1 4 1186 36 18 # 141 122 126 88 527 36 192 # 15 4 3 1173 0 1 0 8 # 16 229 169 167 164 133 49 252 # 17 6 9 11 2 414 11 377 # 18 4 0 1 0 0 792 0 # 19 200 196 65 129 5 3 15 # 20 1 0 0 2 0 297 0 # 21 40 27 37 32 52 25 42 可以看到的是,数据存在一定的批次效应,这也就是为什么我们需要在多样本整合的过程中使用下面几节的去批次算法。
2.2. anchor(CCA) 此方法为最耗费内存以及计算时间的方法,但去批次效果最为强劲,且能返回 integrated 去批次整合后获得的表达矩阵。
Canonical Correlation Analysis (CCA)去批次比较耗时且占内存,我们可以用以下方式开启并行计算:# 先设置并行计算,加速计算过程:makecore <- function (workcore,memory){ if ( ! require (Seurat)) install.packages ( 'Seurat' ) if ( ! require (future)) install.packages ( 'future' ) plan ( "multisession" , workers = workcore) options ( future.globals.maxSize= memory * 1024 * 1024 ** 2 ) } makecore ( 4 , 4 ) # 以下是Seurat V4的运行方式,这里不做运行 library (Seurat) library (tidyverse) # 定义一个函数帮助预处理数据 myfunction1 <- function (testA.seu){ testA.seu <- NormalizeData (testA.seu, normalization.method = "LogNormalize" , scale.factor = 10000 ) testA.seu <- FindVariableFeatures (testA.seu, selection.method = "vst" , nfeatures = 2000 ) return (testA.seu) } # 执行对每个数据的NormalizeData(标准化表达矩阵)与FindVariableFeatures(高变基因计算),这两部操作的过程可参考第三讲:相关链接见专题页 obj4_test <- myfunction1 (obj4) obj5_test <- myfunction1 (obj5) # Integration步骤,真正耗时耗电的步骤:testAB.anchors <- FindIntegrationAnchors ( object.list = list (obj4_test,obj5_test), dims = 1 : 20 ) # 寻找数据锚定点,约25min testAB.integrated <- IntegrateData ( anchorset = testAB.anchors, dims = 1 : 20 ) # 根据锚定点整合数据 # 需要注意的是:上面的整合步骤相对于**harmony**整合方法更耗费计算资源,尤其是较大的数据集(几万个细胞)非常消耗内存和时间,大约**9G**的数据在**32G**的内存下就已经无法运行;
当存在某一个`Seurat`对象细胞数很少(印象中200以下这样子),会报错,这时建议用第二种整合方法。
# 下游计算 # # 将后续计算用默认矩阵由"RNA"改为"integrated" DefaultAssay (testAB.integrated) <- "integrated" # 整合后的数据从scale开始运行,后续基本与单样本分析部分无异 testAB.integrated <- ScaleData (testAB.integrated, features = rownames (testAB.integrated)) testAB.integrated <- RunPCA (testAB.integrated, npcs = 50 , verbose = FALSE ) testAB.integrated <- FindNeighbors (testAB.integrated, dims = 1 : 30 ) testAB.integrated <- FindClusters (testAB.integrated, resolution = 0.4 ) testAB.integrated <- RunUMAP (testAB.integrated, dims = 1 : 30 ) testAB.integrated <- RunTSNE (testAB.integrated, dims = 1 : 30 ) DimPlot (testAB.integrated, reduction= "umap" ) DimPlot (testAB.integrated, reduction= "tsne" ) # saveRDS(testAB.integrated,"iri.intergrate.havebatch.v4CCA.rds") Seurat V5 中的运行则会简单许多:# 多个layer之间直接去批次整合:obj_test1 <- IntegrateLayers ( object = obj, method = CCAIntegration, # 以CCA的方式整合 orig.reduction = "pca" , new.reduction = "integrated.cca" , # 整合后的降维命名,收录在'obj@reductions$integrated.cca'中 verbose = FALSE ) # 可以看到过程非常简单,省去了原来寻找anchor的步骤 # 把两个Seurat对象的layer合并:obj_test1[[ "RNA" ]] <- JoinLayers (obj_test1[[ "RNA" ]]) obj_test1 <- FindNeighbors (obj_test1, reduction = "integrated.cca" , dims = 1 : 30 ) # 利用去批次后的数据计算最近临近图 obj_test1 <- FindClusters (obj_test1, resolution = 0.4 ) # 利用去批次后的数据分群 obj_test1 <- RunUMAP (obj_test1, dims = 1 : 30 , reduction = "integrated.cca" ) # 利用去批次后的数据降维 # 保存Seurat对象 # saveRDS(obj_test1,"iri.intergrate.havebatch.v5CCA.rds") obj_test1 <- readRDS ( "iri.intergrate.havebatch.v5CCA.rds" ) # 可视化:p3 <- DimPlot (obj_test1, reduction = "umap" , group.by = "orig.ident" , cols = cols) # 查看降维结果 p4 <- DimPlot (obj_test1, reduction = "umap" , group.by = "seurat_clusters" , cols = cols) p3 | p4 table ( Idents (obj_test1),obj_test1 @ meta.data $ orig.ident) # # IRI12h1b1 IRI12h1b2 IRI12h2 IRI12h3 IRI4h1 IRI4h2 IRI4h3 # 0 1050 136 63 973 2867 640 128 # 1 12 1038 1081 5 53 68 1647 # 2 1207 73 27 998 1215 299 63 # 3 7 1126 966 2 7 27 972 # 4 580 560 503 485 459 214 240 # 5 297 565 490 204 364 74 696 # 6 560 487 393 439 199 76 420 # 7 334 504 207 222 414 165 322 # 8 327 305 185 265 219 51 283 # 9 212 188 175 180 230 63 236 # 10 248 288 238 177 165 66 102 # 11 282 7 6 239 487 116 8 # 12 220 168 173 161 123 48 193 # 13 142 121 129 91 173 36 84 # 14 75 46 26 56 26 2 28 # 15 40 27 37 32 52 25 41 # 现在也不会生成新的'integrated'矩阵来干扰大家,assay中仍是只有'RNA'矩阵 names (obj_test1 @ assays) # [1] "RNA" 2.3. harmony 速度快、内存少 2.3.1 预处理 if ( ! require (harmony))devtools :: install_github ( "immunogenomics/harmony" ) # harmony前需要完成标准化、高变基因计算、scale、PCA等分析 obj_test2 <- merge (obj4, y = obj5, project = "ALL" ) obj_test2 <- obj_test2 %>% NormalizeData %>% FindVariableFeatures ( selection.method = "vst" , nfeatures = 2000 ) %>% ScaleData obj_test2 <- RunPCA (obj_test2, npcs = 50 , verbose = FALSE ) 2.3.2 harmony run 到 PCA 再进行 harmony,harmony 实际上相当于一种降维方法 obj_test2 <- obj_test2 %>% RunHarmony ( "orig.ident" , plot_convergence = TRUE ) # 以Harmony的结果进行后续的降维与分群 # UMAP及分群:obj_test2 <- obj_test2 %>% RunUMAP ( reduction = "harmony" , dims = 1 : 30 ) %>% FindNeighbors ( reduction = "harmony" , dims = 1 : 30 ) %>% FindClusters ( resolution = 0.4 ) %>% identity # TSNE, 运行时间比umap长 obj_test2 <- obj_test2 %>% RunTSNE ( reduction = "harmony" , dims = 1 : 30 ) # 保存seurat对象 # saveRDS(obj_test2,"iri.intergrate.harmony.rds") # 创建图片对象 obj_test2 <- readRDS ( "iri.intergrate.harmony.rds" ) p5 <- DimPlot (obj_test2, reduction = "umap" , group.by = "orig.ident" , pt.size= 0.5 , cols = cols) + theme ( axis.line = element_blank , axis.ticks = element_blank , axis.text = element_blank ) p6 <- DimPlot (obj_test2, reduction = "umap" , group.by = "ident" , pt.size= 0.5 , label = TRUE , repel = TRUE , cols = cols) + theme ( axis.line = element_blank , axis.ticks = element_blank , axis.text = element_blank ) # 看一下降维结果展示 p5 | p6 第五讲:细胞类型注释 这一讲内容关于细胞类型注释,之前的课程中我们已对单细胞转录组测序的基本分析流程做了基本的梳理,相信大家已经可以完成一些基础的分析。
单细胞转录组测序的 pipeline 设计较为复杂、也不能像 bulk RNA-Seq 一样设计通用的 自动化 分析流程,最大的原因便是不同数据集之间的细胞群注释 迥然不同,分析数据时需要做到足够的 个性化。因此本次课程我们给大家提供了 数据库注释 、 singleR 内置数据集与自定义注释 、 Seurat内置注释 等方法供大家选择。相信大家完成本次课程的学习,可以解决单细胞类群注释的所有问题。
这部分内容Seurat v4与Seurat V5也没有明显差别。1. 视频及测试文件 1.1 视频教程 1.2 测试文件 请在推送中获取:手把手教你做单细胞测序数据分析
(五):细胞类型注释 2. 代码 2.1 准备工作 进行预处理,作到细胞注释前的步骤 第一个代码块中 没有注释,看不懂的同学去看一下我们的第三讲 单样本分析 # 预处理到细胞注释前的步骤:if ( ! require ( "Seurat" )) install.packages ( "Seurat" ) if ( ! require ( "BiocManager" , quietly = TRUE )) install.packages ( "BiocManager" ) if ( ! require ( "multtest" , quietly = TRUE )) install.packages ( "multtest" ) if ( ! require ( "dplyr" , quietly = TRUE )) install.packages ( "dplyr" ) pbmc.data <- Read10X ( './filtered_gene_bc_matrices/hg19/' ) pbmc <- CreateSeuratObject ( counts = pbmc.data, project = "pbmc3k" , min.cells = 3 , min.features = 200 ) pbmc[[ "percent.mt" ]] <- PercentageFeatureSet (pbmc, pattern = "^MT-" ) pbmc <- subset (pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5 ) pbmc <- NormalizeData (pbmc, normalization.method = "LogNormalize" , scale.factor = 10000 ) pbmc <- FindVariableFeatures (pbmc, selection.method = "vst" , nfeatures = 2000 ) top10 <- head ( VariableFeatures (pbmc), 10 ) pbmc <- ScaleData (pbmc, features = rownames (pbmc)) pbmc <- ScaleData (pbmc, vars.to.regress = "percent.mt" ) pbmc <- RunPCA (pbmc, features = VariableFeatures ( object = pbmc)) print (pbmc[[ "pca" ]], dims = 1 : 5 , nfeatures = 5 ) # PC_ 1 # Positive: CST3, TYROBP, LST1, AIF1, FTL # Negative: MALAT1, LTB, IL32, IL7R, CD2 # PC_ 2 # Positive: CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1 # Negative: NKG7, PRF1, CST7, GZMA, GZMB # PC_ 3 # Positive: HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPA1 # Negative: PPBP, PF4, SDPR, SPARC, GNG11 # PC_ 4 # Positive: HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1 # Negative: VIM, IL7R, S100A6, S100A8, IL32 # PC_ 5 # Positive: GZMB, FGFBP2, S100A8, NKG7, GNLY # Negative: LTB, IL7R, CKB, MS4A7, RP11-290F20.3 VizDimLoadings (pbmc, dims = 1 : 2 , reduction = "pca" ) DimHeatmap (pbmc, dims = 1 , cells = 500 , balanced = TRUE ) DimHeatmap (pbmc, dims = 1 : 15 , cells = 500 , balanced = TRUE ) pbmc <- JackStraw (pbmc, num.replicate = 100 ) pbmc <- ScoreJackStraw (pbmc, dims = 1 : 20 ) JackStrawPlot (pbmc, dims = 1 : 15 ) pbmc <- FindNeighbors (pbmc, dims = 1 : 10 ) pbmc <- FindClusters (pbmc, resolution = 0.5 ) # Modularity Optimizer version 1.3.0 by Ludo Waltmand Nees Jan van Eck # # Number of nodes: 2638 # Number of edges: 95893 # # Running Louvain algorithm... # Maximum modularity in 10 random starts: 0.8735 # Number of communities: 9 # Elapsed time: 0 seconds head ( Idents (pbmc), 5 ) # AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 AAACCGTGCTTCCG-1 # 0 3 2 1 # AAACCGTGTATGCG-1 # 6 # Levels: 0 1 2 3 4 5 6 7 8 pbmc <- RunUMAP (pbmc, dims = 1 : 10 ) pbmc <- RunTSNE (pbmc, dims = 1 : 10 ) DimPlot (pbmc, reduction = "umap" , label = TRUE ) pbmc.markers <- FindAllMarkers (pbmc, only.pos = TRUE , min.pct = 0.25 , logfc.threshold = 0.25 ) library (dplyr) pbmc.markers %>% group_by (cluster) %>% top_n ( n = 2 , wt = avg_log2FC) # # A tibble: 18 × 7 # # Groups: cluster [9] # p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene # <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr> # 1 5.01e- 85 2.39 0.437 0.108 6.88e- 81 0 CCR7 # 2 2.05e- 49 2.09 0.334 0.103 2.80e- 45 0 LEF1 # 3 3.10e-139 7.28 0.3 0.004 4.25e-135 1 FOLR3 # 4 1.63e-121 6.74 0.277 0.006 2.23e-117 1 S100A12 # 5 2.91e- 58 1.68 0.667 0.251 3.98e- 54 2 CD2 # 6 8.70e- 38 1.99 0.282 0.07 1.19e- 33 2 CD40LG # 7 6.59e-271 7.23 0.564 0.01 9.03e-267 3 LINC00926 # 8 1.45e-235 6.95 0.488 0.007 2.00e-231 3 VPREB3 # 9 4.27e-176 4.52 0.573 0.05 5.85e-172 4 GZMK # 10 3.36e- 81 3.61 0.392 0.06 4.60e- 77 4 GZMH # 11 1.69e-212 5.43 0.506 0.01 2.32e-208 5 CDKN1C # 12 8.23e-168 5.88 0.37 0.005 1.13e-163 5 CKB # 13 8.19e-175 6.16 0.468 0.013 1.12e-170 6 AKR1C3 # 14 8.58e-113 6.08 0.292 0.007 1.18e-108 6 SH2D1B # 15 1.48e-220 7.63 0.812 0.011 2.03e-216 7 FCER1A # 16 1.46e-207 8.03 0.5 0.002 2.00e-203 7 SERPINF1 # 17 0 14.3 0.571 0 0 8 LY6G6F # 18 4.36e-206 13.8 0.357 0 5.98e-202 8 RP11-879F14.2 2.2 方法一:查数据库 library (dplyr) # 按照log2FC筛选top10基因 top10 <- pbmc.markers %>% group_by (cluster) %>% top_n ( n = 10 , wt = avg_log2FC) DoHeatmap (pbmc, features = top10 $ gene) + NoLegend VlnPlot (pbmc, features = top10 $ gene[ 1 : 20 ], pt.size= 0 ) DimPlot (pbmc, label = T) 通过标记基因及文献,可以人工确定各分类群的细胞类型,则可以如下手动添加细胞群名称:帮助单细胞测序进行注释的数据库中这两个是我比较心水的:cellmarker MCA bfreaname.pbmc <- pbmc # 新的细胞名称,与cluster一一对应 new.cluster.ids <- c ( "Naive CD4 T" , "Memory CD4 T" , "CD14+ Mono" , "B" , "CD8 T" , "FCGR3A+ Mono" , "NK" , "DC" , "Platelet" ) # 重命名 names (new.cluster.ids) <- levels (pbmc) pbmc <- RenameIdents (pbmc, new.cluster.ids) DimPlot (pbmc, reduction = "umap" , label = TRUE , pt.size = 0.5 ) + NoLegend 2.3 方法二:通过 singleR 进行注释 这一方法有一定的局限性。
主要是由于很难找到合适的 参考数据,但是对于免疫细胞的注释还是有可观效果的。
if ( ! require (SingleR)) install.packages ( "/path/to/project" , repos = NULL , type = "source" ) # remove.packages('matrixStats') # if(!require(matrixStats))install.packages("matrixStats", repos = "第五讲.细胞类型注释/matrixStats_1.0.0.tar.gz", type = "source") # remove.packages(c('dplyr','ellipsis')) # install.packages(c('dplyr','ellipsis')) # remove.packages(c('vctrs')) # install.packages(c('vctrs')) if ( ! require (celldex))devtools :: install_github ( "ArtifactDB/gypsum-R" ) if ( ! require (alabaster.schemas))devtools :: install_github ( 'ArtifactDB/alabaster.schemas' ) if ( ! require (celldex))BiocManager :: install ( "rhdf5" , force = T) if ( ! require (celldex))devtools :: install_github ( 'ArtifactDB/alabaster.base' ) if ( ! require (celldex))devtools :: install_github ( 'LTLA/celldex' ) # 如果需要选择更新某些包,推荐选1更新所有包 # cg=BlueprintEncodeData 人类血液、免疫细胞 # cg=DatabaseImmuneCellExpressionData 人类高质量免疫细胞 # cg=NovershternHematopoieticData 人类造血系统细胞 # cg=MonacoImmuneData 人类外周血单核细胞(PBMC) # cg=ImmGenData 小鼠免疫细胞 # cg=MouseRNAseqData 从GEO下载的小鼠多组织RNA-seq数据集 # cg=HumanPrimaryCellAtlasData 人类多种组织的原代细胞 # 例如我们要使用的免疫细胞参考数据集,由于网络等原因这条函数你可能运行不了 cg <- ImmGenData ref <- MonacoImmuneData # saveRDS(cg,'./cg.rds')# 我已经帮大家保存在了本地 # saveRDS(ref,'./MonacoImmuneData.rds') ref <- readRDS ( './MonacoImmuneData.rds' ) assay_for_SingleR <- GetAssayData (bfreaname.pbmc, slot= "data" ) predictions <- SingleR ( test= assay_for_SingleR, ref= ref, labels= ref $ label.main) table (predictions $ labels) # # B cells CD4+ T cells CD8+ T cells Dendritic cells Monocytes # 346 898 310 45 628 # NK cells Progenitors T cells # 163 15 233 # 取出celltype对应数据框 cellType = data.frame ( seurat= bfreaname.pbmc @ meta.data $ seurat_clusters, predict= predictions $ labels) sort ( table (cellType[, 1 ])) # # 8 7 6 5 4 3 2 1 0 # 14 32 154 162 316 3429 480 709 table (cellType[, 1 : 2 ]) # predict # seurat B cells CD4+ T cells CD8+ T cells Dendritic cells Monocytes NK cells # 0 1 508 161 0 0 0 # 1 3 0 0 14 463 0 # 2 0 377 33 0 0 0 # 3 341 1 0 0 0 0 # 4 0 12 113 0 0 15 # 5 0 0 0 0 162 0 # 6 0 0 3 0 0 148 # 7 0 0 0 31 1 0 # 8 1 0 0 0 2 0 # predict # seurat Progenitors T cells # 0 4 35 # 1 0 0 # 2 0 19 # 3 0 0 # 4 0 176 # 5 0 0 # 6 0 3 # 7 0 0 # 8 11 0 当你有合适的参考数据集时可以用 方法三方法四 进行注释 2.4 方法三:自定义 singleR 的注释 if ( ! require (SingleR)) install.packages ( "第五讲.细胞类型注释/SingleR_1.10.0.tar.gz" , repos = NULL , type = "source" ) library (Seurat) library (ggplot2) if ( ! require (textshape)) install.packages ( 'textshape' ) if ( ! require (scater))BiocManager :: install ( 'scater' ) if ( ! require (SingleCellExperiment))BiocManager :: install ( 'SingleCellExperiment' ) library (dplyr) # 读入scRNA数据 myref <- pbmc # 将当前标签存入celltype中 myref $ celltype <- Idents (myref) table ( Idents (myref)) # # Naive CD4 T Memory CD4 T CD14+ Mono B CD8 T FCGR3A+ Mono # 709 480 429 342 3162 # NK DC Platelet # 154 32 14 # 读入参考数据集 # 对myref对象中 RNA 数据的平均表达值进行log1p变换。
log1p 函数用于计算一个数加 1 之后的自然对数,即 log1p(x) 等价于 log(1 + x),这里的 log 通常指自然对数(以 e 为底。log1p变换通常用于处理表达数据,它可以将数据转换到对数尺度,有助于稳定方差、使数据分布更接近正态分布,同时避免因对零值取对数而产生的错误。
Refassay <- log1p ( AverageExpression (myref, verbose = FALSE ) $ RNA) head (Refassay) # 6 x 9 sparse Matrix of class "dgCMatrix" # Naive CD4 T Memory CD4 T CD14+ Mono B CD8 T # AL627309.1 0.006006857 0.04740195 0.006651185 . 0.01746659 # AP006222.2 . 0.01082590 0.009196592 . 0.01016628 # RP11-206L10.2 0.007300235 . . 0.02055829 . # RP11-206L10.9 . 0.01044641 . . . # LINC00115 0.014803993 0.03685000 0.033640559 0.03836728 0.01657028 # NOC2L 0.410974333 0.24101294 0.312227749 0.46371195 0.39676059 # FCGR3A+ Mono NK DC Platelet # AL627309.1 . . . . # AP006222.2 . . . . # RP11-206L10.2 . . 0.0812375 . # RP11-206L10.9 0.01192865 . . . # LINC00115 0.01458683 0.05726061 . . # NOC2L 0.40564359 0.53378022 0.2841343 . dim (Refassay) # [1] 13714 9 ref_sce <- SingleCellExperiment :: SingleCellExperiment ( assays= list ( counts= Refassay)) ref_sce = scater :: logNormCounts (ref_sce) # logcounts 主要用于获取或设置单细胞RNA-seq数据中的对数标准化表达矩阵 logcounts (ref_sce)[ 1 : 4 , 1 : 4 ] # 4 x 4 sparse Matrix of class "dgCMatrix" # Naive CD4 T Memory CD4 T CD14+ Mono B # AL627309.1 0.009250892 0.06711500 0.009538416 . # AP006222.2 . 0.01560549 0.013172144 . # RP11-206L10.2 0.011235027 . . 0.02967623 # RP11-206L10.9 . 0.01506130 . . colData (ref_sce) $ Type = colnames (Refassay) ref_sce # class: SingleCellExperiment # dim: 13714 9 # metadata(0): # assays(2): counts logcounts # rownames(13714): AL627309.1 AP006222.2 ... PNRC2.1 SRSF10.1 # rowData names(0): # colnames(9): Naive CD4 T Memory CD4 T ... DC Platelet # colData names(2): sizeFactor Type # reducedDimNames(0): # mainExpName: NULL # altExpNames(0): # 提取自己的单细胞矩阵 testdata <- GetAssayData (bfreaname.pbmc, slot= "data" ) pred <- SingleR ( test= testdata, ref= ref_sce, labels= ref_sce $ Type, = [email protected] ) # 查看各label对应的细胞数量 table (pred $ labels) # # B CD14+ Mono CD8 T DC FCGR3A+ Mono Memory CD4 T # 344 537 308 31 179 464 # Naive CD4 T NK Platelet # 601 160 14 head (pred) # DataFrame with 6 rows and 4 columns # scores labels delta.next # <matrix> <character> <numeric> # AAACATACAACCAC-1 0.336675:0.424807:0.430908:... CD8 T 0.0335526 # AAACATTGAGCTAC-1 0.494458:0.381603:0.391545:... B 0.0931824 # AAACATTGATCAGC-1 0.373678:0.519949:0.489834:... CD14+ Mono 0.0647984 # AAACCGTGCTTCCG-1 0.341852:0.362193:0.353068:... FCGR3A+ Mono 0.0327884 # AAACCGTGTATGCG-1 0.233921:0.279848:0.329152:... NK 0.3171146 # AAACGCACTGGTAC-1 0.391590:0.458438:0.426201:... CD14+ Mono 0.0678422 # pruned.labels # <character> # AAACATACAACCAC-1 CD8 T # AAACATTGAGCTAC-1 B # AAACATTGATCAGC-1 CD14+ Mono # AAACCGTGCTTCCG-1 FCGR3A+ Mono # AAACCGTGTATGCG-1 NK # AAACGCACTGGTAC-1 CD14+ Mono as.data.frame ( table (pred $ labels)) # Var1 Freq # 1 B 344 # 2 CD14+ Mono 537 # 3 CD8 T 308 # 4 DC 31 # 5 FCGR3A+ Mono 179 # 6 Memory CD4 T 464 # 7 Naive CD4 T 601 # 8 NK 160 # 9 Platelet 14 cellType = data.frame ( seurat= bfreaname.pbmc @ meta.data $ seurat_clusters, predict= pred $ labels) sort ( table (cellType[, 1 ])) # # 8 7 6 5 4 3 2 1 0 # 14 32 154 162 316 3429 480 709 table (cellType[, 1 : 2 ]) # predict # seurat B CD14+ Mono CD8 T DC FCGR3A+ Mono Memory CD4 T Naive CD4 T NK # 0 2 159 13 0 0 0 535 0 # 1 0 0 0 1 17 462 0 0 # 2 0 355 10 0 0 0 64 0 # 3 341 1 0 0 0 0 0 0 # 4 1 22 282 0 0 0 2 9 # 5 0 0 0 0 162 0 0 0 # 6 0 0 3 0 0 0 0 151 # 7 0 0 0 30 0 2 0 0 # 8 0 0 0 0 0 0 0 0 # predict # seurat Platelet # 0 0 # 1 0 # 2 0 # 3 0 # 4 0 # 5 0 # 6 0 # 7 0 # 8 14 lalala <- as.data.frame ( table (cellType[, 1 : 2 ])) finalmap <- lalala %>% group_by (seurat) %>% top_n ( n = 1 , wt = Freq) finalmap <- finalmap[ order (finalmap $ seurat),] $ predict print (finalmap) # [1] Naive CD4 T Memory CD4 T CD14+ Mono B CD8 T # [6] FCGR3A+ Mono NK DC Platelet # 9 Levels: B CD14+ Mono CD8 T DC FCGR3A+ Mono Memory CD4 T Naive CD4 T ... Platelet testname <- bfreaname.pbmc # 重命名 new.cluster.ids <- as.character (finalmap) names (new.cluster.ids) <- levels (testname) testname <- RenameIdents (testname, new.cluster.ids) p1 <- DimPlot (pbmc, label = T) p2 <- DimPlot (testname, label = T) p1 | p2 2.5 Seurat内置的TransferData 将参考数据与待注释数据进行映射处理 library (Seurat) pancreas.query <- bfreaname.pbmc # 与cca那里一样寻找Anchors pancreas.anchors <- FindTransferAnchors ( reference = pbmc, query = pancreas.query, dims = 1 : 30 ) pbmc $ celltype <- Idents (pbmc) predictions <- TransferData ( anchorset = pancreas.anchors, refdata = pbmc $ celltype, dims = 1 : 30 ) # 将预测结果添加到待测数据集中 pancreas.query <- AddMetaData (pancreas.query, metadata = predictions) pancreas.query $ prediction.match <- pancreas.query $ predicted.id table (pancreas.query $ prediction.match) # # B CD14+ Mono CD8 T DC FCGR3A+ Mono Memory CD4 T # 340 430 299 32 158 485 # Naive CD4 T NK Platelet # 730 151 13 Idents (pancreas.query) <- 'prediction.match' p1 <- DimPlot (pbmc, label = T) p2 <- DimPlot (pancreas.query, label = T) p1 | p2 3. 环境版本信息 # 最近发现代码运行的最大障碍是各个packages的版本冲突问题,这里列出本次分析环境中的所有信息,报错时可以参考。
检查一下环境是否与我的相同: sessionInfo 可以看出,各种注释方法都是比较科学的,但最大的挑战来源于 参考数据集 对于 待注释数据集 间的适用度 第六讲:组间差异分析 测试文件及代码地址:点击下载 1. 读入并检查数据 library (Seurat) library (dplyr) pbmc <- readRDS ( './pbmcrenamed.rds' ) pbmc # An object of class Seurat # 22254 features across 900 samples within 2 assays # Active assay: RNA (20254 features, 0 variable features) # 2 layers present: counts, data # 1 other assay present: integrated # 3 dimensional reductions calculated: pca, umap, tsne DimPlot (pbmc) names (pbmc @ meta.data) # [1] "orig.ident" "nCount_RNA" # [3] "nFeature_RNA" "percent.mt" # [5] "group" "integrated_snn_res.0.025" # [7] "seurat_clusters" "celltype.group" # [9] "celltype" unique (pbmc $ group) # [1] "C57" "AS1" "P3" DimPlot (pbmc, split.by = 'group' ) 2. 差异分析 pbmc $ celltype.group <- paste (pbmc $ celltype, pbmc $ group, sep = "_" ) pbmc $ celltype <- Idents (pbmc) Idents (pbmc) <- "celltype.group" table ( Idents (pbmc)) # # VSMC_C57 T cell_C57 Fibro_C57 EC_C57 # 63 32 48 59 # Myeloid cells_C57 B cell_C57 Mono_C57 Macro_C57 # 44 39 21 3 # Mono_AS1 T cell_AS1 Macro_AS1 Neut_AS1 # 49 45 25 28 # Fibro_AS1 Myeloid cells_AS1 B cell_AS1 EC_AS1 # 16 52 46 18 # VSMC_AS1 Macro_P3 Neut_P3 Mono_P3 # 8 72 30 # B cell_P3 T cell_P3 Fibro_P3 VSMC_P3 # 15 23 36 29 # EC_P3 Myeloid cells_P3 # 23 4 mydeg <- FindMarkers (pbmc, ident.1 = 'VSMC_AS1' , ident.2 = 'VSMC_C57' , verbose = FALSE , test.use = 'wilcox' , min.pct = 0.1 ) head (mydeg) # p_val avg_log2FC pct.1 pct.2 p_val_adj # Cd24a 6.327111e-07 4.931989 0.500 0.016 0.01281493 # Spta1 9.387127e-07 4.618852 0.375 0.000 0.01901269 # Lum 9.387127e-07 9.785188 0.375 0.000 0.01901269 # Gda 9.387127e-07 5.350293 0.375 0.000 0.01901269 # Isg20 6.651476e-06 4.669678 0.500 0.032 0.13471900 # Hbb-bt 7.937909e-06 4.454887 0.500 0.032 0.16077441 3. 解放生产力 通过循环自动计算差异基因 cellfordeg <- levels (pbmc $ celltype) cellfordeg # [1] "VSMC" "T cell" "Macro" "B cell" # [5] "Fibro" "Myeloid cells" "Mono" "Neut" # [9] "EC" for (i in 1 : length (cellfordeg)){ ident1 = paste0 (cellfordeg[i], "_P3" ) print (ident1) ident2 = paste0 (cellfordeg[i], "_AS1" ) print (ident2) CELLDEG <- FindMarkers (pbmc, ident.1 = ident1 , ident.2 = ident2, verbose = FALSE ) write.csv (CELLDEG, paste0 ( "../result_seuratV5/" ,cellfordeg[i], ".CSV" )) } # [1] "VSMC_P3" # [1] "VSMC_AS1" # [1] "T cell_P3" # [1] "T cell_AS1" # [1] "Macro_P3" # [1] "Macro_AS1" # [1] "B cell_P3" # [1] "B cell_AS1" # [1] "Fibro_P3" # [1] "Fibro_AS1" # [1] "Myeloid cells_P3" # [1] "Myeloid cells_AS1" # [1] "Mono_P3" # [1] "Mono_AS1" # [1] "Neut_P3" # [1] "Neut_AS1" # [1] "EC_P3" # [1] "EC_AS1" list.files ( "../result_seuratV5/" ) # [1] "B cell.CSV" "EC.CSV" "Fibro.CSV" # [4] "Macro.CSV" "Mono.CSV" "Myeloid cells.CSV" # [7] "Neut.CSV" "pbmc.rds" "T cell.CSV" # [10] "VSMC.CSV" 4. 可视化方法 library (dplyr) # CELLDEG为EC现在是"Neut_P3"和"Neut_AS1"的组间差异基因 CELLDEG <- read.table ( "../result_seuratV5/Neut.CSV" , header = T, sep = "," , row.names = 1 ) head (CELLDEG) # p_val avg_log2FC pct.1 pct.2 p_val_adj # Sptlc2 1.731937e-06 -1.6468877 0.222 0.750 0.03507865 # Ifnar1 2.754659e-06 -1.6050075 0.125 0.607 0.05579286 # Igkc 4.157411e-06 -0.5147022 0.042 0.429 0.08420420 # Slbp 4.318635e-05 -2.9794327 0.069 0.393 0.87469624 # Rabep2 5.839358e-05 -4.0548058 0.000 0.214 1.00000000 # Itsn1 8.990620e-05 -4.3010964 0.014 0.250 1.00000000 top10 <- CELLDEG %>% top_n ( n = 10 , wt = avg_log2FC) %>% row.names top10 # [1] "Trappc11" "Fmo2" "Nqo1" "Wdr7" "Clec4n" "Stfa3" # [7] "Cxcl1" "Csta2" "Tnfsf9" "Stfa1" pbmc <- ScaleData (pbmc, features = rownames (pbmc)) pbmc_sub <- subset (pbmc, idents = c ( "Neut_P3" , "Neut_AS1" )) DoHeatmap (pbmc_sub, features = top10, size= 3 ) VlnPlot (pbmc_sub, features = top10) FeaturePlot (pbmc_sub , features = top10) DotPlot (pbmc_sub , features = top10) 5. 提取表达量,用ggplot2 DIY一个箱线图 # mymatrix <- as.data.frame (pbmc @ assays $ RNA @ data) mymatrix2 <- t (mymatrix) %>% as.data.frame mymatrix2 $ celltype <- pbmc $ celltype mymatrix2[, ncol (mymatrix2) + 1 ] <- mymatrix2[, ncol (mymatrix2) + 1 ] <- pbmc $ group colnames (mymatrix2)[ ncol (mymatrix2)] <- "group" library (ggplot2) p1 <- ggplot2 :: ggplot (mymatrix2, aes ( x= celltype, y= Thbs1, fill= group)) + geom_boxplot ( alpha= 0.7 ) + scale_y_continuous ( name = "Expression" ) + scale_x_discrete ( name= "Celltype" ) + scale_fill_manual ( values = c ( 'deepskyblue' , 'orange' , 'pink' )) p1 # 另一种提取方法,借助log1p函数提取 Idents (pbmc) <- colnames (pbmc) mymatrix <- log1p ( AverageExpression (pbmc, verbose = FALSE ) $ RNA) mymatrix2 <- t (mymatrix) %>% as.data.frame mymatrix2 $ celltype <- pbmc $ celltype mymatrix2[, ncol (mymatrix2) + 1 ] <- pbmc $ group colnames (mymatrix2)[ ncol (mymatrix2)] <- "group" library (ggplot2) p2 <- ggplot2 :: ggplot (mymatrix2, aes ( x= celltype, y= Thbs1, fill= group)) + geom_boxplot ( alpha= 0.7 ) + scale_y_continuous ( name = "Expression" ) + scale_x_discrete ( name= "Celltype" ) + scale_fill_manual ( values = c ( 'deepskyblue' , 'orange' , 'pink' )) p2 library (patchwork) p1 | p2 第七讲:富集分析 上一讲中刚教会大家如何进行差异基因的计算及可视化,然而大家拿着成千上百的差异基因直接调研、撰文显然是不现实的,因此需要通过富集分析去揭示这些基因群背后蕴含的 生物学意义 1. 视频及测试文件 1.1 视频教程 1.2 测试文件 如果你获取了当前教学文档,那么随之下载的便是测试文件,此外还可在推送中获取:手把手教你做单细胞测序数据分析
(七):基因集富集分析 2. 代码 suppressWarnings ( suppressMessages ( if ( ! require (rvcheck))devtools :: install_version ( "rvcheck" , version = "0.1.8" , repos = "http://cran.us.r-project.org" ))) suppressWarnings ( suppressMessages ( if ( ! require (clusterProfiler))BiocManager :: install ( "clusterProfiler" ))) suppressWarnings ( suppressMessages ( if ( ! require (org.Mm.eg.db))BiocManager :: install ( "org.Mm.eg.db" ))) suppressWarnings ( suppressMessages ( if ( ! require (org.Hs.eg.db))BiocManager :: install ( "org.Hs.eg.db" ))) suppressWarnings ( suppressMessages ( library (dplyr))) 2.1 准备基因和绘图函数 # 读取marker列表 bcell.marker <- read.csv ( './bcell.marker.CSV' , row.names = 1 ) head (bcell.marker) # p_val avg_log2FC pct.1 pct.2 p_val_adj SYMBOL # GZMB 3.186146e-97 -5.507567 0.017 0.961 4.369480e-93 GZMB # NKG7 1.262672e-94 -6.262287 0.087 1.000 1.731628e-90 NKG7 # PRF1 1.835435e-93 -4.612813 0.029 0.948 2.517115e-89 PRF1 # GZMA 3.036581e-93 -4.466820 0.015 0.929 4.164367e-89 GZMA # CTSW 3.151498e-93 -4.203776 0.070 0.987 4.321965e-89 CTSW # CST7 9.767143e-92 -4.380031 0.038 0.948 1.339466e-87 CST7 # 筛选出B细胞marker中p_val_adj最显著(越小越显著)的100个基因用于下游分析 mygene <- bcell.marker %>% top_n ( n = - 100 , wt = p_val_adj) %>% rownames mygene[ 1 : 10 ] # [1] "GZMB" "NKG7" "PRF1" "GZMA" "CTSW" "CST7" "GNLY" "FGFBP2" # [9] "CD247" "FCGR3A" # 利用org.Hs.eg.db整合基因名、 ENTREZID、ENSEMBL gene.df <- bitr (mygene, fromType= "SYMBOL" , toType= c ( "ENTREZID" , "ENSEMBL" ), OrgDb = org.Hs.eg.db) 2.2 KEGG 与 GO 富集分析 # 使用在线数据,可能会受域名限制 # 设置clusterProfiler包的下载方法,保证在使用clusterProfiler进行数据库查询或注释数据下载时,可以自动选择合适的下载方式(如 curl、wget 或 internal) R.utils :: setOption ( "clusterProfiler.download.method" , "auto" ) # 进行kegg富集分析 ekegg <- enrichKEGG ( unique (gene.df $ ENTREZID), organism= 'hsa' , pvalueCutoff= 0.05 , pAdjustMethod= 'BH' , qvalueCutoff= 0.2 , minGSSize= 10 , maxGSSize= 500 , use_internal_data= F) # 把ENTREZID转换为SYMBOL ekegg <- setReadable (ekegg, 'org.Hs.eg.db' , 'ENTREZID' ) write.csv (ekegg @ result, '../result_enrich/kegg.result.csv' ) library ( "enrichplot" ) dotplot (ekegg, x = "GeneRatio" , color = "p.adjust" , size = "Count" , showCategory = 10 ) # 保存图片 png ( file = '../result_enrich/kegg.png' , width = 10 , height = 10 ) dotplot (ekegg, x = "GeneRatio" , color = "p.adjust" , size = "Count" , showCategory = 10 ) dev.off # png # 2 如果你在做KEGG富集分析时遇到这样的报错:那么你的 clusterProfiler 需要更新一下:# remove.packages('clusterProfiler') # devtools::install_github("YuLab-SMU/clusterProfiler") function、biological process、cell component三个类 # MF ggoMF <- enrichGO (gene.df $ SYMBOL, org.Hs.eg.db, keyType = "SYMBOL" , ont = "MF" , pvalueCutoff = 0.05 , pAdjustMethod = "BH" , qvalueCutoff = 0.2 , minGSSize = 10 , maxGSSize = 500 , readable = FALSE , pool = FALSE ) write.csv (ggoMF @ result, '../result_enrich/ggoMF.result.csv' ) dotplot (ggoMF, x = "GeneRatio" , color = "p.adjust" , size = "Count" , showCategory = 10 ) png ( file = '../result_enrich/ggoMF.png' , width = 10 , height = 10 ) dotplot (ggoMF, x = "GeneRatio" , color = "p.adjust" , size = "Count" , showCategory = 10 ) dev.off # png # 2 # CC ggoCC <- enrichGO (gene.df $ SYMBOL, org.Hs.eg.db, keyType = "SYMBOL" , ont = "CC" , pvalueCutoff = 0.05 , pAdjustMethod = "BH" , qvalueCutoff = 0.2 , minGSSize = 10 , maxGSSize = 500 , readable = FALSE , pool = FALSE ) write.csv (ggoCC @ result, '../result_enrich/GO.CC.result.csv' ) dotplot (ggoCC, x = "GeneRatio" , color = "p.adjust" , size = "Count" , showCategory = 10 ) png ( file = '../result_enrich/ggoCC.png' , width = 10 , height = 10 ) dotplot (ggoCC, x = "GeneRatio" , color = "p.adjust" , size = "Count" , showCategory = 10 ) dev.off # png # 2 ggoBP <- enrichGO (gene.df $ SYMBOL, org.Hs.eg.db, keyType = "SYMBOL" , ont = "BP" , pvalueCutoff = 0.05 , pAdjustMethod = "BH" , qvalueCutoff = 0.2 , minGSSize = 10 , maxGSSize = 500 , readable = FALSE , pool = FALSE ) write.csv (ggoBP @ result, '../result_enrich/GO.BP.result.csv' ) dotplot (ggoBP, x = "GeneRatio" , color = "p.adjust" , size = "Count" , showCategory = 10 ) png ( file = '../result_enrich/ggoBP.png' , width = 10 , height = 10 ) dotplot (ggoBP, x = "GeneRatio" , color = "p.adjust" , size = "Count" , showCategory = 10 ) dev.off # png # 2 2.3 GSEA分析 bcell.marker $ SYMBOL <- rownames (bcell.marker) genetable <- merge (gene.df,bcell.marker, by= 'SYMBOL' ) geneList <- genetable $ avg_log2FC names (geneList) <- genetable $ ENTREZID # 用于gsea分析的genelist本质上是一个名为ENTREZID的数值排序结果 geneList <- sort (geneList, decreasing = T) egseGO <- gseGO (geneList, ont = "MF" , org.Hs.eg.db, exponent = 1 , nPerm = 1000 , minGSSize = 10 , maxGSSize = 500 , pvalueCutoff = 1 , pAdjustMethod = "BH" , verbose = TRUE , seed = FALSE , by = "fgsea" ) BP kk2 <- gseKEGG ( geneList= geneList, organism = "hsa" , => human , mmu => mouse nPerm= 1000 , minGSSize = 120 , pvalueCutoff = 1 , verbose = FALSE ) library (patchwork) # 拼图神包,详细使用方法可以看:相关链接见专题页 for (i in 1 : length (egseGO $ Description)) { plot1 <- gseaplot (egseGO, geneSetID = i, title = egseGO $ Description[i]) ggsave ( filename = paste0 ( "../result_enrich/" ,egseGO $ Description[i], ".pdf" ), plot = plot1[[ 2 ]] / plot1[[ 1 ]], height = 10 , width = 10 ) } gseaplot (egseGO, geneSetID = 1 , title = egseGO $ Description[ 1 ]) 2.4 GSVA分析 一文了解GSVA原理 三行搞定GSVA分析 一文搞定GSVA及下游分析 单细胞中如何做GSVA 第八讲:Seurat分类变量处理技巧 # 首先读入原始文件 library (Seurat) pbmc <- readRDS ( './pbmcrenamed.rds' ) # 查看metadata里存储了哪些信息 head (pbmc @ meta.data) # orig.ident nCount_RNA nFeature_RNA percent.mt group # AAAGGGCTCGCAGAGA-1_1 C57 15155 3673 4.988453 C57 # AACACACAGCATCCCG-1_1 C57 6300 1995 7.174603 C57 # AACAGGGTCAACACCA-1_1 C57 21299 4453 2.427344 C57 # AACCCAAAGGCCTTGC-1_1 C57 1160 688 3.706897 C57 # AACCCAAAGTGGACGT-1_1 C57 1036 387 1.833977 C57 # AACCCAAGTCCGAAGA-1_1 C57 3480 1401 2.126437 C57 # integrated_snn_res.0.025 seurat_clusters celltype.group # AAAGGGCTCGCAGAGA-1_1 0 0 VSMC_C57 # AACACACAGCATCCCG-1_1 1 1 T cell_C57 # AACAGGGTCAACACCA-1_1 4 4 Fibro_C57 # AACCCAAAGGCCTTGC-1_1 8 8 EC_C57 # AACCCAAAGTGGACGT-1_1 5 5 Myeloid cells_C57 # AACCCAAGTCCGAAGA-1_1 1 1 T cell_C57 # celltype # AAAGGGCTCGCAGAGA-1_1 VSMC # AACACACAGCATCCCG-1_1 T cell # AACAGGGTCAACACCA-1_1 Fibro # AACCCAAAGGCCTTGC-1_1 EC # AACCCAAAGTGGACGT-1_1 Myeloid cells # AACCCAAGTCCGAAGA-1_1 T cell names (pbmc @ meta.data) # [1] "orig.ident" "nCount_RNA" # [3] "nFeature_RNA" "percent.mt" # [5] "group" "integrated_snn_res.0.025" # [7] "seurat_clusters" "celltype.group" # [9] "celltype" # 查看初始分类变量 head (pbmc @ active.ident) # AAAGGGCTCGCAGAGA-1_1 AACACACAGCATCCCG-1_1 AACAGGGTCAACACCA-1_1 # VSMC T cell Fibro # AACCCAAAGGCCTTGC-1_1 AACCCAAAGTGGACGT-1_1 AACCCAAGTCCGAAGA-1_1 # EC Myeloid cells T cell # Levels: VSMC T cell Macro B cell Fibro Myeloid cells Mono Neut EC table (pbmc @ active.ident) # # VSMC T cell Macro B cell Fibro # 100 # Myeloid cells Mono Neut EC # 100 table ( Idents (pbmc)) # # VSMC T cell Macro B cell Fibro # 100 # Myeloid cells Mono Neut EC # 100 # 修改初始分类变量为group Idents (pbmc) <- "group" table (pbmc @ active.ident) # # C57 AS1 P3 # 309 287 304 table ( Idents (pbmc)) # # C57 AS1 P3 # 309 287 304 unique (pbmc @ active.ident) # [1] C57 AS1 P3 # Levels: C57 AS1 P3 unique (pbmc $ group) # [1] "C57" "AS1" "P3" # 筛选部分分组 unique (pbmc $ group)[ 1 : 2 ] # [1] "C57" "AS1" pbmc_sub <- subset (pbmc, idents = unique (pbmc $ group)[ 1 : 2 ]) # 可以看看subset前后的变化,metadata里的变量都可以这样取子集 table ( Idents (pbmc_sub)) # # C57 AS1 # 309 287 # 以group列作为分类标签进行数据展示 # 计算线粒体比例,小鼠线粒体基因开头为mt- pbmc_sub[[ "percent.mt" ]] <- PercentageFeatureSet (pbmc_sub, pattern = "^mt-" ) # VlnPlot展示细胞质量 VlnPlot (pbmc_sub, features = c ( "nFeature_RNA" , "nCount_RNA" , "percent.mt" ), ncol = 3 ) # 过滤低质量的细胞 pbmc_sub <- subset (pbmc_sub, subset = nFeature_RNA > 200 & nFeature_RNA < 5000 & percent.mt < 10 ) # 过滤后再次展示细胞质量 VlnPlot (pbmc_sub, features = c ( "nFeature_RNA" , "nCount_RNA" , "percent.mt" ), ncol = 3 ) # 数据标准化 suppressWarnings ( suppressMessages ({ pbmc_sub <- NormalizeData (pbmc_sub, normalization.method = "LogNormalize" , scale.factor = 10000 ) # 计算高变基因 pbmc_sub <- FindVariableFeatures (pbmc_sub, selection.method = "vst" , nfeatures = 2000 ) pbmc_sub <- ScaleData (pbmc_sub, features = rownames (pbmc_sub)) # 剔除不想要的变量(如线粒体的比例) pbmc_sub <- ScaleData (pbmc_sub, vars.to.regress = "percent.mt" ) # PCA降维分析 pbmc_sub <- RunPCA (pbmc_sub, features = VariableFeatures ( object = pbmc_sub)) # 计算细胞临近关系 pbmc_sub <- FindNeighbors (pbmc_sub, dims = 1 : 20 ) pbmc_sub <- RunUMAP (pbmc_sub, dims = 1 : 20 ) })) DimPlot (pbmc_sub, label = T) # 查看[email protected]包含的信息 names (pbmc_sub @ meta.data) # [1] "orig.ident" "nCount_RNA" # [3] "nFeature_RNA" "percent.mt" # [5] "group" "integrated_snn_res.0.025" # [7] "seurat_clusters" "celltype.group" # [9] "celltype" # 查看pbmc_sub每个细胞数目 table (pbmc_sub $ celltype) # # VSMC T cell Macro B cell Fibro # 60 76 22 85 56 # Myeloid cells Mono Neut EC # 95 64 28 63 # 查看pbmc_sub初始分类变量 table ( Idents (pbmc_sub)) # # C57 AS1 # 286 263 # 修改pbmc_sub的分类变量为celltype Idents (pbmc_sub) <- "celltype" table ( Idents (pbmc_sub)) # # VSMC T cell Macro B cell Fibro # 60 76 22 85 56 # Myeloid cells Mono Neut EC # 95 64 28 63 DimPlot (pbmc_sub, label = T) # 修改pbmc_sub的分类变量为celltype后,调整分类变量里面的元素名称 library (tidyverse) pbmc_sub $ new <- pbmc_sub $ group unique (pbmc_sub $ new) # [1] "C57" "AS1" pbmc_sub $ new <- recode (pbmc_sub $ new, "C57" = "sample1" , "AS1" = "sample2" ) Idents (pbmc_sub) <- "new" table ( Idents (pbmc_sub)) # # sample1 sample2 # 286 263 DimPlot (pbmc_sub) # 根据分类变量group拆分seurat对象 pbmc_sub.list <- SplitObject (pbmc_sub, split.by = 'group' ) C57 <- pbmc_sub.list[[ 1 ]] AS1 <- pbmc_sub.list[[ 2 ]] # 查看拆分后的对象类别 class (C57) # [1] "Seurat" # attr(,"package") # [1] "SeuratObject" class (AS1) # [1] "Seurat" # attr(,"package") # [1] "SeuratObject" # 拆分后的对象都添加一个新的类别 C57 $ group <- "group1" AS1 $ group <- "group2" # 重新合并两个seurat对象 pbmc_sub_new <- merge ( x= C57, y= AS1) table ( Idents (pbmc_sub_new)) # # sample1 sample2 # 286 263 unique (pbmc_sub_new $ group) # [1] "group1" "group2" # 调整分类变量为group Idents (pbmc_sub_new) <- "group" table ( Idents (pbmc_sub_new)) # # group1 group2 # 286 263 # 小提琴图展示基因表达量,同时分类变量是我们刚刚设置的group VlnPlot (pbmc_sub_new, features = "Mndal" , pt.size = 0 ) 第九讲:沉浸式统计细胞比例 本期测试的数据来源于数据集 GSE139107。
前期的数据处理可以参考我们之前推送的DimPlot优化内容改造单细胞降维图|1.DimPlot的探索 1. 准备工作 library (plotrix) library (dplyr) library (ggsci) library (Seurat) sce <- readRDS ( "iri.intergrate.harmony.rds" ) # 根据推文中所有样本的整合后的rds中已经包含了注释后的细胞名称,我们可以根据相同的barcode将新的rds中也修改成和原来一样的细胞名称 sce_origin <- readRDS ( "iri.intergrate.rds" ) sce @ meta.data $ Celltype <- sce_origin @ meta.data[ colnames (sce), "celltype" ] sce @ meta.data $ Celltype <- factor (sce @ meta.data $ Celltype, levels= levels (sce_origin)) Idents (sce) <- "Celltype" # 保存重新命名之后的rds # saveRDS(sce,"./iri.intergrate.harmony.rename.rds") library (plotrix) library (dplyr) library (ggsci) library (Seurat) # 读取添加了注释后细胞标签的rds sce <- readRDS ( "./iri.intergrate.harmony.rename.rds" ) # 选取部分细胞用于后续分析 sel <- c ( "PTS1" , "PTS2" , "NewPT1" , "MTAL" , "DCT" , "EC1" ) sce <- subset (sce, Celltype %in% sel) # 指定因子展示顺序 sce @ meta.data $ Celltype <- factor (sce @ meta.data $ Celltype, levels = sel) Idents (sce) <- factor ( Idents (sce), levels = sel) head (sce @ meta.data) # orig.ident nCount_RNA nFeature_RNA_snn_res.0.4 # IRI4h1_AAACCTGAGATTACCC IRI4h1 2490 564 6 # IRI4h1_AAACCTGAGTGTTAGA IRI4h1 514 320 3 # IRI4h1_AAACCTGCACCAACCG IRI4h1 1171 392 2 # IRI4h1_AAACCTGCAGCCTGTG IRI4h1 1712 530 1 # IRI4h1_AAACCTGGTGATGTGG IRI4h1 3062 696 6 # IRI4h1_AAACCTGGTTACGTCA IRI4h1 1996 556 0 # seurat_clusters Celltype # IRI4h1_AAACCTGAGATTACCC 6 NewPT1 # IRI4h1_AAACCTGAGTGTTAGA 3 EC1 # IRI4h1_AAACCTGCACCAACCG 2 DCT # IRI4h1_AAACCTGCAGCCTGTG 1 MTAL # IRI4h1_AAACCTGGTGATGTGG 6 NewPT1 # IRI4h1_AAACCTGGTTACGTCA 0 PTS2 unique (sce $ orig.ident) # [1] "IRI4h1" "IRI4h2" "IRI4h3" "IRI12h1b1" "IRI12h1b2" "IRI12h2" # [7] "IRI12h3" unique (sce $ Celltype) # [1] NewPT1 EC1 DCT MTAL PTS2 PTS1 # Levels: PTS1 PTS2 NewPT1 MTAL DCT EC1 table (sce $ orig.ident) # # IRI12h1b1 IRI12h1b2 IRI12h2 IRI12h3 IRI4h1 IRI4h2 IRI4h3 # 3286 3505 2780 2670 4804 143446 table (sce $ Celltype) # # PTS1 PTS2 NewPT1 MTAL DCT EC1 # 2646 4134 5305 4609 2322 2909 2. 饼图 # 提取细胞标签 mynames <- table (sce $ Celltype) %>% names mynames # [1] "PTS1" "PTS2" "NewPT1" "MTAL" "DCT" "EC1" # 提取细胞数目 myratio <- table (sce $ Celltype) %>% as.numeric myratio # [1] 2646 4134 5305 4609 2322 2909 # 计算细胞占比,每个细胞数目/总细胞数目 pielabel <- paste0 (mynames, " (" , round (myratio / sum (myratio) * 100 , 2 ), "%)" ) pielabel # [1] "PTS1 (12.07%)" "PTS2 (18.86%)" "NewPT1 (24.2%)" "MTAL (21.02%)" # [5] "DCT (10.59%)" "EC1 (13.27%)" cols <- pal_npg ( "nrc" )( 10 ) cols # [1] " " " " " " # [7] " " " " pie (myratio, labels= pielabel, radius = 1.0 , 参数用于控制饼图的半径大小,表示相对于图形区域的比例:默认值 radius = 1,即饼图占据完整的绘图区域(不会超出,但可能贴近边界)。
radius < 1 会缩小饼图,使其不占满整个绘图区域。
radius > 1 会扩大饼图,可能导致部分图形超出绘图区域并被截断 clockwise= F, 参数用于控制扇形区域的绘制方向,clockwise=F逆时针 main = "Celltype" , col = cols) # 绘制 3D 图,family 要设置你系统支持的中文字体库 pie3D (myratio, labels = pielabel, explode = 0.1 , 0 ~ 1 之间) main = "Cell Proption" , height = 0.3 , 3D 饼图的厚度,数值越大,立体效果越明显 labelcex = 1 ) 1,值越大字体越大 3. 甜甜圈图 # 并没有直接画甜甜圈图的R包,所以在饼图源代码的基础上改改 doughnut <- function (x, labels = names (x), edges = 200 , outer.radius = 0.8 , inner.radius= 0.6 , clockwise = FALSE , init.angle = if (clockwise) 90 else 0 , density = NULL , angle = 45 , col = NULL , border = FALSE , lty = NULL , main = NULL , ...) { if ( ! is.numeric (x) || any ( is.na (x) | x < 0 )) stop ( "'x' values must be positive." ) if ( is.null (labels)) labels <- as.character ( seq_along (x)) else labels <- as.graphicsAnnot (labels) x <- c ( 0 , cumsum (x) / sum (x)) dx <- diff (x) nx <- length (dx) plot.new pin <- par ( "pin" ) xlim <- ylim <- c ( - 1 , 1 ) if (pin[ 1 L ] > pin[ 2 L ]) xlim <- (pin[ 1 L ] / pin[ 2 L ]) * xlim else ylim <- (pin[ 2 L ] / pin[ 1 L ]) * ylim plot.window (xlim, ylim, "" , asp = 1 ) if ( is.null (col)) col <- if ( is.null (density)) palette else par ( "fg" ) col <- rep (col, length.out = nx) border <- rep (border, length.out = nx) lty <- rep (lty, length.out = nx) angle <- rep (angle, length.out = nx) density <- rep (density, length.out = nx) twopi <- if (clockwise) - 2 * pi else 2 * pi t2xy <- function (t, radius) { t2p <- twopi * t + init.angle * pi / 180 list ( x = radius * cos (t2p), y = radius * sin (t2p)) } for (i in 1 L : nx) { n <- max ( 2 , floor (edges * dx[i])) P <- t2xy ( seq.int (x[i], x[i + 1 ], length.out = n), outer.radius) polygon ( c (P $ x, 0 ), c (P $ y, 0 ), density = density[i], angle = angle[i], border = border[i], col = col[i], lty = lty[i]) Pout <- t2xy ( mean (x[i + 0 : 1 ]), outer.radius) lab <- as.character (labels[i]) if ( ! is.na (lab) && nzchar (lab)) { lines ( c ( 1 , 1.05 ) * Pout $ x, c ( 1 , 1.05 ) * Pout $ y) text ( 1.1 * Pout $ x, 1.1 * Pout $ y, labels[i], xpd = TRUE , adj = ifelse (Pout $ x < 0 , 1 , 0 ), ...) } Pin <- t2xy ( seq.int ( 0 , 1 , length.out = n * nx), inner.radius) polygon (Pin $ x, Pin $ y, density = density[i], angle = angle[i], border = border[i], col = "white" , lty = lty[i]) } title ( main = main, ...) invisible ( NULL ) } # 绘图 df <- table (sce $ Celltype) %>% as.data.frame df # Var1 Freq # 1 PTS1 2646 # 2 PTS2 4134 # 3 NewPT1 5305 # 4 MTAL 4609 # 5 DCT 2322 # 6 EC1 2909 labs <- paste0 (df $ Var1, " (" , round (df $ Freq / sum (df $ Freq) * 100 , 2 ), "%)" ) labs # [1] "PTS1 (12.07%)" "PTS2 (18.86%)" "NewPT1 (24.2%)" "MTAL (21.02%)" # [5] "DCT (10.59%)" "EC1 (13.27%)" doughnut ( df $ Freq, labels= labs, init.angle= 0 , # 设置初始角度 col = cols , # 设置颜色 border= "white" , # 边框颜色 inner.radius= 0.4 , # 内环大小 cex = 1 ) # 字体大小 4. 堆积柱状图 library (ggplot2) cellnum <- table (sce $ Celltype,sce $ orig.ident) cell.prop <- as.data.frame ( prop.table (cellnum)) colnames (cell.prop) <- c ( "Celltype" , "Group" , "Proportion" ) cell.prop # Celltype Group Proportion # 1 PTS1 IRI12h1b1 0.014139111 # 2 PTS2 IRI12h1b1 0.020250855 # 3 NewPT1 IRI12h1b1 0.035393387 # 4 MTAL IRI12h1b1 0.038677309 # 5 DCT IRI12h1b1 0.016009122 # 6 EC1 IRI12h1b1 0.025404789 # 7 PTS1 IRI12h1b2 0.016693273 # 8 PTS2 IRI12h1b2 0.021026226 # 9 NewPT1 IRI12h1b2 0.041778791 # 10 MTAL IRI12h1b2 0.039452680 # 11 DCT IRI12h1b2 0.016966933 # 12 EC1 IRI12h1b2 0.023945268 # 13 PTS1 IRI12h2 0.012862030 # 14 PTS2 IRI12h2 0.022576967 # 15 NewPT1 IRI12h2 0.025906499 # 16 MTAL IRI12h2 0.029509692 # 17 DCT IRI12h2 0.013500570 # 18 EC1 IRI12h2 0.022440137 # 19 PTS1 IRI12h3 0.014367161 # 20 PTS2 IRI12h3 0.020114025 # 21 NewPT1 IRI12h3 0.023854048 # 22 MTAL IRI12h3 0.030239453 # 23 DCT IRI12h3 0.011904219 # 24 EC1 IRI12h3 0.021299886 # 25 PTS1 IRI4h1 0.031790194 # 26 PTS2 IRI4h1 0.056647662 # 27 NewPT1 IRI4h1 0.054640821 # 28 MTAL IRI4h1 0.033112885 # 29 DCT IRI4h1 0.023397948 # 30 EC1 IRI4h1 0.019521095 # 31 PTS1 IRI4h2 0.003466363 # 32 PTS2 IRI4h2 0.007616876 # 33 NewPT1 IRI4h2 0.028734322 # 34 MTAL IRI4h2 0.009897377 # 35 DCT IRI4h2 0.006066135 # 36 EC1 IRI4h2 0.009623717 # 37 PTS1 IRI4h3 0.027366021 # 38 PTS2 IRI4h3 0.040319270 # 39 NewPT1 IRI4h3 0.031653364 # 40 MTAL IRI4h3 0.029327252 # 41 DCT IRI4h3 0.018061574 # 42 EC1 IRI4h3 0.010444698 p.bar <- ggplot (cell.prop, aes (Group,Proportion, fill= Celltype)) + geom_bar ( stat= "identity" , position= "fill" ) + scale_fill_manual ( values= cols) + ggtitle ( "cell portation" ) + theme_bw + theme ( axis.ticks.length= unit ( 0.5 , 'cm' )) + guides ( fill= guide_legend ( title= NULL )) p.bar 5. 箱线图 cellnum <- table (sce $ Celltype,sce $ orig.ident) cellnum # # IRI12h1b1 IRI12h1b2 IRI12h2 IRI12h3 IRI4h1 IRI4h2 IRI4h3 # PTS1 310 366 282 315 697 76 600 # PTS2 444 461 495 441 1242 167 884 # NewPT1 776 916 568 523 1198 630 694 # MTAL 848 865 647 663 726 217 643 # DCT 351 372 296 261 5133 396 # EC1 557 525 492 467 428 211 229 for (i in 1 : ncol (cellnum)) { cellnum[,i] <- cellnum[,i] / sum (cellnum[,i]) } cellnum # # IRI12h1b1 IRI12h1b2 IRI12h2 IRI12h3 IRI4h1 IRI4h2 # PTS1 0.09433962 0.10442225 0.10143885 0.11797753 0.14508743 0.05299861 # PTS2 0.13511869 0.13152639 0.17805755 0.16516854 0.25853455 0.11645746 # NewPT1 0.23615338 0.26134094 0.20431655 0.19588015 0.24937552 0.43933054 # MTAL 0.25806452 0.24679030 0.23273381 0.24831461 0.15112406 0.15132497 # DCT 0.10681680 0.10613409 0.10647482 0.09775281 0.10678601 0.09274756 # EC1 0.16950700 0.14978602 0.17697842 0.17490637 0.08909242 0.14714086 # # IRI4h3 # PTS1 0.17411492 # PTS2 0.25652931 # NewPT1 0.20139292 # MTAL 0.18659315 # DCT 0.11491584 # EC1 0.06645386 cellnum <- as.data.frame (cellnum) cellnum # Var1 Var2 Freq # 1 PTS1 IRI12h1b1 0.09433962 # 2 PTS2 IRI12h1b1 0.13511869 # 3 NewPT1 IRI12h1b1 0.23615338 # 4 MTAL IRI12h1b1 0.25806452 # 5 DCT IRI12h1b1 0.10681680 # 6 EC1 IRI12h1b1 0.16950700 # 7 PTS1 IRI12h1b2 0.10442225 # 8 PTS2 IRI12h1b2 0.13152639 # 9 NewPT1 IRI12h1b2 0.26134094 # 10 MTAL IRI12h1b2 0.24679030 # 11 DCT IRI12h1b2 0.10613409 # 12 EC1 IRI12h1b2 0.14978602 # 13 PTS1 IRI12h2 0.10143885 # 14 PTS2 IRI12h2 0.17805755 # 15 NewPT1 IRI12h2 0.20431655 # 16 MTAL IRI12h2 0.23273381 # 17 DCT IRI12h2 0.10647482 # 18 EC1 IRI12h2 0.17697842 # 19 PTS1 IRI12h3 0.11797753 # 20 PTS2 IRI12h3 0.16516854 # 21 NewPT1 IRI12h3 0.19588015 # 22 MTAL IRI12h3 0.24831461 # 23 DCT IRI12h3 0.09775281 # 24 EC1 IRI12h3 0.17490637 # 25 PTS1 IRI4h1 0.14508743 # 26 PTS2 IRI4h1 0.25853455 # 27 NewPT1 IRI4h1 0.24937552 # 28 MTAL IRI4h1 0.15112406 # 29 DCT IRI4h1 0.10678601 # 30 EC1 IRI4h1 0.08909242 # 31 PTS1 IRI4h2 0.05299861 # 32 PTS2 IRI4h2 0.11645746 # 33 NewPT1 IRI4h2 0.43933054 # 34 MTAL IRI4h2 0.15132497 # 35 DCT IRI4h2 0.09274756 # 36 EC1 IRI4h2 0.14714086 # 37 PTS1 IRI4h3 0.17411492 # 38 PTS2 IRI4h3 0.25652931 # 39 NewPT1 IRI4h3 0.20139292 # 40 MTAL IRI4h3 0.18659315 # 41 DCT IRI4h3 0.11491584 # 42 EC1 IRI4h3 0.06645386 # 计算关注的组间显著性 library (reshape2) library (ggplot2) library (ggsci) library (ggsignif) library (ggpubr) colnames (cellnum) <- c ( 'Celltype' , 'Sample' , 'Freq' ) cellnum $ Group <- c ( rep ( "IRI12h" , times= 24 ), rep ( "IRI4h" , times= 18 )) cellnum # Celltype Sample Freq Group # 1 PTS1 IRI12h1b1 0.09433962 IRI12h # 2 PTS2 IRI12h1b1 0.13511869 IRI12h # 3 NewPT1 IRI12h1b1 0.23615338 IRI12h # 4 MTAL IRI12h1b1 0.25806452 IRI12h # 5 DCT IRI12h1b1 0.10681680 IRI12h # 6 EC1 IRI12h1b1 0.16950700 IRI12h # 7 PTS1 IRI12h1b2 0.10442225 IRI12h # 8 PTS2 IRI12h1b2 0.13152639 IRI12h # 9 NewPT1 IRI12h1b2 0.26134094 IRI12h # 10 MTAL IRI12h1b2 0.24679030 IRI12h # 11 DCT IRI12h1b2 0.10613409 IRI12h # 12 EC1 IRI12h1b2 0.14978602 IRI12h # 13 PTS1 IRI12h2 0.10143885 IRI12h # 14 PTS2 IRI12h2 0.17805755 IRI12h # 15 NewPT1 IRI12h2 0.20431655 IRI12h # 16 MTAL IRI12h2 0.23273381 IRI12h # 17 DCT IRI12h2 0.10647482 IRI12h # 18 EC1 IRI12h2 0.17697842 IRI12h # 19 PTS1 IRI12h3 0.11797753 IRI12h # 20 PTS2 IRI12h3 0.16516854 IRI12h # 21 NewPT1 IRI12h3 0.19588015 IRI12h # 22 MTAL IRI12h3 0.24831461 IRI12h # 23 DCT IRI12h3 0.09775281 IRI12h # 24 EC1 IRI12h3 0.17490637 IRI12h # 25 PTS1 IRI4h1 0.14508743 IRI4h # 26 PTS2 IRI4h1 0.25853455 IRI4h # 27 NewPT1 IRI4h1 0.24937552 IRI4h # 28 MTAL IRI4h1 0.15112406 IRI4h # 29 DCT IRI4h1 0.10678601 IRI4h # 30 EC1 IRI4h1 0.08909242 IRI4h # 31 PTS1 IRI4h2 0.05299861 IRI4h # 32 PTS2 IRI4h2 0.11645746 IRI4h # 33 NewPT1 IRI4h2 0.43933054 IRI4h # 34 MTAL IRI4h2 0.15132497 IRI4h # 35 DCT IRI4h2 0.09274756 IRI4h # 36 EC1 IRI4h2 0.14714086 IRI4h # 37 PTS1 IRI4h3 0.17411492 IRI4h # 38 PTS2 IRI4h3 0.25652931 IRI4h # 39 NewPT1 IRI4h3 0.20139292 IRI4h # 40 MTAL IRI4h3 0.18659315 IRI4h # 41 DCT IRI4h3 0.11491584 IRI4h # 42 EC1 IRI4h3 0.06645386 IRI4h unique (cellnum $ Group) # [1] "IRI12h" "IRI4h" ggplot (cellnum, aes ( x= Celltype, y= Freq, fill= Group)) + geom_boxplot + # 箱线图 theme_bw + # 设置主题 theme ( panel.grid = element_blank , axis.text.x.bottom = element_text ( angle = 45 , hjust = 1 , vjust = 1 )) + # 调整x轴倾斜角度 scale_fill_manual ( values = pal_npg ( "nrc" )( 10 )) + # 调整箱子颜色 stat_compare_means ( aes ( group = Group), method = "t.test" , label = "p.signif" ) # 统计检验方法,可改为 "wilcox.test" ggplot (cellnum, aes ( x= Group, y= Freq, fill= Group)) + geom_boxplot + # 绘制箱线图 theme_bw + # 主题设置 theme ( panel.grid = element_blank , axis.text.x = element_text ( angle = 45 , hjust = 1 , vjust = 1 )) + # x轴标签角度调整 scale_fill_manual ( values = pal_npg ( "nrc" )( 10 )) + facet_wrap ( ~ Celltype) + # 颜色填充 geom_signif ( comparisons = list ( c ( "IRI12h" , "IRI4h" )), # 需要比较的组 map_signif_level = TRUE , # 自动映射显著性水平 test = "t.test" , # 统计检验方法,可改为 "wilcox.test" step_increase = 0.2 , # 调整 p 值标记的高度,防止与箱线图重叠 tip_length = 0.03 , # 调整误差线长度 textsize = 3 ) 第十讲:改造单细胞图形-DimPlot 1. 准备工作 library (Seurat) library (ggsci) sce <- readRDS ( "./iri.intergrate.harmony.rename.rds" ) # 选取部分细胞用于后续分析 sel <- c ( "PTS1" , "PTS2" , "NewPT1" , "DTL-ATL" , "MTAL" , "DCT" , "CNT" , "PC2" , "ICA" , "EC1" , "Fib" , "Mø" ) sce <- subset (sce, Celltype %in% sel) # 指定因子展示顺序 sce @ meta.data $ Celltype <- factor (sce @ meta.data $ Celltype, levels = sel) Idents (sce) <- factor ( Idents (sce), levels = sel) # DimPlot的完整参数 ?DimPlot 2. 调整DimPlot参数 2.1 cols: 修改颜色 mycol <- c ( pal_d3 ( 7 ), pal_aaas ( 7 ), pal_uchicago ( 7 ), pal_jama ( 7 )) DimPlot (sce) DimPlot (sce, cols = mycol) 2.2 pt.size: 修改点的大小 DimPlot (sce, pt.size = 0.01 ) | DimPlot (sce, pt.size = 1 ) 2.3 reduction: 降维 DimPlot (sce, reduction = 'umap' ) | DimPlot (sce, reduction = 'tsne' ) | DimPlot (sce, reduction = 'pca' ) 2.4 label相关 DimPlot (sce, label = F) | DimPlot (sce, label = T) # 筛选数量最多的五种细胞类型做展示 suppressMessages ( library (dplyr)) mycelltype <- table ( Idents (sce)) %>% sort ( decreasing = T) %>% names (.) %>% .[ c ( 1 : 5 )] mycelltype # [1] "NewPT1" "MTAL" "PTS2" "EC1" "PTS1" sce_sub <- subset (sce, idents= mycelltype) DimPlot (sce_sub, label = F) | DimPlot (sce, label = T) # 修改标签大小 # 一样选的比较极端,大家可以自行调整 DimPlot (sce, label = T, label.size = 2.5 ) | DimPlot (sce, label = T, label.size = 5 ) # 增加标签边框 DimPlot (sce, label = T, label.box = T) | DimPlot (sce, label = T, label.box = F) # repel参数控制细胞聚类标签的排布方式的 DimPlot (sce, label = T, label.box = T, repel = T) | DimPlot (sce, label = T, label.box = T, repel = F) # 高亮部分细胞 DimPlot (sce, label = T) | DimPlot (sce, label = T, cells.highlight = c ( 'PTS1' , 'PTS2' )) | DimPlot (sce, label = T, cells.highlight = list ( PTS1_and_PTS2= colnames (sce)[sce $ Celltype %in% c ( 'PTS1' , 'PTS2' )])) # 高亮部分细胞,同时修改点的大小 DimPlot (sce, label = T) | DimPlot (sce, label = T, cells.highlight = list ( PTS1_and_PTS2= colnames (sce)[sce $ Celltype %in% c ( 'PTS1' , 'PTS2' )]), sizes.highlight= 0.2 ) | DimPlot (sce, label = T, cells.highlight = list ( PTS1_and_PTS2= colnames (sce)[sce $ Celltype %in% c ( 'PTS1' , 'PTS2' )]), sizes.highlight= 2 ) 2.5 raster: 是否栅格化 # 即把矢量图像素化 DimPlot (sce, raster = F) | DimPlot (sce, raster = T) pdf ( file = "../result_dimplot/p1.pdf" ) DimPlot (sce, raster = F) dev.off # png # 2 pdf ( file = "../result_dimplot/p2.pdf" ) DimPlot (sce, raster = T) dev.off # png # 2 file.size ( "../result_dimplot/p1.pdf" ) # [1] 1585135 file.size ( "../result_dimplot/p2.pdf" ) # [1] 35107 # reaster.dpi:设定栅格化的分辨率 DimPlot (sce, raster = T, raster.dpi = c ( 512 , 512 )) | DimPlot (sce, raster = T, raster.dpi = c ( 200 , 200 )) 3. ggplot2直接中的DIY 3.1 数据整理 class ( DimPlot (sce)) # [1] "patchwork" "gg" "ggplot" # 这也就意味着,我们可以通过ggplot完成对DimPlot降维图的改造,大家可以先读一下DimPlot的源码了解一下:print (DimPlot) # function (object, dims = c(1, 2), cells = NULL, cols = NULL, # pt.size = NULL, reduction = NULL, group.by = NULL, split.by = NULL, # shape.by = NULL, order = NULL, shuffle = FALSE, seed = 1, # label = FALSE, label.size = 4, label.color = "black", label.box = FALSE, # repel = FALSE, alpha = 1, cells.highlight = NULL, cols.highlight = " # sizes.highlight = 1, na.value = "grey50", ncol = NULL, combine = TRUE, # raster = NULL, raster.dpi = c(512, 512)) # { # if (!is_integerish(x = dims, n = 2L, finite = TRUE) || !all(dims > # 0L)) { # abort(message = "'dims' must be a two-length integer vector") # } # reduction <- reduction %||% DefaultDimReduc(object = object) # cells <- cells %||% Cells(x = object, assay = DefaultAssay(object = object[[reduction]])) # dims <- paste0(Key(object = object[[reduction]]), dims) # orig.groups <- group.by # group.by <- group.by %||% "ident" # data <- FetchData(object = object, vars = c(dims, group.by), # cells = cells, clean = "project") # group.by <- colnames(x = data)[3:ncol(x = data)] # for (group in group.by) { # if (!is.factor(x = data[, group])) { # data[, group] <- factor(x = data[, group]) # } # } # if (!is.null(x = shape.by)) { # data[, shape.by] <- object[[shape.by, drop = TRUE]] # } # if (!is.null(x = split.by)) { # split <- FetchData(object = object, vars = split.by, # clean = TRUE)[split.by] # data <- data[rownames(split), ] # data[, split.by] <- split # } # if (isTRUE(x = shuffle)) { # set.seed(seed = seed) # data <- data[sample(x = 1:nrow(x = data)), ] # } # plots <- lapply(X = group.by, FUN = function(x) { # plot <- SingleDimPlot(data = data[, c(dims, x, split.by, # shape.by)], dims = dims, col.by = x, cols = cols, # pt.size = pt.size, shape.by = shape.by, order = order, # alpha = alpha, label = FALSE, cells.highlight = cells.highlight, # cols.highlight = cols.highlight, sizes.highlight = sizes.highlight, # na.value = na.value, raster = raster, raster.dpi = raster.dpi) # if (label) { # plot <- LabelClusters(plot = plot, id = x, repel = repel, # size = label.size, split.by = split.by, box = label.box, # color = label.color) # } # if (!is.null(x = split.by)) { # plot <- plot + FacetTheme + facet_wrap(facets = vars(!!sym(x = split.by)), # ncol = if (length(x = group.by) > 1 || is.null(x = ncol)) { # length(x = unique(x = data[, split.by])) # } # else { # ncol # }) # } # plot <- if (is.null(x = orig.groups)) { # plot + labs(title = NULL) # } # else { # plot + CenterTitle # } # }) # if (!is.null(x = split.by)) { # ncol <- 1 # } # if (combine) { # plots <- wrap_plots(plots, ncol = orig.groups %iff% ncol) # } # return(plots) # } # <bytecode: 0x557cc27befc8> # <environment: namespace:Seurat> print (SingleDimPlot) # function (data, dims, col.by = NULL, cols = NULL, pt.size = NULL, # shape.by = NULL, alpha = 1, alpha.by = NULL, order = NULL, # label = FALSE, repel = FALSE, label.size = 4, cells.highlight = NULL, # cols.highlight = " sizes.highlight = 1, na.value = "grey50", # raster = NULL, raster.dpi = NULL) # { # if ((nrow(x = data) > 1e+05) & is.null(x = raster)) { # message("Rasterizing points since number of points exceeds 100,000.", # "\nTo disable this behavior set `raster=FALSE`") # } # raster <- raster %||% (nrow(x = data) > 1e+05) # pt.size <- pt.size %||% AutoPointSize(data = data, raster = raster) # if (!is.null(x = cells.highlight) && pt.size != AutoPointSize(data = data, # raster = raster) && sizes.highlight != pt.size && isTRUE(x = raster)) { # warning("When `raster = TRUE` highlighted and non-highlighted cells must be the same size. Plot will use the value provided to 'sizes.highlight'.") # } # if (!is.null(x = raster.dpi)) { # if (!is.numeric(x = raster.dpi) || length(x = raster.dpi) != # 2) # stop("'raster.dpi' must be a two-length numeric vector") # } # if (length(x = dims) != 2) { # stop("'dims' must be a two-length vector") # } # if (!is.data.frame(x = data)) { # data <- as.data.frame(x = data) # } # if (is.character(x = dims) && !all(dims %in% colnames(x = data))) { # stop("Cannot find dimensions to plot in data") # } # else if (is.numeric(x = dims)) { # dims <- colnames(x = data)[dims] # } # if (!is.null(x = cells.highlight)) { # if (inherits(x = cells.highlight, what = "data.frame")) { # stop("cells.highlight cannot be a dataframe. ", "Please supply a vector list") # } # highlight.info <- SetHighlight(cells.highlight = cells.highlight, # cells.all = rownames(x = data), sizes.highlight = sizes.highlight %||% # pt.size, cols.highlight = cols.highlight, col.base = cols[1] %||% # " pt.size = pt.size, raster = raster) # order <- highlight.info$plot.order # data$highlight <- highlight.info$highlight # col.by <- "highlight" # pt.size <- highlight.info$size # cols <- highlight.info$color # } # if (!is.null(x = order) && !is.null(x = col.by)) { # if (typeof(x = order) == "logical") { # if (order) { # data <- data[order(!is.na(x = data[, col.by]), # data[, col.by]), ] # } # } # else { # order <- rev(x = c(order, setdiff(x = unique(x = data[, # col.by]), y = order))) # data[, col.by] <- factor(x = data[, col.by], levels = order) # new.order <- order(x = data[, col.by]) # data <- data[new.order, ] # if (length(x = pt.size) == length(x = new.order)) { # pt.size <- pt.size[new.order] # } # } # } # if (!is.null(x = col.by) && !col.by %in% colnames(x = data)) { # warning("Cannot find ", col.by, " in plotting data, not coloring plot") # col.by <- NULL # } # else { # col.index <- match(x = col.by, table = colnames(x = data)) # if (grepl(pattern = "^\\d", x = col.by)) { # col.by <- paste0("x", col.by) # } # else if (grepl(pattern = "-", x = col.by)) { # col.by <- gsub(pattern = "-", replacement = ".", # x = col.by) # } # colnames(x = data)[col.index] <- col.by # } # if (!is.null(x = shape.by) && !shape.by %in% colnames(x = data)) { # warning("Cannot find ", shape.by, " in plotting data, not shaping plot") # } # if (!is.null(x = alpha.by) && !alpha.by %in% colnames(x = data)) { # warning("Cannot find alpha variable ", alpha.by, " in data, setting to NULL", # call. = FALSE, immediate. = TRUE) # alpha.by <- NULL # } # plot <- ggplot(data = data) # plot <- if (isTRUE(x = raster)) { # plot + geom_scattermore(mapping = aes_string(x = dims[1], # y = dims[2], color = paste0("`", col.by, "`"), shape = shape.by, # alpha = alpha.by), pointsize = pt.size, alpha = alpha, # pixels = raster.dpi) # } # else { # plot + geom_point(mapping = aes_string(x = dims[1], y = dims[2], # color = paste0("`", col.by, "`"), shape = shape.by, # alpha = alpha.by), size = pt.size, alpha = alpha) # } # plot <- plot + guides(color = guide_legend(override.aes = list(size = 3, # alpha = 1))) + labs(color = NULL, title = col.by) + CenterTitle # if (label && !is.null(x = col.by)) { # plot <- LabelClusters(plot = plot, id = col.by, repel = repel, # size = label.size) # } # if (!is.null(x = cols)) { # if (length(x = cols) == 1 && (is.numeric(x = cols) || # cols %in% rownames(x = brewer.pal.info))) { # scale <- scale_color_brewer(palette = cols, na.value = na.value) # } # else if (length(x = cols) == 1 && (cols %in% c("alphabet", # "alphabet2", "glasbey", "polychrome", "stepped"))) { # colors <- DiscretePalette(length(unique(data[[col.by]])), # palette = cols) # scale <- scale_color_manual(values = colors, na.value = na.value) # } # else { # scale <- scale_color_manual(values = cols, na.value = na.value) # } # plot <- plot + scale # } # plot <- plot + theme_cowplot # return(plot) # } # <bytecode: 0x557cc334eea8> # <environment: namespace:Seurat> mydata <- sce @ reductions $ umap @ cell.embeddings # mydata[,3] <- as.data.frame(pbmc$celltype) mymeta <- as.data.frame (sce @ meta.data) mydata <- cbind (mydata,mymeta) head (mydata) # umap_1 umap_2 orig.ident nCount_RNA nFeature_RNA # IRI4h1_AAACCTGAGATTACCC 7.387328 -4.849914 IRI4h1 2490 564 # IRI4h1_AAACCTGAGTGTTAGA -2.336923 -12.271916 IRI4h1 514 320 # IRI4h1_AAACCTGCACCAACCG -3.402302 6.163878 IRI4h1 1171 392 # IRI4h1_AAACCTGCAGCCTGTG -3.669633 2.410121 IRI4h1 1712 530 # IRI4h1_AAACCTGCAGGGAGAG 1.870005 5.571418 IRI4h1 891 503 # IRI4h1_AAACCTGGTAGCGCTC -7.298200 -10.822622 IRI4h1 1121 521 # RNA_snn_res.0.4 seurat_clusters Celltype # IRI4h1_AAACCTGAGATTACCC 6 6 NewPT1 # IRI4h1_AAACCTGAGTGTTAGA 3 3 EC1 # IRI4h1_AAACCTGCACCAACCG 2 2 DCT # IRI4h1_AAACCTGCAGCCTGTG 1 1 MTAL # IRI4h1_AAACCTGCAGGGAGAG 10 ICA # IRI4h1_AAACCTGGTAGCGCTC 7 7 Fib 3.2 重现dimplot library (ggplot2) p <- ggplot (mydata, aes ( x= umap_1, y= umap_2, col= Celltype)) + geom_point ( size = 0.15 ) p # 去掉灰色背景 p + theme_bw # 去掉网格线 library (ggthemes) p + theme_bw + theme_few # 修改坐标轴,把边框去掉一半,这下坐标轴就一样了 p + theme_bw + theme_few + theme_classic # 修改点的大小,这样主图就基本就一模一样啦:ggplot (mydata, aes ( x= umap_1, y= umap_2, col= Celltype)) + geom_point ( size= 0.1 ) + theme_bw + theme_few + theme_classic | DimPlot (sce, pt.size = 0.1 ) 3.3 开始改造 3.3.1 添加置信区间 # 加一个光秃秃的置信区间 p <- ggplot (mydata, aes ( x= umap_1, y= umap_2, col= Celltype)) + geom_point ( size= 0.1 ) + theme_bw + theme_few + theme_classic p p + stat_ellipse p + stat_ellipse ( aes ( fill= Celltype), geom = "polygon" , alpha= 1 / 5 ) library (ggsci) mycol <- c ( pal_d3 ( 7 ), pal_aaas ( 7 ), pal_uchicago ( 7 ), pal_jama ( 7 )) p <- ggplot (mydata, aes ( x= umap_1, y= umap_2, col= Celltype)) + scale_color_manual ( values = mycol) + geom_point ( size= 0.1 ) + theme_bw + theme_few + theme_classic p p + stat_ellipse ( aes ( fill= Celltype), geom = "polygon" , linetype = 2 , size= 1 , alpha= 0.2 ) + scale_fill_manual ( values = mycol) p + stat_ellipse ( aes ( fill= Celltype), geom = "polygon" , linetype = 2 , size= 1 , alpha= 0.2 ) + scale_fill_manual ( values = mycol) | p + stat_ellipse ( aes ( fill= Celltype), geom = "polygon" , linetype = 2 , size= 1 , alpha= 1 # 置信区间的透明度,=1时会将点覆盖 ) + scale_fill_manual ( values = mycol) # 修改置信区间线条类型 (p + stat_ellipse ( aes ( fill= Celltype), geom = "polygon" , linetype = 1 , size= 1 , alpha= 0.2 ) + scale_fill_manual ( values = mycol) | p + stat_ellipse ( aes ( fill= Celltype), geom = "polygon" , linetype = 3 , size= 1 , alpha= 0.2 # 置信区间的透明度,=1时会将点覆盖 ) + scale_fill_manual ( values = mycol) ) / ( p + stat_ellipse ( aes ( fill= Celltype), geom = "polygon" , linetype = 4 , size= 1 , alpha= 0.2 ) + scale_fill_manual ( values = mycol) | p + stat_ellipse ( aes ( fill= Celltype), geom = "polygon" , linetype = 5 , size= 1 , alpha= 0.2 # 置信区间的透明度,=1时会将点覆盖 ) + scale_fill_manual ( values = mycol) ) # 置信区间椭圆计算方式 # type 参数决定了椭圆的计算方式,即如何估计数据的分布形状。
可选的参数值如下:# "t"(默认值):计算基于t分布的置信椭圆,适用于小样本(n<30)。使用Mahalanobis距离计算椭圆边界点。计算的置信区间较宽,适合数据点较少的情况 # "norm"(正态分布):假设数据服从多元正态分布,使用协方差矩阵计算椭圆。计算方式基于 MASS::cov.trob 计算协方差,适用于大样本数据。# "euclid"(欧几里得距离):生成基于标准偏差的等距椭圆,而非置信区间。
计算方式为:以数据均值为中心,沿x和y轴分别按照个标准差绘制椭圆。适用于标准化数据(如 PCA 结果)。
p + stat_ellipse ( aes ( fill= Celltype), geom = "polygon" , linetype = 2 , size= 1 , alpha= 0.2 , # 置信区间的透明度,=1时会将点覆盖 type= 't' ) + scale_fill_manual ( values = mycol) | p + stat_ellipse ( aes ( fill= Celltype), geom = "polygon" , linetype = 2 , size= 1 , alpha= 0.2 , # 置信区间的透明度,=1时会将点覆盖 type= 'norm' ) + normal distribution 分布 scale_fill_manual ( values = mycol) | p + stat_ellipse ( aes ( fill= Celltype), geom = "polygon" , linetype = 2 , size= 1 , alpha= 0.2 , # 置信区间的透明度,=1时会将点覆盖 type= 'euclid' ) + scale_fill_manual ( values = mycol) # 调整置信度(置信区间包含真实总体参数的概率,通常用百分比表示(如 95%、99%)) p + stat_ellipse ( aes ( fill= Celltype), geom = "polygon" , linetype = 2 , size= 1 , alpha= 0.2 , # 置信区间的透明度,=1时会将点覆盖 type= 't' , level = 0.95 ) + scale_fill_manual ( values = mycol) | p + stat_ellipse ( aes ( fill= Celltype), geom = "polygon" , linetype = 2 , size= 1 , alpha= 0.2 , # 置信区间的透明度,=1时会将点覆盖 type= 't' , level = 0.5 ) + scale_fill_manual ( values = mycol) # type=euclid时,level的值就是置信区间的半径,看看效果:p + stat_ellipse ( aes ( fill= Celltype), geom = "polygon" , linetype = 2 , size= 1 , alpha= 0.2 , # 置信区间的透明度 type= 'euclid' , level = 1 ) + scale_fill_manual ( values = mycol) | p + stat_ellipse ( aes ( fill= Celltype), geom = "polygon" , linetype = 2 , size= 1 , alpha= 0.2 , # 置信区间的透明度 type= 'euclid' , level = 3 ) + scale_fill_manual ( values = mycol) 3.3.2 坐标轴调整 # 原图 p <- ggplot (mydata, aes ( x= umap_1, y= umap_2, col= Celltype)) + scale_color_manual ( values = mycol) + geom_point ( size= 0.1 ) + theme_bw + theme_few + theme_classic + stat_ellipse ( aes ( fill= Celltype), geom = "polygon" , linetype = 2 , size= 1 , alpha= 0.2 ) + # 置信区间的透明度,=1时会将点覆盖 scale_fill_manual ( values = mycol) p # 增加坐标轴两端箭头 p | (p + theme ( axis.line = element_line ( arrow = arrow ( length = unit ( 0.5 , 'cm' )))) ) # 隐藏坐标轴刻度和坐标轴文本 p | p + theme ( axis.ticks = element_blank , axis.text.y = element_blank ) + theme ( axis.text.x = element_blank , axis.text.y = element_blank ) # 箭头大小调整 p + theme ( axis.ticks = element_blank , axis.text.y = element_blank ) + theme ( axis.text.x = element_blank , axis.text.y = element_blank ) + theme ( axis.line = element_line ( arrow = arrow ( length = unit ( 0.1 , 'cm' )))) | p + theme ( axis.ticks = element_blank , axis.text.y = element_blank ) + theme ( axis.text.x = element_blank , axis.text.y = element_blank ) + theme ( axis.line = element_line ( arrow = arrow ( length = unit ( 1 , 'cm' )))) # 修改箭头的形状 p + theme ( axis.ticks = element_blank , axis.text.y = element_blank ) + theme ( axis.text.x = element_blank , axis.text.y = element_blank ) + theme ( axis.line = element_line ( arrow = arrow ( length = unit ( 0.5 , 'cm' ), type = 'open' ))) | p + theme ( axis.ticks = element_blank , axis.text.y = element_blank ) + theme ( axis.text.x = element_blank , axis.text.y = element_blank ) + theme ( axis.line = element_line ( arrow = arrow ( length = unit ( 0.5 , 'cm' ), type = 'closed' ))) # 缩短坐标轴长度 line.x.data <- data.frame ( x= c ( - 15 , - 10 ), y= c ( - 15 , - 15 )) line.y.data <- data.frame ( x= c ( - 15 , - 15 ), y= c ( - 15 , - 10 )) ggplot (mydata, aes ( x= umap_1, y= umap_2)) + scale_color_manual ( values = mycol) + geom_point ( data = mydata, size= 0.1 , aes ( col= Celltype)) + geom_line ( data = line.x.data, aes ( x = x, y = y), arrow = ggplot2 :: arrow ( length = unit ( 0.2 , "cm" ), type = 'closed' )) + geom_line ( data = line.y.data, aes ( x = x, y = y), arrow = ggplot2 :: arrow ( length = unit ( 0.2 , "cm" ), type = 'closed' )) + stat_ellipse ( aes ( fill= Celltype), geom = "polygon" , linetype = 2 , size= 1 , alpha= 0.2 ) + scale_fill_manual ( values = mycol) + theme ( panel.border = element_blank , axis.title = element_blank , axis.text = element_blank , axis.ticks = element_blank , panel.background = element_rect ( fill = 'white' ), plot.background= element_rect ( fill= "white" )) 3.3.3 加上文本标签 # 生成标签的数据 my.label <- data.frame for (runcell in unique (mydata $ Celltype)) { umap_1 <- dplyr :: filter (mydata,Celltype == runcell) %>% pull (umap_1) %>% mean umap_2 <- dplyr :: filter (mydata,Celltype == runcell) %>% pull (umap_2) %>% mean my.label <- rbind (my.label, cbind (umap_1,umap_2)) } rownames (my.label) <- unique (mydata $ Celltype) my.label $ Celltype <- rownames (my.label) my.label # umap_1 umap_2 Celltype # NewPT1 6.827872 -1.2620147 NewPT1 # EC1 -2.075786 -11.0531261 EC1 # DCT -4.130859 5.7925009 DCT # MTAL -4.155011 1.1480176 MTAL # ICA 1.866516 5.4740464 ICA # Fib -6.432889 -9.7453412 Fib # PTS2 7.218748 0.8346768 PTS2 # PTS1 8.553653 2.6910735 PTS1 # Mø -2.993166 -7.0425605 Mø # PC2 -10.724453 5.2327600 PC2 # DTL-ATL -1.891181 -2.5475451 DTL-ATL # CNT -7.711035 7.0706413 CNT # 跟celltype一个颜色看不清楚 p <- ggplot (mydata, aes ( x= umap_1, y= umap_2, col= Celltype)) + geom_point ( size= 0.1 ) + theme_bw + theme_few + theme_classic + stat_ellipse ( aes ( fill= Celltype), geom = "polygon" , linetype = 2 , size= 1 , alpha= 0.2 ) p + geom_text ( aes ( x = umap_1, y = umap_2, label= Celltype), fontface= "bold" , data = my.label) # 貌似图例有点问题,我们来修改一下:p + geom_text ( aes ( x = umap_1, y = umap_2, label= Celltype), fontface= "bold" , data = my.label, show.legend = F) p + geom_text ( aes ( x = umap_1 + 1 , y = umap_2 + 1 , label= Celltype), fontface= "bold" , data = my.label, show.legend = F) # 修改geom_text字体颜色,使其更加明显 p + geom_text ( aes ( x = umap_1, y = umap_2, label= Celltype), color= 'black' , fontface= "bold" , data = my.label, show.legend = F) # 添加带有边框的文本标签 p + geom_label ( aes ( x = umap_1, y = umap_2, label= Celltype), fontface= "bold" , data = my.label, show.legend = F) # 填充距离为 0.5 行高,即标签与数据点之间会保留 0.5 行高的空白区域,避免标签与数据点过于接近 library (ggrepel) p + geom_label_repel ( aes ( x = umap_1, y = umap_2, label= Celltype), fontface= "bold" , data = my.label, point.padding= unit ( 0.5 , "lines" ), show.legend = F) my.label[, 1 : 2 ] <- my.label[, 1 : 2 ] + 1 p + geom_label_repel ( aes ( x = umap_1, y = umap_2, label= Celltype), fontface= "bold" , data = my.label, point.padding= unit ( 0.5 , "lines" ), show.legend = F) 3.3.4 分面 DimPlot (sce, split.by = 'orig.ident' ) p <- ggplot (mydata, aes ( x= umap_1, y= umap_2, col= Celltype)) + scale_color_manual ( values = mycol) + geom_point ( size= 0.1 ) + theme_bw + theme_few + theme_classic + stat_ellipse ( aes ( fill= Celltype), geom = "polygon" , linetype = 2 , size= 1 , alpha= 0.2 ) + scale_fill_manual ( values = mycol) + theme ( axis.ticks = element_blank , axis.text.y = element_blank ) + theme ( axis.text.x = element_blank , axis.text.y = element_blank ) + theme ( axis.line = element_line ( arrow = arrow ( length = unit ( 0.1 , 'cm' )))) p p + facet_grid ( ~ orig.ident) p + facet_grid ( orig.ident ~ . ) p + facet_grid (orig.ident ~ Celltype) p + facet_grid (. ~ orig.ident + Celltype) # 如果想要更改split后的顺序:p + facet_grid (. ~ orig.ident) levels (mydata $ orig.ident) # NULL unique (mydata $ orig.ident) # [1] "IRI4h1" "IRI4h2" "IRI4h3" "IRI12h1b1" "IRI12h1b2" "IRI12h2" # [7] "IRI12h3" mydata $ orig.ident <- factor (mydata $ orig.ident, levels = c ( "IRI4h3" , "IRI4h2" , "IRI4h1" , "IRI12h3" , "IRI12h2" , "IRI12h1b2" , "IRI12h1b1" )) levels (mydata $ orig.ident) # [1] "IRI4h3" "IRI4h2" "IRI4h1" "IRI12h3" "IRI12h2" "IRI12h1b2" # [7] "IRI12h1b1" p2 <- ggplot (mydata, aes ( x= umap_1, y= umap_2, col= Celltype)) + scale_color_manual ( values = mycol) + geom_point ( size= 0.1 ) + theme_bw + theme_few + theme_classic + stat_ellipse ( aes ( fill= Celltype), geom = "polygon" , linetype = 2 , size= 1 , alpha= 0.2 ) + scale_fill_manual ( values = mycol) + theme ( axis.ticks = element_blank , axis.text.y = element_blank ) + theme ( axis.text.x = element_blank , axis.text.y = element_blank ) + theme ( axis.line = element_line ( arrow = arrow ( length = unit ( 0.1 , 'cm' )))) (p + facet_grid (. ~ orig.ident )) / (p2 + facet_grid (. ~ orig.ident )) 4. 3D降维图与动图 4.1 数据整理 library (Seurat) set.seed ( 20230105 ) dim (sce @ reductions $ umap) dim (sce @ reductions $ tsne) DimPlot (sce) # 重新计算umap和tsne sce <- RunTSNE (sce, reduction = "pca" , dims = 1 : 20 , dim.embed= 3 ) sce <- RunUMAP (sce, reduction = "pca" , dims = 1 : 20 , n.components = 3 ) # 保存rds # saveRDS(sce,"./iri.intergrate.harmony.rename.3D.rds") 4.2 ggplot2的3d降维图 if ( ! require (gg3D))devtools :: install_github ( "AckerDWM/gg3D" ) sce <- readRDS ( "./iri.intergrate.harmony.rename.3D.rds" ) dim (sce @ reductions $ umap) # [1] 27653 3 dim (sce @ reductions $ tsne) # [1] 27653 3 tsne <- sce @ reductions $ tsne <- Embeddings ( object = sce[[ "tsne" ]]) umap <- sce @ reductions $ umap <- Embeddings ( object = sce[[ "umap" ]]) umap43d <- cbind (umap,sce @ meta.data) head (umap43d) # umap_1 umap_2 umap_3 orig.ident nCount_RNA # IRI4h1_AAACCTGAGATTACCC 3.559075 -1.740733 2.9102795 IRI4h1 2490 # IRI4h1_AAACCTGAGTGTTAGA -4.556370 -5.701739 -0.4349645 IRI4h1 514 # IRI4h1_AAACCTGCACCAACCG -3.213806 4.902372 3.5176237 IRI4h1 1171 # IRI4h1_AAACCTGCAGCCTGTG -5.149614 6.129024 -2.5434456 IRI4h1 1712 # IRI4h1_AAACCTGCAGGGAGAG -6.157923 -1.452615 -1.0119777 IRI4h1 891 # IRI4h1_AAACCTGGTAGCGCTC -3.686494 -5.441414 -4.7313709 IRI4h1 1121 # nFeature_RNA_snn_res.0.4 seurat_clusters Celltype # IRI4h1_AAACCTGAGATTACCC 564 6 6 NewPT1 # IRI4h1_AAACCTGAGTGTTAGA 320 3 3 EC1 # IRI4h1_AAACCTGCACCAACCG 392 2 2 DCT # IRI4h1_AAACCTGCAGCCTGTG 530 1 1 MTAL # IRI4h1_AAACCTGCAGGGAGAG 503 10 ICA # IRI4h1_AAACCTGGTAGCGCTC 521 7 7 Fib ggplot (umap43d, aes ( x= umap_1, y= umap_2, z= umap_3, color= Celltype)) + theme ( axis.ticks = element_blank , axis.text = element_blank ) + axes_3D ( theta= 0 , phi= 90 ) + 3D 图形中添加坐标轴 stat_3D ( theta= 0 , phi= 90 , size = 0.1 ) + 3D 散点(或者统计变换的 3D 点) labs_3D ( theta= 0 , phi= 90 , 3D 图形的坐标轴添加标签 hjust= c ( 1 , 0 , 0 ), vjust= c ( 1.5 , 1 , - . 2 ), labs= c ( "UMAP_1" , "UMAP_2" , "UMAP_3" )) + theme_void # 修改配色 ggplot (umap43d, aes ( x= umap_1, y= umap_2, z= umap_3, color= Celltype)) + theme ( axis.ticks = element_blank , axis.text = element_blank ) + scale_color_manual ( values = mycol) + axes_3D ( theta= 0 , phi= 90 ) + stat_3D ( theta= 0 , phi= 90 , size = 0.1 ) + labs_3D ( theta= 0 , phi= 90 , hjust= c ( 1 , 0 , 0 ), vjust= c ( 1.5 , 1 , - . 2 ), labs= c ( "UMAP_1" , "UMAP_2" , "UMAP_3" )) + theme_void 4.3 scatterplot3d library (scatterplot3d) # col 控制点的颜色 umap43d $ celltype.col <- umap43d $ Celltype head (umap43d) # umap_1 umap_2 umap_3 orig.ident nCount_RNA # IRI4h1_AAACCTGAGATTACCC 3.559075 -1.740733 2.9102795 IRI4h1 2490 # IRI4h1_AAACCTGAGTGTTAGA -4.556370 -5.701739 -0.4349645 IRI4h1 514 # IRI4h1_AAACCTGCACCAACCG -3.213806 4.902372 3.5176237 IRI4h1 1171 # IRI4h1_AAACCTGCAGCCTGTG -5.149614 6.129024 -2.5434456 IRI4h1 1712 # IRI4h1_AAACCTGCAGGGAGAG -6.157923 -1.452615 -1.0119777 IRI4h1 891 # IRI4h1_AAACCTGGTAGCGCTC -3.686494 -5.441414 -4.7313709 IRI4h1 1121 # nFeature_RNA_snn_res.0.4 seurat_clusters Celltype # IRI4h1_AAACCTGAGATTACCC 564 6 6 NewPT1 # IRI4h1_AAACCTGAGTGTTAGA 320 3 3 EC1 # IRI4h1_AAACCTGCACCAACCG 392 2 2 DCT # IRI4h1_AAACCTGCAGCCTGTG 530 1 1 MTAL # IRI4h1_AAACCTGCAGGGAGAG 503 10 ICA # IRI4h1_AAACCTGGTAGCGCTC 521 7 7 Fib # celltype.col # IRI4h1_AAACCTGAGATTACCC NewPT1 # IRI4h1_AAACCTGAGTGTTAGA EC1 # IRI4h1_AAACCTGCACCAACCG DCT # IRI4h1_AAACCTGCAGCCTGTG MTAL # IRI4h1_AAACCTGCAGGGAGAG ICA # IRI4h1_AAACCTGGTAGCGCTC Fib for (i in 1 : 12 ) { umap43d $ celltype.col <- gsub ( unique (umap43d $ celltype.col)[i],mycol[i],umap43d $ celltype.col) } head (umap43d) # umap_1 umap_2 umap_3 orig.ident nCount_RNA # IRI4h1_AAACCTGAGATTACCC 3.559075 -1.740733 2.9102795 IRI4h1 2490 # IRI4h1_AAACCTGAGTGTTAGA -4.556370 -5.701739 -0.4349645 IRI4h1 514 # IRI4h1_AAACCTGCACCAACCG -3.213806 4.902372 3.5176237 IRI4h1 1171 # IRI4h1_AAACCTGCAGCCTGTG -5.149614 6.129024 -2.5434456 IRI4h1 1712 # IRI4h1_AAACCTGCAGGGAGAG -6.157923 -1.452615 -1.0119777 IRI4h1 891 # IRI4h1_AAACCTGGTAGCGCTC -3.686494 -5.441414 -4.7313709 IRI4h1 1121 # nFeature_RNA_snn_res.0.4 seurat_clusters Celltype # IRI4h1_AAACCTGAGATTACCC 564 6 6 NewPT1 # IRI4h1_AAACCTGAGTGTTAGA 320 3 3 EC1 # IRI4h1_AAACCTGCACCAACCG 392 2 2 DCT # IRI4h1_AAACCTGCAGCCTGTG 530 1 1 MTAL # IRI4h1_AAACCTGCAGGGAGAG 503 10 ICA # IRI4h1_AAACCTGGTAGCGCTC 521 7 7 Fib # celltype.col # IRI4h1_AAACCTGAGATTACCC # IRI4h1_AAACCTGAGTGTTAGA # IRI4h1_AAACCTGCACCAACCG # IRI4h1_AAACCTGCAGCCTGTG # IRI4h1_AAACCTGCAGGGAGAG # IRI4h1_AAACCTGGTAGCGCTC p1 <- scatterplot3d (umap43d[, 1 : 3 ], color = umap43d $ celltype.col, main= "scatterplot3d.umap" , pch = 16 , cex.symbols = 0.2 ) # 形状为实心圆,点的大小是默认的0.2倍 legend (p1 $ xyz.convert ( 8.5 , 2.5 , 5 ), legend = unique (umap43d $ Celltype), col = unique (umap43d $ celltype.col), pch = 16 ) # legend形状为实心圆 # pch 控制点的形状 umap43d $ celltype.pch <- umap43d $ Celltype for (i in 1 : 12 ) { umap43d $ celltype.pch <- gsub ( unique (umap43d $ celltype.pch)[i],( 1 : 12 )[i],umap43d $ celltype.pch) } head (umap43d) # umap_1 umap_2 umap_3 orig.ident nCount_RNA # IRI4h1_AAACCTGAGATTACCC 3.559075 -1.740733 2.9102795 IRI4h1 2490 # IRI4h1_AAACCTGAGTGTTAGA -4.556370 -5.701739 -0.4349645 IRI4h1 514 # IRI4h1_AAACCTGCACCAACCG -3.213806 4.902372 3.5176237 IRI4h1 1171 # IRI4h1_AAACCTGCAGCCTGTG -5.149614 6.129024 -2.5434456 IRI4h1 1712 # IRI4h1_AAACCTGCAGGGAGAG -6.157923 -1.452615 -1.0119777 IRI4h1 891 # IRI4h1_AAACCTGGTAGCGCTC -3.686494 -5.441414 -4.7313709 IRI4h1 1121 # nFeature_RNA_snn_res.0.4 seurat_clusters Celltype # IRI4h1_AAACCTGAGATTACCC 564 6 6 NewPT1 # IRI4h1_AAACCTGAGTGTTAGA 320 3 3 EC1 # IRI4h1_AAACCTGCACCAACCG 392 2 2 DCT # IRI4h1_AAACCTGCAGCCTGTG 530 1 1 MTAL # IRI4h1_AAACCTGCAGGGAGAG 503 10 ICA # IRI4h1_AAACCTGGTAGCGCTC 521 7 7 Fib # celltype.col celltype.pch # IRI4h1_AAACCTGAGATTACCC 1 # IRI4h1_AAACCTGAGTGTTAGA 2 # IRI4h1_AAACCTGCACCAACCG 3 # IRI4h1_AAACCTGCAGCCTGTG 4 # IRI4h1_AAACCTGCAGGGAGAG 5 # IRI4h1_AAACCTGGTAGCGCTC 6 p2 <- scatterplot3d (umap43d[, 1 : 3 ], color = umap43d $ celltype.col, main= "scatterplot3d.umap" , pch = as.numeric (umap43d $ celltype.pch), cex.symbols = 0.4 ) legend (p2 $ xyz.convert ( 8.5 , 2.5 , 5 ), legend = unique (umap43d $ Celltype), col = unique (umap43d $ celltype.col), pch = as.numeric (umap43d $ celltype.pch)) 4.4 plotly的3d降维图 if ( ! require (plotly)) install.packages (plotly) tsne <- as.data.frame (tsne) tsne .3 d <- plot_ly (tsne, x = ~ tSNE_1, y = ~ tSNE_2, z = ~ tSNE_3, color = sce $ Celltype, colors = mycol, marker = list ( size = 1 )) %>% # 这里的 size 控制散点的大小 layout ( legend = list ( itemsizing = "constant" )) tsne .3 d umap <- as.data.frame (umap) umap .3 d <- plot_ly (umap, x = ~ umap_1, y = ~ umap_2, z = ~ umap_3, color = sce $ Celltype, colors = mycol, size= 2 , marker = list ( size = 1 )) %>% # 这里的 size 控制散点的大小 layout ( legend = list ( itemsizing = "constant" )) umap .3 d 5. 一些造好的轮子 # 1 plot1cell # 更多功能可以看这里:https://github.com/HaojiaWu/plot1cell if ( ! require (devtools)) install.packages ( 'devtools' ) if ( ! require (biomaRt)) install.packages ( 'biomaRt' ) if ( ! require (XML)) install.packages ( 'XML' ) if ( ! require (biomaRt))devtools :: install_github ( 'alyst/biomaRt' , dependencies = T) if ( ! require (GenomeInfoDb)) install.packages ( 'GenomeInfoDb' ) if ( ! require ( "BiocManager" , quietly = TRUE )) install.packages ( "BiocManager" ) if ( ! require (EnsDb.Hsapiens.v86))BiocManager :: install ( 'EnsDb.Hsapiens.v86' ) if ( ! require (GEOquery))BiocManager :: install ( 'GEOquery' ) if ( ! require (simplifyEnrichment))BiocManager :: install ( 'simplifyEnrichment' ) if ( ! require (ComplexHeatmap))BiocManager :: install ( 'ComplexHeatmap' ) if ( ! require (DoubletFinder))devtools :: install_github ( "chris-mcginnis-ucsf/DoubletFinder" ) if ( ! require (loomR))devtools :: install_github ( "mojaveazure/loomR" ) if ( ! require (hdf5r))devtools :: install_github ( "Novartis/hdf5r" ) if ( ! require (plot1cell)) devtools :: install_github ( "TheHumphreysLab/plot1cell" ) sce <- readRDS ( "./iri.intergrate.harmony.rename.rds" ) # 选取部分细胞用于后续分析 sel <- c ( "PTS1" , "PTS2" , "NewPT1" , "DTL-ATL" , "MTAL" , "DCT" , "CNT" , "PC2" , "ICA" , "EC1" , "Fib" , "Mø" ) sce <- subset (sce, Celltype %in% sel) # 指定因子展示顺序 sce @ meta.data $ Celltype <- factor (sce @ meta.data $ Celltype, levels = sel) Idents (sce) <- factor ( Idents (sce), levels = sel) circ_data <- prepare_circlize_data (sce, scale = 0.8 ) set.seed ( 1234 ) cluster_colors <- rand_color ( length ( levels (sce))) seurat_clusters_colors <- rand_color ( length ( names ( table (sce $ seurat_clusters)))) rep_colors <- rand_color ( length ( names ( table (sce $ orig.ident)))) plot_circlize (circ_data, do.label = T, pt.size = 0.01 , col.use = cluster_colors , bg.color = 'white' , kde2d.n = 200 , repel = T, label.cex = 0.6 ) add_track (circ_data, group = "seurat_clusters" , colors = seurat_clusters_colors, track_num = 2 ) # 由内向外数 add_track (circ_data, group = "orig.ident" , colors = rep_colors, track_num = 3 ) if ( ! require (scRNAtoolVis))devtools :: install_github ( "junjunlab/scRNAtoolVis" ) test <- system.file ( "extdata" , "seuratTest.RDS" , package = "scRNAtoolVis" ) # tmp <- readRDS(test) # 最简单的绘图 clusterCornerAxes ( object = sce, reduction = 'umap' , pSize= 0.1 , noSplit = T, clusterCol= "Celltype" ) # 改变箭头形状 clusterCornerAxes ( object = sce, reduction = 'umap' , pSize= 0.1 , noSplit = T, arrowType = 'open' , clusterCol= "Celltype" ) clusterCornerAxes ( object = sce, reduction = 'umap' , pSize= 0.1 , noSplit = F, groupFacet = "Celltype" , aspect.ratio = 1 , relLength = 0.5 , clusterCol= "Celltype" , show.legend= F) clusterCornerAxes ( object = sce, reduction = 'umap' , pSize= 0.1 , noSplit = F, groupFacet = "Celltype" , aspect.ratio = 1 , relLength = 0.5 , clusterCol= "Celltype" , show.legend= F, axes = "one" ) clusterCornerAxes ( object = sce, reduction = 'umap' , pSize= 0.1 , noSplit = F, groupFacet = "Celltype" , aspect.ratio = 1 , relLength = 0.5 , lineTextcol = 'grey50' , show.legend = F, clusterCol= "Celltype" ) clusterCornerAxes ( object = sce, reduction = 'umap' , pSize= 0.1 , noSplit = T, keySize = 8 , clusterCol= "Celltype" ) clusterCornerAxes ( object = sce, reduction = 'umap' , pSize= 0.1 , noSplit = T, cellLabel = T, cellLabelSize = 4 , clusterCol= "Celltype" ) clusterCornerAxes ( object = sce, reduction = 'umap' , pSize= 0.1 , groupFacet = 'orig.ident' , noSplit = F, cellLabel = T, cellLabelSize = 3 , show.legend = F, aspect.ratio = 1 , themebg = 'bwCorner' , clusterCol= "Celltype" ) clusterCornerAxes ( object = sce, reduction = 'umap' , pSize= 0.1 , noSplit = T, cornerTextSize = 3.5 , themebg = 'bwCorner' , addCircle = TRUE , cicAlpha = 0.2 , nbin = 200 , cicDelta= 0.1 , clusterCol= "Celltype" ) 第十一讲:改造单细胞图形-VlnPlot 1. 准备工作 scRNA <- readRDS ( 'pbmcrenamed.rds' ) # 查看一下数据细胞类型及组别:DimPlot (scRNA, split.by = 'group' ) # 共有三个组:unique (scRNA $ group) # [1] "C57" "AS1" "P3" # 共包含九种细胞类型:unique (scRNA $ celltype) # [1] VSMC T cell Fibro EC Myeloid cells # [6] B cell Mono Macro Neut # Levels: VSMC T cell Macro B cell Fibro Myeloid cells Mono Neut EC # 查看高表达基因:library (dplyr) # 按照表达量给基因排序:myorder <- apply ( as.data.frame (scRNA @ assays $ RNA @ data), 1 , function (x){ sum (x)}) %>% order (., decreasing = T) # 看一下总表达量最高的5个基因:rownames (scRNA)[myorder][ 1 : 5 ] # [1] "Gm42418" "Tmsb4x" "Malat1" "Actb" "mt-Atp6" # 浅画一下VlnPlot:VlnPlot (scRNA, features = 'Gm42418' ) 2. 调整VlnPlot参数 library (patchwork) # 调整点的大小: VlnPlot (scRNA, features = 'Gm42418' , pt.size = 0.5 ) | # 默认为1 VlnPlot (scRNA, features = 'Gm42418' , pt.size = 0 ) # 点的大小为0时即关闭点的展示 # 调整颜色 library (ggsci) ori_p <- VlnPlot (scRNA, features = 'Gm42418' , pt.size = 0 ) # 离散型颜色 ori_p | VlnPlot (scRNA, features = 'Gm42418' , pt.size = 0 , cols = ggsci :: pal_npg ( 10 ) ) # 渐变颜色 ori_p | VlnPlot (scRNA, features = 'Gm42418' , pt.size = 0 , cols = ggsci :: pal_gsea ( 10 ) ) # 降序,按照表达均值排序 ori_p | VlnPlot (scRNA, features = 'Gm42418' , pt.size = 0 , sort = T) # 展示不同分组变量 ori_p | VlnPlot (scRNA, features = 'Gm42418' , pt.size = 0 , group.by = 'group' ) # 把每个celltype分成不同组别展示 ori_p | VlnPlot (scRNA, features = 'Gm42418' , pt.size = 0 , split.by = 'group' ) # 绘制多个基因 he_gene <- rownames (scRNA)[myorder][ 1 : 10 ] VlnPlot (scRNA, features = he_gene, pt.size = 0 ) # 多个基因更适合堆积小提琴图:VlnPlot (scRNA, features = he_gene, pt.size = 0 , stack = T ) 3. ggplot2直接中的DIY 3.1 VlnPlot源码 # Vlnplot源码(注意下面的函数名不要加括号):VlnPlot # function (object, features, cols = NULL, pt.size = NULL, alpha = 1, # idents = NULL, sort = FALSE, assay = NULL, group.by = NULL, # split.by = NULL, adjust = 1, y.max = NULL, same.y.lims = FALSE, # log = FALSE, ncol = NULL, slot = deprecated, layer = NULL, # split.plot = FALSE, stack = FALSE, combine = TRUE, fill.by = "feature", # flip = FALSE, add.noise = TRUE, raster = NULL) # { # if (is_present(arg = slot)) { # deprecate_soft(when = "5.0.0", what = "VlnPlot(slot = )", # with = "VlnPlot(layer = )") # layer <- slot %||% layer # } # layer.set <- suppressWarnings(Layers(object = object, search = layer %||% # "data")) # if (is.null(layer) && length(layer.set) == 1 && layer.set == # "scale.data") { # warning("Default search for \"data\" layer yielded no results; utilizing \"scale.data\" layer instead.") # } # assay.name <- DefaultAssay(object) # if (is.null(layer.set) & is.null(layer)) { # warning("Default search for \"data\" layer in \"", assay.name, # "\" assay yielded no results; utilizing \"counts\" layer instead.", # call. = FALSE, immediate. = TRUE) # layer.set <- Layers(object = object, search = "counts") # } # if (is.null(layer.set)) { # stop("layer \"", layer, "\" is not found in assay: \"", # assay.name, "\"") # } # else { # layer <- layer.set # } # if (!is.null(x = split.by) & getOption(x = "Seurat.warn.vlnplot.split", # default = TRUE)) { # message("The default behaviour of split.by has changed.\n", # "Separate violin plots are now plotted side-by-side.\n", # "To restore the old behaviour of a single split violin,\n", # "set split.plot = TRUE.\n \nThis message will be shown once per session.") # options(Seurat.warn.vlnplot.split = FALSE) # } # return(ExIPlot(object = object, type = ifelse(test = split.plot, # yes = "splitViolin", no = "violin"), features = features, # idents = idents, ncol = ncol, sort = sort, assay = assay, # y.max = y.max, same.y.lims = same.y.lims, adjust = adjust, # pt.size = pt.size, alpha = alpha, cols = cols, group.by = group.by, # split.by = split.by, log = log, layer = layer, stack = stack, # combine = combine, fill.by = fill.by, flip = flip, add.noise = add.noise, # raster = raster)) # } # <bytecode: 0x557c8422ee38> # <environment: namespace:Seurat> # 这个代码框不要运行!
# 最终的绘图函数为"ExIPlot"下的"SingleExIPlot"函数 ExIPlot <- function ( object, features, type = 'violin' , idents = NULL , ncol = NULL , sort = FALSE , assay = NULL , y.max = NULL , same.y.lims = FALSE , adjust = 1 , cols = NULL , pt.size = 0 , alpha = 1 , group.by = NULL , split.by = NULL , log = FALSE , slot = deprecated , layer = 'data' , stack = FALSE , combine = TRUE , fill.by = NULL , flip = FALSE , add.noise = TRUE , raster = NULL ) { if ( is_present ( arg = slot)) { layer <- layer %||% slot } assay <- assay %||% DefaultAssay ( object = object) DefaultAssay ( object = object) <- assay cells <- Cells ( x = object, assay = NULL ) if ( isTRUE ( x = stack)) { if ( ! is.null ( x = ncol)) { warning ( "'ncol' is ignored with 'stack' is TRUE" , call. = FALSE , immediate. = TRUE ) } if ( ! is.null ( x = y.max)) { warning ( "'y.max' is ignored when 'stack' is TRUE" , call. = FALSE , immediate. = TRUE ) } } else { ncol <- ncol %||% ifelse ( test = length ( x = features) > 9 , yes = 4 , no = min ( length ( x = features), 3 ) ) } if ( ! is.null ( x = idents)) { cells <- intersect ( x = names ( x = Idents ( object = object)[ Idents ( object = object) %in% idents]), y = cells ) } data <- FetchData ( object = object, vars = features, slot = layer, cells = cells) pt.size <- pt.size %||% AutoPointSize ( data = object) features <- colnames ( x = data) data <- data[cells, , drop = FALSE ] idents <- if ( is.null ( x = group.by)) { Idents ( object = object)[cells] } else { object[[group.by, drop = TRUE ]][cells] } if ( ! is.factor ( x = idents)) { idents <- factor ( x = idents) } if ( is.null ( x = split.by)) { split <- NULL } else { split <- FetchData (object,split.by)[cells,split.by] if ( ! is.factor ( x = split)) { split <- factor ( x = split) } if ( is.null ( x = cols)) { cols <- hue_pal ( length ( x = levels ( x = idents))) cols <- Interleave (cols, InvertHex ( hexadecimal = cols)) } else if ( length ( x = cols) == 1 && cols == 'interaction' ) { split <- interaction (idents, split) cols <- hue_pal ( length ( x = levels ( x = idents))) } else { cols <- Col2Hex (cols) } if ( length ( x = cols) < length ( x = levels ( x = split))) { cols <- Interleave (cols, InvertHex ( hexadecimal = cols)) } cols <- rep_len ( x = cols, length.out = length ( x = levels ( x = split))) names ( x = cols) <- levels ( x = split) if (( length ( x = cols) > 2 ) & (type == "splitViolin" )) { warning ( "Split violin is only supported for <3 groups, using multi-violin." ) type <- "violin" } } if (same.y.lims && is.null ( x = y.max)) { y.max <- max (data) } if ( isTRUE ( x = stack)) { return ( MultiExIPlot ( type = type, data = data, idents = idents, split = split, sort = sort, same.y.lims = same.y.lims, adjust = adjust, cols = cols, pt.size = pt.size, log = log, fill.by = fill.by, add.noise = add.noise, flip = flip )) } plots <- lapply ( X = features, FUN = function (x) { return ( SingleExIPlot ( type = type, data = data[, x, drop = FALSE ], idents = idents, split = split, sort = sort, y.max = y.max, adjust = adjust, cols = cols, pt.size = pt.size, alpha = alpha, log = log, add.noise = add.noise, raster = raster )) } ) label.fxn <- switch ( EXPR = type, 'violin' = if (stack) { xlab } else { ylab }, "splitViolin" = if (stack) { xlab } else { ylab }, 'ridge' = xlab, stop ( "Unknown ExIPlot type " , type, call. = FALSE ) ) for (i in 1 : length ( x = plots)) { key <- paste0 ( unlist ( x = strsplit ( x = features[i], split = '_' ))[ 1 ], '_' ) obj <- names ( x = which ( x = Key ( object = object) == key)) if ( length ( x = obj) == 1 ) { if ( inherits ( x = object[[obj]], what = 'DimReduc' )) { plots[[i]] <- plots[[i]] + label.fxn ( label = 'Embeddings Value' ) } else if ( inherits ( x = object[[obj]], what = 'Assay' ) || inherits ( x = object[[obj]], what = 'Assay5' )) { next } else { warning ( "Unknown object type " , class ( x = object), immediate. = TRUE , call. = FALSE ) plots[[i]] <- plots[[i]] + label.fxn ( label = NULL ) } } else if ( ! features[i] %in% rownames ( x = object)) { plots[[i]] <- plots[[i]] + label.fxn ( label = NULL ) } } if (combine) { plots <- wrap_plots (plots, ncol = ncol) if ( length ( x = features) > 1 ) { plots <- plots & NoLegend } } return (plots) } # 这个代码框不要运行!
# SingleExIPlot函数源码 SingleExIPlot <- function ( data, idents, split = NULL , type = 'violin' , sort = FALSE , y.max = NULL , adjust = 1 , pt.size = 0 , alpha = 1 , cols = NULL , seed.use = 42 , log = FALSE , add.noise = TRUE , raster = NULL ) { if ( ! is.null ( x = raster) && isTRUE ( x = raster)){ if ( ! PackageCheck ( 'ggrastr' , error = FALSE )) { stop ( "Please install ggrastr from CRAN to enable rasterization." ) } } if ( PackageCheck ( 'ggrastr' , error = FALSE )) { # Set rasterization to true if ggrastr is installed and # number of points exceeds 100,000 if (( nrow ( x = data) > 1e5 ) & is.null ( x = raster)){ message ( "Rasterizing points since number of points exceeds 100,000." , " \n To disable this behavior set `raster=FALSE`" ) # change raster to TRUE raster <- TRUE } } if ( ! is.null ( x = seed.use)) { set.seed ( seed = seed.use) } if ( ! is.data.frame ( x = data) || ncol ( x = data) != 1 ) { stop ( "'SingleExIPlot requires a data frame with 1 column" ) } feature <- colnames ( x = data) data $ ident <- idents if (( is.character ( x = sort) && nchar ( x = sort) > 0 ) || sort) { data $ ident <- factor ( x = data $ ident, levels = names ( x = rev ( x = sort ( x = tapply ( X = data[, feature], INDEX = data $ ident, FUN = mean ), decreasing = grepl ( pattern = paste0 ( '^' , tolower ( x = sort)), x = 'decreasing' ) ))) ) } if (log) { noise <- rnorm ( n = length ( x = data[, feature])) / 200 data[, feature] <- data[, feature] + 1 } else { noise <- rnorm ( n = length ( x = data[, feature])) / 100000 } if ( ! add.noise) { noise <- noise * 0 } if ( all (data[, feature] == data[, feature][ 1 ])) { warning ( paste0 ( "All cells have the same value of " , feature, "." )) } else { data[, feature] <- data[, feature] + noise } axis.label <- 'Expression Level' y.max <- y.max %||% max (data[, feature][ is.finite ( x = data[, feature])]) if (type == 'violin' && ! is.null ( x = split)) { data $ split <- split vln.geom <- geom_violin fill <- 'split' } else if (type == 'splitViolin' && ! is.null ( x = split )) { data $ split <- split vln.geom <- geom_split_violin fill <- 'split' type <- 'violin' } else { vln.geom <- geom_violin fill <- 'ident' } switch ( EXPR = type, 'violin' = { x <- 'ident' y <- paste0 ( "`" , feature, "`" ) xlab <- 'Identity' ylab <- axis.label geom <- list ( vln.geom ( scale = 'width' , adjust = adjust, trim = TRUE ), theme ( axis.text.x = element_text ( angle = 45 , hjust = 1 )) ) if ( is.null ( x = split)) { if ( isTRUE ( x = raster)) { jitter <- ggrastr :: rasterize ( geom_jitter ( height = 0 , size = pt.size, alpha = alpha, show.legend = FALSE )) } else { jitter <- geom_jitter ( height = 0 , size = pt.size, alpha = alpha, show.legend = FALSE ) } } else { if ( isTRUE ( x = raster)) { jitter <- ggrastr :: rasterize ( geom_jitter ( position = position_jitterdodge ( jitter.width = 0.4 , dodge.width = 0.9 ), size = pt.size, alpha = alpha, show.legend = FALSE )) } else { jitter <- geom_jitter ( position = position_jitterdodge ( jitter.width = 0.4 , dodge.width = 0.9 ), size = pt.size, alpha = alpha, show.legend = FALSE ) } } log.scale <- scale_y_log10 axis.scale <- ylim }, 'ridge' = { x <- paste0 ( "`" , feature, "`" ) y <- 'ident' xlab <- axis.label ylab <- 'Identity' geom <- list ( geom_density_ridges ( scale = 4 ), theme_ridges , scale_y_discrete ( expand = c ( 0.01 , 0 )), scale_x_continuous ( expand = c ( 0 , 0 )) ) jitter <- geom_jitter ( width = 0 , size = pt.size, alpha = alpha, show.legend = FALSE ) log.scale <- scale_x_log10 axis.scale <- function (...) { invisible ( x = NULL ) } }, stop ( "Unknown plot type: " , type) ) plot <- ggplot ( data = data, mapping = aes_string ( x = x, y = y, fill = fill)[ c ( 2 , 3 , 1 )] ) + labs ( x = xlab, y = ylab, title = feature, fill = NULL ) + theme_cowplot + theme ( plot.title = element_text ( hjust = 0.5 )) plot <- do.call ( what = '+' , args = list (plot, geom)) plot <- plot + if (log) { log.scale } else { axis.scale ( min (data[, feature]), y.max) } if (pt.size > 0 ) { plot <- plot + jitter } if ( ! is.null ( x = cols)) { if ( ! is.null ( x = split)) { idents <- unique ( x = as.vector ( x = data $ ident)) splits <- unique ( x = as.vector ( x = data $ split)) labels <- if ( length ( x = splits) == 2 ) { splits } else { unlist ( x = lapply ( X = idents, FUN = function (pattern, x) { x.mod <- gsub ( pattern = paste0 (pattern, '.' ), replacement = paste0 (pattern, ': ' ), x = x, fixed = TRUE ) x.keep <- grep ( pattern = ': ' , x = x.mod, fixed = TRUE ) x.return <- x.mod[x.keep] names ( x = x.return) <- x[x.keep] return (x.return) }, x = unique ( x = as.vector ( x = data $ split)) )) } if ( is.null ( x = names ( x = labels))) { names ( x = labels) <- labels } } else { labels <- levels ( x = droplevels (data $ ident)) } plot <- plot + scale_fill_manual ( values = cols, labels = labels) } return (plot) } # 这个代码框不要运行!
# 其中最为核心的绘图区域使用的R包为ggplot2, 所以接下来我们基于ggplot2进行后续的优化 plot <- ggplot ( data = data, mapping = aes_string ( x = x, y = y, fill = fill)[ c ( 2 , 3 , 1 )] ) 3.2 数据整理 # ggplot2的图片绘制需要一个长格式的数据,我们以上面的高表达基因为例,整理出一个长格式数据框 # 取出表达矩阵:data4plot <- scRNA @ assays $ RNA @ data[he_gene[ 1 : 5 ],] %>% as.data.frame %>% t head (data4plot) # Gm42418 Tmsb4x Malat1 Actb mt-Atp6 # AAAGGGCTCGCAGAGA-1_1 3.125663 5.191950 5.443660 4.880147 4.500693 # AACACACAGCATCCCG-1_1 5.717602 4.747414 4.719490 4.166542 5.093166 # AACAGGGTCAACACCA-1_1 4.716741 4.179333 5.849399 3.356926 3.870171 # AACCCAAAGGCCTTGC-1_1 5.104691 3.290715 6.404316 4.364196 3.569047 # AACCCAAAGTGGACGT-1_1 5.853611 7.403626 3.399781 5.316662 3.897164 # AACCCAAGTCCGAAGA-1_1 4.562551 6.145295 4.913077 5.337041 3.646919 # 添加注释信息:data4plot <- cbind (data4plot,scRNA @ meta.data[, c ( 'group' , 'celltype' )]) # 查看一下表头:head (data4plot) # Gm42418 Tmsb4x Malat1 Actb mt-Atp6 group # AAAGGGCTCGCAGAGA-1_1 3.125663 5.191950 5.443660 4.880147 4.500693 C57 # AACACACAGCATCCCG-1_1 5.717602 4.747414 4.719490 4.166542 5.093166 C57 # AACAGGGTCAACACCA-1_1 4.716741 4.179333 5.849399 3.356926 3.870171 C57 # AACCCAAAGGCCTTGC-1_1 5.104691 3.290715 6.404316 4.364196 3.569047 C57 # AACCCAAAGTGGACGT-1_1 5.853611 7.403626 3.399781 5.316662 3.897164 C57 # AACCCAAGTCCGAAGA-1_1 4.562551 6.145295 4.913077 5.337041 3.646919 C57 # celltype # AAAGGGCTCGCAGAGA-1_1 VSMC # AACACACAGCATCCCG-1_1 T cell # AACAGGGTCAACACCA-1_1 Fibro # AACCCAAAGGCCTTGC-1_1 EC # AACCCAAAGTGGACGT-1_1 Myeloid cells # AACCCAAGTCCGAAGA-1_1 T cell 3.3 重现VlnPlot # 待复刻的图:plot0 <- VlnPlot (scRNA, features = 'Gm42418' , pt.size = 0 ) plot0 library (ggplot2) library (ggthemes) # 最简单的复刻:plot1 <- ggplot ( data = data4plot, aes ( x = celltype, y = Gm42418, fill = celltype)) + # 确定坐标与颜色填充 geom_violin # 绘制小提琴图 plot1 # 显然这个还不是很像 # 稍作修改 plot2 <- plot1 + labs ( x = 'Identity' , y = 'Expression Level' , title = 'Gm42418' , fill = NULL ) + # 修改标题 geom_violin + theme_bw + # 去灰色背景 theme_few + # 去网格 theme_classic + theme ( plot.title = element_text ( hjust = 0.5 , face = 'bold' , size= 15 ) ) + # 加粗标题并调整大小 theme ( axis.text.x.bottom = element_text ( angle = 45 , hjust = 1 , vjust = 1 , size = 12 ), # 调整x轴坐标位置、字体、大小 axis.text.y.left = element_text ( size= 10 )) # y轴坐标字体调整 plot2 # 最后看一下,第一张图和第三张图基本一致 library (patchwork) plot0 / plot1 / plot2 # 需要横向排列则是plot0|plot1|plot2 3.4 重现stack vlnplot # 原本VlnPlot中的stack=True选项依赖MultiExIPlo进行可视化:# 再看一眼原图 ori_plot <- VlnPlot (scRNA, features = he_gene, pt.size = 0 , stack = T ) ori_plot # 整理数据 library (reshape2) data4stack <- as.data.frame (scRNA[[ "RNA" ]] @ data[he_gene,]) data4stack $ Gene <- factor (he_gene, levels = he_gene) data4stack[, 1 : 4 ] # AAAGGGCTCGCAGAGA-1_1 AACACACAGCATCCCG-1_1 AACAGGGTCAACACCA-1_1 # Gm42418 3.125663 5.717602 4.716741 # Tmsb4x 5.191950 4.747414 4.179333 # Malat1 5.443660 4.719490 5.849399 # Actb 4.880147 4.166542 3.356926 # mt-Atp6 4.500693 5.093166 3.870171 # mt-Co1 4.223976 4.516079 3.615016 # Fth1 4.598321 1.751268 3.450555 # mt-Co2 4.455747 4.661190 3.522394 # mt-Co3 3.796612 5.022522 3.435546 # Tpt1 3.334123 4.462619 4.313646 # AACCCAAAGGCCTTGC-1_1 # Gm42418 5.104691 # Tmsb4x 3.290715 # Malat1 6.404316 # Actb 4.364196 # mt-Atp6 3.569047 # mt-Co1 3.965073 # Fth1 2.903693 # mt-Co2 3.965073 # mt-Co3 3.786538 # Tpt1 2.903693 # 把宽格式的数据转换为长格式数据 data4stack <- melt (data4stack, id= "Gene" ) head (data4stack) # Gene variable value # 1 Gm42418 AAAGGGCTCGCAGAGA-1_1 3.125663 # 2 Tmsb4x AAAGGGCTCGCAGAGA-1_1 5.191950 # 3 Malat1 AAAGGGCTCGCAGAGA-1_1 5.443660 # 4 Actb AAAGGGCTCGCAGAGA-1_1 4.880147 # 5 mt-Atp6 AAAGGGCTCGCAGAGA-1_1 4.500693 # 6 mt-Co1 AAAGGGCTCGCAGAGA-1_1 4.223976 colnames (data4stack)[ c ( 2 , 3 )] = c ( "Cell" , "Expression" ) head (data4stack) # Gene Cell Expression # 1 Gm42418 AAAGGGCTCGCAGAGA-1_1 3.125663 # 2 Tmsb4x AAAGGGCTCGCAGAGA-1_1 5.191950 # 3 Malat1 AAAGGGCTCGCAGAGA-1_1 5.443660 # 4 Actb AAAGGGCTCGCAGAGA-1_1 4.880147 # 5 mt-Atp6 AAAGGGCTCGCAGAGA-1_1 4.500693 # 6 mt-Co1 AAAGGGCTCGCAGAGA-1_1 4.223976 # 提取细胞类型 my_meta = scRNA @ meta.data[, c ( 'group' , 'celltype' )] my_meta $ Cell <- rownames (my_meta) head (my_meta) # group celltype Cell # AAAGGGCTCGCAGAGA-1_1 C57 VSMC AAAGGGCTCGCAGAGA-1_1 # AACACACAGCATCCCG-1_1 C57 T cell AACACACAGCATCCCG-1_1 # AACAGGGTCAACACCA-1_1 C57 Fibro AACAGGGTCAACACCA-1_1 # AACCCAAAGGCCTTGC-1_1 C57 EC AACCCAAAGGCCTTGC-1_1 # AACCCAAAGTGGACGT-1_1 C57 Myeloid cells AACCCAAAGTGGACGT-1_1 # AACCCAAGTCCGAAGA-1_1 C57 T cell AACCCAAGTCCGAAGA-1_1 # 根据Cell列合并data4stack和my_meta,最终会在data4stack增加上细胞类型和分组的信息 data4stack = inner_join (data4stack,my_meta, by= "Cell" ) # 瞅一眼数据 head (data4stack) # Gene Cell Expression group celltype # 1 Gm42418 AAAGGGCTCGCAGAGA-1_1 3.125663 C57 VSMC # 2 Tmsb4x AAAGGGCTCGCAGAGA-1_1 5.191950 C57 VSMC # 3 Malat1 AAAGGGCTCGCAGAGA-1_1 5.443660 C57 VSMC # 4 Actb AAAGGGCTCGCAGAGA-1_1 4.880147 C57 VSMC # 5 mt-Atp6 AAAGGGCTCGCAGAGA-1_1 4.500693 C57 VSMC # 6 mt-Co1 AAAGGGCTCGCAGAGA-1_1 4.223976 C57 VSMC copy_stack <- ggplot (data4stack, aes ( x = Expression, y = celltype, fill = Gene)) + geom_violin + labs ( x = "Expression Level" , y = "Identity" ) + facet_grid (. ~ Gene, scales = "free_x" , space = 'free_x' ) + # 按照基因拆分做stack,并使每个小提琴图的坐标按照真实极值生成,而非全部统一 theme_bw + # 去灰色背景 theme_few + # 去网格 theme ( panel.spacing.x = unit ( 0 , "cm" )) + # 去间隙 theme ( strip.background = element_blank , # 去Gene标题的背景 strip.text.x = element_text ( angle = - 90 , hjust = 1 , size= 15 , face= 'bold' ) # 旋转Gene标题的角度 ) ori_plot / copy_stack 3.5 重现拆分的VlnPlot # VlnPlot中能够做到按照一定的分类变量拆分小提琴,我们同样可以在ggplot2中重现这一过程 VlnPlot (scRNA, features = he_gene[ 1 ], split.by = 'group' ) # 用ggplot拆一下:ggplot ( data = data4plot, aes ( x = celltype, y = Gm42418, fill = group)) + # 同样是更改fill来实现 geom_violin + # 绘制小提琴图 labs ( x = 'Identity' , y = 'Expression Level' , title = 'Gm42418' , fill = NULL ) + # 设置坐标与标题的字符串 geom_violin + # 添加小提琴图 theme_bw + # 去灰色背景 theme_few + # 去网格 theme_classic + theme ( plot.title = element_text ( hjust = 0.5 , face = 'bold' , size= 15 ) ) + # 加粗标题并调整大小 theme ( axis.text.x.bottom = element_text ( angle = 45 , hjust = 1 , vjust = 1 , size = 12 ), # 横坐标斜体45° axis.text.y.left = element_text ( size= 10 )) 4. ggplot2改造 4.1 VlnPlot颜色修改 plot1 <- ggplot ( data = data4plot, aes ( x = celltype, y = Gm42418, fill = celltype)) + geom_violin + # 绘制小提琴图 labs ( x = 'Identity' , y = 'Expression Level' , title = 'Gm42418' , fill = NULL ) + # 设置坐标与标题的字符串 geom_violin + # 添加小提琴图 theme_bw + # 去灰色背景 theme_few + # 去网格 theme_classic + theme ( plot.title = element_text ( hjust = 0.5 , face = 'bold' , size= 15 ) ) + # 加粗标题并调整大小 theme ( axis.text.x.bottom = element_text ( angle = 45 , hjust = 1 , vjust = 1 , size = 12 ), # 调整x轴坐标角度、位置、大小 axis.text.y.left = element_text ( size= 10 )) # 调整y轴提提大小 # 用ggsci加一个柳叶刀配色 plot2 <- plot1 + scale_fill_manual ( values = ggsci :: pal_nejm ( 9 )) # 颜色会比默认配色好看很多 plot1 / plot2 # 如果觉得图注多余,可以隐藏。
legend.position = "right"(默认):图例位于图的右侧。# 其他:"left"左侧;"top"顶部;
"bottom"底部 plot2 + theme ( legend.position = "none" ) # 或者自定义图注的位置 # x = 0 表示最左边,x = 1 表示最右边;y = 0 表示底部,y = 1 表示顶部 plot2 + theme ( legend.position = c ( - 0.05 , - 0.05 )) # 调整的舒服一些 plot2 + theme ( legend.position = c ( 0.3 , 1 ), legend.background = element_blank , # 让图注背景框透明 legend.direction = 'horizontal' , plot.title = element_text ( hjust = 0.8 , face = 'bold' , size= 15 )) # 调整标题位置(水平对齐,0 = 左对齐,1 = 右对齐,0.5 = 居中。
这里 0.8 表示标题稍微向右偏移)、字体、大小 4.2 stack vlnplot颜色修改 # 添加ggsci配色 p1 <- copy_stack + scale_fill_manual ( values = c (ggsci :: pal_nejm ( 8 ),ggsci :: pal_aaas ( 2 ))) # 把颜色填充由Gene改为Celltype:p2 <- ggplot (data4stack, aes ( x = Expression, y = celltype, fill = celltype)) + # 改动fill即可 geom_violin + labs ( x = "Expression Level" , y = "Identity" ) + facet_grid (. ~ Gene, scales = "free_x" , space = 'free_x' ) + # 按照基因拆分做stack,并使每个小提琴图的坐标按照真实极值生成,而非全部统一 theme_bw + # 去灰色背景 theme_few + # 去网格 theme ( panel.spacing.x = unit ( 0 , "cm" )) + # 去间隙 theme ( strip.background = element_blank , # 去Gene标题的背景 strip.text.x = element_text ( angle = - 90 , hjust = 1 , size= 15 , face= 'bold' ) # 旋转Gene标题的角度 ) # 把颜色填充由Gene改为Celltype后,重新对每个Celltype赋予新的颜色 p3 <- p2 + scale_fill_manual ( values = c (ggsci :: pal_nejm ( 8 ),ggsci :: pal_aaas ( 2 ))) # 对比一下四个图 copy_stack | p1 | p2 | p3 # 交换x/y轴 copy_stack <- ggplot (data4stack, aes ( x = celltype, y = Expression , fill = Gene)) + geom_violin + labs ( x = "Expression Level" , y = "Identity" ) + facet_grid (Gene ~ ., scales = "free_y" , space = 'free_y' ) + theme_bw + # 去灰色背景 theme_few + # 去网格 theme ( panel.spacing.x = unit ( 0 , "cm" )) + # 去间隙 theme ( strip.background = element_blank , # 去Gene标题的背景 strip.text.x = element_text ( angle = - 90 , hjust = 1 , size= 15 , face= 'bold' ) # 旋转Gene标题的角度 ) copy_stack 4.3 拆分的VlnPlot颜色修改 # 此时需要自定义颜色应该输入group对应的三个颜色 ggplot ( data = data4plot, aes ( x = celltype, y = Gm42418, fill = group)) + geom_violin + # 绘制小提琴图 labs ( x = 'Identity' , y = 'Expression Level' , title = 'Gm42418' , fill = NULL ) + # 设置坐标与标题的字符串 geom_violin + # 添加小提琴图 theme_bw + # 去灰色背景 theme_few + # 去网格 theme_classic + theme ( plot.title = element_text ( hjust = 0.5 , face = 'bold' , size= 15 ) ) + # 加粗标题并调整大小 theme ( axis.text.x.bottom = element_text ( angle = 45 , hjust = 1 , vjust = 1 , size = 12 ), axis.text.y.left = element_text ( size= 10 )) + # 横坐标斜体45° scale_fill_manual ( values = ggsci :: pal_aaas ( 3 )) 4.4 坐标轴箭头调整 # 基于前面plot2图形调整 plot2 # 空心箭头坐标轴:plot2 + theme ( axis.line = element_line ( arrow = arrow ( length = unit ( 0.5 , 'cm' )))) # 调整箭头大小:plot2 + theme ( axis.line = element_line ( arrow = arrow ( length = unit ( 2 , 'cm' )))) # 调整箭头形状:plot2 + theme ( axis.line = element_line ( arrow = arrow ( length = unit ( 0.5 , 'cm' ), type = 'closed' ))) # 很多时候,简单的小提琴无法帮助我们直接的观察到基因的表达差异,例如下面这张图:# 4.4 小提琴+箱线图 # 显然这样能够看出C57的中位数要低于AS1组 my_box <- ggplot (data4plot, aes (group,Gm42418, fill= group)) + geom_violin ( alpha= 0.4 ) + stat_boxplot ( geom= "errorbar" , position = position_dodge ( width = 0.1 ), width= 0.1 ) + # 加上误差棒 geom_boxplot ( alpha= 0.5 , outlier.size= 0 , size= 0.3 , width= 0.3 ) + # 加上箱线图 scale_fill_manual ( values = c ( ' , ' , ' )) + # 填充颜色 theme_bw my_box 4.5 海盗船图 # 海盗船图:if ( ! require (ggpirate))devtools :: install_github ( "mikabr/ggpirate" ) ggplot (data4plot, aes ( x = group, y = Gm42418, fill = group)) + geom_pirate ( aes ( x = group, y = Gm42418, fill = group), alpha = 0.4 ) + scale_fill_manual ( values = c ( ' , ' , ' )) + # 填充颜色 theme_bw # 加上显著性:library (ggsignif) # 设置比较组 compaired <- list ( c ( "C57" , "AS1" ), c ( "C57" , "P3" ), c ( "AS1" , "P3" )) my_box + geom_signif ( comparisons = compaired, step_increase = 0.1 , map_signif_level = F, test = 'wilcox.test' ) 4.6 点图与蜂群图 # 加上点图 my_box + geom_dotplot ( binaxis= 'y' , # y轴上对应了点的表达量 stackdir= 'center' , dotsize = . 5 , fill= "red" , alpha= 0.3 ) # 蜂群图 library (ggbeeswarm) ggplot (data4plot, aes (group, Gm42418, fill = group)) + geom_beeswarm ( alpha = 0.4 ) # 蜂群图+箱线图 ggplot (data4plot, aes (group, Gm42418, fill = group)) + geom_beeswarm ( alpha = 0.4 ) + # 蜂群图 stat_boxplot ( geom = "errorbar" , position = position_dodge ( width = 0.1 ), width = 0.1 ) + # 加上误差棒 geom_boxplot ( alpha = 0.5 , outlier.size = 0 , size = 0.3 , width = 0.3 ) + # 加上箱线图 scale_fill_manual ( values = c ( ' , ' , ' )) + # 填充颜色 theme_bw 4.7 直方图 ggplot (data4plot, aes ( x = Gm42418, fill = celltype)) + geom_histogram ( position = "identity" , alpha = 0.4 , bins = 30 ) + # 直方图是ggplot2最基础的功能 theme_bw + facet_grid (group ~ .) # 仅按照group拆分 # 分面进行对比:ggplot (data4plot, aes ( x = Gm42418, fill = celltype)) + geom_histogram ( position = "identity" , alpha = 0.4 , bins = 30 ) + theme_bw + facet_grid (group ~ celltype) # groupXcelltype式拆分 4.8 劈开小提琴 # 劈开小提琴图,插入箱线图 library (gghalves) my_box <- ggplot (data4plot, aes (group, Gm42418, fill = group)) + geom_half_violin ( aes ( x = group, y = Gm42418), scale = "width" , side = "r" , position = position_nudge ( x = 0.1 )) + # 右半边小提琴图,通过position_nudge调整位置 geom_half_violin ( aes ( x = group, y = Gm42418), scale = "width" , side = "l" , position = position_nudge ( x = - 0.1 )) + # 左半边小提琴图,通过position_nudge调整位置 geom_boxplot ( width = 0.1 , outlier.shape = NA ) + # 中间的箱线图 scale_fill_manual ( values = c ( ' , ' , ' )) + # 填充颜色 theme_bw my_box # 让两半边分别展示不同的分组:my_box <- ggplot (dplyr :: filter (data4plot,group %in% c ( 'C57' , 'AS1' )), aes (celltype, Gm42418, fill = group)) + geom_half_violin ( data = dplyr :: filter (data4plot,group == 'C57' ) , aes ( x = celltype, y = Gm42418), scale = "width" , side = "r" , position = position_nudge ( x = 0.1 )) + # 右半边小提琴图,通过position_nudge调整位置 geom_half_violin ( data = dplyr :: filter (data4plot,group == 'AS1' ), aes ( x = celltype, y = Gm42418), scale = "width" , side = "l" , position = position_nudge ( x = - 0.1 )) + # 左半边小提琴图,通过position_nudge调整位置 geom_boxplot ( width = 0.1 , outlier.shape = NA ) + # 中间的箱线图 scale_fill_manual ( values = c ( ' , ' )) + # 填充颜色 theme_bw my_box 第十二讲:改造单细胞图形-FeaturePlot 1. 准备工作 library (Seurat) library (dplyr) library (patchwork) # 读入第三讲-单样本分析保存的pbmc.rds并绘制FeaturePlot scRNA <- readRDS ( '../result_seuratV5/pbmc.rds' ) DimPlot (scRNA) scRNA.markers <- FindAllMarkers (scRNA, only.pos = TRUE , min.pct = 0.25 , logfc.threshold = 0.25 ) # 选取每个亚群top3基因 top3gene <- scRNA.markers %>% group_by (cluster) %>% top_n ( n = 3 , wt = avg_log2FC) top3gene # # A tibble: 27 × 7 # # Groups: cluster [9] # p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene # <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr> # 1 1.17e- 83 2.37 0.435 0.108 1.60e- 79 Naive CD4 T CCR7 # 2 3.28e- 49 2.10 0.333 0.103 4.50e- 45 Naive CD4 T LEF1 # 3 9.31e- 44 2.02 0.328 0.11 1.28e- 39 Naive CD4 T PRKCQ-AS1 # 4 0 6.64 0.975 0.121 0 CD14+ Mono S100A8 # 5 3.10e-139 7.28 0.3 0.004 4.25e-135 CD14+ Mono FOLR3 # 6 1.63e-121 6.74 0.277 0.006 2.23e-117 CD14+ Mono S100A12 # 7 5.53e- 61 1.65 0.657 0.245 7.58e- 57 Memory CD4 T CD2 # 8 2.61e- 59 2.11 0.424 0.111 3.58e- 55 Memory CD4 T AQP3 # 9 1.94e- 35 1.90 0.267 0.069 2.66e- 31 Memory CD4 T CD40LG # 10 0 6.91 0.936 0.041 0 B CD79A # # ℹ 17 more rows 2. 调整FeaturePlot参数 # 初始FeaturePlot,CD79A为B细胞标记基因 FeaturePlot (scRNA, features = 'CD79A' , label = T) # 和原先DimPLot结果对比 DimPlot (scRNA) | FeaturePlot (scRNA, features = 'CD79A' , label = T) # 绘制多个基因 FeaturePlot (scRNA, c ( 'CD79A' , 'CD8A' ), blend = F) # FeaturePlot绘制两个基因共表达 # blend = T启动颜色混合模式,这种情况只能绘制两个基因 FeaturePlot (scRNA, c ( 'CD79A' , 'CD8A' ), blend = T) # 绘制两个以上基因会出现以下提示 # try({FeaturePlot(scRNA,c('CD79A','CD8A','CCR7'),blend = T)}) # Error in FeaturePlot(scRNA, c("CD79A", "CD8A", "CCR7"), blend = T) : # Blending feature plots only works with two features 3. ggplot2直接中的DIY 3.1 数据整理 library (dplyr) library (ggplot2) library (ggthemes) library (ggnewscale) mydata <- scRNA @ reductions $ umap @ cell.embeddings head (mydata) # umap_1 umap_2 # AAACATACAACCAC-1 -3.304555 -3.914703 # AAACATTGAGCTAC-1 -5.620712 10.682330 # AAACATTGATCAGC-1 -5.630144 -6.400627 # AAACCGTGCTTCCG-1 9.145644 5.535673 # AAACCGTGTATGCG-1 -5.354362 2.053312 # AAACGCACTGGTAC-1 -3.218432 -6.085276 myexpr <- as.data.frame ( GetAssayData (scRNA, layer = "data" ))[ c ( 'CD79A' , 'CD8A' , 'CCR7' ),] %>% t head (myexpr) # CD79A CD8A CCR7 # AAACATACAACCAC-1 0.000000 1.635873 1.635873 # AAACATTGAGCTAC-1 1.962726 0.000000 0.000000 # AAACATTGATCAGC-1 0.000000 0.000000 0.000000 # AAACCGTGCTTCCG-1 0.000000 0.000000 0.000000 # AAACCGTGTATGCG-1 0.000000 0.000000 0.000000 # AAACGCACTGGTAC-1 0.000000 0.000000 0.000000 mydata <- cbind (mydata,myexpr) %>% as.data.frame mydata $ celltype <- Idents (scRNA) head (mydata) # umap_1 umap_2 CD79A CD8A CCR7 celltype # AAACATACAACCAC-1 -3.304555 -3.914703 0.000000 1.635873 1.635873 Memory CD4 T # AAACATTGAGCTAC-1 -5.620712 10.682330 1.962726 0.000000 0.000000 B # AAACATTGATCAGC-1 -5.630144 -6.400627 0.000000 0.000000 0.000000 Memory CD4 T # AAACCGTGCTTCCG-1 9.145644 5.535673 0.000000 0.000000 0.000000 CD14+ Mono # AAACCGTGTATGCG-1 -5.354362 2.053312 0.000000 0.000000 0.000000 NK # AAACGCACTGGTAC-1 -3.218432 -6.085276 0.000000 0.000000 0.000000 Memory CD4 T 3.2 重现FeaturePlot # 绘制一个基因 p <- ggplot (mydata, aes ( x= umap_1, y= umap_2)) + geom_point ( data = mydata, aes ( x= umap_1, y= umap_2, color= CD79A), size= 1 ) + scale_color_gradient ( 'CD8A' , low = alpha ( 'grey' , 0.1 ), high = alpha ( ' , 1 )) p # 添加置信区间 p + stat_density2d ( aes ( colour= CD8A)) # 同时绘制三个基因 # 一般情况下,ggplot2默认只能有一个颜色映射,如果后续 geom_* 图层再使用 aes(color = ...),它会复用前面的颜色映射。
# new_scale("color") 是ggnewscale包中的函数,用于在同一张ggplot2图中应用多个颜色映射。# 例如这里我们需要绘制三个基因,scale_color_gradient已经对CD79A进行了颜色映射,颜色从灰色到紫色。
后续还想对其他基因表达值提供新的颜色映射,new_scale("color") 允许你在后续图层中重新定义 aes(color = ...),并应用新的 scale_color_* 规则 p <- ggplot (mydata, aes ( x= umap_1, y= umap_2)) + geom_point ( data = mydata, aes ( x= umap_1, y= umap_2, color= CD79A), size= 1 ) + scale_color_gradient ( 'CD79A' , low = alpha ( 'grey' , 0.1 ), high = alpha ( ' , 1 )) + new_scale ( 'color' ) + geom_point ( data = mydata, aes ( x= umap_1, y= umap_2, color= CD8A), size= 1 ) + scale_color_gradient ( 'CD8A' , low = alpha ( 'grey' , 0.1 ), high = alpha ( 'red' , 1 )) + new_scale ( 'color' ) + geom_point ( data = mydata, aes ( x= umap_1, y= umap_2, color= CCR7), size= 1 ) + scale_color_gradient ( 'CCR7' , low = alpha ( 'grey' , 0.1 ), high = alpha ( ' , 1 )) p # 更改背景 p <- p + theme_bw + # 删除灰色背景 theme_few + # 删除网格线 theme_classic # 删除上方和右侧的边框线 p # 添加置信区间 p <- p + stat_ellipse ( aes ( fill= celltype)) p # 整理label文件 # 计算每种细胞类型(celltype)在UMAP 空间(umap_1, umap_2)上的平均坐标,并存入my.label这个数据框,以便后续用于UMAP细胞类型标注 my.label <- data.frame for (runcell in unique (mydata $ celltype)) { umap_1 <- dplyr :: filter (mydata,celltype == runcell) %>% pull (umap_1) %>% mean umap_2 <- dplyr :: filter (mydata,celltype == runcell) %>% pull (umap_2) %>% mean my.label <- rbind (my.label, cbind (umap_1,umap_2)) } rownames (my.label) <- unique (mydata $ celltype) my.label $ celltype <- rownames (my.label) my.label # umap_1 umap_2 celltype # Memory CD4 T -4.203053 -5.8672066 Memory CD4 T # B -4.050225 11.3856252 B # CD14+ Mono 9.963179 3.5813161 CD14+ Mono # NK -6.006065 1.7420971 NK # CD8 T -4.693666 -1.3874622 CD8 T # Naive CD4 T -1.245441 -5.6826215 Naive CD4 T # FCGR3A+ Mono 8.431078 7.3155198 FCGR3A+ Mono # DC 6.634143 3.7099839 DC # Platelet 5.470041 0.4560609 Platelet # 添加label p <- p + geom_text ( aes ( x = umap_1, y = umap_2, label= celltype), color= 'black' , fontface= "bold" , data = my.label, show.legend = F) p # 分面 p + facet_grid ( ~ celltype) 第十三讲:改造单细胞图形-DotPlot 1. 准备工作 suppressMessages ({ library (Seurat) library (ggplot2) library (dplyr) library (tidyr) library (aplot) library (ggtree) }) # 和第十二讲前期处理类似 # 同样读入第三讲-单样本分析保存的pbmc.rds sce_final <- readRDS ( '../result_seuratV5/pbmc.rds' ) Idents (sce_final) <- "seurat_clusters" DimPlot (sce_final) scRNA.markers <- FindAllMarkers (sce_final, only.pos = TRUE , min.pct = 0.25 , logfc.threshold = 0.25 ) # 选取每个亚群top3基因 marker <- scRNA.markers %>% group_by (cluster) %>% top_n ( n = 3 , wt = avg_log2FC) marker <- marker $ gene 2. DotPlot参数 DotPlot ( object, features, assay = NULL , 默认是active assay cols = c ( "lightgrey" , "blue" ), col.min = - 2.5 , col.max = 2.5 , dot.min = 0 , dot.scale = 6 , idents = NULL , group.by = NULL , split.by = NULL , cluster.idents = FALSE , scale = TRUE , scale.by = "radius" , scale.min = NA , scale.max = NA ) 3. DotPlot及其自带参数美化 # 默认参数效果 DotPlot (sce_final, features = marker) # 修改参数展示 # 一般比较常用的调整参数包括以下:idents / group.by / cols / features # (1) 挑选部分细胞类型展示 DotPlot (sce_final, features = marker, idents = c ( "0" , "2" , "4" , "6" )) # (2) 按照指定列作为分组信息展示,例如使用seurat_clusters分组 DotPlot (sce_final, features = marker, group.by = "celltype" ) # (3) 修改渐变色 DotPlot (sce_final, features = marker, cols = c ( "white" , "red" )) # (4) 不同类别的features,分类展示 marker_split <- list ( TCells= c ( "IL32" , "CD3D" , "CD3E" , "MALAT1" , "LDHB" ), Myeloid= c ( "LST1" , "TYROBP" , "FCER1G" , "CST3" , "AIF1" ), BCells= c ( "CD79A" , "MS4A1" , "CD79B" , "LINC00926" , "TCL1A" ), Platelet= c ( "GP9" , "AP001189.4" , "ITGA2B" , "TMEM40" , "LY6G6F" )) DotPlot (sce_final, features = marker_split ) 4、DotPlot联合ggplot2美化 # 在绘图过程中我们也可能遇到几十个基因同时绘制的情况,这时候避免出现字符重叠的情况,我们可以利用Seurat自带修饰主题或者ggplot2这个好帮手 # (1)Seurat自带修饰主题,调整坐标轴字符倾斜角度 DotPlot (sce_final, features = marker) + RotatedAxis # (2)ggplot2的theme函数,调整坐标轴字符倾斜角度 DotPlot (sce_final, features = marker) + theme ( axis.text.x= element_text ( angle= 45 , hjust = 1 )) # (3)ggplot2的coord_flip函数,坐标轴翻转 DotPlot (sce_final, features = marker) + coord_flip 5、ggplot2复现DotPlot函数源码 DotPlot <- function ( object, features, assay = NULL , cols = c ( "lightgrey" , "blue" ), col.min = - 2.5 , col.max = 2.5 , dot.min = 0 , dot.scale = 6 , idents = NULL , group.by = NULL , split.by = NULL , cluster.idents = FALSE , scale = TRUE , scale.by = 'radius' , scale.min = NA , scale.max = NA ) { assay <- assay %||% DefaultAssay ( object = object) DefaultAssay ( object = object) <- assay split.colors <- ! is.null ( x = split.by) && ! any (cols %in% rownames ( x = brewer.pal.info)) scale.func <- switch ( EXPR = scale.by, 'size' = scale_size, 'radius' = scale_radius, stop ( "'scale.by' must be either 'size' or 'radius'" ) ) feature.groups <- NULL if ( is.list (features) | any ( ! is.na ( names (features)))) { feature.groups <- unlist ( x = sapply ( X = 1 : length (features), FUN = function (x) { return ( rep ( x = names ( x = features)[x], each = length (features[[x]]))) } )) if ( any ( is.na ( x = feature.groups))) { warning ( "Some feature groups are unnamed." , call. = FALSE , immediate. = TRUE ) } features <- unlist ( x = features) names ( x = feature.groups) <- features } cells <- unlist ( x = CellsByIdentities ( object = object, cells = colnames (object[[assay]]), idents = idents)) data.features <- FetchData ( object = object, vars = features, cells = cells) data.features $ id <- if ( is.null ( x = group.by)) { Idents ( object = object)[cells, drop = TRUE ] } else { object[[group.by, drop = TRUE ]][cells, drop = TRUE ] } if ( ! is.factor ( x = data.features $ id)) { data.features $ id <- factor ( x = data.features $ id) } id.levels <- levels ( x = data.features $ id) data.features $ id <- as.vector ( x = data.features $ id) if ( ! is.null ( x = split.by)) { splits <- FetchData ( object = object, vars = split.by)[cells, split.by] if (split.colors) { if ( length ( x = unique ( x = splits)) > length ( x = cols)) { stop ( paste0 ( "Need to specify at least " , length ( x = unique ( x = splits)), " colors using the cols parameter" )) } cols <- cols[ 1 : length ( x = unique ( x = splits))] names ( x = cols) <- unique ( x = splits) } data.features $ id <- paste (data.features $ id, splits, sep = '_' ) unique.splits <- unique ( x = splits) id.levels <- paste0 ( rep ( x = id.levels, each = length ( x = unique.splits)), "_" , rep ( x = unique ( x = splits), times = length ( x = id.levels))) } data.plot <- lapply ( X = unique ( x = data.features $ id), FUN = function (ident) { data.use <- data.features[data.features $ id == ident, 1 : ( ncol ( x = data.features) - 1 ), drop = FALSE ] avg.exp <- apply ( X = data.use, MARGIN = 2 , FUN = function (x) { return ( mean ( x = expm1 ( x = x))) } ) pct.exp <- apply ( X = data.use, MARGIN = 2 , FUN = PercentAbove, threshold = 0 ) return ( list ( avg.exp = avg.exp, pct.exp = pct.exp)) } ) names ( x = data.plot) <- unique ( x = data.features $ id) if (cluster.idents) { mat <- do.call ( what = rbind, args = lapply ( X = data.plot, FUN = unlist) ) mat <- scale ( x = mat) id.levels <- id.levels[ hclust ( d = dist ( x = mat)) $ order] } data.plot <- lapply ( X = names ( x = data.plot), FUN = function (x) { data.use <- as.data.frame ( x = data.plot[[x]]) data.use $ features.plot <- rownames ( x = data.use) data.use $ id <- x return (data.use) } ) data.plot <- do.call ( what = 'rbind' , args = data.plot) if ( ! is.null ( x = id.levels)) { data.plot $ id <- factor ( x = data.plot $ id, levels = id.levels) } ngroup <- length ( x = levels ( x = data.plot $ id)) if (ngroup == 1 ) { scale <- FALSE warning ( "Only one identity present, the expression values will be not scaled" , call. = FALSE , immediate. = TRUE ) } else if (ngroup < 5 & scale) { warning ( "Scaling data with a low number of groups may produce misleading results" , call. = FALSE , immediate. = TRUE ) } avg.exp.scaled <- sapply ( X = unique ( x = data.plot $ features.plot), FUN = function (x) { data.use <- data.plot[data.plot $ features.plot == x, 'avg.exp' ] if (scale) { data.use <- scale ( x = log1p (data.use)) data.use <- MinMax ( data = data.use, min = col.min, max = col.max) } else { data.use <- log1p ( x = data.use) } return (data.use) } ) avg.exp.scaled <- as.vector ( x = t ( x = avg.exp.scaled)) if (split.colors) { avg.exp.scaled <- as.numeric ( x = cut ( x = avg.exp.scaled, breaks = 20 )) } data.plot $ avg.exp.scaled <- avg.exp.scaled data.plot $ features.plot <- factor ( x = data.plot $ features.plot, levels = features ) data.plot $ pct.exp[data.plot $ pct.exp < dot.min] <- NA data.plot $ pct.exp <- data.plot $ pct.exp * 100 if (split.colors) { splits.use <- unlist ( x = lapply ( X = data.plot $ id, FUN = function (x) sub ( paste0 ( ".*_(" , paste ( sort ( unique ( x = splits), decreasing = TRUE ), collapse = '|' ), ")$" ), " \\ 1" , x ) ) ) data.plot $ colors <- mapply ( FUN = function (color, value) { return ( colorRampPalette ( colors = c ( 'grey' , color))( 20 )[value]) }, color = cols[splits.use], value = avg.exp.scaled ) } color.by <- ifelse ( test = split.colors, yes = 'colors' , no = 'avg.exp.scaled' ) if ( ! is.na ( x = scale.min)) { data.plot[data.plot $ pct.exp < scale.min, 'pct.exp' ] <- scale.min } if ( ! is.na ( x = scale.max)) { data.plot[data.plot $ pct.exp > scale.max, 'pct.exp' ] <- scale.max } if ( ! is.null ( x = feature.groups)) { data.plot $ feature.groups <- factor ( x = feature.groups[data.plot $ features.plot], levels = unique ( x = feature.groups) ) } plot <- ggplot ( data = data.plot, mapping = aes_string ( x = 'features.plot' , y = 'id' )) + geom_point ( mapping = aes_string ( size = 'pct.exp' , color = color.by)) + scale.func ( range = c ( 0 , dot.scale), limits = c (scale.min, scale.max)) + theme ( axis.title.x = element_blank , axis.title.y = element_blank ) + guides ( size = guide_legend ( title = 'Percent Expressed' )) + labs ( x = 'Features' , y = ifelse ( test = is.null ( x = split.by), yes = 'Identity' , no = 'Split Identity' ) ) + theme_cowplot if ( ! is.null ( x = feature.groups)) { plot <- plot + facet_grid ( facets = ~ feature.groups, scales = "free_x" , space = "free_x" , switch = "y" ) + theme ( panel.spacing = unit ( x = 1 , units = "lines" ), strip.background = element_blank ) } if (split.colors) { plot <- plot + scale_color_identity } else if ( length ( x = cols) == 1 ) { plot <- plot + scale_color_distiller ( palette = cols) } else { plot <- plot + scale_color_gradient ( low = cols[ 1 ], high = cols[ 2 ]) } if ( ! split.colors) { plot <- plot + guides ( color = guide_colorbar ( title = 'Average Expression' )) } return (plot) } 从中可以看到,DotPlot函数主要依赖于下面这段代码绘图,使用到的主要绘图函数是ggplot2 接下来我们尝试用终极大法“ggplot2”还原DotPlot的结果 # DotPlot绘图数据获取 p1 <- DotPlot (sce_final, features = marker) p1 data <- p1 $ data # 添加cluster对应细胞类型名称 data $ celltype <- as.character (new.cluster.ids[data $ id]) # 数据查看 head (data) # avg.exp pct.exp features.plot id avg.exp.scaled celltype # CCR7 2.966887045 43.459916 CCR7 0 2.0782862 Naive CD4 T # LEF1 2.080254074 33.333333 LEF1 0 2.0344535 Naive CD4 T # PRKCQ-AS1 2.092014612 32.770745 PRKCQ-AS1 0 1.8243866 Naive CD4 T # S100A8 0.415138208 8.298172 S100A8 0 -0.5995877 Naive CD4 T # FOLR3 0.005816666 0.140647 FOLR3 0 -0.3653701 Naive CD4 T # S100A12 0.012959228 0.281294 S100A12 0 -0.4006327 Naive CD4 T # 设置展示顺序,将相同细胞类型的cluster放在一起展示 data $ id <- factor (data $ id, levels= c ( "0" , "2" , "4" , "6" , "1" , "5" , "7" , "3" , "8" )) # ggplot2绘图 p2 <- ggplot (data , aes ( x = features.plot, y = id, size = pct.exp, color = avg.exp.scaled)) + geom_point ( shape= 16 ) + = 16实心圆 theme_bw + scale_colour_gradient2 + xlab ( NULL ) + ylab ( NULL ) + theme ( axis.text.x = element_text ( angle = 45 , hjust = 1 )) + theme ( panel.grid.major= element_blank ) + geom_hline ( yintercept= c ( 4.5 , 7.5 , 8.5 ), linewidth= . 5 ) + theme ( panel.border = element_rect ( fill= NA , color= "black" , linewidth= 1.5 , linetype= "solid" )) + 线型为实心线 coord_flip p2 new.cluster.ids <- c ( "TCells" , "Myeloid" , "TCells" , "BCells" , "TCells" , "Myeloid" , "TCells" , "Myeloid" , "Platelet" ) names (new.cluster.ids) <- 0 : 8 anno1 <- data.frame (new.cluster.ids) anno1 <- anno1[ c ( "0" , "2" , "4" , "6" , "1" , "5" , "7" , "3" , "8" ),,drop = F] anno1 $ cluster <- rownames (anno1) anno1 $ cluster <- factor (anno1 $ cluster, levels= c ( "0" , "2" , "4" , "6" , "1" , "5" , "7" , "3" , "8" )) anno1 $ other <- "anno1" anno1 # new.cluster.ids cluster other # 0 TCells 0 anno1 # 2 TCells 2 anno1 # 4 TCells 4 anno1 # 6 TCells 6 anno1 # 1 Myeloid 1 anno1 # 5 Myeloid 5 anno1 # 7 Myeloid 7 anno1 # 3 BCells 3 anno1 # 8 Platelet 8 anno1 p2_anno <- ggplot (anno1, aes ( x= cluster, y= other, fill= new.cluster.ids)) + geom_tile + scale_fill_manual ( values = c ( "TCells" = " , "Myeloid" = " , "BCells" = " , "Platelet" = " )) + scale_y_discrete ( position = "right" ) + theme_bw + theme ( panel.grid.major= element_blank , panel.border = element_blank ) + theme ( axis.text.x = element_blank , axis.text.y = element_blank , axis.ticks.x= element_blank , axis.ticks.y = element_blank ) + labs ( x= "" , y= "" ) p2_anno set.seed ( 7 ) # 合并dotplot和细胞注释条 p2 %>% insert_top (p2_anno, height = . 05 ) expr <- as.data.frame (sce_final @ assays $ RNA $ data[marker,]) hc <- hclust ( dist (expr)) ggtree (hc) ggtree (hc) + geom_text ( aes ( label = node), hjust = - 0.3 , vjust = - 0.3 ) hcc1 <- ggtree (hc) + geom_nodepoint ( color= " , alpha= 1 / 4 , size= 10 ) + geom_tippoint ( color= " , shape= 8 , size= 3 ) hcc1 hcc2 <- ggtree (hc) + geom_point ( aes ( shape= isTip, color= isTip), size= 3 ) + geom_hilight ( node= 31 , fill= "steelblue" , alpha= . 1 ) + geom_hilight ( node= 34 , fill= "darkgreen" , alpha= . 1 ) + geom_hilight ( node= 37 , fill= " , alpha= . 1 ) + geom_hilight ( node= 40 , fill= " , alpha= . 1 ) + geom_hilight ( node= 41 , fill= " , alpha= . 1 ) hcc2 # 合并dotplot、细胞注释条和基因聚类树 p2 %>% insert_top (p2_anno, height = . 05 ) %>% insert_left (hcc1, width = 1 ) p2 %>% insert_top (p2_anno, height = . 05 ) %>% insert_left (hcc2, width = 1 ) 第十四讲:改造单细胞图形-DoHeatmap 1、加载数据 suppressMessages ({ library (Seurat) library (ggplot2) library (dplyr) library (tidyr) library (aplot) library (pheatmap) library (circlize) library (ComplexHeatmap) library (dendextend) }) 本期测试的数据来源于数据集 GSE139107。
前期的数据处理可以参考我们之前推送的DimPlot优化内容改造单细胞降维图|1.DimPlot的探索 sce <- readRDS ( "./iri.intergrate.rds" ) sce # An object of class Seurat # 29133 features across 126578 samples within 2 assays # Active assay: RNA (27133 features, 0 variable features) # 2 layers present: counts, data # 1 other assay present: integrated # 3 dimensional reductions calculated: pca, umap, tsne DimPlot (sce, raster= FALSE ) levels (sce) # [1] "PTS1" "PTS2" "PTS3" "NewPT1" "NewPT2" "DTL-ATL" "MTAL" # [8] "CTAL1" "CTAL2" "MD" "DCT" "DCT-CNT" "CNT" "PC1" # [15] "PC2" "ICA" "ICB" "Uro" "Pod" "PEC" "EC1" # [22] "EC2" "Fib" "Per" "Mø" "Tcell" sub_celltype <- c ( "MD" , "Uro" , "Pod" , "PEC" , "EC2" , "Per" , "Tcell" ) sub # standardGeneric for "sub" defined from package "base" # # function (pattern, replacement, x, ignore.case = FALSE, perl = FALSE, # fixed = FALSE, useBytes = FALSE) # standardGeneric("sub") # <environment: 0x557cc51c4b90> # Methods may be defined for arguments: pattern, replacement, x, ignore.case, perl, fixed, useBytes # Use showMethods(sub) for currently available ones. sub <- subset (sce, subset= celltype %in% sub_celltype) table ( Idents (sub)) # # MD Uro Pod PEC2 Per Tcell # 155 1167 772 518 434 397 619 sub <- ScaleData (sub) sub.markers <- FindAllMarkers (sub, only.pos = TRUE ) top <- sub.markers %>% group_by (cluster) %>% dplyr :: filter (avg_log2FC > 1 ) %>% slice_head ( n = 5 ) %>% ungroup marker <- top $ gene marker # [1] "Slc12a1" "Nos1" "Tenm2" "Enox1" "Erbb4" "Crybg1" # [7] "Naaladl2" "Gpc5" "Grip1" "Gsdmc2" "Magi2" "Srgap1" # [13] "Ptpro" "Robo2" "Wt1" "Ncam1" "Akap12" "Bicc1" # [19] "Cp" "Cdh6" "Ptprb" "Prex2" "Arhgap31" "Heg1" # [25] "Shank3" "Ebf1" "Cacna1c" "Slit3" "Stac" "Dgkb" # [31] "Gm2682" "Arhgap45" "Arhgap15" "Skap1" "Mir142hg" 2、DoHeatmap参数 DoHeatmap ( object, features = NULL , cells = NULL , group.by = "ident" , group.bar = TRUE , group.colors = NULL , disp.min = - 2.5 , (下面的所有值都被剪裁) disp.max = NULL , (以上所有值都被剪切) ; 如果槽是 “scale.data”,默认值为 2.5,否则为 6 slot = "scale.data" , scale.data,也可以选择 data assay = NULL , label = TRUE , TRUE size = 5.5 , hjust = 0 , vjust = 0 , angle = 45 , raster = TRUE , draw.lines = TRUE , lines.width = NULL , group.bar.height = 0.02 , combine = TRUE ) 3、DoHeatmap及其自带参数美化 # 默认参数效果 DoHeatmap (sub, features = marker) # 修改参数展示 # 一般比较常用的调整参数包括以下:cells / group.by / group.colors / angle / lines.width / size # (1)挑选部分细胞类型展示 sub_sub <- subset (sub, idents= c ( "Per" , "Tcell" )) cells <- colnames (sub_sub) DoHeatmap (sub, features = marker, cells= cells) # 这样展示可能会导致被剔除的细胞标签还显示在上面,可以直接用挑选后的seurat对象绘图 DoHeatmap (sub_sub, features = marker) # (2)按照指定列作为分组信息展示,例如使用Group分组 DoHeatmap (sub, features = marker, group.by = "Group" ) # (3)修改分组颜色 DoHeatmap (sub, features = marker, group.colors = rainbow ( 7 )) # (4)调整字体显示角度 DoHeatmap (sub, features = marker, angle= 90 ) # (5)增加分割线的宽度 DoHeatmap (sub, features = marker, lines.width= 50 ) # (6)修改文本字体大小 DoHeatmap (sub, features = marker, size = 10 ) 4、ggplot2复现DoHeatmap函数源码 DoHeatmap <- function ( object, features = NULL , cells = NULL , group.by = 'ident' , group.bar = TRUE , group.colors = NULL , disp.min = - 2.5 , disp.max = NULL , slot = 'scale.data' , assay = NULL , label = TRUE , size = 5.5 , hjust = 0 , vjust = 0 , angle = 45 , raster = TRUE , draw.lines = TRUE , lines.width = NULL , group.bar.height = 0.02 , combine = TRUE ) { assay <- assay %||% DefaultAssay ( object = object) DefaultAssay ( object = object) <- assay cells <- cells %||% colnames ( x = object[[assay]]) if ( is.numeric ( x = cells)) { cells <- colnames ( x = object)[cells] } features <- features %||% VariableFeatures ( object = object) features <- rev ( x = unique ( x = features)) disp.max <- disp.max %||% ifelse ( test = slot == 'scale.data' , yes = 2.5 , no = 6 ) # make sure features are present possible.features <- rownames ( x = GetAssayData ( object = object, slot = slot)) if ( any ( ! features %in% possible.features)) { bad.features <- features[ ! features %in% possible.features] features <- features[features %in% possible.features] if ( length ( x = features) == 0 ) { stop ( "No requested features found in the " , slot, " slot for the " , assay, " assay." ) } warning ( "The following features were omitted as they were not found in the " , slot, " slot for the " , assay, " assay: " , paste (bad.features, collapse = ", " )) } data <- as.data.frame ( x = as.matrix ( x = t ( x = GetAssayData ( object = object, slot = slot)[features, cells, drop = FALSE ]))) object <- suppressMessages ( expr = StashIdent ( object = object, save.name = 'ident' )) group.by <- group.by %||% 'ident' groups.use <- object[[group.by]][cells, , drop = FALSE ] # group.use <- switch( # EXPR = group.by, # 'ident' = Idents(object = object), # object[[group.by, drop = TRUE]] # ) # group.use <- factor(x = group.use[cells]) plots <- vector ( mode = 'list' , length = ncol ( x = groups.use)) for (i in 1 : ncol ( x = groups.use)) { data.group <- data group.use <- groups.use[, i, drop = TRUE ] if ( ! is.factor ( x = group.use)) { group.use <- factor ( x = group.use) } names ( x = group.use) <- cells if (draw.lines) { # create fake cells to serve as the white lines, fill with NAs lines.width <- lines.width %||% ceiling ( x = nrow ( x = data.group) * 0.0025 ) placeholder.cells <- sapply ( X = 1 : ( length ( x = levels ( x = group.use)) * lines.width), FUN = function (x) { return ( RandomName ( length = 20 )) } ) placeholder.groups <- rep ( x = levels ( x = group.use), times = lines.width) group.levels <- levels ( x = group.use) names ( x = placeholder.groups) <- placeholder.cells group.use <- as.vector ( x = group.use) names ( x = group.use) <- cells group.use <- factor ( x = c (group.use, placeholder.groups), levels = group.levels) na.data.group <- matrix ( data = NA , nrow = length ( x = placeholder.cells), ncol = ncol ( x = data.group), dimnames = list (placeholder.cells, colnames ( x = data.group)) ) data.group <- rbind (data.group, na.data.group) } lgroup <- length ( levels (group.use)) plot <- SingleRasterMap ( data = data.group, raster = raster, disp.min = disp.min, disp.max = disp.max, feature.order = features, cell.order = names ( x = sort ( x = group.use)), group.by = group.use ) if (group.bar) { # TODO : Change group.bar to annotation.bar default.colors <- c ( hue_pal ( length ( x = levels ( x = group.use)))) if ( ! is.null ( x = names ( x = group.colors))) { cols <- unname ( obj = group.colors[ levels ( x = group.use)]) } else { cols <- group.colors[ 1 : length ( x = levels ( x = group.use))] %||% default.colors } if ( any ( is.na ( x = cols))) { cols[ is.na ( x = cols)] <- default.colors[ is.na ( x = cols)] cols <- Col2Hex (cols) col.dups <- sort ( x = unique ( x = which ( x = duplicated ( x = substr ( x = cols, start = 1 , stop = 7 ))))) through <- length ( x = default.colors) while ( length ( x = col.dups) > 0 ) { pal.max <- length ( x = col.dups) + through cols.extra <- hue_pal (pal.max)[(through + 1 ) : pal.max] cols[col.dups] <- cols.extra col.dups <- sort ( x = unique ( x = which ( x = duplicated ( x = substr ( x = cols, start = 1 , stop = 7 ))))) } } group.use2 <- sort ( x = group.use) if (draw.lines) { na.group <- RandomName ( length = 20 ) levels ( x = group.use2) <- c ( levels ( x = group.use2), na.group) group.use2[placeholder.cells] <- na.group cols <- c (cols, " ) } pbuild <- ggplot_build ( plot = plot) names ( x = cols) <- levels ( x = group.use2) # scale theight of the bar y.range <- diff ( x = pbuild $ layout $ panel_params[[ 1 ]] $ y.range) y.pos <- max (pbuild $ layout $ panel_params[[ 1 ]] $ y.range) + y.range * 0.015 y.max <- y.pos + group.bar.height * y.range x.min <- min (pbuild $ layout $ panel_params[[ 1 ]] $ x.range) + 0.1 x.max <- max (pbuild $ layout $ panel_params[[ 1 ]] $ x.range) - 0.1 plot <- plot + annotation_raster ( raster = t ( x = cols[group.use2]), xmin = x.min, xmax = x.max, ymin = y.pos, ymax = y.max ) + coord_cartesian ( ylim = c ( 0 , y.max), clip = 'off' ) + scale_color_manual ( values = cols[ - length ( x = cols)], name = "Identity" , na.translate = FALSE ) if (label) { x.max <- max (pbuild $ layout $ panel_params[[ 1 ]] $ x.range) # Attempt to pull xdivs from x.major in ggplot2 < 3.3.0; if NULL, pull from the >= 3.3.0 slot x.divs <- pbuild $ layout $ panel_params[[ 1 ]] $ x.major %||% attr ( x = pbuild $ layout $ panel_params[[ 1 ]] $ x $ get_breaks , which = "pos" ) x <- data.frame ( group = sort ( x = group.use), x = x.divs) label.x.pos <- tapply ( X = x $ x, INDEX = x $ group, FUN = function (y) { if ( isTRUE ( x = draw.lines)) { mean ( x = y[ - length ( x = y)]) } else { mean ( x = y) } }) label.x.pos <- data.frame ( group = names ( x = label.x.pos), label.x.pos) plot <- plot + geom_text ( stat = "identity" , data = label.x.pos, aes_string ( label = 'group' , x = 'label.x.pos' ), y = y.max + y.max * 0.03 * 0.5 + vjust, angle = angle, hjust = hjust, size = size ) plot <- suppressMessages (plot + coord_cartesian ( ylim = c ( 0 , y.max + y.max * 0.002 * max ( nchar ( x = levels ( x = group.use))) * size), clip = 'off' ) ) } } plot <- plot + theme ( line = element_blank ) plots[[i]] <- plot } if (combine) { plots <- wrap_plots (plots) } return (plots) } 从中可以看到,DoHeatmap函数主要依赖于下面这段代码绘图,使用到的主要绘图函数还是基于ggplot2 接下来我们尝试用“ggplot2”还原DoHeatmap的结果 # DoHeatmap绘图横坐标顺序获取 p1 <- DoHeatmap (sub, features = marker) gg_build <- ggplot_build (p1) x_order <- gg_build $ layout $ panel_params[[ 1 ]] $ x $ get_labels x_order <- x_order[ ! (x_order %in% setdiff (x_order, colnames (sub)))] # 目标基因表达矩阵获取 data1 <- as.data.frame (Seurat :: GetAssayData (sub, layer = "data" )) rownames (data1) <- rownames (sub) colnames (data1) <- colnames (sub) data1 <- data1[marker,x_order] data1 <- as.data.frame ( scale (data1)) data2 <- data1 data2 $ feature <- rownames (data2) data2 <- data2 %>% pivot_longer ( cols = colnames (data1), names_to = "variable" , values_to = "value" ) data2 <- as.data.frame (data2 ) data2 $ variable <- factor (data2 $ variable, levels= x_order) data2 $ feature <- factor (data2 $ feature, levels= rev (marker)) p2 <- ggplot (data2, aes ( x = variable, y = feature, fill = value)) + geom_tile + scale_fill_gradient2 + labs ( x = "Cell ID" , y = "Gene" ) + theme ( axis.text.x = element_blank , axis.text.y = element_text ( size = 14 , color = "black" ), axis.title.x = element_text ( size = 16 , color = "black" ), axis.title.y = element_text ( size = 16 , color = "black" ) ) p2 anno1 <- sub @ meta.data[x_order,] anno1 $ Cell <- rownames (anno1) anno1 $ other <- "anno" anno1 $ Cell <- factor (anno1 $ Cell, levels= x_order) anno1 $ celltype <- factor (anno1 $ celltype, levels= c ( "MD" , "Uro" , "Pod" , "PEC" , "EC2" , "Per" , "Tcell" )) # 绘制机械分群注释条 p2_anno1 <- ggplot (anno1, aes ( x= Cell, y= other, fill = celltype)) + geom_tile + scale_y_discrete ( position = "right" ) + scale_fill_manual ( values = c ( "MD" = " , "Uro" = " , "Pod" = " , "PEC" = " , "EC2" = " , "Per" = " , "Tcell" = " )) + theme_bw + theme ( panel.grid.major= element_blank , panel.border = element_blank ) + theme ( axis.text.x = element_blank , axis.text.y = element_blank , axis.ticks.x= element_blank , axis.ticks.y = element_blank ) + labs ( x= "" , y= "" ) p2_anno1 # 合并heatmap和机械分群注释条 p2 %>% insert_top (p2_anno1, height = 0.05 ) 5、pheatmap复现DoHeatmap epithelial cells; Per:pericytes; Pod:podocytes; MD:macula densa;
Uro:urothelium anno2 <- anno1 %>% mutate ( group = case_when ( celltype %in% c ( "Tcell" ) ~ "Immune" , celltype %in% c ( "EC2" , "PEC" , "Per" , "Uro" , "MD" , "Pod" ) ~ "other" )) anno2_color <- list ( celltype = c ( "MD" = " , "Uro" = " , "Pod" = " , "PEC" = " , "EC2" = " , "Per" = " , "Tcell" = " ), group = c ( "Immune" = " , "other" = " )) pheatmap (data1, cluster_rows = F, cluster_cols = F, show_colnames = F, color = colorRampPalette ( c ( "white" , " ))( 100 ), annotation_col = anno2[, c ( "celltype" , "group" )], annotation_colors = anno2_color ) pheatmap (data1, cluster_rows = F, cluster_cols = F, show_colnames = F, color = colorRampPalette ( c ( "white" , " ))( 100 ), annotation_col = anno2[, c ( "celltype" , "group" )], annotation_colors = anno2_color, gaps_col = c ( 155 , 1322 , 2094 , 2612 , 3046 , 3443 , 4062 ), gaps_row = c ( 5 , 10 , 15 , 20 , 25 , 30 ) ) 6、complexheatmap复现DoHeatmap 6.1 计算差异基因 sub1 <- subset (sce, subset= celltype %in% c ( "MD" , "PEC" )) sub1 <- ScaleData (sub1) sub.markers1 <- FindAllMarkers (sub1, only.pos = TRUE ) # write.table(sub.markers1,"MD_PEC.diffgene.xls") top1 <- sub.markers1 %>% group_by (cluster) %>% dplyr :: filter (avg_log2FC > 1 ) %>% slice_head ( n = 50 ) %>% ungroup marker1 <- top1 $ gene marker1 # [1] "Slc12a1" "Thsd4" "Robo2" "Mecom" "Erbb4" "Enox1" # [7] "Nos1" "Slit2" "Tenm2" "Pde10a" "Tshr" "Stk32b" # [13] "Col4a3" "Esrrg" "Prkd1" "Flvcr1" "Kcnc2" "Nadk2" # [19] "Ndst4" "Slc16a5" "Cacna1d" "Plcb1" "Cables1" "Tmem117" # [25] "Wdr72" "Pappa2" "Slc5a1" "Stox2" "Pappa" "Efhd1" # [31] "Tmem116" "Cntn5" "Enpp1" "Neat1" "Prdm16" "Dock8" # [37] "Vav3" "Paqr5" "Pla2g3" "Pantr1" "Ranbp3l" "Syne1" # [43] "Rap1gap2" "Car15" "Sgms2" "Unc5d" "Ppp1r1a" "Ksr2" # [49] "Blnk" "Tbxas1" "Ncam1" "Akap12" "Cp" "Ext1" # [55] "Bicc1" "Kcnma1" "Glis3" "Slc6a6" "Pkhd1" "Ror1" # [61] "Tshz2" "Fmn1" "Rbpms" "Gpc6" "Palld" "Arid1b" # [67] "Chrm3" "Igf1r" "Sdk1" "Bnc2" "Wwc1" "Cdh6" # [73] "Tnc" "Spns2" "Bmp6" "Col18a1" "Wt1" "Pax8" # [79] "Dcdc2a" "Prkg1" "Crim1" "Ptpn13" "Rai14" "Rcan2" # [85] "Cald1" "Nav2" "Tbc1d4" "Pard3b" "Robo1" "Chn2" # [91] "Pakap" "Mical2" "Zbtb20" "Actn1" "Fam19a1" "Tns1" # [97] "Arhgap28" "Agap1" "Nfia" "Hspa12a" marker1_MD <- marker1[ 1 : 50 ] marker1_PEC <- marker1[ 51 : 100 ] DoHeatmap (sub1,marker1) data7 <- as.data.frame (Seurat :: GetAssayData (sub1, layer = "data" )) rownames (data7) <- rownames (sub1) colnames (data7) <- colnames (sub1) data7 <- data7[marker1, intersect (x_order, colnames (sub1))] data7 <- scale (data7) 6.2 complexheatmap绘图 这里我们可以在初始热图的基础上,添加更多的信息,例如我们可以加上基因通路富集分析的结果。
没有这部分知识的同学可以看这里:基因集富集分析和差异分析的热图如何用富集分析词云做注释 suppressMessages ( library (org.Mm.eg.db)) suppressMessages ( library (clusterProfiler)) ha = HeatmapAnnotation ( foo = anno_block ( gp = gpar ( fill = 2 : 3 ), labels = c ( "PEC" , "MD" ), labels_gp = gpar ( col = "white" , fontsize = 10 )) ) gene.df <- bitr ( gsub ( ' \\ . \\ d$' , '' ,marker1), fromType= "SYMBOL" , toType= c ( "ENTREZID" , "ENSEMBL" ), OrgDb = org.Mm.eg.db) gene can be mapped....”,建议修改参数为T,参考https://www.jianshu.com/p/2d0e371bb7fe ekegg <- enrichKEGG ( unique (gene.df $ ENTREZID), organism= 'mmu' , pvalueCutoff= 0.05 , pAdjustMethod= 'BH' , minGSSize= 10 , maxGSSize= 500 , use_internal_data= F) ekegg <- setReadable (ekegg, 'org.Mm.eg.db' , 'ENTREZID' ) my.kegg.res <- as.data.frame (ekegg @ result) head (my.kegg.res) # category subcategory ID # mmu04970 Organismal Systems Digestive system mmu04970 # mmu04270 Organismal Systems Circulatory system mmu04270 # mmu04730 Organismal Systems Nervous system mmu04730 # mmu04911 Organismal Systems Endocrine system mmu04911 # mmu04713 Organismal Systems Environmental adaptation mmu04713 # mmu04973 Organismal Systems Digestive system mmu04973 # Description # mmu04970 Salivary secretion - Mus musculus (house mouse) # mmu04270 Vascular smooth muscle contraction - Mus musculus (house mouse) # mmu04730 Long-term depression - Mus musculus (house mouse) # mmu04911 Insulin secretion - Mus musculus (house mouse) # mmu04713 Circadian entrainment - Mus musculus (house mouse) # mmu04973 Carbohydrate digestion and absorption - Mus musculus (house mouse) # GeneRatio BgRatio RichFactor FoldEnrichment zScore pvalue # mmu04970 5/50 87/9575 0.05747126 11.005747 6.792365 8.471584e-05 # mmu04270 6/50 144/9575 0.04166667 7.979167 6.113715 9.606816e-05 # mmu04730 4/50 60/9575 0.06666667 12.766667 6.624055 2.587172e-04 # mmu04911 4/50 86/9575 0.04651163 8.906977 5.336404 1.019846e-03 # mmu04713 4/50 99/9575 0.04040404 7.737374 4.881968 1.720174e-03 # mmu04973 3/50 49/9575 0.06122449 11.724490 5.452788 2.084891e-03 # p.adjust qvalue geneID Count # mmu04970 0.007781521 0.006674209 Nos1/Plcb1/Kcnma1/Chrm3/Prkg1 5 # mmu04270 0.007781521 0.006674209 Cacna1d/Plcb1/Pla2g3/Kcnma1/Prkg1/Cald1 6 # mmu04730 0.013970730 0.011982693 Nos1/Plcb1/Igf1r/Prkg1 4 # mmu04911 0.041303769 0.035426235 Cacna1d/Plcb1/Kcnma1/Chrm3 4 # mmu04713 0.055733624 0.047802718 Nos1/Cacna1d/Plcb1/Prkg1 4 # mmu04973 0.056292061 0.048281689 Cacna1d/Plcb1/Slc5a1 3 marker.gene <- top_n (my.kegg.res, 1 , GeneRatio) %>% pull (geneID) %>% .[ 1 ] %>% strsplit (., '/' ) %>% unlist marker.gene # [1] "Cacna1d" "Plcb1" "Pla2g3" "Kcnma1" "Prkg1" "Cald1" Heatmap (data7, column_km = 2 , border = TRUE , top_annotation = ha, right_annotation = rowAnnotation ( most.enrich.gene= anno_mark ( at = ( 1 : length (marker1))[marker1 %in% marker.gene], labels = marker.gene)), show_row_names = F, show_column_names = F ) 6.3 制作文本注释 up.gene.df <- bitr ( gsub ( ' \\ . \\ d$' , '' ,marker1_MD), fromType= "SYMBOL" , toType= c ( "ENTREZID" , "ENSEMBL" ), OrgDb = org.Mm.eg.db) down.gene.df <- bitr ( gsub ( ' \\ . \\ d$' , '' ,marker1_PEC), fromType= "SYMBOL" , toType= c ( "ENTREZID" , "ENSEMBL" ), OrgDb = org.Mm.eg.db) up.ekegg <- enrichKEGG ( unique (up.gene.df $ ENTREZID), organism= 'mmu' , pvalueCutoff= 0.05 , pAdjustMethod= 'BH' , minGSSize= 10 , maxGSSize= 500 , use_internal_data= F) up.ekegg <- setReadable (up.ekegg, 'org.Mm.eg.db' , 'ENTREZID' ) my.up.ekegg.res <- as.data.frame (up.ekegg @ result) head (my.up.ekegg.res) # category subcategory ID # mmu04973 Organismal Systems Digestive system mmu04973 # mmu04713 Organismal Systems Environmental adaptation mmu04713 # mmu04925 Organismal Systems Endocrine system mmu04925 # mmu04024 Environmental Information Processing Signal transduction mmu04024 # mmu04020 Environmental Information Processing Signal transduction mmu04020 # mmu04926 Organismal Systems Endocrine system mmu04926 # Description # mmu04973 Carbohydrate digestion and absorption - Mus musculus (house mouse) # mmu04713 Circadian entrainment - Mus musculus (house mouse) # mmu04925 Aldosterone synthesis and secretion - Mus musculus (house mouse) # mmu04024 cAMP signaling pathway - Mus musculus (house mouse) # mmu04020 Calcium signaling pathway - Mus musculus (house mouse) # mmu04926 Relaxin signaling pathway - Mus musculus (house mouse) # GeneRatio BgRatio RichFactor FoldEnrichment zScore pvalue # mmu04973 3/28 49/9575 0.06122449 20.936589 7.576801 0.0003771123 # mmu04713 3/28 99/9575 0.03030303 10.362554 5.070977 0.0029116826 # mmu04925 3/28 104/9575 0.02884615 9.864354 4.922187 0.0033476944 # mmu04024 4/28 224/9575 0.01785714 6.106505 4.188042 0.0038406248 # mmu04020 4/28 255/9575 0.01568627 5.364146 3.825196 0.0060811968 # mmu04926 3/28 130/9575 0.02307692 7.891484 4.284259 0.0062518856 # p.adjust qvalue geneID Count # mmu04973 0.04789326 0.03652035 Cacna1d/Plcb1/Slc5a1 3 # mmu04713 0.12193984 0.09298355 Nos1/Cacna1d/Plcb1 3 # mmu04925 0.12193984 0.09298355 Prkd1/Cacna1d/Plcb1 3 # mmu04024 0.12193984 0.09298355 Pde10a/Tshr/Cacna1d/Vav3 4 # mmu04020 0.12546150 0.09566894 Erbb4/Nos1/Cacna1d/Plcb1 4 # mmu04926 0.12546150 0.09566894 Nos1/Col4a3/Plcb1 3 up.pathway <- top_n (my.up.ekegg.res, - 10 , pvalue) down.gene.df <- bitr ( gsub ( ' \\ . \\ d$' , '' ,marker1_PEC), fromType= "SYMBOL" , toType= c ( "ENTREZID" , "ENSEMBL" ), OrgDb = org.Mm.eg.db) down.ekegg <- enrichKEGG ( unique (down.gene.df $ ENTREZID), organism= 'mmu' , pvalueCutoff= 0.05 , pAdjustMethod= 'BH' , minGSSize= 10 , maxGSSize= 500 , use_internal_data= F) down.ekegg <- setReadable (down.ekegg, 'org.Mm.eg.db' , 'ENTREZID' ) my.down.ekegg.res <- as.data.frame (down.ekegg @ result) head (my.down.ekegg.res) # category subcategory ID # mmu04970 Organismal Systems Digestive system mmu04970 # mmu04270 Organismal Systems Circulatory system mmu04270 # mmu04730 Organismal Systems Nervous system mmu04730 # mmu04913 Organismal Systems Endocrine system mmu04913 # mmu04510 Cellular Processes Cellular community - eukaryotes mmu04510 # mmu05202 Human Diseases Cancer: overview mmu05202 # Description # mmu04970 Salivary secretion - Mus musculus (house mouse) # mmu04270 Vascular smooth muscle contraction - Mus musculus (house mouse) # mmu04730 Long-term depression - Mus musculus (house mouse) # mmu04913 Ovarian steroidogenesis - Mus musculus (house mouse) # mmu04510 Focal adhesion - Mus musculus (house mouse) # mmu05202 Transcriptional misregulation in cancer - Mus musculus (house mouse) # GeneRatio BgRatio RichFactor FoldEnrichment zScore pvalue # mmu04970 3/22 87/9575 0.03448276 15.007837 6.298419 0.0009848422 # mmu04270 3/22 144/9575 0.02083333 9.067235 4.680748 0.0041602216 # mmu04730 2/22 60/9575 0.03333333 14.507576 5.036592 0.0082294941 # mmu04913 2/22 64/9575 0.03125000 13.600852 4.853611 0.0093214491 # mmu04510 3/22 202/9575 0.01485149 6.463771 3.766319 0.0106002026 # mmu05202 3/226/9575 0.01327434 5.777353 3.487764 0.0143489610 # p.adjust qvalue geneID Count # mmu04970 0.07484801 0.06738394 Kcnma1/Chrm3/Prkg1 3 # mmu04270 0.15808842 0.14232337 Kcnma1/Prkg1/Cald1 3 # mmu04730 0.16112308 0.14505540 Igf1r/Prkg1 2 # mmu04913 0.16112308 0.14505540 Igf1r/Bmp6 2 # mmu04510 0.16112308 0.14505540 Igf1r/Tnc/Actn1 3 # mmu05202 0.17688661 0.15924695 Igf1r/Wt1/Pax8 3 down.pathway <- top_n (my.down.ekegg.res, - 10 , pvalue) text = list ( 'up' = up.pathway $ Description, 'down' = down.pathway $ Description) text # $up # [1] "Carbohydrate digestion and absorption - Mus musculus (house mouse)" # [2] "Circadian entrainment - Mus musculus (house mouse)" # [3] "Aldosterone synthesis and secretion - Mus musculus (house mouse)" # [4] "cAMP signaling pathway - Mus musculus (house mouse)" # [5] "Calcium signaling pathway - Mus musculus (house mouse)" # [6] "Relaxin signaling pathway - Mus musculus (house mouse)" # [7] "Nicotinate and nicotinamide metabolism - Mus musculus (house mouse)" # [8] "Vascular smooth muscle contraction - Mus musculus (house mouse)" # [9] "Adrenergic signaling in cardiomyocytes - Mus musculus (house mouse)" # [10] "Long-term depression - Mus musculus (house mouse)" # # $down # [1] "Salivary secretion - Mus musculus (house mouse)" # [2] "Vascular smooth muscle contraction - Mus musculus (house mouse)" # [3] "Long-term depression - Mus musculus (house mouse)" # [4] "Ovarian steroidogenesis - Mus musculus (house mouse)" # [5] "Focal adhesion - Mus musculus (house mouse)" # [6] "Transcriptional misregulation in cancer - Mus musculus (house mouse)" # [7] "Insulin secretion - Mus musculus (house mouse)" # [8] "Adherens junction - Mus musculus (house mouse)" # [9] "Pancreatic secretion - Mus musculus (house mouse)" # [10] "Thyroid hormone signaling pathway - Mus musculus (house mouse)" library (ggsci) Heatmap (data7, column_km = 2 , border = TRUE , top_annotation = ha, right_annotation = rowAnnotation ( most.enrich.gene= anno_mark ( at = ( 1 : length (marker1))[marker1 %in% marker.gene], labels = marker.gene)), show_row_names = F, show_column_names = F, left_annotation = rowAnnotation ( textbox = anno_textbox ( rep ( c ( "up" , "down" ), each= 50 ), text, by = "anno_block" , add_new_line = T, gp = gpar ( col = c ( pal_npg ( 10 ), pal_nejm ( 10 ))), round_corners = TRUE ) ) ) 7、 拓展:基因表达均值热图 7.1 pheatmap avg <- AverageExpression (sub, features = marker)[[ "RNA" ]] avg <- as.data.frame (avg) anno3 <- as.data.frame (sub_celltype) rownames (anno3) <- sub_celltype anno3 $ group <- c ( "other" , "other" , "other" , "other" , "other" , "other" , "Immune" ) anno3_color <- list ( sub_celltype = c ( "MD" = " , "Uro" = " , "Pod" = " , "PEC" = " , "EC2" = " , "Per" = " , "Tcell" = " ), group = c ( "Immune" = " , "other" = " )) pheatmap (avg, cluster_rows = F, cluster_cols = F, show_colnames = T, color = colorRampPalette ( c ( "navy" , "white" , "firebrick3" ))( 50 ), annotation_col = anno3, annotation_colors = anno3_color, scale = "row" ) 7.2 pheatmap美化 # 修改边框颜色 # 修改方格长方形为正方形 # 增加间隔 pheatmap (avg, cluster_rows = F, cluster_cols = F, show_colnames = T, color = colorRampPalette ( c ( "navy" , "white" , "firebrick3" ))( 50 ), annotation_col = anno3, treeheight_row = 50 , annotation_colors = anno3_color, scale = "row" , border_color = "white" , # 设置方格边框 gaps_row = c ( 5 , 10 , 15 , 20 , 25 , 30 ), cellwidth = 17 , cellheight = 17 ) 7.3 增加散点图或者柱状图 ha_column1 = HeatmapAnnotation ( points = anno_points ( rnorm ( 35 )), annotation_name_side = "left" ) ht1 = Heatmap ( scale ( t (avg)), name = "ht1" , column_title = "Heatmap 1" , km= 3 , col = colorRamp2 ( c ( - 2 , 0 , 2 ), c ( "navy" , "white" , "firebrick3" )), top_annotation = ha_column1, row_names_side = "left" ) ha_column2 = HeatmapAnnotation ( type = rep ( c ( "MD" , "Uro" , "Pod" , "PEC" , "EC2" , "Per" , "Tcell" ), each= 5 ), col = list ( type = c ( "MD" = " , "Uro" = " , "Pod" = " , "PEC" = " , "EC2" = " , "Per" = " , "Tcell" = " ))) ht2 = Heatmap ( scale ( t (avg)), name = "ht2" , column_title = "Heatmap 2" , column_km = 6 , col = colorRamp2 ( c ( - 2 , 0 , 2 ), c ( "navy" , "white" , "firebrick3" )), bottom_annotation = ha_column2) ht_list = ht1 + ht2 + rowAnnotation ( bar = anno_barplot ( rowMeans ( scale ( t (avg))), width = unit ( 2 , "cm" ))) draw (ht_list, row_title = "Heatmap list" , column_title = "Heatmap list" ) 7.4 circlize圈形热图 col_fun = colorRamp2 ( c ( - 2 , 0 , 2 ), c ( "navy" , "white" , "firebrick3" )) circos.clear data3 <- as.data.frame ( scale ( t (avg))) split <- rownames (data3) split <- factor (split, levels = rownames (data3)) circos.heatmap (data3, col = col_fun, split= split, bg.border = "red" , bg.lwd = 2 , bg.lty = 2 , show.sector.labels = TRUE , cluster = F, ) lgd = Legend ( title = "avg" , col_fun = col_fun) grid.draw (lgd) circos.clear data4 <- as.data.frame ( scale (avg)) split <- rep ( 1 : 7 , each= 5 ) split <- factor (split, levels = 1 : 7 ) circos.clear circos.heatmap (data4, col = col_fun, split= split, rownames.side = "outside" , bg.border = "red" , bg.lwd = 2 , bg.lty = 2 , cluster = F, ) lgd = Legend ( title = "avg" , col_fun = col_fun) grid.draw (lgd) circos.clear circos.heatmap (data4, col = col_fun, dend.side = "inside" , rownames.side = "outside" , rownames.col = 1 : nrow (avg) %% 10 + 1 , rownames.cex = runif ( nrow (avg), min = 0.3 , max = 1.5 ), rownames.font = 1 : nrow (avg) %% 4 + 1 ) lgd = Legend ( title = "avg" , col_fun = col_fun) grid.draw (lgd) circos.clear dend_col <- structure ( rainbow ( 7 ), names = 1 : 7 ) circos.clear circos.heatmap (data4, col = col_fun, split= split, rownames.side = "outside" , bg.border = "red" , bg.lwd = 2 , bg.lty = 2 , dend.side = "inside" , dend.track.height = 0.2 , dend.callback = function (dend, m, si) { color_branches (dend, k = 1 , col = dend_col[si]) } ) lgd = Legend ( title = "avg" , col_fun = col_fun) grid.draw (lgd) 本教程用到的环境 sessionInfo # R version 4.4.1 (2024-06-14) # Platform: x86_64-pc-linux-gnu # Running under: Ubuntu 20.04.6 LTS # # Matrix products: default # BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 # LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0 # # locale: # [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8 # [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8 # [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C # [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C # # time zone: Asia/Shanghai # tzcode source: system (glibc) # # attached base packages: # [1] grid stats4 stats graphics grDevices utils datasets # [8] methods base # # other attached packages: # [1] dendextend_1.17.1 pheatmap_1.0.12 # [3] ggtree_3.12.0 aplot_0.2.3 # [5] ggnewscale_0.5.0 gghalves_0.1.4 # [7] ggpirate_0.1.2 scRNAtoolVis_0.1.0 # [9] plot1cell_0.0.0.9000 wordcloud_2.6 # [11] rlang_1.1.4 cowplot_1.1.3 # [13] Matrix_1.7-0 data.table_1.16.0 # [15] ComplexUpset_1.3.3 ggbeeswarm_0.7.2 # [17] RColorBrewer_1.1-3 progress_1.2.3 # [19] scales_1.3.0 MASS_7.3-61 # [21] ggh4x_0.2.8 circlize_0.4.16 # [23] loomR_0.2.1.9000 hdf5r_1.3.11 # [25] R6_2.5.1 DoubletFinder_2.0.4 # [27] ComplexHeatmap_2.20.0 simplifyEnrichment_1.14.1 # [29] GEOquery_2.72.0 EnsDb.Hsapiens.v86_2.99.0 # [31] ensembldb_2.28.1 AnnotationFilter_1.28.0 # [33] GenomicFeatures_1.56.0 XML_3.99-0.17 # [35] biomaRt_2.60.1 devtools_2.4.5 # [37] usethis_3.0.0 plotly_4.10.4 # [39] scatterplot3d_0.3-44 gg3D_0.0.0.9000 # [41] ggrepel_0.9.5 ggthemes_5.1.0 # [43] ggpubr_0.6.0 ggsignif_0.6.4 # [45] reshape2_1.4.4 ggsci_3.2.0 # [47] plotrix_3.8-4 enrichplot_1.24.2 # [49] org.Hs.eg.db_3.19.1 org.Mm.eg.db_3.19.1 # [51] AnnotationDbi_1.66.0 clusterProfiler_4.12.6 # [53] rvcheck_0.2.1 scater_1.32.1 # [55] scuttle_1.14.0 SingleCellExperiment_1.26.0 # [57] textshape_1.7.5 alabaster.schemas_1.4.0 # [59] celldex_1.14.0 SingleR_2.6.0 # [61] SummarizedExperiment_1.34.0 GenomicRanges_1.56.1 # [63] GenomeInfoDb_1.40.1 IRanges_2.38.1 # [65] S4Vectors_0.42.1 MatrixGenerics_1.16.0 # [67] matrixStats_1.3.0 BiocManager_1.30.25 # [69] SeuratData_0.2.2.9001 patchwork_1.2.0 # [71] gridExtra_2.3 future_1.34.0 # [73] lubridate_1.9.3 forcats_1.0.0 # [75] stringr_1.5.1 purrr_1.0.2 # [77] readr_2.1.5 tidyr_1.3.1 # [79] tibble_3.2.1 ggplot2_3.5.1 # [81] tidyverse_2.0.0 dplyr_1.1.4 # [83] multtest_2.60.0 Biobase_2.64.0 # [85] BiocGenerics_0.50.0 mindr_1.3.3 # [87] magrittr_2.0.3 Seurat_5.1.0 # [89] SeuratObject_5.0.2 sp_2.1-4 # # loaded via a namespace (and not attached): # [1] igraph_2.0.3 ica_1.0-3 # [3] zlibbioc_1.50.0 doParallel_1.0.17 # [5] tidyselect_1.2.1 bit_4.0.5 # [7] tm_0.7-14 clue_0.3-65 # [9] lattice_0.22-6 rjson_0.2.22 # [11] blob_1.2.4 urlchecker_1.0.1 # [13] S4Arrays_1.4.1 parallel_4.4.1 # [15] png_0.1-8 cli_3.6.3 # [17] ggplotify_0.1.2 usdata_0.3.1 # [19] ProtGenerics_1.36.0 askpass_1.2.0 # [21] goftest_1.2-3 textshaping_0.4.0 # [23] BiocIO_1.14.0 BiocNeighbors_1.22.0 # [25] uwot_0.2.2 shadowtext_0.1.4 # [27] curl_5.2.2 mime_0.12 # [29] evaluate_0.24.0 tidytree_0.4.6 # [31] leiden_0.4.3.1 openintro_2.5.0 # [33] stringi_1.8.4 backports_1.5.0 # [35] httpuv_1.6.15 rappdirs_0.3.3 # [37] splines_4.4.1 ggraph_2.2.1 # [39] sctransform_0.4.1 sessioninfo_1.2.2 # [41] DBI_1.2.3 HDF5Array_1.32.1 # [43] jquerylib_0.1.4 withr_3.0.1 # [45] systemfonts_1.1.0 lmtest_0.9-40 # [47] tidygraph_1.3.1 rtracklayer_1.64.0 # [49] splancs_2.01-45 htmlwidgets_1.6.4 # [51] fs_1.6.4 cherryblossom_0.1.0 # [53] labeling_0.4.3 SparseArray_1.4.8 # [55] reticulate_1.38.0 zoo_1.8-12 # [57] XVector_0.44.0 knitr_1.48 # [59] UCSC.utils_1.0.0 airports_0.1.0 # [61] timechange_0.3.0 foreach_1.5.2 # [63] fansi_1.0.6 rhdf5_2.48.0 # [65] R.oo_1.26.0 RSpectra_0.16-2 # [67] irlba_2.3.5.1 ggrastr_1.0.2 # [69] fastDummies_1.7.4 gridGraphics_0.5-1 # [71] ellipsis_0.3.2 lazyeval_0.2.2 # [73] yaml_2.3.10 survival_3.7-0 # [75] scattermore_1.2 BiocVersion_3.19.1 # [77] crayon_1.5.3 RcppAnnoy_0.0.22 # [79] progressr_0.14.0 tweenr_2.0.3 # [81] later_1.3.2 GlobalOptions_0.1.2 # [83] ggridges_0.5.6 codetools_0.2-20 # [85] profvis_0.3.8 alphahull_2.5 # [87] KEGGREST_1.44.1 shape_1.4.6.1 # [89] Rtsne_0.17 limma_3.60.4 # [91] Rsamtools_2.20.0 filelock_1.0.3 # [93] pkgconfig_2.0.3 xml2_1.3.6 # [95] spatstat.univar_3.0-0 GenomicAlignments_1.40.0 # [97] spatstat.sparse_3.1-0 alabaster.base_1.4.2 # [99] ape_5.8 viridisLite_0.4.2 # [101] xtable_1.8-4 interp_1.1-6 # [103] car_3.1-2 highr_0.11 # [105] plyr_1.8.9 httr_1.4.7 # [107] tools_4.4.1 globals_0.16.3 # [109] pkgbuild_1.4.4 beeswarm_0.4.0 # [111] broom_1.0.6 nlme_3.1-166 # [113] dbplyr_2.5.0 ExperimentHub_2.12.0 # [115] crosstalk_1.2.1 ggunchull_1.0.1 # [117] digest_0.6.37 qpdf_1.3.3 # [119] farver_2.1.2 tzdb_0.4.0 # [121] yulab.utils_0.1.7 viridis_0.6.5 # [123] glue_1.7.0 cachem_1.1.0 # [125] BiocFileCache_2.12.0 polyclip_1.10-7 # [127] proxyC_0.4.1 generics_0.1.3 # [129] Biostrings_2.72.1 plot3D_1.4.1 # [131] parallelly_1.38.0 pkgload_1.4.0 # [133] statmod_1.5.0 RcppHNSW_0.6.0 # [135] ragg_1.3.2 ScaledMatrix_1.12.0 # [137] carData_3.0-5 pbapply_1.7-2 # [139] httr2_1.0.3 tcltk_4.4.1 # [141] pdftools_3.5.0 spam_2.10-0 # [143] gson_0.1.0 utf8_1.2.4 # [145] graphlayouts_1.1.1 alabaster.se_1.4.1 # [147] shiny_1.9.1 GenomeInfoDbData_1.2.12 # [149] R.utils_2.12.3 rhdf5filters_1.16.0 # [151] RCurl_1.98-1.16 memoise_2.0.1 # [153] rmarkdown_2.28 R.methodsS3_1.8.2 # [155] gypsum_1.0.1 RANN_2.6.2 # [157] Cairo_1.6-2 spatstat.data_3.1-2 # [159] rstudioapi_0.16.0 cluster_2.1.6 # [161] spatstat.utils_3.1-0 hms_1.1.3 # [163] fitdistrplus_1.2-1 sgeostat_1.0-27 # [165] munsell_0.5.1 misc3d_0.9-1 # [167] colorspace_2.1-1 ggdendro_0.2.0 # [169] DelayedMatrixStats_1.26.0 sparseMatrixStats_1.16.0 # [171] dotCall64_1.1-1 ggforce_0.4.2 # [173] dbscan_1.2-0 xfun_0.47 # [175] alabaster.matrix_1.4.2 iterators_1.0.14 # [177] remotes_2.5.0 abind_1.4-5 # [179] GOSemSim_2.30.2 treeio_1.28.0 # [181] Rhdf5lib_1.26.0 bitops_1.0-8 # [183] promises_1.3.0 scatterpie_0.2.4 # [185] RSQLite_2.3.7 qvalue_2.36.0 # [187] fgsea_1.30.0 DelayedArray_0.30.1 # [189] GO.db_3.19.1 compiler_4.4.1 # [191] alabaster.ranges_1.4.2 prettyunits_1.2.0 # [193] beachmat_2.20.0 listenv_0.9.1 # [195] Rcpp_1.0.13 AnnotationHub_3.12.0 # [197] BiocSingular_1.20.0 tensor_1.5 # [199] BiocParallel_1.38.0 spatstat.random_3.3-1 # [201] fastmap_1.2.0 fastmatch_1.1-4 # [203] rstatix_0.7.2 vipor_0.4.7 # [205] ROCR_1.0-11 rsvd_1.0.5 # [207] gtable_0.3.5 KernSmooth_2.23-24 # [209] miniUI_0.1.1.1 deldir_2.0-4 # [211] htmltools_0.5.8.1 bit64_4.0.5 # [213] spatstat.explore_3.3-2 lifecycle_1.0.4 # [215] NLP_0.3-0 restfulr_0.0.15 # [217] sass_0.4.9 vctrs_0.6.5 # [219] isoband_0.2.7 slam_0.1-52 # [221] spatstat.geom_3.3-2 DOSE_3.30.4 # [223] ggfun_0.1.6 future.apply_1.11.2 # [225] bslib_0.8.0 pillar_1.9.0 # [227] magick_2.8.4 jsonlite_1.8.8 # [229] GetoptLong_1.0.5