第 12 课：ggstatsplot 包绘制 - PDF深度整合批次1

主讲老师第十二课:ggstatsplot 包绘制

相比较 ggplot2 包而言，ggpubr 包是为发表级别图的绘制而出现的话，那么 ggstatsplot包就是将绘图和统计进行了强大的结合。接下来，我们将讲解一下 ggstatsplot 包的使用方法（ggstatsplot.zip）。一般而言，统计图主要显示数据分布及检验结果。

与 ggplot2 包和ggpubr 包相比较，ggstatsplot 包在绘图过程中可以同时显示数据分布信息与统计分析结果。

一图胜千言，可以大大减少对统计分析结果的文字说明。目前，ggstatsplot 包在统计学分析方面支持最常见的统计测试类型包括 1). t-test；2). ANOVA；3). 非参数检验；4). 相关性分析；5). 列联表分析；6).回归分析。我们一起来看下，如何使用 ggstatsplot 包绘制另外一种统计 plus 版本的图形结果。

1.R 包和数据准备

1.1 R 包的安装与读取

关于 ggstatsplot 包的安装，与 ggpubr 包的方法一致，同样存在 2 种不同的安装途径，包括从 CRAN和 Github 上进行安装。

任选一种，成功安装后，使用 library()函数进行检验是否成功安装。

1.2 数据准备

关于数据的准备，和前面的 ggpubr 一样，我们需要自己创造一组用于画图的数据首先，提取目标基因的表达值接着读取免疫数据。

将两个数据进行合并，开天辟地，构建了数据集后，再来端详一下，还缺少了个分组信息，对于分组情况，稍稍做个改变，我们基于四分位数值，以小于百分之 25 的为低，大于百分之 75 的为高，之间的为中间，分成三组。

分别设置 low 和 high 以后，把中间的定义为 median，然后转换为因子形式。

最终，结果显示，high 和 low 分组分别 125 例，median 组为 251 例，这里插一个，为什么还是要转换成因子呢？如果不转换，后面的作图又是按字母来，先 high，再 low，最后median，整个顺序就很丑了。

2.常见图形的绘制

2.1 ggbetweenstats()函数

该函数主要用于组间均值比较，可以输出 boxplot、violinplot 或者二者结合，通过plot.type 参数定义。首先看一下使用默认的参数进行绘图接着，我们来对其中的参数进行修改调整对其中的参数进行解释一下：1). 在 type 中，p 代表参数检验，np 代表非参数，默认参数为参数检验； 2). mean.ci = TRUE 表示展示平均数的 95% 可信区间； 3). pairwise.comparisons = TRUE表示为配对检验；4). pairwise.display 参数表示控制展示比较的结果,ns 无意义,all 所有,s 有意义的；在图中，每个点展示了平均数，95%可信区间，以及不同组别之间的比较分析

2.2 ggwithinstats()函数

关于 ggwithinstats()函数，其功能与 ggbetweenstats()函数的功能几乎相同

2.3 gghistostats()函数

此外，我们可以通过 gghistostats() 函数，用于绘制直方图和点图，主要用于展示一个变量的分布情况，并通过一个样本测试检查它是否与指定值明显不同。

在图中，其中蓝色线代表中位值，黑色线表示用于显示比较的值。

2.4 ggcorrmat()函数

下面介绍的这个函数是 ggcorrmat()函数，其主要的功能是绘制相关性热图，还记得昨天介绍的那个 R包吗？用来绘制相关性热图的 corrplot 包。那么，我们还是来展示一下如何使用 ggcorrmat()函数来绘制免疫浸润细胞的相关性热图。首先，读取免疫细胞表达数据，并经过初步的处理，这里，关于免疫细胞的数据，提前保存在 Rdata 文件里，直接 load 进来就可以了。

接着，提取免疫细胞的表达，并且去除表达量全为 0 的细胞，接下来，就是 ggcorrmat()函数相关参数的设置，先使用默认的参数来绘制一下，其中 colors 用来定义相关性变化趋势的颜色结果显示，其中颜色深浅与相关系数的大小呈正相关。随后，我们来对一些相关的参数进行修改。通过sig.level 来设定显著性分析 P 值的阈值，由于上下半边是完全一样的，因此可以使用 matrix.type= "upper" 表示只展示左上方的结果。

2.5 ggscattterstats()函数

对于相关关系分析，ggstatsplot 包中同样有相应的函数可以进行绘制，在 method 中可以设置不同的统计方法，同时我们可以对参数 marginal.type 进行选择，表示边缘分布的显示方式，比如直方图/密度曲线图/箱线图/小提琴图等等。

2.6 ggbarstats()函数

与 ggplot2 包的 position 参数相对应，ggbarstats()函数提供了相应的作用。同样的，我们还是使用之前 ggplot2 包中使用的内置数据集 diamonds 来进行作图。

好啦，我们快速的把这些能做的图形过了一遍，将来要用的时候，记得在哪个包里面出现过，再来找找就有了。今天的内容虽然不难，但是信息量挺大的，介绍了各种函数和图形的使用和绘制方法。

ggstatsplot 包ggstatsplot 包背景知识常用函数代码实战ggbetweenstats函数ggbetweenstats函数ggscatterstats函数这个包含有很多统计类的函数，这里push上截图，感兴趣可以参考下面文档进一步学习扩展背景知识ggstatsplot 是 ggplot2 包的扩展，主要用于创建美观的图片同时自动输出统计学分析结果，其统计学分析结果包含统计分析的详细信息，该包对于经常需要做统计分析的科研工作者来说非常有用。

主讲老师解读：

一般情况下，可视化和统计是两个不同的阶段。而ggstatsplot的核心思想很简单：将这两个阶段合并为输出具有统计细节的图片，使数据探索更简单，更快捷。

ggstatsplot 在统计学分析方面：目前它支持最常见的统计测试类型：t-test / anova，非参数，相关性分析，列联表分析和回归分析。而在图片输出方面：（1）小提琴图（用于不同组之间连续数据的异同分析）；（2）饼图（用于分类数据的分布检验）；（3）条形图（用于分类数据的分布检验）；（4）散点图（用于两个变量之间的相关性分析）；（5）相关矩阵（用于多个变量之间的相关性分析）；（6）直方图和点图/图表（关于分布的假设检验）；（7）点须图（用于回归模型）

主讲老师解读：如何选择统计学方法呢？

请参见—— 统计方法的选择以及全代码作图实现常用函数主讲老师解读：这里面 Parametric 以及 Non-parametric 是参数和非参数的意思数据符合正态且方差齐，我们使用——参数数据非正态分布或方差不齐，我们使用——非参数Robust ——robust（稳健性）包括两种含义：效度的稳健性（robustness of validity）、效率的稳健性（robustness of efficiency）。效度的稳健性简单理解是数据微小的波动，不会对估计量造成剧烈的影响，效率的稳健性简单理解是估计量假定分布不满足是，对其精度影响小提供主要的算法代码实战#安装R包

###安装方法一：CRAN

install.packages("ggstatsplot")

###安装方法二：Github

#https://github.com/IndrajeetPatil/ggstatsplotif(!require(devtools)) install.packages("devtools")devtools::install_github("IndrajeetPatil/ggstatsplot")#加载R包library(ggstatsplot)这里我们使用内置数据集 iris 来进行演示，我们来看一下数据集head(iris)# Sepal.Length Sepal.Width Petal.Length Petal.Width Species#1 5.1 3.5 1.4 0.2 setosa#2 4.9 3.0 1.4 0.2 setosa#3 4.7 3.2 1.3 0.2 setosa#4 4.6 3.1 1.5 0.2 setosa#5 5.0 3.6 1.4 0.2 setosa#6 5.4 3.9 1.7 0.4 setosaggbetweenstats函数#回归正题，绘图~ggstatsplot::ggbetweenstats(data = iris,x = Species,y = Sepal.Length)#如果我们想进一步就该上面的图表ggstatsplot::ggbetweenstats(data = iris,x = Species,y = Sepal.Length,messages = FALSE) + # further modification outside of ggstatsplotggplot2::coord_cartesian(ylim = c(3, 8))#让图形上移，同时去掉上面的聚类框#再进行修改#设置一下连续型纵坐标的值域# loading needed librarieslibrary(ggstatsplot)# for reproducibilityset.seed(123)# plotggstatsplot::ggbetweenstats(data = iris,x = Species,y = Sepal.Length,messages = FALSE) + # further modification outside of ggstatsplotggplot2::coord_cartesian(ylim = c(3, 8)) +ggplot2::scale_y_continuous(breaks = seq(3, 8, by = 1))#下面对这个函数进行一下简单的介绍ggbetweenstats(data = iris,x = Species,y = Sepal.Length,type = "np", #非参数检验(默认p，其他有np、r、bf)mean.ci = TRUE,pairwise.comparisons = FALSE, #配对检验pairwise.display = "s",p.adjust.method = "fdr", #p值调整，有holm、bonferroni、fdr等effectsize.type="biased",messages = FALSE)type 参数中——P代表参数检验，np代表非参数，默认为参数检验mean.ci 参数——展示平均数的95%可信区间pairwise.comparisons 参数——逻辑值，TRUE为配对检验pairwise.display 参数——控制展示的结果，ns无意义，all所有，s有意义很显然，不同种类的iris在 Sepal.Length上有显著差异#当然也有更复杂的代码展现# plotggstatsplot::ggbetweenstats(data = iris,x = Species,y = Sepal.Length,notch = TRUE, # show notched box plotmean.plotting = TRUE, # whether mean for each group is to be displayedmean.ci = TRUE, # whether to display confidence interval for meansmean.label.size = 2.5, # size of the label for meantype = "p", # which type of test is to be runk = 3, # number of decimal places for statistical resultsoutlier.tagging = TRUE, # whether outliers need to be taggedoutlier.label = Sepal.Width, # variable to be used for the outlier tagoutlier.label.color = "darkgreen", # changing the color for the text labelxlab = "Type of Species", # label for the x-axis variableylab = "Attribute: Sepal Length", # label for the y-axis variabletitle = "Dataset: Iris flower data set", # title text for the plotggtheme = ggthemes::theme_fivethirtyeight(), # choosing a different themeggstatsplot.layer = FALSE, # turn off ggstatsplot theme layerpackage = "wesanderson", # package from which color palette is to be takenpalette = "Darjeeling1", # choosing a different color palettemessages = FALSE)ggbetweenstats函数关于 ggwithinstats()函数，其功能与 ggbetweenstats()函数的功能几乎相同.# for reproducibility and dataset.seed(123)data("iris")ggstatsplot::ggwithinstats(data = iris,x = Species,y = Sepal.Length,messages = FALSE)#需求进阶一下# plotggstatsplot::ggwithinstats(data = iris,x = Species,y = Sepal.Length,sort = "descending", # ordering groups along the x-axis based onsort.fun = median, # values of `y` variablepairwise.comparisons = TRUE,pairwise.display = "s",pairwise.annotation = "p",title = "iris",caption = "Data from: iris",ggtheme = ggthemes::theme_fivethirtyeight(),ggstatsplot.layer = FALSE,messages = FALSE)ggscatterstats函数对于相关关系分析，ggstatsplot 包中同样有相应的函数可以进行绘制，在 method 中可以设置不同的统计方法，同时我们可以对参数 marginal.type 进行选择，表示边缘分布的显示方式，比如直方图/密度曲线图/箱线图/小提琴图等等。

#这里我们更换了输入数据，我们来看一下head(ggplot2::msleep)#绘图ggstatsplot::ggscatterstats(data = ggplot2::msleep,x = sleep_rem,y = awake,xlab = "REM sleep (in hours)",ylab = "Amount of time spent awake (in hours)",title = "Understanding mammalian sleep",messages = FALSE)#可选参数#method——设置不同统计方法（lm、glm、loess或者自定义函数）#marginal.type——设置边缘分布显示方式，默认histogram（直方图）该图表达的是sleep_rem与awake存在相关性，其中X轴为sleep_rem，Y轴为awake。该图中右侧和上方的直方图代表的是数据的分布。该段数据越多，其柱子越高#细化DIY# for reproducibilityset.seed(123)# plotggstatsplot::ggscatterstats(data = dplyr::filter(.data = ggstatsplot::movies_long, genre == "Action"),x = budget,y = rating,type = "robust", # type of test that needs to be runconf.level = 0.99, # confidence levelxlab = "Movie budget (in million/ US$)", # label for x axisylab = "IMDB rating", # label for y axislabel.var = "title", # variable for labeling data pointslabel.expression = "rating < 5 & budget > 100", # expression that decideswhich points to labelsmooth.line.args = list(size = 1.5, color = "yellow"), # changing regressionline color linetitle = "Movie budget and IMDB rating (action)", # title text for the plotcaption = expression( # caption text for the plotpaste(italic("Note"), ": IMDB stands for Internet Movie DataBase")),ggtheme = theme_bw(), # choosing a different themeggstatsplot.layer = FALSE, # turn off ggstatsplot theme layer 某学习平台丨marginal.type = "density", # type of marginal distribution to be displayedxfill = "#0072B2", # color fill for x-axis marginal distributionyfill = "#009E73", # color fill for y-axis marginal distributionxalpha = 0.6, # transparency for x-axis marginal distributionyalpha = 0.6, # transparency for y-axis marginal distributionmessages = FALSE # turn off messages and notes)主讲老师解读：基于ggplot2语法不代表可以嫁接ggplot2这个包含有很多统计类的函数，这里push上截图，感兴趣可以参考下面文档进一步学习扩展#通过ggplot.component参数可以联合ggplot2的绘图语法ggscatterstats(data = movies_long, # dataframe from which variables are takenx = budget, # predictor/independent variabley = rating, # dependent variablexlab = "Budget (in millions of US dollars)", # label for the x-axisylab = "Rating on IMDB", # label for the y-axislabel.var = title, # variable to use for labeling data pointslabel.expression = rating < 5 & budget > 100, # expression for deciding whichpoints to labelpoint.label.args = list(alpha = 0.7, size = 4, color = "grey50"),xfill = "#CC79A7", # fill for marginals on the x-axisyfill = "#009E73", # fill for marginals on the y-axistitle = "Relationship between movie budget and IMDB rating",caption = "Source: www.imdb.com",ggplot.component = list(stat_summary(aes(x = 0.1, xintercept = stat(y)),fun = median, geom = "vline", color ='red', linetype = 2))))