其实WGCNA本身是对基因进行合理(加权共表达)的分组。
也就是说,只要是多分组,就涉及到多次差异分析,而且多分组意味着样品数量肯定不少,这样的话,在这个表达量矩阵里面,不同基因之间可以计算合理的相关性, 就可以根据基因之间的相似性进行基因划分为不同的模块了。
那么,单细胞表达量矩阵理论上每个细胞就是一列,哪怕是单个10x样品也是好几千个细胞,现在是多样品时代,所以数万或者几十万细胞数量也是很常见的。所以理论上也可以使用单细胞表达量矩阵进行wgcna分析,但是由于单细胞表达量矩阵的稀疏性,必然是需要对wgcna算法进行一些调整。恰好学徒最近在她的公众号系统性介绍了即高维high dimensional WGCNA(hdWGCNA),也称为scWGCNA,我这里借花献佛分享给大家哈。
高维high dimensional WGCNA(hdWGCNA),也称为scWGCNA
作用:使用单细胞加权基因相关网络分析(scWGCNA),鉴定共表达的基因模块和差异表达的中枢基因(hub gene),同时也可以揭示了细胞状态相关的基因调控网络(GRN)
帮助文档:https://smorabit.github.io/hdWGCNA/index.html
文献:https://www.nature.com/articles/s41588-021-00894-z
数据:https://github.com/smorabit/hdWGCNA
作为单细胞高级分析一种,目前使用的单细胞文章挺少的,可作为分析的后备方案~
分析前的单细胞数据,需要先进行标准分析,如归一化标准化等直到降维聚类
A single-cell or single-nucleus transcriptomics dataset in Seurat format
Normalize the gene expression matrix NormalizeData
Identify highly variable genes VariableFeatures
Scale the normalized expression data ScaleData
Perform dimensionality reduction RunPCA
and batch correction if needed RunHarmony
Non-linear dimensionality reduction RunUMAP
for visualizations
Group cells into clusters (FindNeighbors
and FindClusters
)
使用已处理过的人脑的单细胞数据作为例子分析(需要自己下载这个rds文件哦)
wget https://swaruplab.bio.uci.edu/public_data/Zhou_2020.rds
# single-cell analysis package
library(Seurat)
# plotting and data science packages
library(tidyverse)
library(cowplot)
library(patchwork)
library(devtools)
library(igraph)
# co-expression network analysis packages:
library(WGCNA)
# devtools::install_github('smorabit/hdWGCNA', ref='dev')
library(hdWGCNA)
# using the cowplot theme for ggplot
theme_set(theme_cowplot())
# set random seed for reproducibility
set.seed(12345)
# load the Zhou et al snRNA-seq dataset
seurat_obj <- readRDS('data/Zhou_2020.rds')
seurat_obj
# An object of class Seurat
# 36601 features across 36671 samples within 1 assay
# Active assay: RNA (36601 features, 0 variable features)
# 2 dimensional reductions calculated: harmony, umap
p <- DimPlot(seurat_obj, group.by='cell_type', label=TRUE)+
ggtitle('Zhou et al Control Cortex') + NoLegend()
p
SetupForWGCNA
:创建对象,选择基因进行分析
在进行hdWGCNA分析前,需要建立Seurat对象,通常储存在对象的@misc中。并且可以储存多个hdWGCNA结果,就像在前面我们可以储存不同res的细胞降维聚类情况,同样道理这里可以调整参数储存多个hdWGCNA结果
并且通常单细胞矩阵基因数量很多,此函数可以根据调整参数gene_select
,筛选出一部分基因进行分析
SetupForWGCNA(
seurat_obj,
wgcna_name,
features = NULL,
metacell_location = NULL,
group = NULL,
fraction = 0.05,
group.by = NULL,
gene_list = NULL
...
)
参数注释:
seurat_obj
A Seurat object
gene_select
的三个选择:
variable
: 使用储存在VariableFeatures
的高变基因,默认2000
fraction
: 使用在整个数据集或每组细胞中以特定比例fraction
细胞表达的基因,不同组别由group.by
指定
custom
: 使用人为特定的基因
seurat_obj <- SetupForWGCNA(
seurat_obj,
gene_select = "fraction", # the gene selection approach
fraction = 0.05, # fraction of cells that a gene needs to be expressed in order to be included
wgcna_name = "tutorial" # the name of the hdWGCNA experiment
)
length(seurat_obj@misc$tutorial$wgcna_genes) #[1] 12639
这里挑选出1万多个基因进行后续分析。
上面创建了Seurat对象后,第一步就是构建metacells
单细胞数据具有严重的稀疏性(data sparsity):每个细胞中绝大部分基因的表达值被计数为零,该比例甚至可以高达 90%。这样的稀疏性对于数据分析带来了挑战,也引发了争议。
不同的研究人员以不同的方式看待单细胞数据中的大量零计数:有些研究者认为零计数代表生物信号(即基因没有表达或低表达),而另一些研究者则认为是零计数是需要被纠正的缺失数据(missing data)
但WGCNA分析对数据的稀疏性极其的敏感,因此需要构建metacells,来代替原始表达矩阵
核心算法:k-Nearest Neighbors (KNN)算法,来识别来自同一生物来源的相同样本的一小群相似细胞,然后计算这些细胞的平均或总和表达,从而产生metacell基因表达矩阵,这样就能降低数据的稀疏性。如研究单细胞表观组的方法 Cicero,也用了类似聚集算法来构建可及性网络
核心函数:MetacellsByGroups
参数:group.by
参数group.by
很重要,一般指定细胞类型和样本类型。通常先挑出相同生物样本的细胞,再使用算法将多次重复中都能聚在一起的细胞确定为 metacell
k值
:选取与当前测试对象最近的k个训练对象,来判断测试对象的类别。其设置也是KNN算法的关键。
一般取决于数据集的细胞数量,细胞数目越小k值倾向更小。通常设置k=20~75。此数据集有 40039 个细胞,每个生物样本中从 890 到 8,188,因此使用k=25
参考:
基本格式:
MetacellsByGroups(
seurat_obj,
group.by = c("seurat_clusters"),
ident.group = "seurat_clusters",
k = 25,
reduction = "pca",
assay = NULL,
cells.use = NULL,
slot = "counts",
mode = "average",
min_cells = 100,
max_shared = 15,
target_metacells = 1000,
max_iter = 5000,
verbose = FALSE,
wgcna_name = NULL
)
参数注释:
seurat_obj
A Seurat object
group.by
指定细胞类型和样本类型
table(seurat_obj$cell_type)
# INH EX OPC ODC ASC MG
# 3763 10692 1206 16210 3162 1638
table(seurat_obj$Sample)
# C1 C11 C12 C2 C3 C4 C5 C6 C7 C8 C9
# 1788 3345 2333 1899 3002 711 6841 3320 2934 7618 2880
seurat_obj
# An object of class Seurat
# 36601 features across 36671 samples within 1 assay
# Active assay: RNA (36601 features, 0 variable features)
# 2 dimensional reductions calculated: harmony, umap
# construct metacells in each group
seurat_obj <- MetacellsByGroups(
seurat_obj = seurat_obj,
group.by = c("cell_type", "Sample"), # specify the columns in seurat_obj@meta.data to group by
k = 25, # nearest-neighbors parameter
reduction = "harmony", # 降维方法
slot='counts',
max_shared = 10, # maximum number of shared cells between two metacells
ident.group = 'cell_type' # set the Idents of the metacell seurat object
)
标准化:NormalizeMetacells
提取metacell:GetMetacellObject
# normalize metacell expression matrix:
seurat_obj <- NormalizeMetacells(seurat_obj)
# get the metacell object from the hdWGCNA experiment
metacell_obj <- GetMetacellObject(seurat_obj)
metacell_obj
# An object of class Seurat
# 36601 features across 6308 samples within 1 assay
# Active assay: RNA (36601 features, 0 variable features)
算法将相似细胞聚在一起计算平均表达量,可以看到和原始数据相比,metacell的细胞数量大幅度减少了
seurat_obj
# An object of class Seurat
# 36601 features across 36671 samples within 1 assay
# Active assay: RNA (36601 features, 0 variable features)
# 2 dimensional reductions calculated: harmony, umap
metacell_obj
# An object of class Seurat
# 36601 features across 6308 samples within 1 assay
# Active assay: RNA (36601 features, 0 variable features)
> table(seurat_obj$cell_type,seurat_obj$Sample)
C1 C11 C12 C2 C3 C4 C5 C6 C7 C8 C9
INH 330 170 458 149 485 155 607 341 350 348 370
EX 755 584 990 412 1326 343 1239 1308 1191 1460 1084
OPC 112 105 77 146 125 19 101 113 124 191 93
ODC 171 2128 457 895 610 79 4385 1089 691 4709 996
ASC 276 250 223 229 374 89 320 280 426 472 223
MG 144 108 128 68 82 26 189 189 152 438 114
> table(metacell_obj$cell_type,metacell_obj$Sample)
C1 C11 C12 C2 C3 C4 C5 C6 C7 C8 C9
ASC 30 19 22 21 47 0 35 28 46 60 21
EX 101 54 121 38 185 35 147 139 139 205 125
INH 26 10 35 8 38 7 57 28 27 25 30
MG 11 7 10 0 0 0 16 19 12 66 6
ODC 18 947 89 218 141 0 1000 327 136 1000 306
OPC 8 4 0 11 9 0 4 9 7 18 0
流程:
提取感兴趣的细胞亚群:SetDatExpr
选取软阈值:TestSoftPowers
软阈值绘图:PlotSoftPowers
and wrap_plots
参数输出:wrap_plots
构建拓扑重叠矩阵(TOM):ConstructNetwork
利用TOM进行聚类:PlotDendrogram
输出TOM矩阵:GetTOM
hdwgcna提取细胞群:SetDatExpr
seurat_obj <- SetDatExpr(
seurat_obj,
group_name = "INH", # the name of the group of interest in the group.by column
group.by='cell_type', # the metadata column containing the cell type info. This same column should have also been used in MetacellsByGroups
assay = 'RNA', # using RNA assay
use_metacells = TRUE, # use the metacells (TRUE) or the full expression matrix (FALSE)
slot = 'data' # using normalized data
)
如果是多个亚群
调整参数:group_name = c("INH", "EX")
WGCNA为权重基因共表达网络,权重是指对Pearson相关性值进行幂次运算(也就是软阈值,power,pickSoftThreshold)。这种处理方式可强化强相关,弱化了弱相关,使得相关性数值更符合无标度网络特征,更具生物学意义。而不是使用一刀切的硬阈值,认为高于某个数值两个基因存在连接,万一低于但接近它的直接认为两基因没连接。
选取软阈值:TestSoftPowers
绘图:PlotSoftPowers
and wrap_plots
# Test different soft powers:
seurat_obj <- TestSoftPowers(
seurat_obj,
setDatExpr=FALSE,
powers = c(seq(1, 10, by = 1), seq(12, 30, by = 2))) # 选取soft powers(默认)
# networkType = 'signed' # 网络类型 "unsigned" or "signed hybrid"
# plot the results:
plot_list <- PlotSoftPowers(seurat_obj,
point_size = 5,
text_size = 3)
# selected_power = NULL
# assemble with patchwork
wrap_plots(plot_list, ncol=2)
不知道为什么,当想调整参数 networkType = 'signed'决定有向网络or无向网络时,出现报错
Error in TestSoftPowers(seurat_obj, setDatExpr = FALSE, powers = c(seq(1, : unused argument (networkType = "signed")
WGCNA 和 hdWGCNA 的一般选择软阈值大于或等于 0.8 。因此在这里我们选择软阈值=9
软阈值可以自动函数选择,也可以人工指定selected_power = NULL
进行绘图
参数输出:wrap_plots
power_table <- GetPowerTable(seurat_obj)
head(power_table)
Power SFT.R.sq slope truncated.R.sq mean.k. median.k. max.k.
1 1 0.192964167 10.1742363 0.9508966 5887.9290 5901.7852 6501.0283
2 2 0.001621749 0.4801145 0.9828730 3092.6379 3092.7878 3835.2202
3 3 0.150812707 -3.2345892 0.9918393 1657.6198 1646.6169 2349.6967
4 4 0.407513393 -4.5383131 0.9898440 905.8402 890.9431 1486.7695
5 5 0.585949838 -4.8303680 0.9917528 504.4288 490.2683 968.1157
6 6 0.677668481 -4.6749295 0.9935437 286.1585 274.3233 646.6741
接下来计算拓扑重叠矩阵(TOM),用TOM矩阵反应基因表达的相似性,意味着两基因有相似的表达模式,考虑了直接相关和间接相关,更符合生物中存在的直接调控和间接调控作用
比如gene A 与 gene B,不仅关注的是这俩基因之间直接的共表达关系,拓扑重叠矩阵还考虑中间因素的影响,不止关注gene A对B的调控,还要关注geneA对B的次级影响。比如gene A调控 gene C,gene C又调控gene B。但更次一级的调控就不计算,考虑影响很小
进而构建pheatmap聚类树 从而挑选相似的模块出来
seurat_obj <- ConstructNetwork(
seurat_obj,
soft_power=9,
setDatExpr=FALSE,
corType = "pearson",
networkType = "signed",
TOMType = "signed",
detectCutHeight = 0.995,
minModuleSize = 50,
mergeCutHeight = 0.2,
tom_outdir = "TOM", # 输出文件夹
tom_name = 'INH' # name of the topoligical overlap matrix written to disk
)
PlotDendrogram(seurat_obj, main='INH hdWGCNA Dendrogram')
函数里包括更多参数,例如:
soft_power:软阈值的选择
corType:计算相关性的方法;可选pearson(默认),bicor。后者更能考虑离群点的影响。
networkType:计算邻接矩阵时,是否考虑正负相关性;"unsigned","signed", "signed hybrid"
TOMType:计算TOM矩阵时,是否考虑正负相关性;选择"none", "unsigned", "signed", "signed Nowick", "unsigned 2", "signed 2" and "signed Nowick 2"
minModuleSize:模块的最少基因数
mergeCutHeight:合并模块的阈值
numericLabels:模块名是否为数字;若设置FALSE,表示映射为颜色名。
saveConsensusTOMs:是否保存TOM矩阵;
nThreads:交代线程数,适用于Linux环境
verbose:0默认安静的执行,值越大表示给出的运行提示信息越多
参考:
https://search.r-project.org/CRAN/refmans/WGCNA/html/blockwiseConsensusModules.html
聚类时可将相关性强的基因聚集到一个分枝中。下图就是一个根据相关性值绘制的一个聚类树,图形横向表示不同的基因,纵向表示基因间的不相关性。因此分枝越靠下,表明分枝内的基因不相关性越小,即相关性越强
“灰色”模块是未分组的基因组成,对于所有下游分析应忽略灰色模块
输出TOM矩阵:GetTOM
上回聚类成模块后,可以使用模块特征值(Module eigengene E)开展模块与模块的相关性分析,或者与性状的相关性分析,得出与感兴趣的性状最相关的模块出来
ME是将模块内的基因进行PCA分析,得到PC1值。PC1值可整体描述模块中基因的变化模式
归一化:ScaleData
计算ME值:ModuleEigengenes
输出ME or hME 值:GetMEs
提前对性状变量进行处理
相关性分析:ModuleTraitCorrelation
输出结果:GetModuleTraitCorrelation
相关性图:PlotModuleTraitCorrelation
# need to run ScaleData first or else harmony throws an error:
seurat_obj <- ScaleData(seurat_obj, features=VariableFeatures(seurat_obj))
# compute all MEs in the full single-cell dataset
seurat_obj <- ModuleEigengenes(
seurat_obj,
scale.model.use = "linear", # choices are "linear", "poisson", or "negbinom"
assay = NULL, # 默认:DefaultAssay(seurat_obj)
pc_dim = 1,
group.by.vars="Sample" # 根据样本去批次化 harmonize
)
看到最后得到两个值,除了ME值,还有根据样本harmonize去批次化后的harmonized module eigengenes (hMEs)数值
# harmonized module eigengenes:
hMEs <- GetMEs(seurat_obj)
# module eigengenes:
MEs <- GetMEs(seurat_obj, harmonized=FALSE)
它两的结果还是有所区别
> colnames(seurat_obj@misc[["tutorial"]][["hMEs"]])
[1] "midnightblue" "pink" "salmon" "tan" "black" "lightcyan" "purple"
[8] "red" "yellow" "grey" "green" "brown" "greenyellow" "blue"
[15] "magenta" "turquoise" "cyan"
> seurat_obj@misc[["tutorial"]][["MEs"]][1:6,1:6]
midnightblue salmon tan black lightcyan grey
AGTCTTTGTTGATTCG-11 -3.659012 6.149156 5.360062 7.227669 4.843664 32.37577
AGCAGCCTCCAGATCA-11 -3.108567 6.262729 4.638732 7.985272 5.923548 32.11585
CTTAGGATCTCATTCA-11 -3.667789 6.826362 5.693103 9.340326 5.446225 30.73525
AACTCAGAGGAATGGA-11 -3.602661 6.281440 3.828821 6.927603 5.959433 27.51765
ACGAGGAGTCTCCACT-11 -3.480822 6.684249 6.770255 8.346960 5.314208 29.94229
ACTTACTTCGGAGCAA-11 -3.095366 5.244673 4.612340 6.427003 4.169680 27.57476
> seurat_obj@misc[["tutorial"]][["hMEs"]][1:6,1:6]
midnightblue pink salmon tan black lightcyan
AGTCTTTGTTGATTCG-11 -3.783981 5.309818 5.899195 6.212114 7.203034 4.336715
AGCAGCCTCCAGATCA-11 -3.189240 5.589570 6.018907 5.510649 8.139280 5.752451
CTTAGGATCTCATTCA-11 -3.736299 5.140163 6.593372 6.500594 9.502204 4.942244
AACTCAGAGGAATGGA-11 -3.728414 5.799119 6.222626 4.716675 7.123539 5.489100
ACGAGGAGTCTCCACT-11 -3.535288 5.054225 6.509797 7.671965 8.551482 4.821368
ACTTACTTCGGAGCAA-11 -3.191317 3.976533 5.008423 5.529623 6.580254 3.605844
需要提前对变量进行处理
数值变量:as.numeric
只有两个类别的分类变量:as.factor
具有顺序关系的两个类别以上的分类变量:as.factor
,设置level
无顺序关系的两个类别以上的分类变量
这个情况比较特殊,无法将其按上面的设置,这样完全没有生物学意义
这种情况下有两种方法:
设置成对指标
# 设置成对指标
library(WGCNA)
# Define a categorical variable with 3 levels
x = rep(c("A", "B", "C"), each = 3);
# Binarize it into pairwise indicators
out = binarizeCategoricalVariable(x,
includePairwise = TRUE,
includeLevelVsAll = FALSE);
# Print the variable and the indicators
data.frame(x, out);
x B.vs.A C.vs.A C.vs.B
1 A 0 0 NA
2 A 0 0 NA
3 A 0 0 NA
4 B 1 NA 0
5 B 1 NA 0
6 B 1 NA 0
7 C NA 1 1
8 C NA 1 1
9 C NA 1 1
设置一个水平与所有其他水平进行对比
# 设置一个水平与所有其他水平进行对比
out = binarizeCategoricalVariable(x,
includePairwise = FALSE,
includeLevelVsAll = TRUE);
# Print the variable and the indicators
data.frame(x, out);
x A.vs.all B.vs.all C.vs.all
1 A 1 0 0
2 A 1 0 0
3 A 1 0 0
4 B 0 1 0
5 B 0 1 0
6 B 0 1 0
7 C 0 0 1
8 C 0 0 1
9 C 0 0 1
其实本质上还是设置成二分类变量,这样得到的结果就具有生物学意义
参考:https://peterlangfelder.com/2018/11/25/working-with-categorical-variables/
言归正传,
# convert sex to factor
seurat_obj$msex <- as.factor(seurat_obj$msex)
# convert age_death to numeric
seurat_obj$age_death <- as.numeric(seurat_obj$age_death)
# list of traits to correlate
cur_traits <- c('braaksc', 'pmi', 'msex', 'age_death', 'doublet_scores',
'nCount_RNA', 'nFeature_RNA', 'total_counts_mt')
str(seurat_obj@meta.data[,cur_traits])
# 'data.frame': 36671 obs. of 8 variables:
# $ braaksc : num 3 3 3 3 3 3 3 3 3 3 ...
# $ pmi : num 5.75 5.75 5.75 5.75 5.75 5.75 5.75 5.75 5.75 5.75 ...
# $ msex : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
# $ age_death : num 84.8 84.8 84.8 84.8 84.8 ...
# $ doublet_scores : num 0.108 0.0985 0.1131 0.1131 0.1186 ...
# $ nCount_RNA : num 52105 52011 51777 51717 51484 ...
# $ nFeature_RNA : int 8876 8807 8640 7199 7702 7799 8472 6764 7944 7668 ...
# $ total_counts_mt: num 916 1635 1354 321 645 ...
# 使用去批次化后的hME
seurat_obj <- ModuleTraitCorrelation(
seurat_obj,
traits = cur_traits,
features = "hMEs", # Valid choices are hMEs, MEs, or scores
cor_method = "pearson", # Valid choices are pearson, spearman, kendall.
group.by='cell_type'
)
输出结果:GetModuleTraitCorrelation
相关性图:PlotModuleTraitCorrelation
# get the mt-correlation results
mt_cor <- GetModuleTraitCorrelation(seurat_obj)
names(mt_cor)
# "cor" "pval" "fdr"
PlotModuleTraitCorrelation(
seurat_obj,
label = 'fdr', # add p-val label in each cell of the heatmap
label_symbol = 'stars', # labels as 'stars' or as 'numeric'
text_size = 2,
text_digits = 2,
text_color = 'white',
high_color = '#fc9272',
mid_color = '#ffffbf',
low_color = '#9ecae1',
plot_max = 0.2,
combine=T # 合并结果
)
因为我们一开始是选择了 INH细胞群做分析的,所以单独拎出combine=F
作为结果
很多作图的细节可以单独修改PlotModuleTraitCorrelation
函数绘图,或者直接生成plot_df自己作图
参考:https://github.com/smorabit/hdWGCNA/blob/dev/R/plotting.R
计算连接度:ModuleConnectivity
module重命名:ResetModuleNames
查看module结果:GetModules
可视化hubgene:PlotKMEs
输出hubgene:GetHubGenes
基因功能评分:ModuleExprScore
接下来,寻找hubgenes,通过计算得出连接度(eigengene-based connectivity,kME),来确定各个模块内高度连接的gene。本质上是计算基因与模块特征值module eigengenes间的相关性
# compute eigengene-based connectivity (kME):
seurat_obj <- ModuleConnectivity(
seurat_obj,
group.by = 'cell_type',
corFnc = "bicor", # to obtain Pearson correlation
corOptions = "use='p'", # to obtain Pearson correlation
harmonized = TRUE,
assay = NULL,
slot = "data", # default to normalized 'data' slot
group_name = 'INH' # 感兴趣的细胞群
)
# rename the modules
# 改名后后模块赋予新的名称,以new_name为基础,后续附加个数字
seurat_obj <- ResetModuleNames(
seurat_obj,
new_name = "INH-M"
)
# print out the new module names
modules <- GetModules(seurat_obj)
print(levels(modules$module))
# [1] "grey" "INH-M1" "INH-M2" "INH-M3" "INH-M4" "INH-M5" "INH-M6" "INH-M7" "INH-M8" "INH-M9"
# [11] "INH-M10" "INH-M11" "INH-M12" "INH-M13" "INH-M14" "INH-M15" "INH-M16"
# show the first 6 columns:
head(modules[,1:6])
结果分别对应基因名,模块名称,对应颜色,kME值
> head(modules[,1:6])
gene_name module color kME_grey kME_INH-M1 kME_INH-M2
AL627309.1 AL627309.1 grey grey 0.06589398 0.022642034 0.07024651
LINC01409 LINC01409 INH-M1 yellow 0.11053490 0.144832154 0.08732725
LINC01128 LINC01128 grey grey 0.13877882 0.007831324 0.12543496
NOC2L NOC2L grey grey 0.14437972 -0.025191042 0.16172439
AGRN AGRN grey grey 0.05887587 0.061095383 0.05356501
C1orf159 C1orf159 grey grey 0.03702322 0.037830622 0.02719629
# plot genes ranked by kME for each module
p <- PlotKMEs(seurat_obj,
ncol=5,
n_hubs = 10, # number of hub genes to display
text_size = 2,
plot_widths = c(3, 2) # the relative width between the kME rank plot and the hub gene text
)
p
每个模块,根据KME值由高到低排列,取前n_hubs个
# get hub genes
hub_df <- GetHubGenes(seurat_obj, n_hubs = 10)
head(hub_df)
saveRDS(seurat_obj, file='data/hdWGCNA_object.rds')
> head(hub_df)
gene_name module kME
1 AC105052.4 INH-M1 0.2445930
2 FP700111.1 INH-M1 0.2622643
3 ANKRD26 INH-M1 0.2627551
4 LINC01278 INH-M1 0.2914837
5 GRIN1 INH-M1 0.2954171
6 LINC00342 INH-M1 0.3161212
类似之前的缺氧,焦亡等生物学特征,可以收集它们的功能基因列表,在单细胞转录组学进行评分,确定哪个细胞群或者细胞模块打分最高
9种常见的功能集打分方法:GSEA、GSVA、PLAGE、Zscore、AddModuleScore、ssGSEA、AUCell、UCell和singscore。
提供两种算法:Seurat(AddModuleScore)和 UCell
# compute gene scoring for the top 25 hub genes by kME for each module
# with Seurat method
seurat_obj <- ModuleExprScore(
seurat_obj,
n_genes = 25,
method='Seurat'
)
# compute gene scoring for the top 25 hub genes by kME for each module
# with UCell method
library(remotes)
remotes::install_github("carmonalab/UCell", ref="v1.3")
library(UCell)
seurat_obj <- ModuleExprScore(
seurat_obj,
n_genes = 25,
method='UCell'
)
ModuleFeaturePlot
:FeaturePlot
ModuleCorrelogram
:模块间相关性
DotPlot
:气泡图
VlnPlot
:小提琴图
ModuleNetworkPlot
:可视化每个模块的单独网络图,按 kME 显示前 25 个基因
HubGeneNetworkPlot
:可视化所有模块特定hubgene的复合网络
ModuleUMAPPlot
:使用UMAP降维算法可视化共表达中的所有基因
wgcna前面算出四种分数(hMEs, MEs, scores, or average),可以用各种方式进行展示
# make a featureplot of hMEs for each module
plot_list <-ModuleFeaturePlot(
seurat_obj,
reduction = "umap",
features = "hMEs",
order_points = TRUE, # order so the points with highest hMEs are on top
restrict_range = TRUE,
point_size = 0.5,
alpha = 1,
label_legend = FALSE,
raster_dpi = 500,
raster_scale = 1,
plot_ratio = 1,
title = TRUE
)
# stitch together with patchwork
wrap_plots(plot_list, ncol=6)
截取函数的一部分,其实features可以展示四种分数(hMEs, MEs, scores, or average)
if (features == "hMEs") {
MEs <- GetMEs(seurat_obj, TRUE, wgcna_name)
}
else if (features == "MEs") {
MEs <- GetMEs(seurat_obj, FALSE, wgcna_name)
}
else if (features == "scores") {
MEs <- GetModuleScores(seurat_obj, wgcna_name)
}
else if (features == "average") {
MEs <- GetAvgModuleExpr(seurat_obj, wgcna_name)
restrict_range <- FALSE
}
else (stop("Invalid feature selection. Valid choices: hMEs, MEs, scores, average"))
例如展示基因功能分数 scores
# make a featureplot of hub scores for each module
plot_list <- ModuleFeaturePlot(
seurat_obj,
features='scores', # plot the hub gene scores
order='shuffle', # order so cells are shuffled
ucell = TRUE # depending on Seurat vs UCell for gene scoring
)
# stitch together with patchwork
wrap_plots(plot_list, ncol=6)
# plot module correlagram
ModuleCorrelogram(seurat_obj,
exclude_grey = TRUE, # 默认删除灰色模块
features = "hMEs" # What to plot? Can select hMEs, MEs, scores, or average
)
# get hMEs from seurat object
MEs <- GetMEs(seurat_obj, harmonized=TRUE)
mods <- colnames(MEs); mods <- mods[mods != 'grey']
# add hMEs to Seurat meta-data:
seurat_obj@meta.data <- cbind(seurat_obj@meta.data, MEs)
# plot with Seurat's DotPlot function
p <- DotPlot(seurat_obj, features=mods, group.by = 'cell_type')
# flip the x/y axes, rotate the axis labels, and change color scheme:
p <- p +
coord_flip() +
RotatedAxis() +
scale_color_gradient2(high='red', mid='grey95', low='blue')
# plot output
p
# Plot INH-M4 hME using Seurat VlnPlot function
p <- VlnPlot(
seurat_obj,
features = 'INH-M12',
group.by = 'cell_type',
pt.size = 0 # don't show actual data points
)
# add box-and-whisker plots on top:
p <- p + geom_boxplot(width=.25, fill='white')
# change axis labels and remove legend:
p <- p + xlab('') + ylab('hME') + NoLegend()
# plot output
p
批量出图
plot_list <- lapply(mods, function(x) {
print(x)
p <- VlnPlot(
seurat_obj,
features = x,
group.by = 'cell_type',
pt.size = 0 # don't show actual data points
)
# add box-and-whisker plots on top:
p <- p + geom_boxplot(width=.25, fill='white')
# change axis labels and remove legend:
p <- p + xlab('') + ylab('hME') + NoLegend()
p
})
wrap_plots(plot_list, ncol = 5)
单个模块网络图
# network analysis & visualization package:
remotes::install_github("igraph/rigraph@master")
library(igraph)
# Visualizes the top hub genes for selected modules as a circular network plot
ModuleNetworkPlot(
seurat_obj,
mods = "all", # all modules are plotted.
outdir = "ModuleNetworks", # The directory where the plots will be stored.
plot_size = c(6, 6),
label_center = FALSE,
edge.alpha = 0.25,
vertex.label.cex = 1, # 基因标签的字体大小
vertex.size = 6 # 节点的大小
)
每个节点nodes代表一个基因,每条边edges代表网络中两个基因之间的共表达关系,颜色对应模块的
kME的前10个中心基因位于图的中心,其余15个基因放置在外圈
可视化每个模块的前 3 个中心基因和 6 个其他基因(随机的)
# hubgene network
HubGeneNetworkPlot(
seurat_obj,
mods = "all", # all modules are plotted.
n_hubs = 3,
n_other=6,
edge_prop = 0.75,
)
可以使用return_graph = TRUE
返回可绘制的 igraph 对象
g <- HubGeneNetworkPlot(seurat_obj, return_graph=TRUE)
这里使用另一种方法umap来可视化共表达网络中的所有基因,主要在topological overlap matrix (TOM)矩阵计算出umap坐标
seurat_obj <- RunModuleUMAP(
seurat_obj,
n_hubs = 10, # number of hub genes to include for the UMAP embedding
n_neighbors=15, # neighbors parameter for UMAP
min_dist=0.1 # min distance between points in UMAP space
)
# get the hub gene UMAP table from the seurat object
umap_df <- GetModuleUMAP(seurat_obj)
# plot with ggplot
ggplot(umap_df, aes(x=UMAP1, y=UMAP2)) +
geom_point(
color=umap_df$color, # color each point by WGCNA module
size=umap_df$kME*2 # size of each point based on intramodular connectivity
) +
umap_theme()
ModuleUMAPPlot(
seurat_obj,
edge.alpha=0.25,
sample_edges=TRUE,
edge_prop=0.1, # proportion of edges to sample (20% here)
label_hubs=2 ,# how many hub genes to plot per module?
keep_grey_edges=FALSE
)
联系客服