一文掌握单细胞wgcna分析

其实WGCNA本身是对基因进行合理（加权共表达）的分组。

如果一个表达量矩阵，里面的样品是两个分组，比如正常和对照，那么简单的差异分析就可以拿到上下调基因，各自可以去富集生物学通路，就是基因分组了，并没有太多的进行WGCNA分析的必要性，而且绝大部分的两个分组的表达量矩阵里面的样品数量通常是小于15个的，官方也并不推荐WGCNA分析。
如果一个表达量矩阵，里面的样品是时间序列这样的多分组，比如处理前后以及处理过程的不同时间梯度或者不同浓度，那么我们就需要每个分组都去跟对照组进行差异分析，上下调的组合非常多，结果也很难精炼出生物学结论，这个时候就可以选择WGCNA或者mfuzz这样的时间序列分析。

也就是说，只要是多分组，就涉及到多次差异分析，而且多分组意味着样品数量肯定不少，这样的话，在这个表达量矩阵里面，不同基因之间可以计算合理的相关性，就可以根据基因之间的相似性进行基因划分为不同的模块了。

那么，单细胞表达量矩阵理论上每个细胞就是一列，哪怕是单个10x样品也是好几千个细胞，现在是多样品时代，所以数万或者几十万细胞数量也是很常见的。所以理论上也可以使用单细胞表达量矩阵进行wgcna分析，但是由于单细胞表达量矩阵的稀疏性，必然是需要对wgcna算法进行一些调整。恰好学徒最近在她的公众号系统性介绍了即高维high dimensional WGCNA（hdWGCNA），也称为scWGCNA，我这里借花献佛分享给大家哈。

高维high dimensional WGCNA（hdWGCNA），也称为scWGCNA

作用：使用单细胞加权基因相关网络分析（scWGCNA），鉴定共表达的基因模块和差异表达的中枢基因（hub gene），同时也可以揭示了细胞状态相关的基因调控网络（GRN）
帮助文档：https://smorabit.github.io/hdWGCNA/index.html
文献：https://www.nature.com/articles/s41588-021-00894-z
数据：https://github.com/smorabit/hdWGCNA

作为单细胞高级分析一种，目前使用的单细胞文章挺少的，可作为分析的后备方案~

分析前

分析前的单细胞数据，需要先进行标准分析，如归一化标准化等直到降维聚类

A single-cell or single-nucleus transcriptomics dataset in Seurat format
Normalize the gene expression matrix NormalizeData
Identify highly variable genes VariableFeatures
Scale the normalized expression data ScaleData
Perform dimensionality reduction RunPCA and batch correction if needed RunHarmony
Non-linear dimensionality reduction RunUMAP for visualizations
Group cells into clusters (FindNeighbors and FindClusters)

1. 下载数据

使用已处理过的人脑的单细胞数据作为例子分析（需要自己下载这个rds文件哦）

wget https://swaruplab.bio.uci.edu/public_data/Zhou_2020.rds

2. 加载R包

# single-cell analysis package
library(Seurat)

# plotting and data science packages
library(tidyverse)
library(cowplot)
library(patchwork)
library(devtools)
library(igraph)

# co-expression network analysis packages:
library(WGCNA)
# devtools::install_github('smorabit/hdWGCNA', ref='dev')
library(hdWGCNA)

# using the cowplot theme for ggplot
theme_set(theme_cowplot())

# set random seed for reproducibility
set.seed(12345) 

# load the Zhou et al snRNA-seq dataset
seurat_obj <- readRDS('data/Zhou_2020.rds')
seurat_obj
# An object of class Seurat 
# 36601 features across 36671 samples within 1 assay 
# Active assay: RNA (36601 features, 0 variable features)
# 2 dimensional reductions calculated: harmony, umap

p <- DimPlot(seurat_obj, group.by='cell_type', label=TRUE)+
  ggtitle('Zhou et al Control Cortex') + NoLegend()

p

3. 创建对象，选择基因

SetupForWGCNA:创建对象，选择基因进行分析

在进行hdWGCNA分析前，需要建立Seurat对象，通常储存在对象的@misc中。并且可以储存多个hdWGCNA结果，就像在前面我们可以储存不同res的细胞降维聚类情况，同样道理这里可以调整参数储存多个hdWGCNA结果

并且通常单细胞矩阵基因数量很多，此函数可以根据调整参数gene_select，筛选出一部分基因进行分析

SetupForWGCNA(
  seurat_obj,
  wgcna_name,
  features = NULL,
  metacell_location = NULL,
  group = NULL,
  fraction = 0.05,
  group.by = NULL,
  gene_list = NULL
  ...
)
参数注释:
seurat_obj 
A Seurat object

gene_select的三个选择：

variable: 使用储存在VariableFeatures的高变基因，默认2000
fraction: 使用在整个数据集或每组细胞中以特定比例fraction细胞表达的基因,不同组别由group.by指定
custom: 使用人为特定的基因

seurat_obj <- SetupForWGCNA(
  seurat_obj,
  gene_select = "fraction", # the gene selection approach
  fraction = 0.05, # fraction of cells that a gene needs to be expressed in order to be included
  wgcna_name = "tutorial" # the name of the hdWGCNA experiment
)
length(seurat_obj@misc$tutorial$wgcna_genes) #[1] 12639

这里挑选出1万多个基因进行后续分析。

创建metacells

上面创建了Seurat对象后，第一步就是构建metacells

单细胞数据具有严重的稀疏性（data sparsity）：每个细胞中绝大部分基因的表达值被计数为零，该比例甚至可以高达 90%。这样的稀疏性对于数据分析带来了挑战，也引发了争议。
不同的研究人员以不同的方式看待单细胞数据中的大量零计数：有些研究者认为零计数代表生物信号（即基因没有表达或低表达），而另一些研究者则认为是零计数是需要被纠正的缺失数据（missing data）

但WGCNA分析对数据的稀疏性极其的敏感，因此需要构建metacells，来代替原始表达矩阵

使用方法

核心算法：k-Nearest Neighbors (KNN)算法，来识别来自同一生物来源的相同样本的一小群相似细胞，然后计算这些细胞的平均或总和表达，从而产生metacell基因表达矩阵，这样就能降低数据的稀疏性。如研究单细胞表观组的方法 Cicero，也用了类似聚集算法来构建可及性网络
核心函数：MetacellsByGroups
参数：group.by

参数group.by很重要，一般指定细胞类型和样本类型。通常先挑出相同生物样本的细胞，再使用算法将多次重复中都能聚在一起的细胞确定为 metacell

k值：选取与当前测试对象最近的k个训练对象，来判断测试对象的类别。其设置也是KNN算法的关键。

一般取决于数据集的细胞数量，细胞数目越小k值倾向更小。通常设置k=20~75。此数据集有 40039 个细胞，每个生物样本中从 890 到 8，188，因此使用k=25

参考：
单细胞分析十八般武艺12：metacell
kNN算法根据不同病理特征来预测乳腺癌转移与否

基本格式：

MetacellsByGroups(
  seurat_obj,
  group.by = c("seurat_clusters"),
  ident.group = "seurat_clusters",
  k = 25,
  reduction = "pca",
  assay = NULL,
  cells.use = NULL,
  slot = "counts",
  mode = "average",
  min_cells = 100,
  max_shared = 15,
  target_metacells = 1000,
  max_iter = 5000,
  verbose = FALSE,
  wgcna_name = NULL
)
参数注释：
seurat_obj
A Seurat object

group.by指定细胞类型和样本类型

table(seurat_obj$cell_type)
# INH    EX   OPC   ODC   ASC    MG 
# 3763 10692  1206 16210  3162  1638
table(seurat_obj$Sample)
# C1  C11  C12   C2   C3   C4   C5   C6   C7   C8   C9 
# 1788 3345 2333 1899 3002  711 6841 3320 2934 7618 2880 
seurat_obj
# An object of class Seurat 
# 36601 features across 36671 samples within 1 assay 
# Active assay: RNA (36601 features, 0 variable features)
# 2 dimensional reductions calculated: harmony, umap

# construct metacells  in each group
seurat_obj <- MetacellsByGroups(
  seurat_obj = seurat_obj,
  group.by = c("cell_type", "Sample"), # specify the columns in seurat_obj@meta.data to group by
  k = 25, # nearest-neighbors parameter
  reduction = "harmony", # 降维方法
  slot='counts',
  max_shared = 10, # maximum number of shared cells between two metacells
  ident.group = 'cell_type' # set the Idents of the metacell seurat object
)

标准化：NormalizeMetacells
提取metacell：GetMetacellObject

# normalize metacell expression matrix:
seurat_obj <- NormalizeMetacells(seurat_obj)

#  get the metacell object from the hdWGCNA experiment
metacell_obj <- GetMetacellObject(seurat_obj)
metacell_obj
# An object of class Seurat 
# 36601 features across 6308 samples within 1 assay 
# Active assay: RNA (36601 features, 0 variable features)

结果比对

算法将相似细胞聚在一起计算平均表达量，可以看到和原始数据相比，metacell的细胞数量大幅度减少了

seurat_obj
# An object of class Seurat 
# 36601 features across 36671 samples within 1 assay 
# Active assay: RNA (36601 features, 0 variable features)
# 2 dimensional reductions calculated: harmony, umap
metacell_obj
# An object of class Seurat 
# 36601 features across 6308 samples within 1 assay 
# Active assay: RNA (36601 features, 0 variable features)

> table(seurat_obj$cell_type,seurat_obj$Sample)
     
        C1  C11  C12   C2   C3   C4   C5   C6   C7   C8   C9
  INH  330  170  458  149  485  155  607  341  350  348  370
  EX   755  584  990  412 1326  343 1239 1308 1191 1460 1084
  OPC  112  105   77  146  125   19  101  113  124  191   93
  ODC  171 2128  457  895  610   79 4385 1089  691 4709  996
  ASC  276  250  223  229  374   89  320  280  426  472  223
  MG   144  108  128   68   82   26  189  189  152  438  114
> table(metacell_obj$cell_type,metacell_obj$Sample)
     
        C1  C11  C12   C2   C3   C4   C5   C6   C7   C8   C9
  ASC   30   19   22   21   47    0   35   28   46   60   21
  EX   101   54  121   38  185   35  147  139  139  205  125
  INH   26   10   35    8   38    7   57   28   27   25   30
  MG    11    7   10    0    0    0   16   19   12   66    6
  ODC   18  947   89  218  141    0 1000  327  136 1000  306
  OPC    8    4    0   11    9    0    4    9    7   18    0

共表达网络分析

流程：

提取感兴趣的细胞亚群:SetDatExpr
选取软阈值：TestSoftPowers
软阈值绘图：PlotSoftPowers and wrap_plots
参数输出：wrap_plots
构建拓扑重叠矩阵（TOM）:ConstructNetwork
利用TOM进行聚类:PlotDendrogram
输出TOM矩阵：GetTOM

提取感兴趣的细胞亚群

hdwgcna提取细胞群：SetDatExpr

seurat_obj <- SetDatExpr(
  seurat_obj,
  group_name = "INH", # the name of the group of interest in the group.by column
  group.by='cell_type', # the metadata column containing the cell type info. This same column should have also been used in MetacellsByGroups
  assay = 'RNA', # using RNA assay
  use_metacells = TRUE, # use the metacells (TRUE) or the full expression matrix (FALSE)
  slot = 'data' # using normalized data
)

如果是多个亚群

调整参数：group_name = c("INH", "EX")

选取软阈值

WGCNA为权重基因共表达网络，权重是指对Pearson相关性值进行幂次运算（也就是软阈值，power，pickSoftThreshold）。这种处理方式可强化强相关，弱化了弱相关，使得相关性数值更符合无标度网络特征，更具生物学意义。而不是使用一刀切的硬阈值，认为高于某个数值两个基因存在连接，万一低于但接近它的直接认为两基因没连接。

选取软阈值：TestSoftPowers
绘图：PlotSoftPowers and wrap_plots

# Test different soft powers:
seurat_obj <- TestSoftPowers(
  seurat_obj,
  setDatExpr=FALSE,
  powers = c(seq(1, 10, by = 1), seq(12, 30, by = 2))) # 选取soft powers(默认)
  # networkType = 'signed' # 网络类型 "unsigned" or "signed hybrid"

# plot the results:
plot_list <- PlotSoftPowers(seurat_obj,
                            point_size = 5,
                            text_size = 3)
                            # selected_power = NULL
# assemble with patchwork
wrap_plots(plot_list, ncol=2)

不知道为什么，当想调整参数 networkType = 'signed'决定有向网络or无向网络时，出现报错
Error in TestSoftPowers(seurat_obj, setDatExpr = FALSE, powers = c(seq(1, : unused argument (networkType = "signed")

WGCNA 和 hdWGCNA 的一般选择软阈值大于或等于 0.8 。因此在这里我们选择软阈值=9

软阈值可以自动函数选择，也可以人工指定selected_power = NULL进行绘图

参数输出：wrap_plots

power_table <- GetPowerTable(seurat_obj)
head(power_table)

Power    SFT.R.sq      slope truncated.R.sq   mean.k. median.k.    max.k.
1     1 0.192964167 10.1742363      0.9508966 5887.9290 5901.7852 6501.0283
2     2 0.001621749  0.4801145      0.9828730 3092.6379 3092.7878 3835.2202
3     3 0.150812707 -3.2345892      0.9918393 1657.6198 1646.6169 2349.6967
4     4 0.407513393 -4.5383131      0.9898440  905.8402  890.9431 1486.7695
5     5 0.585949838 -4.8303680      0.9917528  504.4288  490.2683  968.1157
6     6 0.677668481 -4.6749295      0.9935437  286.1585  274.3233  646.6741

拓扑重叠矩阵（TOM）

接下来计算拓扑重叠矩阵（TOM），用TOM矩阵反应基因表达的相似性，意味着两基因有相似的表达模式，考虑了直接相关和间接相关，更符合生物中存在的直接调控和间接调控作用

比如gene A 与 gene B，不仅关注的是这俩基因之间直接的共表达关系，拓扑重叠矩阵还考虑中间因素的影响，不止关注gene A对B的调控，还要关注geneA对B的次级影响。比如gene A调控 gene C，gene C又调控gene B。但更次一级的调控就不计算，考虑影响很小

进而构建pheatmap聚类树从而挑选相似的模块出来

seurat_obj <- ConstructNetwork(
  seurat_obj, 
  soft_power=9,
  setDatExpr=FALSE,
  corType = "pearson",
  networkType = "signed",
  TOMType = "signed",
  detectCutHeight = 0.995,
  minModuleSize = 50,
  mergeCutHeight = 0.2,
  tom_outdir = "TOM", # 输出文件夹
  tom_name = 'INH' # name of the topoligical overlap matrix written to disk
)
PlotDendrogram(seurat_obj, main='INH hdWGCNA Dendrogram')

函数里包括更多参数，例如：

soft_power：软阈值的选择
corType：计算相关性的方法；可选pearson(默认)，bicor。后者更能考虑离群点的影响。
networkType:计算邻接矩阵时，是否考虑正负相关性；"unsigned","signed", "signed hybrid"
TOMType：计算TOM矩阵时，是否考虑正负相关性；选择"none", "unsigned", "signed", "signed Nowick", "unsigned 2", "signed 2" and "signed Nowick 2"
minModuleSize：模块的最少基因数
mergeCutHeight：合并模块的阈值
numericLabels：模块名是否为数字；若设置FALSE，表示映射为颜色名。
saveConsensusTOMs：是否保存TOM矩阵；
nThreads：交代线程数，适用于Linux环境
verbose：0默认安静的执行，值越大表示给出的运行提示信息越多

参考：
https://search.r-project.org/CRAN/refmans/WGCNA/html/blockwiseConsensusModules.html

聚类时可将相关性强的基因聚集到一个分枝中。下图就是一个根据相关性值绘制的一个聚类树，图形横向表示不同的基因，纵向表示基因间的不相关性。因此分枝越靠下，表明分枝内的基因不相关性越小，即相关性越强

“灰色”模块是未分组的基因组成，对于所有下游分析应忽略灰色模块

输出TOM矩阵：GetTOM

模块特征值(ME)

上回聚类成模块后，可以使用模块特征值（Module eigengene E）开展模块与模块的相关性分析，或者与性状的相关性分析，得出与感兴趣的性状最相关的模块出来

ME是将模块内的基因进行PCA分析，得到PC1值。PC1值可整体描述模块中基因的变化模式

流程

归一化：ScaleData
计算ME值：ModuleEigengenes
输出ME or hME 值：GetMEs
提前对性状变量进行处理
相关性分析：ModuleTraitCorrelation
输出结果：GetModuleTraitCorrelation
相关性图：PlotModuleTraitCorrelation

计算ME值

# need to run ScaleData first or else harmony throws an error:
seurat_obj <- ScaleData(seurat_obj, features=VariableFeatures(seurat_obj))

# compute all MEs in the full single-cell dataset
seurat_obj <- ModuleEigengenes(
  seurat_obj,
  scale.model.use = "linear", #  choices are "linear", "poisson", or "negbinom"
  assay = NULL, # 默认:DefaultAssay(seurat_obj)
  pc_dim = 1,
  group.by.vars="Sample" # 根据样本去批次化 harmonize
)

看到最后得到两个值，除了ME值，还有根据样本harmonize去批次化后的harmonized module eigengenes (hMEs)数值

# harmonized module eigengenes:
hMEs <- GetMEs(seurat_obj)

# module eigengenes:
MEs <- GetMEs(seurat_obj, harmonized=FALSE)

它两的结果还是有所区别

> colnames(seurat_obj@misc[["tutorial"]][["hMEs"]])
 [1] "midnightblue" "pink"         "salmon"       "tan"          "black"        "lightcyan"    "purple"      
 [8] "red"          "yellow"       "grey"         "green"        "brown"        "greenyellow"  "blue"        
[15] "magenta"      "turquoise"    "cyan"

> seurat_obj@misc[["tutorial"]][["MEs"]][1:6,1:6]
                    midnightblue   salmon      tan    black lightcyan     grey
AGTCTTTGTTGATTCG-11    -3.659012 6.149156 5.360062 7.227669  4.843664 32.37577
AGCAGCCTCCAGATCA-11    -3.108567 6.262729 4.638732 7.985272  5.923548 32.11585
CTTAGGATCTCATTCA-11    -3.667789 6.826362 5.693103 9.340326  5.446225 30.73525
AACTCAGAGGAATGGA-11    -3.602661 6.281440 3.828821 6.927603  5.959433 27.51765
ACGAGGAGTCTCCACT-11    -3.480822 6.684249 6.770255 8.346960  5.314208 29.94229
ACTTACTTCGGAGCAA-11    -3.095366 5.244673 4.612340 6.427003  4.169680 27.57476
> seurat_obj@misc[["tutorial"]][["hMEs"]][1:6,1:6]
                    midnightblue     pink   salmon      tan    black lightcyan
AGTCTTTGTTGATTCG-11    -3.783981 5.309818 5.899195 6.212114 7.203034  4.336715
AGCAGCCTCCAGATCA-11    -3.189240 5.589570 6.018907 5.510649 8.139280  5.752451
CTTAGGATCTCATTCA-11    -3.736299 5.140163 6.593372 6.500594 9.502204  4.942244
AACTCAGAGGAATGGA-11    -3.728414 5.799119 6.222626 4.716675 7.123539  5.489100
ACGAGGAGTCTCCACT-11    -3.535288 5.054225 6.509797 7.671965 8.551482  4.821368
ACTTACTTCGGAGCAA-11    -3.191317 3.976533 5.008423 5.529623 6.580254  3.605844

模块与性状间的相关性

需要提前对变量进行处理

数值变量：as.numeric
只有两个类别的分类变量：as.factor
具有顺序关系的两个类别以上的分类变量：as.factor，设置level
无顺序关系的两个类别以上的分类变量

这个情况比较特殊，无法将其按上面的设置，这样完全没有生物学意义

这种情况下有两种方法：

设置成对指标

# 设置成对指标
library(WGCNA)
# Define a categorical variable with 3 levels
x = rep(c("A", "B", "C"), each = 3);
# Binarize it into pairwise indicators
out = binarizeCategoricalVariable(x,
includePairwise = TRUE,
includeLevelVsAll = FALSE);
# Print the variable and the indicators
data.frame(x, out);

  x B.vs.A C.vs.A C.vs.B
1 A      0      0     NA
2 A      0      0     NA
3 A      0      0     NA
4 B      1     NA      0
5 B      1     NA      0
6 B      1     NA      0
7 C     NA      1      1
8 C     NA      1      1
9 C     NA      1      1

设置一个水平与所有其他水平进行对比

# 设置一个水平与所有其他水平进行对比
out = binarizeCategoricalVariable(x,
includePairwise = FALSE,
includeLevelVsAll = TRUE);
# Print the variable and the indicators
data.frame(x, out);

  x A.vs.all B.vs.all C.vs.all
1 A        1        0        0
2 A        1        0        0
3 A        1        0        0
4 B        0        1        0
5 B        0        1        0
6 B        0        1        0
7 C        0        0        1
8 C        0        0        1
9 C        0        0        1

其实本质上还是设置成二分类变量，这样得到的结果就具有生物学意义

参考：https://peterlangfelder.com/2018/11/25/working-with-categorical-variables/

言归正传，

# convert sex to factor
seurat_obj$msex <- as.factor(seurat_obj$msex)

# convert age_death to numeric
seurat_obj$age_death <- as.numeric(seurat_obj$age_death)

# list of traits to correlate
cur_traits <- c('braaksc', 'pmi', 'msex', 'age_death', 'doublet_scores', 
                'nCount_RNA', 'nFeature_RNA', 'total_counts_mt')
                
str(seurat_obj@meta.data[,cur_traits])
# 'data.frame': 36671 obs. of  8 variables:
#   $ braaksc        : num  3 3 3 3 3 3 3 3 3 3 ...
# $ pmi            : num  5.75 5.75 5.75 5.75 5.75 5.75 5.75 5.75 5.75 5.75 ...
# $ msex           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
# $ age_death      : num  84.8 84.8 84.8 84.8 84.8 ...
# $ doublet_scores : num  0.108 0.0985 0.1131 0.1131 0.1186 ...
# $ nCount_RNA     : num  52105 52011 51777 51717 51484 ...
# $ nFeature_RNA   : int  8876 8807 8640 7199 7702 7799 8472 6764 7944 7668 ...
# $ total_counts_mt: num  916 1635 1354 321 645 ...

# 使用去批次化后的hME
seurat_obj <- ModuleTraitCorrelation(
  seurat_obj,
  traits = cur_traits,
  features = "hMEs", # Valid choices are hMEs, MEs, or scores
  cor_method = "pearson", # Valid choices are pearson, spearman, kendall.
  group.by='cell_type'
)

输出结果：GetModuleTraitCorrelation
相关性图：PlotModuleTraitCorrelation

# get the mt-correlation results
mt_cor <- GetModuleTraitCorrelation(seurat_obj)

names(mt_cor)
# "cor"  "pval" "fdr"

PlotModuleTraitCorrelation(
  seurat_obj,
  label = 'fdr', # add p-val label in each cell of the heatmap
  label_symbol = 'stars', # labels as 'stars' or as 'numeric'
  text_size = 2,
  text_digits = 2,
  text_color = 'white',
  high_color = '#fc9272',
  mid_color = '#ffffbf',
  low_color = '#9ecae1',
  plot_max = 0.2,
  combine=T # 合并结果
)

因为我们一开始是选择了 INH细胞群做分析的，所以单独拎出combine=F作为结果

很多作图的细节可以单独修改PlotModuleTraitCorrelation函数绘图，或者直接生成plot_df自己作图

参考：https://github.com/smorabit/hdWGCNA/blob/dev/R/plotting.R

hubgenes和功能评分

流程

计算连接度：ModuleConnectivity
module重命名：ResetModuleNames
查看module结果：GetModules
可视化hubgene：PlotKMEs
输出hubgene：GetHubGenes
基因功能评分：ModuleExprScore

接下来，寻找hubgenes，通过计算得出连接度（eigengene-based connectivity,kME），来确定各个模块内高度连接的gene。本质上是计算基因与模块特征值module eigengenes间的相关性

# compute eigengene-based connectivity (kME):
seurat_obj <- ModuleConnectivity(
  seurat_obj,
  group.by = 'cell_type', 
  corFnc = "bicor", # to obtain Pearson correlation
  corOptions = "use='p'", # to obtain Pearson correlation
  harmonized = TRUE,
  assay = NULL,
  slot = "data", # default to normalized 'data' slot
  group_name = 'INH' # 感兴趣的细胞群
)

# rename the modules
# 改名后后模块赋予新的名称，以new_name为基础，后续附加个数字
seurat_obj <- ResetModuleNames(
  seurat_obj,
  new_name = "INH-M"
)

# print out the new module names
modules <- GetModules(seurat_obj)

print(levels(modules$module))
# [1] "grey"    "INH-M1"  "INH-M2"  "INH-M3"  "INH-M4"  "INH-M5"  "INH-M6"  "INH-M7"  "INH-M8"  "INH-M9" 
# [11] "INH-M10" "INH-M11" "INH-M12" "INH-M13" "INH-M14" "INH-M15" "INH-M16"

# show the first 6 columns:
head(modules[,1:6])

结果分别对应基因名，模块名称，对应颜色，kME值

> head(modules[,1:6])
            gene_name module  color   kME_grey   kME_INH-M1 kME_INH-M2
AL627309.1 AL627309.1   grey   grey 0.06589398  0.022642034 0.07024651
LINC01409   LINC01409 INH-M1 yellow 0.11053490  0.144832154 0.08732725
LINC01128   LINC01128   grey   grey 0.13877882  0.007831324 0.12543496
NOC2L           NOC2L   grey   grey 0.14437972 -0.025191042 0.16172439
AGRN             AGRN   grey   grey 0.05887587  0.061095383 0.05356501
C1orf159     C1orf159   grey   grey 0.03702322  0.037830622 0.02719629

输出结果

# plot genes ranked by kME for each module
p <- PlotKMEs(seurat_obj, 
              ncol=5,
              n_hubs = 10, # number of hub genes to display
              text_size = 2,
              plot_widths = c(3, 2) # the relative width between the kME rank plot and the hub gene text
              )
p

每个模块，根据KME值由高到低排列，取前n_hubs个

# get hub genes
hub_df <- GetHubGenes(seurat_obj, n_hubs = 10)

head(hub_df)

saveRDS(seurat_obj, file='data/hdWGCNA_object.rds')

> head(hub_df)
   gene_name module       kME
1 AC105052.4 INH-M1 0.2445930
2 FP700111.1 INH-M1 0.2622643
3    ANKRD26 INH-M1 0.2627551
4  LINC01278 INH-M1 0.2914837
5      GRIN1 INH-M1 0.2954171
6  LINC00342 INH-M1 0.3161212

基因功能评分

类似之前的缺氧，焦亡等生物学特征，可以收集它们的功能基因列表，在单细胞转录组学进行评分，确定哪个细胞群或者细胞模块打分最高

9种常见的功能集打分方法：GSEA、GSVA、PLAGE、Zscore、AddModuleScore、ssGSEA、AUCell、UCell和singscore。
参考：8种方法可视化你的单细胞基因集打分

提供两种算法：Seurat（AddModuleScore）和 UCell

# compute gene scoring for the top 25 hub genes by kME for each module
# with Seurat method
seurat_obj <- ModuleExprScore(
  seurat_obj,
  n_genes = 25,
  method='Seurat'
)

# compute gene scoring for the top 25 hub genes by kME for each module
# with UCell method
library(remotes)
remotes::install_github("carmonalab/UCell", ref="v1.3")
library(UCell)
seurat_obj <- ModuleExprScore(
  seurat_obj,
  n_genes = 25,
  method='UCell'
)

基本可视化

ModuleFeaturePlot：FeaturePlot
ModuleCorrelogram：模块间相关性
DotPlot：气泡图
VlnPlot：小提琴图
ModuleNetworkPlot：可视化每个模块的单独网络图，按 kME 显示前 25 个基因
HubGeneNetworkPlot：可视化所有模块特定hubgene的复合网络
ModuleUMAPPlot：使用UMAP降维算法可视化共表达中的所有基因

wgcna前面算出四种分数（hMEs, MEs, scores, or average），可以用各种方式进行展示

四种分数的FeaturePlot

# make a featureplot of hMEs for each module
plot_list <-ModuleFeaturePlot(
  seurat_obj,
  reduction = "umap",
  features = "hMEs",
  order_points = TRUE, # order so the points with highest hMEs are on top
  restrict_range = TRUE,
  point_size = 0.5,
  alpha = 1,
  label_legend = FALSE,
  raster_dpi = 500,
  raster_scale = 1,
  plot_ratio = 1,
  title = TRUE
)
# stitch together with patchwork
wrap_plots(plot_list, ncol=6)

截取函数的一部分，其实features可以展示四种分数（hMEs, MEs, scores, or average）

if (features == "hMEs") {
        MEs <- GetMEs(seurat_obj, TRUE, wgcna_name)
     }
     else if (features == "MEs") {
         MEs <- GetMEs(seurat_obj, FALSE, wgcna_name)
     }
     else if (features == "scores") {
         MEs <- GetModuleScores(seurat_obj, wgcna_name)
     }
     else if (features == "average") {
         MEs <- GetAvgModuleExpr(seurat_obj, wgcna_name)
         restrict_range <- FALSE
     }
     else (stop("Invalid feature selection. Valid choices: hMEs, MEs, scores, average"))

例如展示基因功能分数 scores

# make a featureplot of hub scores for each module
plot_list <- ModuleFeaturePlot(
  seurat_obj,
  features='scores', # plot the hub gene scores
  order='shuffle', # order so cells are shuffled
  ucell = TRUE # depending on Seurat vs UCell for gene scoring
)

# stitch together with patchwork
wrap_plots(plot_list, ncol=6)

模块间相关性

# plot module correlagram
ModuleCorrelogram(seurat_obj,
                  exclude_grey = TRUE, # 默认删除灰色模块
                  features = "hMEs" # What to plot? Can select hMEs, MEs, scores, or average
                  )

气泡图

# get hMEs from seurat object
MEs <- GetMEs(seurat_obj, harmonized=TRUE)
mods <- colnames(MEs); mods <- mods[mods != 'grey']

# add hMEs to Seurat meta-data:
seurat_obj@meta.data <- cbind(seurat_obj@meta.data, MEs)

# plot with Seurat's DotPlot function
p <- DotPlot(seurat_obj, features=mods, group.by = 'cell_type')

# flip the x/y axes, rotate the axis labels, and change color scheme:
p <- p +
  coord_flip() +
  RotatedAxis() +
  scale_color_gradient2(high='red', mid='grey95', low='blue')

# plot output
p

小提琴图

# Plot INH-M4 hME using Seurat VlnPlot function
p <- VlnPlot(
  seurat_obj,
  features = 'INH-M12',
  group.by = 'cell_type',
  pt.size = 0 # don't show actual data points
)

# add box-and-whisker plots on top:
p <- p + geom_boxplot(width=.25, fill='white')

# change axis labels and remove legend:
p <- p + xlab('') + ylab('hME') + NoLegend()

# plot output
p

批量出图

plot_list <- lapply(mods, function(x) {
  print(x)
  p <- VlnPlot(
    seurat_obj,
    features = x,
    group.by = 'cell_type',
    pt.size = 0 # don't show actual data points
  )
  
  # add box-and-whisker plots on top:
  p <- p + geom_boxplot(width=.25, fill='white')
  
  # change axis labels and remove legend:
  p <- p + xlab('') + ylab('hME') + NoLegend()
  
  p
})
wrap_plots(plot_list, ncol = 5)

网络可视化

单个模块网络图

# network analysis & visualization package:
remotes::install_github("igraph/rigraph@master")
library(igraph)
# Visualizes the top hub genes for selected modules as a circular network plot
ModuleNetworkPlot(
  seurat_obj,
  mods = "all", # all modules are plotted.
  outdir = "ModuleNetworks", # The directory where the plots will be stored.
  plot_size = c(6, 6),
  label_center = FALSE,
  edge.alpha = 0.25,
  vertex.label.cex = 1, # 基因标签的字体大小
  vertex.size = 6 # 节点的大小
)

每个节点nodes代表一个基因，每条边edges代表网络中两个基因之间的共表达关系，颜色对应模块的

kME的前10个中心基因位于图的中心，其余15个基因放置在外圈

组合网络图

可视化每个模块的前 3 个中心基因和 6 个其他基因（随机的）

# hubgene network
HubGeneNetworkPlot(
  seurat_obj,
  mods = "all", # all modules are plotted.
  n_hubs = 3, 
  n_other=6,
  edge_prop = 0.75,
)

可以使用return_graph = TRUE返回可绘制的 igraph 对象

g <- HubGeneNetworkPlot(seurat_obj,  return_graph=TRUE)

umap网络图

这里使用另一种方法umap来可视化共表达网络中的所有基因，主要在topological overlap matrix (TOM)矩阵计算出umap坐标

seurat_obj <- RunModuleUMAP(
  seurat_obj,
  n_hubs = 10, # number of hub genes to include for the UMAP embedding
  n_neighbors=15, # neighbors parameter for UMAP
  min_dist=0.1 # min distance between points in UMAP space
)

# get the hub gene UMAP table from the seurat object
umap_df <- GetModuleUMAP(seurat_obj)

# plot with ggplot
ggplot(umap_df, aes(x=UMAP1, y=UMAP2)) +
  geom_point(
   color=umap_df$color, # color each point by WGCNA module
   size=umap_df$kME*2 # size of each point based on intramodular connectivity
  ) +
  umap_theme()

ModuleUMAPPlot(
  seurat_obj,
  edge.alpha=0.25,
  sample_edges=TRUE,
  edge_prop=0.1, # proportion of edges to sample (20% here)
  label_hubs=2 ,# how many hub genes to plot per module?
  keep_grey_edges=FALSE

)

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。