KEGG数据库的12大代谢通路分类

最近看到不少文章，从kegg数据库里面的下载了86个代谢通路的1660个基因，然后就针对这些代谢基因集做后续分析。

其实这样的KEGG数据库的12大代谢通路数据挖掘文章很多，其中一个佼佼者是复旦大学邵志敏团队三阴性乳腺癌的代谢组学文章，文献标题是：《Metabolic-Pathway-Based Subtyping of Triple- Negative Breast Cancer Reveals Potential Therapeutic Targets》，其数据挖掘仅仅是一个引子，后续仍然是有大量真实病人自己的代谢组数据做支撑。如下所示，可以看到在the tumor samples versus paired normal samples in the FUSCC cohort. 的差异分析里面，统计学显著（upregulated or downregulated (FDR < 0.05)）的失调代谢通路，在 10 metabolic categories 分类展示：

分类展示失调代谢通路

现在就给大家演示一下如何获取KEGG数据库的12大代谢通路以及其分类，首先KEGG官网在：https://www.genome.jp/kegg/pathway.html

进入官网就可以看到12大代谢通路分类，列表如下所示：

Carbohydrate
Energy
Lipid
Nucleotide
Amino acid
Other amino
Glycan
Cofactor/vitamin
Terpenoid/PK
Other secondary metabolite
Xenobiotics
Chemical structure

其中Nucleotide这个分类的通路是最少了，仅仅是嘌呤和嘧啶的这两个不同通路。

通过KEGGREST包来探索KEGG数据库的12大代谢通路

正常情况下，大家安装R包应该是都问题不大了。这里BiocManager安装"KEGGREST"即可，代码如下所示：

#BiocManager安装"KEGGREST"，
if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install("KEGGREST")

#加载"KEGGREST"
library(KEGGREST) #用于提取通路及基因信息

然后查询人类在KEGG数据库中的缩写

library(KEGGREST) #用于提取通路及基因信息
#获取KEGG数据库收录的所有物种的清单
org <- keggList('organism') 
# 在中国大陆地区耗时2-3分钟，在海外耗时一秒钟不到。
head(org) 
# 查询 人类  在KEGG数据库中的缩写
library(stringr)
org[str_detect(org[,3],"human"),]

当然，也可以网页查询。https://www.genome.jp/kegg/catalog/org_list.html ，可以看到，人类在KEGG数据库对应的缩写为“hsa”

接下来获取人类的KEGG数据库的全部通路及基因集：

hsa_path <- keggLink("pathway","hsa") 
#得到字符型向量。元素名为基因id，元素为通路名
# 因为是在线获取，所以根据网络情况，耗时4-10分钟不等
#查看hsa_path的前6个
length(hsa_path)
hsa_path[1:6] 
length(unique(names(hsa_path)))
length(unique(hsa_path))

> length(unique(names(hsa_path)))
[1] 8114
> length(unique(hsa_path))
[1] 345
>

可以看到人类的KEGG通路有 345 条，涉及到的基因数量是 8114 个（2022-02-20查询）。

另外，这个KEGGREST包的函数并不多，值得摸索学习：

keggConv Convert KEGG identifiers to/from outside identifiers
keggFind Finds entries with matching query keywords or other query data in a given database
keggGet Retrieves given database entries
keggInfo Displays the current statistics of a given database
keggLink Find related entries by using database cross-references.
keggList Returns a list of entry identifiers and associated definition for a given database or a given set of database entries.

另外，通过观察KEGG官网：https://www.genome.jp/kegg/pathway.html 的12大代谢通路，可以看到通路的id都是00开头，所以很容易使用下面的代码进行批量查询：

meta= unique(hsa_path)[grepl('hsa00',unique(hsa_path))]
hsa_info <- lapply(meta, keggGet)  

nm=unlist(lapply( hsa_info , function(x) x[[1]]$NAME))
library(stringr)
genes = unlist(lapply( hsa_info , function(x) {
  g = x[[1]]$GENE
  paste(str_split(g[seq(2,length(g),by=2)],';',simplify = T)[,1],collapse =';')
}))
df =data.frame(
  hsa= meta,
  nm=nm,
  genes =genes
)

得到的 84个通路和对应的基因组成的表格如下所示：

84个通路和对应的基因

但是很多文章说的是kegg数据库里面的下载了86个代谢通路的1660个基因，所以我重新认真看了看 KEGG官网：https://www.genome.jp/kegg/pathway.html 的12大代谢通路，发现有一些通路居然是01开头，并不是00开头，让我有点费解，就是这个 1.12 Chemical structure transformation maps

 01060  Biosynthesis of plant secondary metabolites
 01061  Biosynthesis of phenylpropanoids
 01062  Biosynthesis of terpenoids and steroids
 01063  Biosynthesis of alkaloids derived from shikimate pathway
 01064  Biosynthesis of alkaloids derived from ornithine, lysine and nicotinic acid
 01065  Biosynthesis of alkaloids derived from histidine and purine
 01066  Biosynthesis of alkaloids derived from terpenoid and polyketide
 01070  Biosynthesis of plant hormones

但是我去查询01开头的，并没有发现前面的 1.12 Chemical structure transformation maps ：

> unique(hsa_path)[grepl('hsa01',unique(hsa_path))]
 [1] "path:hsa01040" "path:hsa01100" "path:hsa01200" "path:hsa01210"
 [5] "path:hsa01212" "path:hsa01230" "path:hsa01240" "path:hsa01250"
 [9] "path:hsa01521" "path:hsa01522" "path:hsa01523" "path:hsa01524"

所以这个并不是导致kegg数据库里面的通路数量和基因数量产生冲突的原因！

聪明的读者朋友们，你们知道为什么吗？

不过，我看作者在后面的热图展现的时候，剩下来的就7大代谢通路里面的几十个细节展现：

所以，个别通路的通路缺失应该是无伤大雅。

写在文末

我在《生信技能树》，《生信菜鸟团》，《单细胞天地》的大量推文教程里面共享的代码都是复制粘贴即可使用的，有任何疑问欢迎留言讨论，也可以发邮件给我，详细描述你遇到的困难的前因后果给我，我的邮箱地址是 jmzeng1314@163.com

如果你确实觉得我的教程对你的科研课题有帮助，让你茅塞顿开，或者说你的课题大量使用我的技能，烦请日后在发表自己的成果的时候，加上一个简短的致谢，如下所示：

We thank Dr.Jianming Zeng(University of Macau), and all the members of his bioinformatics team, biotrainee, for generously sharing their experience and codes.

十年后我环游世界各地的高校以及科研院所（当然包括中国大陆）的时候，如果有这样的情谊，我会优先见你

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。