fgsea包做GSEA分析

GSEA图

这篇教程翻译自生信博客DAVE TANGE’S BLOG(https://davetang.org/muse/)上的一篇教程,参考资料中已经列出了地址。

fgsea这个包用于做GSEA分析,先来看一下使用这个包做的图,如下所示:

现在简单解释一下这个图形:

x轴——排序后的基因列表L位置对应的坐标,也就是我们自己通过RNA-seq,芯片,qPCR等手段获得的基因表达值倍数变化,或p值排序,总之,这是一个有序列表。

垂直的黑色细胞——上图中类似条形码的图形,这是指的是某一个基因集S中基因对应于L基因中,的位置,在上图中,这个基因集是细胞周期(Cell Cycle),明天看到S中的成员在L的左侧比较密集。

y轴——富集分布,从上面我们可以看到,细胞周期(Cell Cycle)这个基因集在左侧富集,也就是绿色曲线表示的位置。

fgsea使用

安装

先安装fgsea包,如下所示:

1
2
3
install.packages("BiocManager")
BiocManager::install("fgsea")
library(fgsea)

数据

fgsea包中内置的有数据集examplePathways,如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
library(fgsea)
data(examplePathways)
# examplePathways是从'reactome.db'包中
# 提取的信息,并以列表的形式存了这些信息,
# 这些信息主要是小鼠的基因
help("examplePathways")
# 这个数据集是一个列表
class(examplePathways)
# 一共有1,457 "通路"
length(examplePathways)
# 第1列含有Meiotic_Synapsis pathway的EntrezID
examplePathways[1]
# exampleRanks储存的是排序信息
data(exampleRanks)
# exampleRnak是一个数字型的向量
class(exampleRanks)
# exampleRanks中的向量名称则是对称的Entrez ID
head(exampleRanks)
tail(unname(exampleRanks))

运行结果如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
> library(fgsea)
>
> data(examplePathways)
>
> # examplePathways是从'reactome.db'包中
# 提取的信息,并以列表的形式
> # 储存了这些信息,这些信息主要是小鼠的基因
> help("examplePathways")
>
> # 这个数据集是一个列表
> class(examplePathways)
[1] "list"
>
> # 一共有1,457 "通路"
> length(examplePathways)
[1] 1457
>
> # 第1列含有Meiotic_Synapsis pathway的EntrezID
> examplePathways[1]
$`1221633_Meiotic_Synapsis`
[1] "12189" "13006" "15077" "15078" "15270" "15512"
[7] "16905" "16906" "19357" "20842" "20843" "20957"
[13] "20962" "21749" "21750" "22196" "23856" "24061"
[19] "28113" "50878" "56739" "57321" "64009" "66654"
[25] "69386" "71846" "74075" "77053" "94244" "97114"
[31] "97122" "97908" "101185" "140557" "223697" "260423"
[37] "319148" "319149" "319150" "319151" "319152" "319153"
[43] "319154" "319155" "319156" "319157" "319158" "319159"
[49] "319160" "319161" "319565" "320332" "320558" "326619"
[55] "326620" "360198" "497652" "544973" "625328" "667250"
[61] "100041230" "102641229" "102641751" "102642045"
>
> # exampleRanks储存的是排序信息
> data(exampleRanks)
>
> # exampleRnak是一个数字型的向量
> class(exampleRanks)
[1] "numeric"
>
> # exampleRanks中的向量名称则是对称的Entrez ID
> head(exampleRanks)
170942 109711 18124 12775 72148 16010
-63.33703 -49.74779 -43.63878 -41.51889 -33.26039 -32.77626
> tail(unname(exampleRanks))
[1] 47.58235 49.87543 50.25179 50.86532 51.16110 53.28400

分析

现在我们使用上面的数据进行GSEA分析,进行GSEA分析时,我们的通路文件(也就是GSEA官网中的GMT文件,对应的就是基因集S)储存在pathways参数中,用户自己的数据(排过序的数据)放在stats数据集中,剩下的参数不用管,如下所示:

1
2
3
4
5
fgseaRes <- fgsea(pathways = examplePathways,
stats = exampleRanks,
minSize=15,
maxSize=500,
nperm=100000)

分析的结果fgseaRes是一个data.table格式的文件,使用plotEnrichment函数可以绘制出GSEA分析的结果,如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
library(ggplot2)
# 参考 https://github.com/Rdatatable/data.table/wiki
# 来查看data.table相关的信息
class(fgseaRes)
# 查看p值小于0.01的通路的数目
sum(fgseaRes[, padj < 0.01])
# 绘制出最显著的富集通路
plotEnrichment(examplePathways[[head(fgseaRes[order(pval), ], 1)$pathway]],
exampleRanks) +
labs(title=head(fgseaRes[order(pval), ], 1)$pathway)

结果如下所示:

1
2
3
4
5
6
> class(fgseaRes)
[1] "data.table" "data.frame"
>
> # 查看p值小于0.01的通路的数目
> sum(fgseaRes[, padj < 0.01])
[1] 78

可以画出具体的某条通路,如下所示;

1
2
3
plotEnrichment(examplePathways[["5991130_Programmed_Cell_Death"]],
exampleRanks) +
labs(title="Programmed Cell Death")

还可以在一张图中绘制出前10个富集通路,以及后10个富集通路,一共20个,如下所示:

1
2
3
4
5
topPathwaysUp <- fgseaRes[ES > 0][head(order(pval), n=10), pathway]
topPathwaysDown <- fgseaRes[ES < 0][head(order(pval), n=10), pathway]
topPathways <- c(topPathwaysUp, rev(topPathwaysDown))
plotGseaTable(examplePathways[topPathways], exampleRanks, fgseaRes,
gseaParam = 0.5)

Reactome

也可以使用Reactome通路来进行GSEA分析,此时需要安装reactome.db包,这个包很大,600多M,下载的时间很长,安装过程如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# BiocManager::install("reactome.db")
library(reactome.db)
my_pathways <- reactomePathways(names(exampleRanks))
# Reactome pathways have a median of 11 genes
summary(sapply(my_pathways, length))
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.0 4.0 11.0 30.3 30.0 1140.0
fgsea_reactome <- fgsea(pathways = my_pathways,
stats = exampleRanks,
minSize=15,
maxSize=500,
nperm=100000)
head(fgsea_reactome[order(pval), ])
sum(fgsea_reactome[, padj < 0.01])

结果如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
> head(fgsea_reactome[order(pval), ])
pathway pval padj
1: Cell Cycle 1.212856e-05 0.0002571089
2: Cell Cycle, Mitotic 1.236552e-05 0.0002571089
3: Neutrophil degranulation 1.274974e-05 0.0002571089
4: Signaling by Rho GTPases 1.291339e-05 0.0002571089
5: M Phase 1.303509e-05 0.0002571089
6: Cell Cycle Checkpoints 1.350749e-05 0.0002571089
ES NES nMoreExtreme size
1: 0.5324037 2.674889 0 414
2: 0.5475346 2.719737 0 363
3: 0.4258088 2.074081 0 296
4: 0.4073481 1.966039 0 271
5: 0.5022189 2.408762 0 255
6: 0.6085986 2.836881 0 200
leadingEdge
1: 66336,66977,15366,12442,107995,66442,...
2: 66336,66977,15366,12442,107995,66442,...
3: 11676,14190,53381,12306,20430,12505,...
4: 66336,66977,20430,104215,233406,107995,...
5: 66336,66977,12442,107995,66442,52276,...
6: 66336,66977,12442,107995,66442,12428,...
> sum(fgsea_reactome[, padj < 0.01])
[1] 103

Leading edge

在GSEA分析中,我们通常会提取那些构成高得分的核心基因。我们对高得分的核心基因的定义就是,基因集S中位于排序基因列表L位置中的得分最大处之前或之后的那些基因集(也就是GSEA结果中绿色曲线最高点的前面或后面)。

前面我们注意到,GSEA分析的结果中有1列被命名为leadingEdge。这一列包含了高得分的基因。我们使用Reactome通路的富集结果来提取这些基因,如下所示:

1
2
3
4
5
6
7
8
9
10
11
# the most significant pathway
fgsea_reactome[order(pval),][1,]
# list of Entrez gene IDs that contributed to the enrichment score
fgsea_reactome[order(pval),][1,]$leadingEdge
# how many genes are in the leading edge?
length(fgsea_reactome[order(pval),][1,]$leadingEdge[[1]])
# how many genes are in the Cell Cycle pathway?
length(my_pathways[['Cell Cycle']])

结果如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
> # the most significant pathway
> fgsea_reactome[order(pval),][1,]
pathway pval padj ES
1: Cell Cycle 1.212856e-05 0.0002571089 0.5324037
NES nMoreExtreme size
1: 2.674889 0 414
leadingEdge
1: 66336,66977,15366,12442,107995,66442,...
>
> # list of Entrez gene IDs that contributed to the enrichment score
> fgsea_reactome[order(pval),][1,]$leadingEdge
[[1]]
[1] "66336" "66977" "15366" "12442" "107995"
[6] "66442" "12571" "12428" "52276" "54392"
[11] "66311" "215387" "67629" "12649" "72415"
[16] "56150" "57441" "20877" "67121" "12615"
[21] "11799" "66468" "67849" "19053" "73804"
[26] "76044" "20878" "15270" "13555" "60411"
[31] "12580" "17219" "69270" "12575" "69263"
[36] "12448" "14211" "20873" "18005" "72119"
[41] "71988" "12189" "17215" "12534" "66156"
[46] "208628" "237911" "22390" "68240" "228421"
[51] "68014" "269582" "19348" "12236" "72151"
[56] "18817" "21781" "18968" "217653" "66934"
[61] "272551" "227613" "67141" "67951" "68612"
[66] "68298" "108000" "23834" "106344" "56452"
[71] "242705" "18141" "223921" "26886" "13557"
[76] "26909" "72787" "268697" "72155" "56371"
[81] "17535" "107823" "12531" "327762" "12567"
[86] "229841" "67052" "16319" "66634" "171567"
[91] "26931" "67203" "12235" "19891" "74470"
[96] "72083" "381318" "66570" "17216" "76308"
[101] "19687" "17218" "102920" "29870" "18973"
[106] "16881" "17463" "75786" "19645" "19075"
[111] "26417" "69736" "19357" "76816" "70385"
[116] "70645" "22628" "225182" "22627" "52683"
[121] "19076" "18972" "231863" "26932" "12544"
[126] "17997" "51788" "26440" "68549" "12445"
[131] "19088" "269113" "26444" "19324" "103733"
[136] "59001" "107976" "19179" "12579" "232987"
[141] "17420" "228769" "219072" "26445" "105988"
[146] "69745" "18538" "69928" "11651" "235559"
[151] "68097" "57296" "63955" "14235" "19170"
[156] "17246" "17220" "12144" "50793" "77605"
[161] "18392" "236930" "67151" "70024" "59126"
[166] "66296" "16906" "109145" "71819" "67733"
[171] "50883" "12447" "12532" "14156" "26442"
[176] "19177" "230376" "245688"
>
> # how many genes are in the leading edge?
> length(fgsea_reactome[order(pval),][1,]$leadingEdge[[1]])
[1] 178
>
> # how many genes are in the Cell Cycle pathway?
> length(my_pathways[['Cell Cycle']])
[1] 414

总结

GSEA是于2005年首次提出来的,现在已经成了基因表达分析中的常规分析手段,它不同于GO分析,GO分析只关注差异基因,而GSEA分析则关注所有的基因。fgsea包可以使用预先排列好的基因一R中进行GSEA分析。p值的计算结果是基于置换检验(permutation test),这种方法并不是十分精我,因为它忽略了基因之间的相关性,有可能会导致假阳性。但是,在这种方法在研究上调与上调基因方面还是很有用的。即使你计算出的GSEA结果中,p值大于0.05,但是是也可以参考leading edge基因集,为你的实验进行指导。

案例分析

作者提供了用于生成类似于exampleRanks文件的R脚本,不过使用的GEO的数据,平时自己利用fgsea包进行GSEA分析时,生成就好,现在看一下如何将GEO的数据生成类似于exampleRanks文件的排序信息,如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
source("https://raw.githubusercontent.com/assaron/r-utils/master/R/exprs.R")
library(GEOquery)
library(limma)
gse14308 <- getGEO("GSE14308")[[1]]
pData(gse14308)$condition <- sub("-.*$", "", gse14308$title)
es <- collapseBy(gse14308, fData(gse14308)$ENTREZ_GENE_ID, FUN=median)
es <- es[!grepl("///", rownames(es)), ]
es <- es[rownames(es) != "", ]
exprs(es) <- normalizeBetweenArrays(log2(exprs(es)+1), method="quantile")
es.design <- model.matrix(~0+condition, data=pData(es))
fit <- lmFit(es, es.design)
fit2 <- contrasts.fit(fit, makeContrasts(conditionTh1-conditionNaive,
levels=es.design))
fit2 <- eBayes(fit2)
de <- data.table(topTable(fit2, adjust.method="BH", number=12000, sort.by = "B"), keep.rownames = T)
ranks <- de[order(t), list(rn, t)]
head(ranks)
tail(ranks)

结果如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
> head(ranks)
rn t
1: 109711 -50.99316
2: 18124 -44.40559
3: 12775 -43.22109
4: 72148 -33.74441
5: 16010 -33.34034
6: 16206 -30.67962
> tail(ranks)
rn t
1: 80901 47.55618
2: 58801 48.53756
3: 15937 49.88068
4: 13730 50.17693
5: 12772 50.32302
6: 80876 51.72987

现在绘制一下上述数据中6个上升与6个下降的基因热图:

1
2
3
4
5
6
7
8
9
10
11
12
library(pheatmap)
my_group <- data.frame(group = pData(es)$condition)
row.names(my_group) <- colnames(exprs(es))
pheatmap(
mat = es[c(head(de[order(t), 1])$rn, tail(de[order(t), 1])$rn),],
annotation_col = my_group,
cluster_rows = FALSE,
cellwidth=25,
cellheight=15
)

参考资料

  1. Using fgsea package
  2. Comparison of clusterProfiler and GSEA-P
  3. Using the fast preranked gene set enrichment analysis (fgsea) package