GEO笔记

前言

此篇笔记主要是学习生信技术树Jimmy在B站上的视频教程。

背景知识

GEO的全称是Gene Expression Omnibus，数据库地址是：https://www.ncbi.nlm.nih.gov/geo/。

GEO是一个公共的基因组数据库，含有大量的芯片以及二代测序数据。在挖掘GEO上的数据集时，常规流程就是选择GSE号，下载表达矩阵，得到基因集，使用数据库进行注释，PPI等网络等。

GEO里面的数据前缀有4种，如下所示：

GEO Platform (GPL)
GEO Sample (GSM)
GEO Series (GSE)
GEO Dataset (GDS)

其中GSE开头的是表达矩阵，GPL开头的是测序的平台信息。一篇文章可以有一个或者多个GSE数据集，一个GSE里面可以有一个或者多个GSM样本。多个研究的GSM样本可以根据研究目的整合为一个GDS，不过GDS本身用的很少。而每个数据集都有着自己对应的芯片平台，就是GPL。

下载数据

这个数据已经被其他人挖掘过发了文章，文章信息如下所示：

Wang Q L , Chen X , Zhang M H , et al. Identification of hub genes and pathways associated with retinoblastoma based on co-expression network analysis[J]. Genetics and Molecular Research, 2015, 14(4):16151-16161.

现在去GEO官网下载数据：

代码如下所示：

rm(list = ls())
setwd("D:\\GEO_test")
library(GEOquery)
eSet <- getGEO('GSE42872', destdir=".", 
                 AnnotGPL = F,
                 getGPL = F)
save(eSet, file='GSE42872_eSet.Rdata')

这里用到了getGEO函数，这个函数的参数如下所示：

destdir：下载的地址，使用.则下载到当前的工作目录；
AnnotGPL：是否下载注释的信息。
getGPL：若为TRUE，则下载GPL注释文件；若为FALSE，则不下载。默认为TRUE。
GSEMatrix：若为TRUE，则下载Matrix文件；若为FALSE，则下载SOFT文件。默认为TRUE。

随后，D:\GEO_test目录下就会出现下载后的表达矩阵，名称为GSE42872_series_matrix.txt.gz。

GSE24673

下载后的这个eSet文件信息如下所示：

> eSet
$GSE24673_series_matrix.txt.gz
ExpressionSet (storageMode: lockedEnvironment)
assayData: 28869 features, 11 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: GSM607938 GSM607939 ... GSM607948 (11 total)
  varLabels: title geo_accession ... tissue:ch1 (40 total)
  varMetadata: labelDescription
featureData: none
experimentData: use 'experimentData(object)'
Annotation: GPL6244 
> class(eSet)
[1] "list"
> eSet[[1]]
ExpressionSet (storageMode: lockedEnvironment)
assayData: 28869 features, 11 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: GSM607938 GSM607939 ... GSM607948 (11 total)
  varLabels: title geo_accession ... tissue:ch1 (40 total)
  varMetadata: labelDescription
featureData: none
experimentData: use 'experimentData(object)'
Annotation: GPL6244

从上面信息可以知道：

eSet是一个列表(list)；
eSet这个列表只有一个元素。

获取表达矩阵

提取表达矩阵

使用exprs函数可以获取表达矩阵，如下所示：

1 2	raw_profile_data <- eSet[[1]] exprSet = exprs(raw_profile_data)

表达矩阵如下所示：

获取样本信息

使用sampleNames函数可以获取样本信息，如下所示：

1	sampleNames(raw_profile_data)

运行后如下所示：

1
2
3

> sampleNames(raw_profile_data)
 [1] "GSM607938" "GSM607939" "GSM607940" "GSM607941" "GSM607942" "GSM607943" "GSM607944" "GSM607945"
 [9] "GSM607946" "GSM607947" "GSM607948"

使用pData可以获取分组信息，如下所示：

1	pData <- pData(raw_profile_data)

运行如下所示：

从title这一列就可以知道，里面有几个是干预组，有几个是健康组。

表达矩阵过滤

这一部分主要是将探针名称转化为基因名。

转换探针名

表达矩阵的列的名称是探针名，如下所示：

现在将探针名转换为基因名，从前面的eSet数据集中可以获取以下信息：

> eSet
$GSE24673_series_matrix.txt.gz
ExpressionSet (storageMode: lockedEnvironment)
assayData: 28869 features, 11 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: GSM607938 GSM607939 ... GSM607948 (11 total)
  varLabels: title geo_accession ... tissue:ch1 (40 total)
  varMetadata: labelDescription
featureData: none
experimentData: use 'experimentData(object)'
Annotation: GPL6244

从Annotation: GPL6244这行可以知道，这个数据集是用GPL6244这个平台测的，同时在GEO的官网也能查到这个信息，如下所示：

现有的一些常用平台都对应的有相应的R包，如下所示：

	gpl	organism	bioc_package
1	GPL32	Mus musculus	mgu74a
2	GPL33	Mus musculus	mgu74b
3	GPL34	Mus musculus	mgu74c
6	GPL74	Homo sapiens	hcg110
7	GPL75	Mus musculus	mu11ksuba
8	GPL76	Mus musculus	mu11ksubb
9	GPL77	Mus musculus	mu19ksuba
10	GPL78	Mus musculus	mu19ksubb
11	GPL79	Mus musculus	mu19ksubc
12	GPL80	Homo sapiens	hu6800
13	GPL81	Mus musculus	mgu74av2
14	GPL82	Mus musculus	mgu74bv2
15	GPL83	Mus musculus	mgu74cv2
16	GPL85	Rattus norvegicus	rgu34a
17	GPL86	Rattus norvegicus	rgu34b
18	GPL87	Rattus norvegicus	rgu34c
19	GPL88	Rattus norvegicus	rnu34
20	GPL89	Rattus norvegicus	rtu34
22	GPL91	Homo sapiens	hgu95av2
23	GPL92	Homo sapiens	hgu95b
24	GPL93	Homo sapiens	hgu95c
25	GPL94	Homo sapiens	hgu95d
26	GPL95	Homo sapiens	hgu95e
27	GPL96	Homo sapiens	hgu133a
28	GPL97	Homo sapiens	hgu133b
29	GPL98	Homo sapiens	hu35ksuba
30	GPL99	Homo sapiens	hu35ksubb
31	GPL100	Homo sapiens	hu35ksubc
32	GPL101	Homo sapiens	hu35ksubd
36	GPL201	Homo sapiens	hgfocus
37	GPL339	Mus musculus	moe430a
38	GPL340	Mus musculus	mouse4302
39	GPL341	Rattus norvegicus	rae230a
40	GPL342	Rattus norvegicus	rae230b
41	GPL570	Homo sapiens	hgu133plus2
42	GPL571	Homo sapiens	hgu133a2
43	GPL886	Homo sapiens	hgug4111a
44	GPL887	Homo sapiens	hgug4110b
45	GPL1261	Mus musculus	mouse430a2
49	GPL1352	Homo sapiens	u133x3p
50	GPL1355	Rattus norvegicus	rat2302
51	GPL1708	Homo sapiens	hgug4112a
54	GPL2891	Homo sapiens	h20kcod
55	GPL2898	Rattus norvegicu	adme16cod
60	GPL3921	Homo sapiens	hthgu133a
63	GPL4191	Homo sapiens	h10kcod
64	GPL5689	Homo sapiens	hgug4100a
65	GPL6097	Homo sapiens	illuminaHumanv1
66	GPL6102	Homo sapiens	illuminaHumanv2
67	GPL6244	Homo sapiens	hugene10sttranscriptcluster
68	GPL6947	Homo sapiens	illuminaHumanv3
69	GPL8300	Homo sapiens	hgu95av2
70	GPL8490	Homo sapiens	IlluminaHumanMethylation27k
71	GPL10558	Homo sapiens	illuminaHumanv4
72	GPL11532	Homo sapiens	hugene11sttranscriptcluster
73	GPL13497	Homo sapiens	HsAgilentDesign026652
74	GPL13534	Homo sapiens	IlluminaHumanMethylation450k
75	GPL13667	Homo sapiens	hgu219
76	GPL15380	Homo sapiens	GGHumanMethCancerPanelv1
77	GPL15396	Homo sapiens	hthgu133b
78	GPL17897	Homo sapiens	hthgu133a

如果是比较少见的平台，那么可以通过下面的方式来下载相应的平台探针信息：

1	# gpl <- getGEO('GPL6244',destdir = ".")

只是这种方式下载起来比较慢。

此时将探针转换为基因名，我们前面提到，这个数据使用的是GPL6244平台来测的（上表第67行），它对应的R包是hugene10sttranscriptcluster，此时安装这个包并加载即可：

1
2
3

biocLite("hugene10sttranscriptcluster.db")
library(hugene10sttranscriptcluster.db)
ls('package:hugene10sttranscriptcluster.db')

运行结果如下所示：

> ls('package:hugene10sttranscriptcluster.db')
 [1] "hugene10sttranscriptcluster"              "hugene10sttranscriptcluster.db"          
 [3] "hugene10sttranscriptcluster_dbconn"       "hugene10sttranscriptcluster_dbfile"      
 [5] "hugene10sttranscriptcluster_dbInfo"       "hugene10sttranscriptcluster_dbschema"    
 [7] "hugene10sttranscriptclusterACCNUM"        "hugene10sttranscriptclusterALIAS2PROBE"  
 [9] "hugene10sttranscriptclusterCHR"           "hugene10sttranscriptclusterCHRLENGTHS"   
[11] "hugene10sttranscriptclusterCHRLOC"        "hugene10sttranscriptclusterCHRLOCEND"    
[13] "hugene10sttranscriptclusterENSEMBL"       "hugene10sttranscriptclusterENSEMBL2PROBE"
[15] "hugene10sttranscriptclusterENTREZID"      "hugene10sttranscriptclusterENZYME"       
[17] "hugene10sttranscriptclusterENZYME2PROBE"  "hugene10sttranscriptclusterGENENAME"     
[19] "hugene10sttranscriptclusterGO"            "hugene10sttranscriptclusterGO2ALLPROBES" 
[21] "hugene10sttranscriptclusterGO2PROBE"      "hugene10sttranscriptclusterMAP"          
[23] "hugene10sttranscriptclusterMAPCOUNTS"     "hugene10sttranscriptclusterOMIM"         
[25] "hugene10sttranscriptclusterORGANISM"      "hugene10sttranscriptclusterORGPKG"       
[27] "hugene10sttranscriptclusterPATH"          "hugene10sttranscriptclusterPATH2PROBE"   
[29] "hugene10sttranscriptclusterPFAM"          "hugene10sttranscriptclusterPMID"         
[31] "hugene10sttranscriptclusterPMID2PROBE"    "hugene10sttranscriptclusterPROSITE"      
[33] "hugene10sttranscriptclusterREFSEQ"        "hugene10sttranscriptclusterSYMBOL"       
[35] "hugene10sttranscriptclusterUNIGENE"       "hugene10sttranscriptclusterUNIPROT"

这是一个R包的注释文件，它提供的是hugene10sttranscriptcluster这个芯片平台的详细信息，每两个月更新一次。这个包中含有的很多数据，上表已经列出来了，我们将探针转换为基因要使用的是文件是hugene10sttranscriptclusterSYMBOL这个，它的功能在于：将探针编号转换为基因名。

同理，hugene10sttranscriptclusterUNIPROT是将探针编号转换为蛋白名，

同时，hugene10sttranscriptclusterREFSEQ是将探针编号转换为转录组名，转换时使用的函数是toTable()函数，如下所示：

probe_trans_SYMBOL <- toTable(hugene10sttranscriptclusterSYMBOL)
probe_trans_UNIPROT <- toTable(hugene10sttranscriptclusterUNIPROT)
probe_trans_REFSEQ <- toTable(hugene10sttranscriptclusterREFSEQ)
head(probe_trans_SYMBOL)
head(probe_trans_UNIPROT)
head(probe_trans_REFSEQ)

运行结果为：

> head(probe_trans_SYMBOL)
  probe_id    symbol
1  7896759 LINC01128
2  7896761    SAMD11
3  7896779    KLHL17
4  7896798   PLEKHN1
5  7896817     ISG15
6  7896822      AGRN
> head(probe_trans_UNIPROT)
  probe_id uniprot_id
1  7896761     Q96NU1
2  7896779     Q6TDP4
3  7896798     Q494U1
4  7896817     P05161
5  7896822     O00468
6  7896865     Q6ZVT0
> head(probe_trans_REFSEQ)
  probe_id accession
1  7896759 NR_015368
2  7896759 NR_047519
3  7896759 NR_047520
4  7896759 NR_047521
5  7896759 NR_047522
6  7896759 NR_047523

我们这里只使用基因SYMBOL即可，也就是probe_trans_SYMBOL <- toTable(hugene10sttranscriptclusterSYMBOL)这个语句。

由于一个基因对应好几个探针，因此转换后的基因名必然是小于探针名的，如下所示：

1 2	length(unique(probe_trans_SYMBOL$symbol)) length(unique(probe_trans_SYMBOL$probe_id))

运行结果如下所示：

> length(unique(probe_trans_SYMBOL$symbol))
[1] 18834
> length(unique(probe_trans_SYMBOL$probe_id))
[1] 19827

从上面可以看出来，基因SYMBOL有18834个，探针有19827个。

现在我们将下载的表达谱中的探针转换为基因SYMBOL，先来看下表达谱，如下所示：

现在的任务就是将左侧的探针编号转换为基因SYMBOL。

此时有两步操作：

第一步就是我们先把表达谱中能够在平台芯片注释文件中找到基因SYMBOL给挑出来，因为表达谱中的有的探针名没有对应的基因SYMBOL，不要这种探针的数据，筛选过程如下所示：

table(rownames(exprSet) %in% probe_trans_SYMBOL$probe_id)
dim(exprSet)
exprSet_trans_symbol <- exprSet[rownames(exprSet) %in% probe_trans_SYMBOL$probe_id,]
dim(exprSet_trans_symbol)

运行结果如下所示：

> dim(exprSet)
[1] 28869    11
> exprSet_trans_symbol <- exprSet[rownames(exprSet) %in% probe_trans_SYMBOL$probe_id,]
> dim(exprSet_trans_symbol)
[1] 19631    11

从结果中可以看出来，表达谱中有28869行，第1列就是探针名，探针中有19631个能找到相应的基因SYMBOL，因此表达谱经过第一步操作，就从原来的28869行缩小到了19631行，再强调一下这一步的目的，就是筛选表达谱数据，将那些表达谱中有探针号，但没有对应基因SYMBOL的数据给剔除掉。

第二步：将表达谱中的这些探针号与基因名对应起来，如下所示：

1	probe_trans_SYMBOL_filter <- probe_trans_SYMBOL[match(rownames(exprSet_trans_symbol), probe_trans_SYMBOL$probe_id),]

这里用到了match()函数，简单看一个案例：

1
2
3

A<-1:10
B<-seq(5,15,2)
match(A,B) #遍历A中的元素，看A的元素是否在B里面出现过，如果A中的元素在B中出现，就返回此元素在B中的索引号

运行结果如下所示：

1 2	> match(A,B) #遍历A中的元素，看A的元素是否在B里面出现过，如果A中的元素在B中出现，就返回此元素在B中的索引号 [1] NA NA NA NA 1 NA 2 NA 3 NA

因此上面代码的目的在于：将表达谱中的探针位置与转换后的探针文件probe_trans_SYMBOL中的位置对应起来，先看一下这两个文件的长度：

> length(rownames(exprSet_trans_symbol))
[1] 19631
> length(probe_trans_SYMBOL$probe_id)
[1] 19827
> length(probe_trans_SYMBOL_filter$symbol)
[1] 19631

经过前面的match()函数操作，我们就可以看到，探针的位置已经与基因名一一对应了，如下所示：

左侧是探针对应的文件，右侧是表达谱数据，由于现在这两个文件的探针与基因的位置都是一一对应的，使用merge()函数或cbind()都可以，由于名称对应，直接改名也行，如下所示：

1 2	rownames(exprSet_trans_symbol) <- probe_trans_SYMBOL_filter$symbol str(exprSet_trans_symbol)

运行结果如下所示：

> str(exprSet_trans_symbol)
 num [1:19631, 1:11] 7.27 8.05 7.47 7.52 7.08 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:19631] "LINC01128" "SAMD11" "KLHL17" "PLEKHN1" ...
  ..$ : chr [1:11] "GSM607938" "GSM607939" "GSM607940" "GSM607941" ...

此时，这个表达谱是有19631行数据，但是，由于芯片的设计，一个基因名有可能对应好几个探针编号，查看一下：

1 2	> length(unique(rownames(exprSet_trans_symbol))) [1] 18775

从这个结果可以看出来，其实基因一共是18775个，比19631要少，因此我们还要继续剔除掉一些相同的基因名，此时的问题就是，如果有3行的基因相同，要剔掉哪个？按教程中的标准，剔除标准是：只保留在各个样本中均值最大的那个基因名，这个处理过程的示意图如下所示：

在上表中，一共有4个样本，6个基因，其中Gene_5有3个重复，Gene_6有4个重复，因此我们要剔除掉那些重复的基因名，假设我们按照这2个基因在不同样本中的最高均值这个条件来筛选，那么Gene_5只需要保留第1行，Gene_6只需要保留第3行就行，代码如下所示：

exprSet_trans_symbol2 <- as.data.frame(exprSet_trans_symbol)
exprSet_trans_symbol2$names <-rownames(exprSet_trans_symbol)
result <- exprSet_trans_symbol2 %>% group_by(names) %>% summarise_all(max)
result2 <- as.matrix(result[-1])
rownames(result2) <- result$names
final_data <- result2
boxplot(final_data)

现在提取分组信息，在开头的代码中，我们把分组信息储存在了pData变量中，如下所示：

library(stringr)
group_sample <- pData$title
group_clean <- str_sub(group_sample,1,6)
group_clean[10:11] <- rep("Health",2)

有了分组信息后，把原始表达谱再整理一下，如下所示：

library(reshape2)
exprSet_L <- melt(final_data)
colnames(exprSet_L) <- c("probe", "Sample", "value")
exprSet_L$group <- rep(group_clean, each=nrow(final_data))

绘图：

library(ggplot2)
p <- ggplot(exprSet_L, aes(x=Sample, y=value, fill=group))+geom_boxplot()
p <- p + theme(axis.text.x = element_text(angle = 45, hjust = 0.5, vjust = 0.5))
p

参考资料

解读GEO数据存放规律及下载，一文就够