转录组学习笔记07-差异基因分析

前言

利用DEseq2包进行分析不同组之间的差异。都是参考徐洲更.(伪)从零开始学转录组(7):差异基因表达分析的。

导入数据，构建 DESeq2 所需的 DESeqDataSet 对象，如下所示：

library(DESeq2)
countData <- raw_count_filt[,2:5]
condition <- factor(c("control","KD","KD","control"))
dds <- DESeqDataSetFromMatrix(countData, DataFrame(condition), design= ~ condition )

过滤掉低low count

接着，过滤掉一些low count数据（通常是count为0的read），节省内存，提高运行速度，如下所示：

1
2
3

nrow(dds)
dds <- dds[ rowSums(counts(dds)) > 1, ]
nrow(dds)

使用DESeq进行差异表达分析：

DESeq包含三步，estimation of size factors（estimateSizeFactors)， estimation of dispersion（estimateDispersons)， Negative Binomial GLM fitting and Wald statistics（nbinomWaldTest），可以分布运行，也可用一步到位，最后返回 results可用的DESeqDataSet对象，如下所示：

1	dds <- DESeq(dds)

estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing

出现上述结果表示运行结束，过程大概是十几秒。
可以通过result来查看运行的结果，如下所示：

1	res <- results(dds)

可用mcols查看每一项结果的具体含义，比如说log2FoldChange 表示倍数变化取log2结果，还能画个火山图。padj 就是用BH对多重试验进行矫正。如下所示：

1	mcols(res, use.names = TRUE)

用summary可以看描述性的结果，如下所示：

绘制MA图

An MA plot is an application of a Bland–Altman plot for visual representation of genomic data. The plot visualises the differences between measurements taken in two samples, by transforming the data onto M (log ratio) and A (mean average) scales, then plotting these values. Though originally applied in the context of two channel DNA microarray gene expression data, MA plots are also used to visualise high-throughput sequencing analysis —From wikipeida

M表示log fold change，衡量基因表达量变化，上调还是下调。A表示每个基因的count的均值。根据summary可知，low count的比率很高，所以大部分基因表达量不高，也就是集中在0的附近（log2(1)=0，也就是变化1倍）,提供了模型预测系数的分布总览。

plotMA(res, ylim = c(-5,5))
topGene <- rownames(res)[which.min(res$padj)]
with(res[topGene, ], {
  points(baseMean, log2FoldChange, col="dodgerblue", cex=2, lwd=2)
  text(baseMean, log2FoldChange, topGene, pos=2, col="dodgerblue")
})

如下所示：

如果经过lfcShrink 收缩log2 fold change，结果会好看很多

res.shrink <- lfcShrink(dds, contrast = c("condition","KD","control"), res=res)
plotMA(res.shrink, ylim = c(-5,5))
topGene <- rownames(res)[which.min(res$padj)]
with(res[topGene, ], {
  points(baseMean, log2FoldChange, col="dodgerblue", cex=2, lwd=2)
  text(baseMean, log2FoldChange, topGene, pos=2, col="dodgerblue")
})

如下所示：

寻找差异基因

提取差异基因

1	res.deseq2 <- subset(res, padj < 0.05)

使用edgeR进行差异基因分析

edgeR使用DGEList函数读取count matrix数据，也就说你需要提供一个现成的matrix数据，而不是指望它能读取单独的文件，然后进行合并，其实可以用 tximport 或 DESeqDataSetFromHTSeq 读取单独的文件，然后传递给DGEList。

第一步：构建DGEList对象

1
2
3

library(edgeR)
group <- factor(c("control","KD","KD","control"))
genelist <- DGEList(counts=raw_count_filt[,2:5], group = group)

第二步：过滤 low counts数据。

与DESeq2的预过滤不同，DESeq2的预过滤只是为了改善后续运算性能，在运行过程中依旧会自动处理low count数据，edgeR需要在分析前就要排除那些low count数据，而且非常严格。从生物学角度，有生物学意义的基因的表达量必须高于某一个阈值。从统计学角度上， low count的数据不太可能有显著性差异，而且在多重试验矫正阶段还会拖后腿。综上所诉，放心大胆的过滤吧。

根据经验，基因至少在某一些文库的count超过10 ~ 15 才被认为是表达。这一步全靠尝试，剔除太多就缓缓，剔除太少就严格点。我们可以简单的对每个基因的raw count进行比较，但是建议用CPM（count-per-million)标准化后再比较，避免了文库大小的影响。

# 简单粗暴的方法
keep <- rowSums(genelist$count) > 50
# 利用CPM标准化
keep <- rowSums(cpm(genelist) > 0.5 ) >=2
table(keep)
genelist.filted <- genelist[keep,,keep.lib.sizes=FALSE]

这里的0.5(即阈值）等于 10/(最小的文库的 read count数 /1000000)，keep.lib.size=FALSE表示重新计算文库大小。

第三步：根据组成偏好(composition bias)标准化。edgeR的calcNormFactors函数使用TMM算法对DGEList标准化.

1	genelist.norm <- calcNormFactors(genelist.filted)

注大部分的mRNA-Seq数据分析用TMM标准化就行了，但是也有例外，比如说single-cell RNA-Seq(Lun, Bach, and Marioni 2016), 还有就是global differential expression，基因组一半以上的基因都是差异表达的，请尽力避免，(D. Wu et al. 2013)，不然就需要用到内参进行标准化了(Risso et al. 2014).

第四步：实验设计矩阵(Design matrix)，类似于DESeq2中的design参数。 edgeR的线性模型和差异表达分析需要定义一个实验设计矩阵。很直白的就能发现是1vs0

1 2	design <- model.matrix(~0+group) colnames(design) <- levels(group)

第五步：估计离散值（Dispersion）。前面已经提到负二项分布（negative binomial，NB)需要均值和离散值两个参数。edgeR对每个基因都估测一个经验贝叶斯稳健离散值（mpirical Bayes moderated dispersion），还有一个公共离散值（common dispersion，所有基因的经验贝叶斯稳健离散值的均值）以及一个趋势离散值

1 2	genelist.Disp <- estimateDisp(genelist.norm, design, robust = TRUE) plotBCV(genelist.Disp)

还可以进一步通过quasi-likelihood (QL)拟合NB模型，用于解释生物学和技术性导致的基因特异性变异 (Lund et al. 2012; Lun, Chen, and Smyth 2016).

1 2	fit <- glmQLFit(genelist.Disp, design, robust=TRUE) head(fit$coefficients)

第六步: 差异表达检验（1）。这一步主要构建比较矩阵，类似于DESeq2中的results函数的 contrast 参数。

1
2
3

cntr.vs.KD <- makeContrasts(control-KD, levels=design)
res <- glmQLFTest(fit, contrast=cntr.vs.KD)
ig.edger <- res$table[p.adjust(res$table$PValue, method = "BH") < 0.01, ]

这里用的是glmQLFTest而不是glmLRT是因为前面用了glmQLTFit进行拟合，所以需要用QL F-test进行检验。如果前面用的是glmFit，那么对应的就是glmLRT. 作者称QL F-test更加严格。多重试验矫正用的也是BH方法。

后续就是提取显著性差异的基因用作下游分析，做一些图看看

topTags(res,n=10)
is.de <- decideTestsDGE(res)
summary(is.de)
plotMD(res, status=is.de, values=c(1,-1), col=c("red","blue"),legend="topright")

如下所示：

第六步：差异表达检验（2）。上面找到的显著性差异的基因，没有考虑效应值，也就是具体变化了多少倍。我们也可用找表达量变化比较大的基因，对应的函数是 glmTreat。

使用limma进行差异分析

导入read count，保存为专门的对象用于后续分析
原始数据过滤，根据标准化read count 或者 raw count 作为筛选标准
raw read count 标准化
通过各种算法（如经验贝叶斯，EM）预测dispersion离散值
广义线性模型拟合数据
差异分析，也就是统计检验部分

Limma原先用于处理基因表达芯片数据。如果你仔细看edgeR导入界面，你就会发现，edgeR有一部分功能依赖于limma包。Limma采用经验贝叶斯模型（ Empirical Bayesian model）让结果更稳健。

在处理RNA-Seq数据时，raw read count先被转成log2-counts-per-million (logCPM)，然后对mean-variance关系建模。建模有两种方法：

精确权重法（precision weights）也就是“voom”
经验贝叶斯先验趋势（empirical Bayes prior trend），也就是”limma-trend“

数据预处理： Limma使用edgeR的DGEList对象，并且过滤方法都是一致的，对应edgeR的第一步,第二步，第三步

library(edgeR)library(limma)
group <- factor(c("control","KD","KD","control"))
genelist <- DGEList(counts=raw_count_filt[,2:5], group = group)
### filter base  use CPM
keep <- rowSums(cpm(genelist) > 0.5 ) >=2table(keep)
genelist.filted <- genelist[keep, ,keep.lib.sizes=FALSE]
### normalizaition
x <- calcNormFactors(x, method = "TMM")

差异表达分析: 使用”limma-trend”

design <- model.matrix(~0+group)
colnames(design) <- levels(group)
logCPM <- cpm(genelist.norm, log=TRUE, prior.count=3)
fit <- lmFit(logCPM, design)
fit <- eBayes(fit, trend=TRUE)
topTable(fit, coef=ncol(design))

差异表达分析: 使用”limma-voom“

### DGE with voom
v <- voom(genelist.norm, design, plot=TRUE)
#v <- voom(counts, design, plot=TRUE)
fit <- lmFit(v, design)
fit <- eBayes(fit)
all <- topTable(fit, coef=ncol(design), number=10000)
sig.limma <- all[all$adj.P.Val < 0.01, ]
fit <- treat(fit, lfc=log2(1.2))
topTreat(fit, coef=ncol(design))

前言

过滤掉低low count

使用DESeq进行差异表达分析：

绘制MA图

寻找差异基因

使用edgeR进行差异基因分析

第一步： 构建DGEList对象

第二步： 过滤 low counts数据。

使用limma进行差异分析

第一步：构建DGEList对象

第二步：过滤 low counts数据。