轉(zhuǎn)錄組全記錄:從rawdata到DEG功能富集

寫在前面:
本文是新手學(xué)習(xí)全記錄,僅供參考,歡迎交流!

1. rawdata

1.1 QC

fastqc -o out_path filename
multiqc ./

1.2 數(shù)據(jù)量統(tǒng)計(jì)

調(diào)用工具:iTools
使用:

iTools Fqtools stat -InFq <1.fq> -InFq <2.fq>  -OutStat <info.out>

2. cleandata

2.1 fastp過濾

fastp
-q 15 #期望堿基質(zhì)量值
-u 40 #去除高于40%低質(zhì)量堿基數(shù)量的reads
-n 5 #去除多于5個(gè)N堿基的reads
-l 80 #去除總長(zhǎng)度低于80個(gè)堿基的reads
-i read1 -o read1_clean
-I read2 -O read2_clean

2.2 fastqc情況

3. alignment

3.1 hisat2

1)建index
2)比對(duì)
參數(shù):

hisat2 -t -x -1 -2 -S

3)sam2bam, sort

samtools view -S -b in.sam > out.bam
samtools sort -n in.bam out.prefix

3.2 RNA-seq specific QC

用到了qualimap,檢測(cè)reads的基因組來源:intron exon intergenic等

qualimap rnaseq 
-bam  
-gtf 
-pe  -s --java-mem-size=8G

4.差異表達(dá)分析

說明:此處兩套流程:stringtie+ballgown和featureCount+DESeq2都可以,目前主流是fc+DESeq2,但stringtie可做轉(zhuǎn)錄本水平的差異。

5. StringTie

5.1 組裝轉(zhuǎn)錄本

stringtie in.bam -G ref.gtf -o out.gtf -p 8 

StringTie運(yùn)行速度很快,30min/樣品

5.2 merge 轉(zhuǎn)錄本

首先創(chuàng)建mergelist.txt

ls *sample.gtf >mergelist.txt

開始merge,運(yùn)行時(shí)間5min

stringtie --merge -p 8 
-G 
-o   #只有這一個(gè)輸出文件
mergelist.txt

5.3 檢測(cè)轉(zhuǎn)錄本的組裝情況

gffcompare -r gtf 
-G -o merged 
stringtie_merged.gtf

結(jié)果文件:

  1. merged.stringtie_merged.gtf.refmap
  2. merged.stringtie_merged.gtf.tmap

5.4 重新組裝轉(zhuǎn)錄本

stringtie -e # 僅定量-G中存在的轉(zhuǎn)錄本
-p 8 
sample_sorted.bam
-G stringtie_merge.gtf #使用上一步merge的gtf
-o sample.gtf
-b file_for_ballgown/sample/ #具體到每個(gè)樣品的文件夾,或 -B /dir/sample.gtf
-A file_for_abundance/sample.txt #要具體到file名字

生成的文件直接用于下一步的ballgown分析。

6. ballwn差異表達(dá)基因

問題:差異基因分不清是上調(diào)還是下調(diào);

biostar論壇:
ballgown和FPKM不適合differential expression analysis;
用DESeq2 for DE analysis,用Ballgown看基因表達(dá)水平FPKM。
FPKM被認(rèn)為是inferior for sample comparisions;
well-maintained gene level solutions: DESeq2 or edgeR.

腳本:

setwd( )

library(ballgown)
library(RSkittleBrewer)
library(genefilter)
library(dplyr)
library(devtools)

pheno_data <- read.csv("geuvadis_phenodata.csv",sep = ",", header = T) #表型數(shù)據(jù)
bg <- ballgown(dataDir = , samplePattern = "sample", pData=pheno_data) #dataDir是數(shù)據(jù)的地方
bg_filter <- subset(bg, "rowVars(texpr(bg))>1", genomesubset=TRUE) #過濾低豐度基因,濾掉了樣本間差異少于一個(gè)轉(zhuǎn)錄本的數(shù)據(jù)
########################確認(rèn)組間差異###########################
result_tran <- stattest(bg_filter, feature = "transcript", covariate ="stage", adjustvars = c("idv"), getFC=TRUE,meas="FPKM") #組間有差異的轉(zhuǎn)錄本
result_gene <- stattest(bg_filter, feature = "gene", covariate ="stage", adjustvars = c("idv"), getFC=TRUE,meas="FPKM") #組間有差異的基因
result_tran <- data.frame(geneNames=ballgown::geneNames(bg_filter), geneIDs = ballgown::geneIDs(bg_filter), result_tran) #為trans添加基因名

indices <- match(result_gene$id, texpr(bg, 'all')$gene_id) #為gene加基因名,依據(jù)bg
gene_names_for_result <- texpr(bg, 'all')$gene_name[indices]
result_gene <- data.frame(geneNames=gene_names_for_result, result_gene)

result_tran <- arrange(result_tran, pval) #排序
result_gene <- arrange(result_gene, pval)
write.csv(result_tran, "result_tran.csv",row.names=FALSE) #保存
write.csv(result_gene, "result_gene.csv",row.names=FALSE)

indices <- match(result_gene$id, texpr(bg_filter, 'all')$gene_id) #為DEG加基因名,bg_filter
gene_names_for_result <- texpr(bg_filter, 'all')$gene_name[indices]
result_gname <- data.frame(geneNames=gene_names_for_result, result_gene)
write.csv(result_gname, "result_gname_filter.csv",row.names=FALSE)
#########################確定DEG/DET##########################
result_DET <- subset(result_tran, result_tran$qval<0.05) #篩選出q值小于0.05的,即差異
 result_DEG <- subset(result_gene, result_gene$qval<0.05)
write.csv(result_DET,"DET.csv",row.names=FALSE)
write.csv(result_DEG,"DEG.csv",row.names=FALSE)
 ########################畫FPKM圖#############################
 tropical <- c('darkorange', 'dodgerblue','hotpink', 'limegreen', 'yellow')
palette(tropical)
fpkm <- texpr(bg, meas = "FPKM")
fpkm <- log2(fpkm+1)
pdf("FPKM.pdf")
boxplot(fpkm, col=as.numeric(pheno_data$stage), las=2, ylab='log2(FPKM+1)')
dev.off()

###################單個(gè)轉(zhuǎn)錄本的樣品分布箱線圖#####################
#查看單個(gè)轉(zhuǎn)錄本在樣品中的分布
ballgown::transcriptNames(bg)[12]
    12
    "NM_012227"
ballgown::geneNames(bg)[12]
    12
    "GTPBP6"
#繪制箱線圖
plot(fpkm[12,] ~ pheno_data$stage, border=c(1,2),+
    main=paste(ballgown::geneNames(bg)[12], ' : ',+
    ballgown::transcriptNames(bg)[12]), pch=19, xlab="stage",+
    ylab='log2(FPKM+1)')
 points(fpkm[12,] ~ jitter(as.numeric(pheno_data$stage)),+
    col=as.numeric(pheno_data$stage))
############查看某一基因位置上所有的轉(zhuǎn)錄本#########################
# plotTranscripts函數(shù)可以根據(jù)指定基因的id畫出在特定區(qū)段的轉(zhuǎn)錄本
#可以通過sample函數(shù)指定看在某個(gè)樣本中的表達(dá)情況,這里選用id=1750, sample=ERR188234
plotTranscripts(ballgown::geneIDs(bg)[1729], bg,+
    main=c('Gene XIST in sample ERR188234'), sample=c('ERR188234'))
plotMeans('MSTRG.56', bg_filter, groupvar="stage",legend=FALSE)

腳本參考:

  1. http://m.itdecent.cn/p/1f5d13cc47f8
  2. Hisat+stringtie+ballgown文章

7. featureCount+DESeq

7.1 featureCounts

featureCounts 
-p -t exon -g gene_id 
-a Sscrofa11.1.gtf 
-o  
bamfile

結(jié)果: 兩種文件

1, txt文件,很多列,有ENSG基因號(hào)、位置、call到的reads數(shù)等
2, summary文件,很小類似一個(gè)日志文件,顯示了比對(duì)的情況,未必對(duì)上的是什么。

7.2 DESeq2

7.2.1 基本原理

1.1 概述
  1. 全稱:DESeq2 package for differential analysis of count data;
  2. 利用負(fù)二項(xiàng)分布廣義線性模型( negative binomial generalized linear models),同時(shí),還利用了離散型估計(jì)、logFoldChange;
  3. 負(fù)二項(xiàng)分布是一個(gè)離散分布,符合測(cè)序reads分布;
1.2 構(gòu)建dds
  1. 要求輸入原始 reads count 數(shù);不接受已經(jīng)做過處理的FPKM/TPM等,因?yàn)檐浖凶约旱臉?biāo)準(zhǔn)化計(jì)算方法;
  2. 構(gòu)建dds。需要設(shè)置design公式,即告訴軟件你的數(shù)據(jù)是怎樣來的,基本試驗(yàn)設(shè)計(jì)如何,軟件會(huì)根據(jù)幾個(gè)變量綜合計(jì)算;
    一般:design =~ variable1 + variable2 + ...;
    只有一個(gè)變量時(shí):design=~ condition;
    很多醫(yī)學(xué)分析會(huì)加入年齡、性別等:design=~sex+disease+condition
    可以對(duì)應(yīng)幾個(gè)變量,但如果沒有額外參數(shù),log2FC和p值是默認(rèn)對(duì)design公式中的最后一個(gè)變量或者最后一個(gè)因子與參考因子進(jìn)行比較;
1.3 函數(shù)與計(jì)算
1.3.1 標(biāo)準(zhǔn)化:DESeq函數(shù)
  1. 不同樣品的測(cè)序量有差異,最簡(jiǎn)單的標(biāo)準(zhǔn)化方式是計(jì)算counts per million (CPM) = 原始reads count ÷ 總reads數(shù) x 1,000,000
  2. 這種計(jì)算方式,易受到極高表達(dá)且在不同樣品中存在差異表達(dá)的基因的影響:這些基因的打開或關(guān)閉會(huì)影響到細(xì)胞中總的分子數(shù)目,可能導(dǎo)致這些基因標(biāo)準(zhǔn)化之后就不存在表達(dá)差異了,而原本沒有差異的基因標(biāo)準(zhǔn)化之后卻有差異了;
  3. RPKM、FPKM、TPM 是 CPM按照基因或轉(zhuǎn)錄本長(zhǎng)度歸一化后的表達(dá),都會(huì)受到這一影響;
  4. DESeq2的方法:
  • 量化因子 (size factor,SF),首先計(jì)算每個(gè)基因在所有樣品中表達(dá)的幾何平均值;每個(gè)細(xì)胞的SF是所有基因與其在所有樣品中的表達(dá)值的幾何平均值的比值的中位數(shù);由于幾何平均值的使用,只有在所有樣品中表達(dá)都不為0的基因才能用來計(jì)算。這一方法又被稱為RLE(relative log expression)。
  • 不但考慮了測(cè)序深度的問題,還考慮了表達(dá)量超高或者極顯著差異表達(dá)的基因?qū)е耤ount的分布出現(xiàn)偏倚。
  1. DESeq函數(shù)分析:
  • 三步:estimation of size factors(estimateSizeFactors), estimation of dispersion(estimateDispersons), Negative Binomial GLM fitting and Wald statistics(nbinomWaldTest);
  • 可以分步運(yùn)行,也可一步到位,最后返回 results可用的DESeqDataSet對(duì)象。
1.3.2 歸一化:rlog/vst

是我自己去的名字,可能不準(zhǔn)確,我用在對(duì)dds進(jìn)行vst然后做PCA分析。

全稱:快速估算離散趨勢(shì)并應(yīng)用方差穩(wěn)定轉(zhuǎn)換;
若 samples<30 用 rlog函數(shù),>30用 vst;
類似的函數(shù):gmodels - fast.prcomp,輸入數(shù)據(jù)為TPM;或者TMM;

1.3.3 數(shù)據(jù)收縮:lfcShrink:

shrink the log2 foldchange,不會(huì)改變顯著差異的基因總數(shù),作者很推薦這個(gè)新功能。

  1. 為何采用lfcShrink?log2FC estimates do not account for the large dispersion we observe with low read counts. 因此,兩種數(shù)據(jù)特別需要:低表達(dá)量占比高的;數(shù)據(jù)特別分散的。
  2. 但是我只用來做MA plot并沒用來差異分析,因?yàn)椋?/li>
  3. lfcShrink 不改變p值q值,但改變了fc,使 foldchange范圍變小,所以選擇DEG時(shí)會(huì)有不同結(jié)果,一般會(huì)偏少!所以,根據(jù)數(shù)據(jù)情況,本次分析DEG還是不做shrink。
1.3.4 p-value和q-value
  • 作者給出的建議:
    Need to filter on adjusted p-values, not p-values, to obtain FDR control. 10% FDR is common because RNA-seq experiments are often exploratory and having 90% true positives in the gene set is ok.
    即:用padj為標(biāo)準(zhǔn)做結(jié)果篩選。
  • 事實(shí)上,在軟件計(jì)算過程中,多次以alpha表示padj,并默認(rèn)alpha=0.1;
1.3.5 MA plot
  1. MA plot也叫 mean-difference plot或者Bland-Altman plot,用來估計(jì)模型中系數(shù)的分布;
  2. X軸, the "A"(average);Y軸,the "M"(minus) – subtraction of log values is equivalent to the log of the ratio;
  3. M表示log fold change,衡量基因表達(dá)量變化,上調(diào)還是下調(diào);A表示每個(gè)基因的count的均值;
  4. 根據(jù)summary(res)可知,low count的比率很高,所以大部分基因表達(dá)量不高,也就是集中在0的附近(log2(1)=0,也就是變化1倍),提供了模型預(yù)測(cè)系數(shù)的分布總覽。
1.4 DESeq(dds)結(jié)果矩陣每一列的含義:
  1. baseMean: is a just the average of the normalized count values, dividing by size factors, taken over all samples in the DESeqDataSet;是對(duì)照組的樣本標(biāo)準(zhǔn)化counts的均值;
  2. log2FoldChange: the effect size estimate. It tells us how much the gene’s expression seems to have changed due to treatment with dexamethasone in comparison to untreated samples;也不是簡(jiǎn)單的用標(biāo)準(zhǔn)化的counts進(jìn)行計(jì)算,因?yàn)橛?jì)算的時(shí)候需要考慮零值以及其他效應(yīng);結(jié)果是log2fc(trt/untrt)所以要注意對(duì)照和處理的指定;
  3. lfcSE: the standard error estimate for the log2 fold change estimate,(the effect size estimate has an uncertainty associated with it,);
  4. p value: statistical test , the result of this test is reported as a p value. Remember that a p value indicates the probability that a fold change as strong as the observed one, or even stronger, would be seen under the situation described by the null hypothesis;
  • p value有時(shí)候是NA:Sometimes a subset of the p values in res will be NA (not available); This is DESeq's way of reporting that all counts for this gene were zero, and hence no test was applied. In addition, p values can be assigned NA if the gene was excluded from analysis because it contained an extreme count outlier. For more information, see the outlier detection section of the DESeq2 vignette.
  1. padj: adjusted p-value;
1.5 實(shí)際遇到的其它問題
1.5.1 pre-analysis 預(yù)分析

就是開始熟悉你的數(shù)據(jù),選擇合適的分析方法;

先做三個(gè)圖:PCA,相關(guān)性熱圖,聚類圖;

1.5.2 批次效應(yīng)

1. 本項(xiàng)目的批次效應(yīng):

design =~ batch + condition   
  1. 一般批次效應(yīng):
  • 可以用limma removeBatchEffect或者sva Combat等去除;
  • 但是在做差異分析時(shí),ballgown, DESeq2等軟件建議不要提前去批次,而是將批次作為covariate進(jìn)行分析;
  1. 如果想做差異表達(dá)分析,但數(shù)據(jù)中又有已知的批次問題,則應(yīng)該在構(gòu)建模型矩陣時(shí)加入批次因素,我做ballgown時(shí)用了adjust cov = idv。

7.2.2 實(shí)操

經(jīng)過多次分析和調(diào)整,最后用的代碼是:

(1)安裝包
if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install("DESeq2")
(2)導(dǎo)入數(shù)據(jù)兩兩比較
setwd( )
colData <- read.table('', header=TRUE, row.names=1)
readscount <- read.table('', header=TRUE, row.names=1)
condition <- factor(c(rep("A",1),rep("F",1)))
batch <- factor(c(rep(),rep(),rep(),
                  rep(),rep(),rep()))
library(DESeq2) 
dds <- DESeqDataSetFromMatrix(readscount, colData, design =~ batch + condition)
keep <- rowSums(counts(dds) >= 10) >= 3  
dds <- dds[keep, ] 
(3)PCA
vsdata <- vst(dds, blind=FALSE)  #歸一化
assay(vsdata) <- limma::removeBatchEffect(assay(vsdata), vsdata$batch)  #去批次效應(yīng)
plotPCA(vsdata, intgroup = "condition") 
(4)差異分析
dds_norm <- DESeq(dds, minReplicatesForReplace = Inf) #標(biāo)準(zhǔn)化; 
dds_norm$condition   #保證是levels是按照后一個(gè)比前一個(gè)即trt/untrt,否則需在results時(shí)指定
res <- results(dds_norm, contrast = c("condition","A","F"), cooksCutoff = FALSE) #alpha=0.05可指定padj; cookCutoff是不篩選outliers因?yàn)樘嗔?summary(res)  
#resOrdered <- res[order(res$pvalue), ] #排序
sum(res$padj<0.05, na.rm = TRUE)
res_data <- merge(as.data.frame(res),
              as.data.frame(counts(dds_norm,normalize=TRUE)),
              by="row.names",sort=FALSE)
up_DEG <- subset(res_data, padj < 0.05 & log2FoldChange > 1)
down_DEG <- subset(res_data, padj < 0.05 & log2FoldChange < -1)
write.csv(up_DEG, "up.csv")
write.csv(down_DEG, "down.csv")
(5)判斷歐氏距離,若有異常樣品則不用cooksCutoff;當(dāng)有上千個(gè)異常值時(shí)也不用:(完全可以不做)
par(mar=c(8,5,2,2))
boxplot(log10(assays(dds_norm)[["cooks"]]), range=0, las=2)
(6)lfcshrink & MA plot
library(apeglm)  
resultsNames(dds_norm)  #看一下要shrink的維度;shrink數(shù)據(jù)更加緊湊,少了一項(xiàng)stat,并未改變padj,但改變了foldchange
res_shrink <- lfcShrink(dds_norm, coef="condition_A_vs_F", type="apeglm") #最推薦apeglm算法;根據(jù)resultsNames(dds)的第5個(gè)維度,coef=5,也可直接""指定;apeglm不allow contrast,所以要指定coef
pdf("MAplot.pdf", width = 6, height = 6) 
plotMA(res_shrink, ylim=c(-10,10), alpha=0.1, main="MA plot")
dev.off()
(7)火山圖,需要根據(jù)lfc添加significant列,分別為down,up,stable
library(ggplot2)
voldata <-read.csv(file = "allDEGs.csv",header = TRUE, row.names =1,sep = ",")
pdf("volcano.pdf", width = 6, height = 5)
ggplot(data=voldata, aes(x=log2FoldChange,y= -1*log10(padj))) +
  geom_point(aes(color=significant)) +
  scale_color_manual(values=c("#546de5", "#d2dae2","#ff4757")) + 
  labs(title="Volcano Plot", x=expression(log[2](FC), y=expression(-log[10](padj)))) +
  geom_hline(yintercept=1.3,linetype=4) +  
  geom_vline(xintercept=c(-1,1),linetype=4) +
  theme_bw() + theme(panel.grid = element_blank())  
dev.off()
心得:

DESeq2官方說明: https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#indfilt

8. 富集分析 Gene Ontology Enrichment Analysis

富集分析分為兩類:

A:差異基因富集分析(不需要表達(dá)值,只需要gene name)
B:基因集(gene set)富集分析(不管有無差異,需要全部genes表達(dá)值)

8.1 差異基因富集:ID轉(zhuǎn)換

  • 如果是做 人、小鼠,則不需要轉(zhuǎn)換。

方法:
網(wǎng)頁(yè)在線版 g: profiler https://biit.cs.ut.ee/gprofiler/convert

8.2 GO富集與畫圖

  • 工具:Y叔的R包 clusterProfiler;

代碼:

setwd('')
library(clusterProfiler)
library(DOSE)
library(org.Ss.eg.db)
library(ggplot2)
library(stringr)
###get the ENTREZID for the next analysis
keytypes(org.Ss.eg.db) 
gene_All <- read.csv(file = "", header = T)
gene_Alias <- gene_All[ ,2]
gene_ID <- bitr(gene = gene_Alias, fromType = "ALIAS", 
              toType = c("SYMBOL","ENTREZID"),
              OrgDb = org.Ss.eg.db) 
###Go classification
go_res <- enrichGO(gene = gene_ID$ENTREZID, 
                   OrgDb = "org.Ss.eg.db", 
                   ont = "all",
                   pvalueCutoff = 0.9,
                   qvalueCutoff = 0.9)
go_dose <- DOSE::setReadable(go_res, OrgDb = 'org.Ss.eg.db', keyType = 'ENTREZID')
write.csv(go_dose, 'goresult.csv')
pdf('goresult.pdf')
barplot(go_dose, split = "ONTOLOGY", font.size = 10, 
        title="DEGs GO enrichment") + 
  facet_grid(ONTOLOGY~., scale = "free") + 
  scale_x_discrete(labels = function(x) str_wrap(x, width=50)) 
dev.off()

KEGG enrichment:

kegg <- enrichKEGG(gene = gene_ID$ENTREZID,
                   organism = 'ssc',
                   keyType = "kegg",
                   pvalueCutoff = 0.9,
                   qvalueCutoff = 0.9,
                   pAdjustMethod = "BH",
                   minGSSize = 10, 
                   maxGSSize = 500)
kegg[1:30]
pdf('keggresult.pdf')
barplot(kegg, showCategory = 20, font.size = 10, xlab = "Gene Counts",
        title = "kegg") + 
  scale_size(range = c(2, 12)) + 
  scale_x_discrete(labels = function(kegg) str_wrap(kegg, width = 50)) 
dev.off()

8.3 Allenricher GO+KEGG

Linux運(yùn)行;

perl /software/AllEnricher-v1.0/AllEnricher 
-l id.txt 
-s ssc -v v20190612 
-o /Allenricher_GO_KEGG/ 
-r /bin/Rscript 
-i KEGG+GO

# 富集genenum padj 可在allenricher軟件腳本修改。

8.4 基因集富集:GESA(clusterProfiler包)

結(jié)果說明
  1. Enrichment Score:作者認(rèn)為當(dāng)其表達(dá)矩陣的gene list在gene sets中是隨機(jī)分布的話,那么最終的ES值是相對(duì)較小的;當(dāng)是非隨機(jī)分布時(shí),則對(duì)應(yīng)的ES值是相對(duì)較大的。
  2. 一般|NES|>1, p-value<0.05, FDR<=25%的條目有意義;
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

友情鏈接更多精彩內(nèi)容