TCGA新版數(shù)據(jù)差異分析

上一篇文章里面簡單學(xué)習(xí)了一下表達(dá)矩陣的提取,順便探索了一下SummarizedExperiment對(duì)象。

今天學(xué)習(xí)下用TCGAbiolinks做差異分析。

加載R包和數(shù)據(jù)

rm(list = ls())

library(SummarizedExperiment)
## Loading required package: MatrixGenerics
## Loading required package: matrixStats
## 
## Attaching package: 'MatrixGenerics'
## The following objects are masked from 'package:matrixStats':
## 
##     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
##     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
##     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
##     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
##     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
##     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
##     colWeightedMeans, colWeightedMedians, colWeightedSds,
##     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
##     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
##     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
##     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
##     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
##     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
##     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
##     rowWeightedSds, rowWeightedVars
## Loading required package: GenomicRanges
## Loading required package: stats4
## Loading required package: BiocGenerics
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
##     union, unique, unsplit, which.max, which.min
## Loading required package: S4Vectors
## 
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname
## Loading required package: IRanges
## 
## Attaching package: 'IRanges'
## The following object is masked from 'package:grDevices':
## 
##     windows
## Loading required package: GenomeInfoDb
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## 
## Attaching package: 'Biobase'
## The following object is masked from 'package:MatrixGenerics':
## 
##     rowMedians
## The following objects are masked from 'package:matrixStats':
## 
##     anyMissing, rowMedians
library(TCGAbiolinks)

首先是查詢、下載、整理,但是這一步我們?cè)谥耙呀?jīng)做好了,直接加載就可以了!

下載方法:請(qǐng)翻看之前的推文

# 查詢
query <- GDCquery(project = "TCGA-COAD",
                    data.category = "Transcriptome Profiling",
                    data.type = "Gene Expression Quantification",
                    workflow.type = "STAR - Counts"
                    )
# 下載
GDCdownload(query, files.per.chunk = 100) #每次下載100個(gè)文件
  
# 整理
GDCprepare(query,save = T,save.filename = "TCGA-COAD_mRNA.Rdata")

我們已經(jīng)下載好了,就直接加載即可:

load("TCGA-mRNA/TCGA-COAD_mRNA.Rdata")

通常我們會(huì)區(qū)分mRNA和lncRNA,所以我們這里只選擇mRNA即可,方法在上一篇也說過了,非常簡單!直接對(duì)SummarizedExperiment對(duì)象取子集即可!

se_mrna <- data[rowData(data)$gene_type == "protein_coding",]

se_mrna
## class: RangedSummarizedExperiment 
## dim: 19962 521 
## metadata(1): data_release
## assays(6): unstranded stranded_first ... fpkm_unstrand fpkm_uq_unstrand
## rownames(19962): ENSG00000000003.15 ENSG00000000005.6 ...
##   ENSG00000288674.1 ENSG00000288675.1
## rowData names(10): source type ... hgnc_id havana_gene
## colnames(521): TCGA-A6-5664-01A-21R-1839-07
##   TCGA-D5-6530-01A-11R-1723-07 ... TCGA-A6-2683-01A-01R-0821-07
##   TCGA-A6-2683-11A-01R-A32Z-07
## colData names(107): barcode patient ... paper_vascular_invasion_present
##   paper_vital_status

數(shù)據(jù)預(yù)處理

在進(jìn)行差異分析前進(jìn)行一些預(yù)處理。

dim(se_mrna)
## [1] 19962   521

首先根據(jù)spearman相關(guān)系數(shù)去除異常值:

coad_coroutliers <- TCGAanalyze_Preprocessing(se_mrna,cor.cut = 0.7)
## Number of outliers: 0

dim(coad_coroutliers)
## [1] 19962   521

還會(huì)生成一張相關(guān)圖:Array Array Intensity correlation (AAIC):


image.png

接下來進(jìn)行標(biāo)準(zhǔn)化:

# normalization of genes
coadNorm <- TCGAanalyze_Normalization(
    tabDF = coad_coroutliers, 
    geneInfo =  geneInfoHT
)
## I Need about  127 seconds for this Complete Normalization Upper Quantile  [Processing 80k elements /s]
## Step 1 of 4: newSeqExpressionSet ...
## Step 2 of 4: withinLaneNormalization ...
## Step 3 of 4: betweenLaneNormalization ...
## Step 4 of 4: exprs ...

會(huì)使用EDASeq包中的方法:

EDASeq::newSeqExpressionSet
EDASeq::withinLaneNormalization
EDASeq::betweenLaneNormalization
EDASeq::counts
dim(coadNorm)
## [1] 19469   521

然后過濾掉低表達(dá)的基因:

# quantile filter of genes
coadFilt <- TCGAanalyze_Filtering(
    tabDF = coadNorm,
    method = "quantile", 
    qnt.cut =  0.25
)

看看一通操作下來后還剩多少基因?

dim(coadFilt)
## [1] 14600   521

從最開始的19962變成了現(xiàn)在的14600,過濾掉了5000+基因......

最后是根據(jù)腫瘤組織和正常組織進(jìn)行分組:
這里我們只選擇了實(shí)體瘤和部分正常組織。如果你想選擇更多,只要在typesample參數(shù)中添加更多類型即可。

可選類型見下圖,也是根據(jù)TCGA-barcode進(jìn)行判斷的:


image.png
# selection of normal samples "NT"
samplesNT <- TCGAquery_SampleTypes(
    barcode = colnames(coadFilt),
    typesample = c("NT")
)

# selection of tumor samples "TP"
samplesTP <- TCGAquery_SampleTypes(
    barcode = colnames(coadFilt), 
    typesample = c("TP")
)

所以分組這一步我們還是自己搞定!就是根據(jù)barcode的第14,15位數(shù),結(jié)合上面那張圖判斷,毫無難度。


image.png
# 小于10就是tumor
samplesTumor <- as.numeric(substr(colnames(coadFilt),14,15))<10

差異分析

非常簡單,支持edgeRlimma兩種方法,當(dāng)然也可以無縫連接DESeq2進(jìn)行差異分析!

# Diff.expr.analysis (DEA)
coadDEGs <- TCGAanalyze_DEA(
    mat1 = coadFilt[,!samplesTumor], # normal矩陣
    mat2 = coadFilt[,samplesTumor], # tumor矩陣
    Cond1type = "Normal",
    Cond2type = "Tumor",
    fdr.cut = 0.01, 
    logFC.cut = 1,
    pipeline = "edgeR", # limma
    method = "glmLRT"
)
## Batch correction skipped since no factors provided
## ----------------------- DEA -------------------------------
## o 41 samples in Cond1type Normal
## o 480 samples in Cond2type Tumor
## o 14600 features as miRNA or genes
## This may take some minutes...
## ----------------------- END DEA -------------------------------

差異分析就做好了!結(jié)果非常完美,同時(shí)提供了gene_name和gene_type,也就是說我們一開始不取子集也是可以的~~,最后再取也行!

image.png

使用DESeq2進(jìn)行差異分析

連接DESeq2那真是太簡單了,無縫銜接??!

library(DESeq2)

直接把SummarizedExperiment對(duì)象傳給DESeqDataSet()函數(shù)即可。

不過需要分組信息,這個(gè)需要我們手動(dòng)制作一下。

制作分組信息

其實(shí)我們的對(duì)象中包含了sample_type這一列信息,就在coldata中,但是有點(diǎn)過于詳細(xì)了。

table(colData(se_mrna)$sample_type)
## 
##          Metastatic       Primary Tumor     Recurrent Tumor Solid Tissue Normal 
##                   1                 478                   1                  41

我們給它修改一下~

new_type <- ifelse(colData(se_mrna)$sample_type == "Solid Tissue Normal",
                   "Normal",
                   "Tumor")
colData(se_mrna)$sample_type <- new_type

table(colData(se_mrna)$sample_type)
## 
## Normal  Tumor 
##     41    480

這樣就把分組搞定了!skr!

差異分析

然后就是愉快的進(jìn)行差異分析~

ddsSE <- DESeqDataSet(se_mrna, design = ~ sample_type)
## renaming the first element in assays to 'counts'
ddsSE
## class: DESeqDataSet 
## dim: 19962 521 
## metadata(2): data_release version
## assays(6): counts stranded_first ... fpkm_unstrand fpkm_uq_unstrand
## rownames(19962): ENSG00000000003.15 ENSG00000000005.6 ...
##   ENSG00000288674.1 ENSG00000288675.1
## rowData names(10): source type ... hgnc_id havana_gene
## colnames(521): TCGA-A6-5664-01A-21R-1839-07
##   TCGA-D5-6530-01A-11R-1723-07 ... TCGA-A6-2683-01A-01R-0821-07
##   TCGA-A6-2683-11A-01R-A32Z-07
## colData names(107): barcode patient ... paper_vascular_invasion_present
##   paper_vital_status

先過濾,這里我們簡單點(diǎn),

keep <- rowSums(counts(ddsSE)) >= 50
ddsSE <- ddsSE[keep,]

ddsSE
## class: DESeqDataSet 
## dim: 18820 521 
## metadata(2): data_release version
## assays(6): counts stranded_first ... fpkm_unstrand fpkm_uq_unstrand
## rownames(18820): ENSG00000000003.15 ENSG00000000005.6 ...
##   ENSG00000288674.1 ENSG00000288675.1
## rowData names(10): source type ... hgnc_id havana_gene
## colnames(521): TCGA-A6-5664-01A-21R-1839-07
##   TCGA-D5-6530-01A-11R-1723-07 ... TCGA-A6-2683-01A-01R-0821-07
##   TCGA-A6-2683-11A-01R-A32Z-07
## colData names(107): barcode patient ... paper_vascular_invasion_present
##   paper_vital_status

確定誰和誰比,我們?cè)O(shè)定Tumor-Normal。

ddsSE$sample_type <- factor(ddsSE$sample_type, levels = c("Tumor","Normal"))

接下來就可以進(jìn)行差異分析了:

dds <- DESeq(ddsSE)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
## -- replacing outliers and refitting for 1544 genes
## -- DESeq argument 'minReplicatesForReplace' = 7 
## -- original counts are preserved in counts(dds)
## estimating dispersions
## fitting model and testing
res <- results(dds, contrast = c("sample_type","Tumor","Normal"))
res
## log2 fold change (MLE): sample_type Tumor vs Normal 
## Wald test p-value: sample_type Tumor vs Normal 
## DataFrame with 18820 rows and 6 columns
##                     baseMean log2FoldChange     lfcSE       stat      pvalue
##                    <numeric>      <numeric> <numeric>  <numeric>   <numeric>
## ENSG00000000003.15  5198.688       0.456284 0.1445636    3.15628 1.59793e-03
## ENSG00000000005.6     42.064       0.414580 0.3014120    1.37546 1.68989e-01
## ENSG00000000419.13  1743.704       0.849473 0.1134251    7.48929 6.92491e-14
## ENSG00000000457.14   463.909      -0.261646 0.0690276   -3.79046 1.50371e-04
## ENSG00000000460.17   328.546       1.272811 0.0940574   13.53229 1.00837e-41
## ...                      ...            ...       ...        ...         ...
## ENSG00000288658.1   5.317261    -1.45986802  0.274148 -5.3251160 1.00889e-07
## ENSG00000288660.1   4.648268     1.47152585  0.450554  3.2660389 1.09063e-03
## ENSG00000288669.1   0.143343    -0.13840184  1.339247 -0.1033431 9.17691e-01
## ENSG00000288674.1   3.201667     0.00548347  0.174731  0.0313824 9.74965e-01
## ENSG00000288675.1  15.048025     2.01434132  0.174631 11.5348224 8.80688e-31
##                           padj
##                      <numeric>
## ENSG00000000003.15 2.75622e-03
## ENSG00000000005.6  2.13290e-01
## ENSG00000000419.13 3.14116e-13
## ENSG00000000457.14 2.95220e-04
## ENSG00000000460.17 2.77449e-40
## ...                        ...
## ENSG00000288658.1  2.74461e-07
## ENSG00000288660.1  1.92231e-03
## ENSG00000288669.1  9.29645e-01
## ENSG00000288674.1  9.78866e-01
## ENSG00000288675.1  1.20917e-29

結(jié)果很棒,不過沒有g(shù)ene_symbol了,需要自己添加哦~

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容