關(guān)鍵詞
- 批次效應(yīng)
- 批次校正
- 單細(xì)胞數(shù)據(jù)整合
適用背景
單細(xì)胞數(shù)據(jù)由于實驗平臺或樣本等原因會造成不同數(shù)據(jù)集之間存在批次效應(yīng),這種批次效應(yīng)是人為因素造成的,沒有實際的生物學(xué)意義,可能對研究結(jié)果產(chǎn)生極大影響。批次效應(yīng)可由聚類結(jié)果確定,如果聚類出的某些亞群絕大部分都來自同一個樣本一般就認(rèn)為存在批次效應(yīng),因此需要進(jìn)行批次校正。
本文總結(jié)了三種常用的整合方法代碼:CCA,SCTransform和Harmony。
方法一
CCA整合方法是目前應(yīng)用最多方法,是Seurat自帶的,大多數(shù)情況以及夠用了,效果也還可以,但是對于較大數(shù)據(jù)集,耗時較長,占內(nèi)存也較大。目前,Seurat官網(wǎng)在此基礎(chǔ)上推薦reference-based,也就是指定參考數(shù)據(jù)集進(jìn)行整合,但對于自產(chǎn)數(shù)據(jù)集,一般根本無法預(yù)先知道哪個樣本效果最好,這種reference-based的思路更適合數(shù)據(jù)挖掘類的研究。
參數(shù)簡介
- obj,Seurat對象
- group.by,整合分組
- mt.pattern或mt.list,指定線粒體基因,mt.pattern支持模糊搜索,mt.list直接給定基因集,格式為向量
- dim.use PCA主成分的選擇個數(shù)
- mt.cutoff 線粒體百分比閾值
- nf.low 基因數(shù)下限
- nf.high 基因數(shù)上限
- nfeatures,用于整合的高變基因選擇數(shù)
- res,聚類亞群的分辨率
seurat_integ <- function(obj,group.by=NULL,
mt.pattern="^MT-",mt.list=NULL,dim.use=30,mt.cutoff=5,
nf.low=500,nf.high=6000,nfeatures=3000,
res=1.5) {
all <- obj
if (is.null(mt.list)) {
all[["percent.mt"]] <- PercentageFeatureSet(all, pattern = mt.pattern)
}else{
mt.list <- mt.list[which(mt.list %in% rownames(all))]
all[["percent.mt"]] <- PercentageFeatureSet(all, features = mt.list)
}
all <- subset(all, subset = nFeature_RNA > nf.low & percent.mt < mt.cutoff & nFeature_RNA < nf.high)
all <- NormalizeData(all, normalization.method = "LogNormalize", scale.factor = 10000)
all.list <- SplitObject(all, split.by = group.by)
for (i in 1:length(all.list)){
all.list[[i]] <- NormalizeData(all.list[[i]], verbose = FALSE)
all.list[[i]] <- FindVariableFeatures(all.list[[i]], selection.method = "vst",nfeatures = nfeatures, verbose = FALSE)}
reference.list <- all.list
all.anchors <- FindIntegrationAnchors(object.list = reference.list, dims =1:dim.use)
all.integrated <- IntegrateData(anchorset = all.anchors, dims = 1:dim.use)
DefaultAssay(all.integrated) <- "integrated"
all.integrated <- ScaleData(all.integrated, verbose = FALSE)
npcs <- dim.use+10
all.integrated <- RunPCA(all.integrated, npcs = npcs, verbose = FALSE)
all.integrated <- FindNeighbors(all.integrated, reduction = "pca", dims = 1:dim.use)
all.integrated <- FindClusters(all.integrated, resolution = res)
all.integrated <- RunUMAP(all.integrated, reduction = "pca", dims = 1:dim.use)
return(all.integrated)
}
方法二
第二種方法是Seurat官網(wǎng)極度推薦的,主要由于方法一的Normalization and variance stabilization流程存在一定問題,會造成基因表達(dá)量會與測序深度存在明顯的相關(guān)關(guān)系等,因此提出了SCTransform進(jìn)行預(yù)處理,然后再整合,其實后面的整合方法跟方法一的類型,只不過這里的前期預(yù)處理用的是SCTransform,而方法一用的是LogNormalize,因此整合的對象結(jié)構(gòu)是類似的。詳細(xì)內(nèi)容可閱讀這篇文獻(xiàn)。這種方法理論上更為合理,但是也更耗運行內(nèi)存和運行時間。參數(shù)與方法一一致。
sct_integ <- function(obj,group.by=NULL,
mt.pattern="^MT-",mt.list=NULL,dim.use=30,mt.cutoff=5,
nf.low=500,nf.high=6000,nfeatures=3000,
res=1.5) {
all <- obj
if (is.null(mt.list)) {
all[["percent.mt"]] <- PercentageFeatureSet(all, pattern = mt.pattern)
}else{
mt.list <- mt.list[which(mt.list %in% rownames(obj))]
all[["percent.mt"]] <- PercentageFeatureSet(all, features = mt.list)
}
all <- subset(all, subset = nFeature_RNA > nf.low & percent.mt < mt.cutoff & nFeature_RNA < nf.high)
all <- NormalizeData(all, normalization.method = "LogNormalize", scale.factor = 10000)
obj <- all
ifnb.list <- SplitObject(obj, split.by = group.by)
if (!requireNamespace("glmGamPoi", quietly = TRUE)) {
ifnb.list <- lapply(X = ifnb.list, FUN = SCTransform, vars.to.regress = "percent.mt", verbose = FALSE)
}else{
ifnb.list <- lapply(X = ifnb.list, FUN = SCTransform, vars.to.regress = "percent.mt", verbose = FALSE, method = "glmGamPoi")
}
features <- SelectIntegrationFeatures(object.list = ifnb.list, nfeatures = nfeatures)
ifnb.list <- PrepSCTIntegration(object.list = ifnb.list, anchor.features = features)
immune.anchors <- FindIntegrationAnchors(object.list = ifnb.list, normalization.method = "SCT",
anchor.features = features)
immune.combined.sct <- IntegrateData(anchorset = immune.anchors, normalization.method = "SCT")
immune.combined.sct <- RunPCA(immune.combined.sct, verbose = FALSE)
immune.combined.sct <- RunUMAP(immune.combined.sct, reduction = "pca", dims = 1:dim.use)
all.integrated <- immune.combined.sct
all.integrated <- FindNeighbors(all.integrated, reduction = "pca", dims = 1:dim.use)
all.integrated <- FindClusters(all.integrated, resolution = res)
all.integrated <- RunUMAP(all.integrated, reduction = "pca", dims = 1:dim.use)
return(all.integrated)
}
方法三
第三種方法是一種降維整合,基于harmony包,這種方法的優(yōu)勢在于夠快,大部分情況都能有比較好的結(jié)果。參數(shù)與方法一一致。
harmony_integ <- function(obj,group.by=NULL,
mt.pattern="^MT-",mt.list=NULL,dim.use=20,mt.cutoff=5,
nf.low=500,nf.high=6000,nfeatures=3000,
res=1.5) {
library(harmony)
all <- obj
if (is.null(mt.list)) {
all[["percent.mt"]] <- PercentageFeatureSet(all, pattern = mt.pattern)
}else{
mt.list <- mt.list[which(mt.list %in% rownames(all))]
all[["percent.mt"]] <- PercentageFeatureSet(all, features = mt.list)
}
all <- subset(all, subset = nFeature_RNA > nf.low & percent.mt < mt.cutoff & nFeature_RNA < nf.high)
all <- NormalizeData(all, normalization.method = "LogNormalize", scale.factor = 10000)
all <- FindVariableFeatures(all, selection.method = "vst", nfeatures = 2000)
all.genes <- rownames(all)
all <- ScaleData(all , features = all.genes, vars.to.regress = "nCount_RNA")
saveRDS(all,"regress.rds")
all <- RunPCA(all, features = VariableFeatures(object = all))
all <- RunHarmony(all, group.by , plot_convergence = F,dims.use = 1:dim.use)
Combine <- all
Combine = RunTSNE(Combine, reduction = "harmony", dims = 1:dim.use)
Combine = RunUMAP(Combine, reduction = "harmony", dims = 1:dim.use)
Combine = FindNeighbors(Combine, reduction = "harmony",dims = 1:dim.use)
Combine = FindClusters(Combine, resolution = res)
return(Combine)
}
小結(jié)與補充
單細(xì)胞數(shù)據(jù)整合一直是個玄學(xué),沒有說哪一種整合方法是最好的,不同方法針對不同樣本會出現(xiàn)不同效果,只能每種方法都試一下才能知道哪種較好。而且還需要結(jié)合實際情況進(jìn)行選擇,例如數(shù)據(jù)集太大,或者運行內(nèi)存不夠,可能harmony的方法更適合,當(dāng)然如果數(shù)據(jù)集適中,各種運行條件也合適那就可以考慮理論上更為合理的SCTransform方法。