CUT&Tag的分析——demo重復(fù)

其他拓展閱讀

http://m.itdecent.cn/p/27e91c8dac2f #ChIP-seq與CUT&Tag的介紹與對比

https://www.sohu.com/a/331749726_120054289 #CUT&Tag的詳解

CUT&Tag for efficient epigenomic profiling of small samples and single cells #技術(shù)原文

https://www.abcam.cn/epigenetics/chromatin-profiling-guide-1 #ChIP-seq、CUT&RUN、CUT&Tag等的綜合介紹

Single-cell CUT&Tag analysis of chromatin modifications in differentiation and tumor progression #單細胞CUT&Tag技術(shù)文章

重復(fù)分析主要參考:

https://yezhengstat.github.io/CUTTag_tutorial/

流程中出現(xiàn)K4me3寫為K4m3的錯誤,請注意

流程如下:


pipeline

01.數(shù)據(jù)的下載

根據(jù)流程,從SRA數(shù)據(jù)庫下載相關(guān)數(shù)據(jù):

cd /public/home/jlwang_gibh/project/05.gu/01.CUTTag_test/00.data
wget -c https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos3/sra-pub-run-20/SRR12246717/SRR12246717.1
wget -c https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos3/sra-pub-run-19/SRR11074240/SRR11074240.1
wget -c https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos3/sra-pub-run-20/SRR11074254/SRR11074254.1
wget -c https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos3/sra-pub-run-19/SRR11074258/SRR11074258.1
wget -c https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos3/sra-pub-run-20/SRR11923224/SRR11923224.1
wget -c https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos2/sra-pub-run-15/SRR8754611/SRR8754611.1
wget -c https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos2/sra-pub-run-15/SRR8754612/SRR8754612.1

for i in `ls *.1`;do `fastq-dump --split-3 $i`;done
#重命名數(shù)據(jù)
mv SRR11074240.1_1.fastq K27me3_rep2_1.fastq
mv SRR11074240.1_2.fastq K27me3_rep2_2.fastq
mv SRR11074258.1_1.fastq K4me3_rep2_1.fastq
mv SRR11074258.1_2.fastq K4me3_rep2_2.fastq
mv SRR11074254.1_1.fastq K4me3_rep1_1.fasyq
mv SRR11074254.1_2.fastq K4me3_rep1_2.fasyq
mv SRR11923224.1_1.fastq IgG_rep1_1.fastq
mv SRR11923224.1_2.fastq IgG_rep1_2.fastq 
mv SRR12246717.1_1.fastq K27me3_rep1_1.fastq
mv SRR12246717.1_2.fastq K27me3_rep1_2.fastq
cat SRR8754611.1_1.fastq SRR8754612.1_1.fastq >IgG_rep2_1.fastq
cat SRR8754611.1_2.fastq SRR8754612.1_2.fastq>IgG_rep2_2.fastq

樣品名稱與數(shù)據(jù)名稱的對應(yīng)關(guān)系為:


image.png

02.數(shù)據(jù)的預(yù)處理

01.使用fastqc進行質(zhì)量的檢查

所有樣本的數(shù)據(jù)都符合要求:包括GC content、adapter含量等,所以不進行低質(zhì)量reads的過濾、adapter的trim等。如果數(shù)據(jù)質(zhì)量不符合,則進行reads的過濾,個人一般使用的是fastp(可以自動識別接頭)。

cd /public/home/jlwang_gibh/project/05.gu/01.CUTTag_test/01.Data_Pre_processing/
for i in `ls /public/home/jlwang_gibh/project/05.gu/01.CUTTag_test/00.data/*_1.fastq`;do dir_path=$(dirname  $i); file_name=$(basename $i _1.fastq); echo "fastqc -t 10 -o ./ $dir_path/${file_name}_1.fastq">>fastqc.sh; echo "fastqc -t 10 -o ./ $dir_path/${file_name}_2.fastq">>fastqc.sh;done
ParaFly -c fastqc.sh -CPU 2

02.如果有技術(shù)重復(fù),可以考慮進行數(shù)據(jù)的合并

如果有技術(shù)重復(fù),可以在fastq層面直接進行合并。

03.序列的比對

包含Tn5接頭和PCR條形碼的插入序列結(jié)構(gòu)如下:


image.png

01.比對至hg38基因組

該測序數(shù)據(jù)不需要進行數(shù)據(jù)的修剪:因為它使用的是25*25 PE的標準測序,因此接頭序列不會被包含在數(shù)據(jù)里面。在更長的測序中,需要使用以下的參數(shù):“--local --very-sensitive --no-mixed --no-discordant --phred33 -I 10 -X 700”進行比對,以規(guī)避可能存在的接頭序列?;蛘咛崆斑M行修剪。

cd /public/home/jlwang_gibh/project/05.gu/01.CUTTag_test/02.mapping
for i in `ls /public/home/jlwang_gibh/project/05.gu/01.CUTTag_test/00.data/*_1.fastq`;do dir_path=$(dirname  $i); file_name=$(basename $i _1.fastq); echo "bowtie2 --end-to-end --very-sensitive --no-mixed --no-discordant --phred33 -I 10 -X 700 -p 10 -x /public/home/jlwang_gibh/project/00.data/03.GRCh38/GRCh38.primary_assembly.genome -1 $dir_path/${file_name}_1.fastq -2 $dir_path/${file_name}_2.fastq -S $file_name.sam 2>$file_name.summary">>bowtie2.sh;done
nohup ParaFly -c bowtie2.sh -CPU 2  2>&1 &

02.比對至E.coli基因組

大腸桿菌的DNA攜帶著細菌產(chǎn)生的pA-Tn5蛋白,并在反應(yīng)過程中被非特異性標記。比對到大腸桿菌基因組的reads數(shù)量取決于與目標蛋白結(jié)合的CUT&Tag數(shù)量,因此也取決于使用的細胞數(shù)量和染色質(zhì)中該目標蛋白的豐度。由于在CUT&Tag反應(yīng)中加入了固定數(shù)量的pATn5,并攜帶了固定數(shù)量的大腸桿菌DNA,因此在一系列實驗中大腸桿菌reads可以被用來spike-in目標蛋白所提取信號的豐度。但這邊的假設(shè)是,樣本的細胞數(shù)和所加的pATn5需要一樣/接近,而一般沒有經(jīng)過精心設(shè)計的實驗,無法達到該假設(shè),所以想用這種spike-in方法,進行矯正,反而可能是錯誤的。聚體是否使用此來進行spike-in需考慮實驗過程產(chǎn)生的影響。

cd /public/home/jlwang_gibh/project/05.gu/01.CUTTag_test/02.mapping
for i in `ls /public/home/jlwang_gibh/project/05.gu/01.CUTTag_test/00.data/*_1.fastq`;do dir_path=$(dirname  $i); file_name=$(basename $i _1.fastq); echo "bowtie2 --end-to-end --very-sensitive --no-mixed --no-discordant --phred33 -I 10 -X 700 -p 10 -x /public/home/jlwang_gibh/project/00.data/04.Ecoli_Genome/Ecoli.genome -1 $dir_path/${file_name}_1.fastq -2 $dir_path/${file_name}_2.fastq -S $file_name.ecoli.sam 2>$file_name.ecoli.summary">>bowtie2.ecoli.sh;done
nohup ParaFly -c bowtie2.ecoli.sh -CPU 2  2>&1 &
for i in `ls *sam`; do echo $i >>seqDepth; echo `samtools view -F 0x04 $i|wc -l `>>seqDepth;done

進行spike-in標準化時候,需額外使用--no-overlap和--no-dovetail兩個額外的參數(shù)將序列比對到e.coli基因組上(--end-to-end --very-sensitive --no-overlap --no-dovetail --no-mixed --no-discordant --phred33 -I 10 -X 700),去避免可能的實驗的基因組序列交叉比對到用于 E. coli DNA校準的序列。

03.對比對結(jié)果進行統(tǒng)計

對測序深度、比對率、可比對reads的數(shù)量、重復(fù)序列比率、唯一序列的比率片段大小分布等進行統(tǒng)計

查看全局比對與唯一比對率,比對率高于80%的,認為是高質(zhì)量的數(shù)據(jù)。因為CUT&Tag有著十分低的背景噪音,所以只需要超過1 Million的reads數(shù)量便可以獲得較可靠的結(jié)果。而一些低豐度的轉(zhuǎn)錄因子和染色體蛋白,需要超過10倍于比對數(shù)量的reads用于下游分析。所以需要考慮CUT&Tag的所檢測信號類型,不同的轉(zhuǎn)錄因子差異會十分大。

perl bowtie2_summary.pl

并手動將比對深度與其他比對統(tǒng)計指標合并
比對結(jié)果:


image.png

04.腳本

bowtie2_summary.pl

#!perl
@file= `ls *summary`;
open OUT,">All_bowtie2_summary";
print OUT "File Total reads aligned 0 times aligned exactly 1 time  overall alignment rate\n";
foreach $file(@file){
    #print $file,"\n";
    open IN,"$file";
    $file=~s/.summary\n//g;
    while(<IN>){
        chomp;
        $_=~s/\r//g;
        if(/reads; of these:/){
            $_=~/(.*) reads; of these:/;
            $total_reads{$file}=$1;
        }
        if(/aligned concordantly exactly 1 time/){
            $_=~/\s+(.*) aligned concordantly exactly 1 time/;
            $aligned_exactly_1_time{$file}=$1;
        }
        if(/ overall alignment rate/){
            $_=~/(.*) overall alignment rate/;
            $overall_alignment_rate{$file}=$1;
            #print $overall_alignment_rate{$file},"\n";
        }
    }
    close IN;
}
foreach(sort keys %overall_alignment_rate){
    
    print OUT "$_\t$total_reads{$_}\t$aligned_exactly_1_time{$_}\t$overall_alignment_rate{$_}\n";
}

04.重復(fù)序列的去除

因為CUT&Tag的特性,將會整合接頭到pA-Tn5結(jié)合抗體周圍的DNA區(qū)域中,且整合的位點受到DNA可及性的影響。所以在此因素的影響下,獲取的片段有著一致的起始與終止位點是可能的,而不是因為PCR擴增的原因。在高質(zhì)量的文庫中,不需要進行CUT&Tag重復(fù)序列的去除。而當文庫初始濃度、總量很少或者PCR擴增次數(shù)很多的時候,重復(fù)序列需要進行去除。不過因為很多數(shù)據(jù)都不能像此demo的數(shù)據(jù)一樣,質(zhì)量十分好,所以一般推薦把重復(fù)序列去除。特別是一些檢測轉(zhuǎn)錄因子結(jié)合信號的,一般重復(fù)序列的比例都很高,不大可能是由CUT&Tag的特性導(dǎo)致的,更可能的原因應(yīng)該是光學(xué)重復(fù)和PCR重復(fù)。

cd  /public/home/jlwang_gibh/project/05.gu/01.CUTTag_test/03.picard
for i in `ls /public/home/jlwang_gibh/project/05.gu/01.CUTTag_test/02.mapping/*sam`; do dir_path=$(dirname  $i); file_name=$(basename $i .sam);echo "picard SortSam I=$i O=$file_name.sorted.sam SORT_ORDER=coordinate">>$file_name.picard.sh; echo "picard MarkDuplicates I=$file_name.sorted.sam O=$file_name.sorted.dupMarked.sam METRICS_FILE=$file_name.picard.dupMark.txt">>$file_name.picard.sh; echo "picard MarkDuplicates I=$file_name.sorted.sam O=$file_name.sorted.rmDup.sam REMOVE_DUPLICATES=true METRICS_FILE=$file_name.picard.rmDup.txt" >>$file_name.picard.sh;done
rm *.ecoli.picard.sh
for i in `ls *sh`;do echo "bash $i ">>picard.sh;done
ParaFly -c picard.sh -CPU 10
perl picard_summary.pl

查看結(jié)果


image.png

01.腳本

picard_summary.pl

#!perl
@file= `ls *dupMark.txt`;
open OUT,">All_picard_summary";
print OUT "File name\tLibrary size\tDuplication ratios\n";
foreach $file(@file){
    #print $file,"\n";
    open IN,"$file";
    $file=~s/.picard.dupMark.txt\n//g;
    while(<IN>){
        chomp;
        $_=~s/\n//g;
        if(/SECONDARY_OR_SUPPLEMENTARY_RDS/){
            $value=<IN>;
            $value=~s/\n//g;
            @values=split /\t/,$value;
            $ESTIMATED_LIBRARY_SIZE=$values[9];
            $PERCENT_DUPLICATION=$values[8];
        }
    }
    print OUT "$file\t$ESTIMATED_LIBRARY_SIZE\t$PERCENT_DUPLICATION\n";
}

05.評估比對片段大小的分布

因為CUT&Tag的特性,其產(chǎn)生的片段長度應(yīng)該在一個核小體覆蓋DNA的長度(180bp)或者多個核小體覆蓋DNA的長度,所以在分布圖上,會在180×n位置處出現(xiàn)鼓包??梢砸源藖碓u估數(shù)據(jù)質(zhì)量的好壞。
首先在linux中統(tǒng)計比對上序列的長度

cd /public/home/jlwang_gibh/project/05.gu/01.CUTTag_test/04.assessing_length_of_fragments
for i in `ls /public/home/jlwang_gibh/project/05.gu/01.CUTTag_test/02.mapping/*sam`; do dir_path=$(dirname  $i); file_name=$(basename $i .sam); samtools view -F 0x04 $i|awk -F '\t' 'function abs(x){return ((x < 0.0) ? -x : x)}{print abs($9)}' | sort | uniq -c | awk -v OFS="\t" '{print $2, $1/2}'>$file_name.fragmentLen.txt;done
rm -rf *.ecoli.fragmentLen.txt

之后在R中生成用于繪圖的表格數(shù)據(jù)

sampleList = c("K27me3_rep1","K27me3_rep2","K4me3_rep1","K4me3_rep2","IgG_rep1","IgG_rep2")
histList = c("K27me3","K4me3","IgG")
All<-data.frame(matrix(ncol=6,nrow=0))
colnames(All)<-c("V1","V2","Weight","Histone","Replicate","sampleInfo")
for(hist in sampleList){
  histInfo = strsplit(hist, "_")[[1]]
  AA = read.table(paste0(hist,".fragmentLen.txt"), header = FALSE) 
  for(i in colnames(AA)){
    AA[,i]<-as.numeric(AA[,i])
  }
    BB<-mutate(AA, Weight = as.numeric(V2)/sum(as.numeric(V2)), Histone = histInfo[1], Replicate = histInfo[2], sampleInfo = hist) 
    
    All=rbind(All,BB)
}
write.table(All,"length_of_fragments.txt",sep="\t")
library("ggpubr")
library("ggplot2")
library("viridis")
setwd("C:/Users/Kong_lab_user/Desktop")
sampleList = c("K27me3_rep1","K27me3_rep2","K4me3_rep1","K4me3_rep2","IgG_rep1","IgG_rep2")
histList = c("K27me3","K4me3","IgG")
length_of_fragments<-read.table("length_of_fragments.txt",sep="\t",header = T)
length_of_fragments$sampleInfo = factor(length_of_fragments$sampleInfo, levels = sampleList)
length_of_fragments$Histone = factor(length_of_fragments$Histone, levels = histList)
## Generate the fragment size density plot (violin plot)
fig5A=ggplot(length_of_fragments,aes(x = sampleInfo, y = V1, weight = Weight, fill = Histone)) +
  geom_violin(bw = 5) +
  scale_y_continuous(breaks = seq(0, 800, 50)) +
  scale_fill_viridis(discrete = TRUE, begin = 0.1, end = 0.9, option = "magma", alpha = 0.8) +
  scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9) +
  theme_bw(base_size = 20) +
  ggpubr::rotate_x_text(angle = 20) +
  ylab("Fragment Length") +
  xlab("")

fig5B =ggplot(length_of_fragments,aes(x = V1, y = V2, color = Histone, group = sampleInfo, linetype = Replicate)) +
  geom_line(size = 1) +
  scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9, option = "magma") +
  theme_bw(base_size = 20) +
  xlab("Fragment Length") +
  ylab("Count") +
  coord_cartesian(xlim = c(0, 500))

ggarrange(fig5A, fig5B, ncol = 2)
image.png

06.比對結(jié)果的過濾與格式轉(zhuǎn)換

01.根據(jù)比對分數(shù)進行過濾

有一些項目需要嚴格的對比對分數(shù)進行過濾,可以使用samtools進行過濾。這邊使用的數(shù)據(jù)是未去除重復(fù)序列的。

for i in `ls /public/home/jlwang_gibh/project/05.gu/01.CUTTag_test/02.mapping/*sam`; do dir_path=$(dirname  $i); file_name=$(basename $i .sam);echo "grep ^@ $i>$file_name.qualityScore2.sam">>$file_name.filtering_based_on_score.sh;echo "samtools view -q 2 $i>>$file_name.qualityScore2.sam">>$file_name.filtering_based_on_score.sh;done
for i in `ls *filtering_based_on_score.sh`;do echo "bash $i">>filtering_based_on_score.sh;done
ParaFly -c filtering_based_on_score.sh -CPU 12
 rm *.ecoli.qualityScore2.sam

02.格式轉(zhuǎn)換

  • 此處存在著一個error,即在使用bedtools bamtobed在 -bedpe 參數(shù)下,將bam轉(zhuǎn)為bed時,需要先將bam依據(jù)名字排序,不然會出現(xiàn)warning:

WARNING: Query SRR3286802-24999 is marked as paired, but its mate does not occur next to it in your BAM file. Skipping.

這時整個數(shù)據(jù)會因為出現(xiàn)這個warning而導(dǎo)致信息丟失(skipping掉很多reads,造成分析的不準確),以致于后續(xù)的分析出現(xiàn)錯誤。此處錯誤在本demo中未修改。

for i in `ls /public/home/jlwang_gibh/project/05.gu/01.CUTTag_test/05.filtering_and_conversion/*sam`;do dir_path=$(dirname  $i); file_name=$(basename $i .qualityScore2.sam);echo "samtools view -bS -F 0x04 $i >$file_name.qualityScore2.mapped.bam">>$file_name.conversion.sh;echo "bedtools bamtobed -i $file_name.qualityScore2.mapped.bam -bedpe >$file_name.qualityScore2.bed">>$file_name.conversion.sh;echo "awk '\$1==\$4 && \$6-\$2 < 1000 {print \$0}' $file_name.qualityScore2.bed >$file_name.qualityScore2.clean.bed">>$file_name.conversion.sh; echo "cut -f 1,2,6 $file_name.qualityScore2.clean.bed | sort -k1,1 -k2,2n -k3,3n  >$file_name.qualityScore2.fragments.bed">>$file_name.conversion.sh;done
for i in `ls *conversion.sh`;do echo "bash $i">>conversion.sh;done
ParaFly -c conversion.sh -CPU 12

03.評估重復(fù)樣品的重復(fù)性

以500bp的bin長度統(tǒng)計基因組上的mapping率

for i in `ls *qualityScore2.fragments.bed`;do dir_path=$(dirname  $i); file_name=$(basename $i .bed); echo "awk -v w=500 '{print \$1, int((\$2 + \$3)/(2*w))*w + w/2}' $i | sort -k1,1V -k2,2n | uniq -c | awk -v OFS=\"\t\" '{print \$2, \$3, \$1}' |  sort -k1,1V -k2,2n >$file_name.bin500.bed">>$file_name.replicate_reproducibility.sh;done
for i in `ls *replicate_reproducibility.sh`;do echo "bash $i">>replicate_reproducibility.sh;done
ParaFly -c replicate_reproducibility.sh -CPU 12

之后在R中進行數(shù)據(jù)的整合和相關(guān)性分析,將生成的bin500.bed下載到本地

library("ggpubr")
library("ggplot2")
library("viridis")
library("dplyr")
library("corrplot")
setwd("C:/Users/Kong_lab_user/Desktop")
#設(shè)置讀取的sampels
sampleList = c("K27me3_rep1","K27me3_rep2","K4me3_rep1","K4me3_rep2","IgG_rep1","IgG_rep2")

#創(chuàng)建空的bed數(shù)據(jù)框
bed_data_frame=NULL
#循環(huán)讀取數(shù)據(jù)
for(i in sampleList){
    #設(shè)置文件名
    bed_file_name<-paste(i,".qualityScore2.fragments.bin500.bed",collapse = "",sep="")
    #讀取文件
    bed_file<-read.table(bed_file_name,header = F,sep="\t")
    #設(shè)置讀入數(shù)據(jù)框的列名
    colnames(bed_file)<-c("chrom", "bin",i)
    #整合不同文件的bed數(shù)據(jù)
    if(length(bed_data_frame)==0){
        bed_data_frame<-bed_file
    }else{
        bed_data_frame<-full_join(bed_data_frame, bed_file, by = c("chrom", "bin"))
  }
}
#進行相關(guān)性分析
M = cor(bed_data_frame %>% select(-c("chrom", "bin")) %>% log2(), use = "complete.obs") 
#繪制相關(guān)性圖
corrplot(M, method = "color", 
         outline = T, 
         addgrid.col = "darkgray",
         order="hclust", addrect = 3, 
         rect.col = "black", 
         rect.lwd = 3,
         cl.pos = "b", 
         tl.col = "indianred4", 
         tl.cex = 1, 
         cl.cex = 1, 
         addCoef.col = "black", 
         number.digits = 2, 
         number.cex = 1, 
         col = colorRampPalette(c("midnightblue","white","darkred"))(100))
image.png

07.Spike-in校正

能進行校正的基本假設(shè)是:在一系列使用相同數(shù)量細胞的樣本中,比對到E.coli基因組上reads的比例是相同的。由于這個假設(shè),分析流程沒有進行樣品間和批次間的標準化,這會使得殘余的E.coli DNA數(shù)量差異很大??梢远x一個S,然后去標準化比對深度。

S定義為
S = C / (fragments mapped to E. coli genome)
Normalized coverage的定義為
Normalized coverage = (primary_genome_coverage) * S
首先計算出每個樣本的S值(scale_factor)

根據(jù)03.03中獲得的表格,可以進行每個樣本的S值,設(shè)定C為10000

cd /public/home/jlwang_gibh/project/05.gu/01.CUTTag_test/06.spike_in
#生成bedtools所需的基因組信息文件
perl -e 'while(<>){if(/>/){chomp;$chr=$_;$chr=~/>(.*)\s/;$chr=$1}else{$sequence{$chr}.=$_}};open OUT,">genome_information.txt";foreach $chr(sort keys %sequence){$length=length($sequence{$chr});print OUT "$chr\t$length\n"}' GRCh38.primary_assembly.genome.fa
#獲得校正后的bedgraph文件
perl -e 'while(<>){chomp;if(/E.coli/){@arr=split /\t/,$_;$file_name=$arr[1];$file_name=~s/.ecoli//g;$S=10000/$arr[5];print "bedtools genomecov -bg -scale $S -g /public/home/jlwang_gibh/project/00.data/03.GRCh38/genome_information.txt -i /public/home/jlwang_gibh/project/05.gu/01.CUTTag_test/05.filtering_and_conversion/${file_name}.qualityScore2.fragments.bed >${file_name}.qualityScore2.fragments.normalized.bedgraph\n"}}' summary.txt>>Spike_in.sh
ParaFly -c Spike_in.sh  -CPU 10

08.鑒定Peak

01.使用SEACR進行peak的calling

因為不知道原文流程中對Control樣本是如何選擇的(有兩個IgG重復(fù)樣本),所以只能全部都用來做Control。

cd /public/home/jlwang_gibh/project/05.gu/01.CUTTag_test/07.Peak_calling/01.SEACR
for i in `ls /public/home/jlwang_gibh/project/05.gu/01.CUTTag_test/06.spike_in/K*fragments.normalized.bedgraph`;do dir_path=$(dirname  $i); file_name=$(basename $i .qualityScore2.fragments.normalized.bedgraph);echo "/public/home/jlwang_gibh/opt/bin/SEACR-1.3/SEACR_1.3.sh  $i $dir_path/IgG_rep1.qualityScore2.fragments.normalized.bedgraph non stringent $file_name.IgG_rep1">>$file_name.SEACR.peakcalling.sh;echo "/public/home/jlwang_gibh/opt/bin/SEACR-1.3/SEACR_1.3.sh $i 0.01 non stringent $file_name.0.01">>$file_name.SEACR.peakcalling.sh;echo "/public/home/jlwang_gibh/opt/bin/SEACR-1.3/SEACR_1.3.sh $i $dir_path/IgG_rep2.qualityScore2.fragments.normalized.bedgraph non stringent $file_name.IgG_rep2">>$file_name.SEACR.peakcalling.sh;done

01.進行peaks數(shù)量的統(tǒng)計

for i in `ls *SEACR.peakcalling.sh`;do echo "bash $i">>SEACR.peakcalling.sh;done
ParaFly -c SEACR.peakcalling.sh -CPU 10
#對peak文件進行統(tǒng)計
perl -e '@file= `ls *stringent.bed`;foreach $file(@file){$file=~/(K.*)_(rep\d).(.*).stringent.bed/;$Histone=$1;$Replicate=$2;$Control=$3;$peak_number=`wc -l $file`;$peak_number=~/(\d+) /;$peak_number=$1;print "$Histone\t$Replicate\t$Control\t$peak_number\n"}'>SEACR_calling_summary.txt

統(tǒng)計結(jié)果:


image.png

02.統(tǒng)計峰的重復(fù)性

通過比對重復(fù)樣品各自獲得的峰,尋找重復(fù)樣品之間都存在的峰,定義為可重復(fù)峰。其中進一步選擇前1%的峰(由每個區(qū)塊的peak強度)作為高置信度位點。

library(GenomicRanges)
setwd("C:/Users/Kong_lab_user/Desktop")
#設(shè)置樣品名稱
histL = c("K27me3", "K4me3")
repL = paste0("rep", 1:2)
peakType = c("0.01", "IgG_rep1","IgG_rep2")
peakOverlap_infor<-data.frame(matrix(ncol = 3,nrow = 0))
colnames(peakOverlap_infor)<-c("peakReprod", "Histone", "Control")
#讀取數(shù)據(jù)并查看重復(fù)之間的overlap
for(type in peakType){
  for(hist in histL){
    overlap.gr = GRanges()
    for(rep in repL){
      peakInfo = read.table(paste0(hist,"_",rep,".",type,".stringent.bed"), header = FALSE, fill = TRUE)
      #將起始數(shù)據(jù)轉(zhuǎn)換為GRanges格式的位置信息數(shù)據(jù)
      peakInfo.gr = GRanges(peakInfo$V1, IRanges(start = peakInfo$V2, end = peakInfo$V3), strand = "*")
      if(length(overlap.gr) >0){
        #查看不同rep的peak之間是否存在overlap
        overlap.gr = overlap.gr[findOverlaps(overlap.gr, peakInfo.gr)@from]
        }else{
        overlap.gr = peakInfo.gr
        }
    }
    #統(tǒng)計overlap peak之間的數(shù)量
    peakOverlap = data.frame(peakReprod = length(overlap.gr), Histone = hist, Control = type) 
    peakOverlap_infor<-rbind(peakOverlap_infor, peakOverlap)
  }
}
#讀取080101中peak數(shù)量統(tǒng)計信息,并與此步驟獲取的overlap信息合并
peakN<-read.table("PeakN.txt",header = T,sep = "\t")
peakReprod = left_join(peakN, peakOverlap_infor, by = c("Histone", "Control")) 
#計算重現(xiàn)率:`# peaks overlapping rep1 and rep2/# peaks of rep1 or rep2 * 100
peakReprod$peakReprodRate=peakReprod$peakReprod/peakReprod$Peak_number*100
write.table(peakReprod,file="peakReprod.txt",sep="\t")
image.png

03.FRagment proportion in Peaks regions (FRiPs)的計算

計算FRiPs用于衡量信噪比,并將其與IgG進行對比。

library(GenomicRanges)
library(dplyr)
library(chromVAR)
setwd("C:/Users/Kong_lab_user/Desktop")
#設(shè)置sample名稱
sampleList = c("K27me3_rep1", "K27me3_rep2", "K4me3_rep1", "K4me3_rep2", "IgG_rep1", "IgG_rep2")
histL = c("K27me3", "K4m3")
repL = c("rep1", "rep2")
peakTypes = c("0.01", "IgG_rep1","IgG_rep2")
inPeakData = c()
#循環(huán)讀取peak的bed文件和比對的bam文件,并計算出位于每個peak區(qū)域的reads數(shù)量
for(hist in histL){
  for(rep in repL){
    for(peakType in peakTypes){
      #讀取peak的bed文件
      peakRes = read.table(paste0(hist, "_", rep, ".",peakType,".stringent.bed"), header = FALSE, fill = TRUE)
      #轉(zhuǎn)為GRanges格式,以便后續(xù)的分析
      peak.gr = GRanges(seqnames = peakRes$V1, IRanges(start = peakRes$V2, end = peakRes$V3), strand = "*")
      #讀取bam文件
      bamFile = paste0(hist, "_", rep, ".qualityScore2.mapped.bam")
      #獲取位于peak上的reads數(shù)量
      fragment_counts <- getCounts(bamFile, peak.gr, paired = TRUE, by_rg = FALSE, format = "bam")
      inPeakN = counts(fragment_counts)[,1] 
      inPeakData = rbind(inPeakData, data.frame(inPeakN = inPeakN, Histone = hist, Replicate = rep,peakTypes=peakType))
    }
  }
}
#讀取0303步驟中獲得的比對情況表
mapping_info<-read.table("mapping_info.txt",header = T,sep="\t")
Histone<-c()
rep<-c()
#整合信息表
for(i in rownames(mapping_info)){
  Histone<-c(Histone,strsplit(mapping_info[i,"Samples"],"_")[[1]][1])
  rep<-c(rep,strsplit(mapping_info[i,"Samples"],"_")[[1]][2])
}
mapping_info$Histone<-Histone
mapping_info$Replicate<-rep
#提取所需的部分信息
mapping_info_hg38<-mapping_info[mapping_info$Genome%in%"hg38" & mapping_info$Histone  %in% histL,]
mapping_info_hg38$Overall.alignment.rate<-as.numeric(gsub("%","",mapping_info_hg38$Overall.alignment.rate))
mapping_info_hg38$MappedFragNum_hg38<-mapping_info_hg38$Total.reads*mapping_info_hg38$Overall.alignment.rate/100

#將0303步驟獲取的信息與peak上reads數(shù)量的信息整合
summary_inPeakData<-data.frame(matrix(ncol=4,nrow=0))
colnames(summary_inPeakData)<-c("Histone","Replicate","peakTypes","Sum of reads in peak")
for(i in unique(inPeakData$Histone)){
  for(j in unique(inPeakData$Replicate)){
    for(m in unique(inPeakData$peakTypes)){
      rowname_tmp<-paste(i,j,m)
      #獲取樣品中比對至peak區(qū)域的reads總數(shù)
      sum1<-sum(inPeakData[inPeakData$Histone%in%i & inPeakData$Replicate %in% j & inPeakData$peakTypes%in% m,"inPeakN"])
      summary_inPeakData[rowname_tmp,]=c(i,j,m,sum1)
    }
    
  }
}
#合并兩者信息
summary_inPeakData<-left_join(summary_inPeakData,mapping_info_hg38,by=c("Histone", "Replicate"))
summary_inPeakData$`Sum of reads in peak`<-as.numeric(summary_inPeakData$`Sum of reads in peak`)
#計算FRiPs,分母為比對至hg38基因組上的序列數(shù)量
summary_inPeakData$FRiPs<-summary_inPeakData$`Sum of reads in peak`/summary_inPeakData$MappedFragNum_hg38
#輸出summary_inPeakData
write.table(summary_inPeakData,file = "summary_inPeakData.txt",sep="\t")

結(jié)果:


image.png

09.可視化

01.使用IGV進行可視化

將由07步驟獲得的bedgraph文件輸入到IGV進行可視化,結(jié)果:


image.png

02.繪制heatmap進行可視化——轉(zhuǎn)錄單元

這邊使用的比對文件為0601步驟獲取的去除低質(zhì)量比對的sam文件,去進行可視化。其中computeMatrix的-R參數(shù)文件由http://genome.ucsc.edu/cgi-bin/hgTables下載,僅僅取10000個基因作圖(加快作圖速度)。

cd /public/home/jlwang_gibh/project/05.gu/01.CUTTag_test/08.Visualization
#生成bw文件
for i in `ls /public/home/jlwang_gibh/project/05.gu/01.CUTTag_test/05.filtering_and_conversion/*qualityScore2.sam`;do dir_path=$(dirname  $i); file_name=$(basename $i .qualityScore2.sam); echo "samtools view -@ 16 -b -S $i -o $file_name.qualityScore2.bam" >>$file_name.heatmap_Visualization.sh; echo "samtools sort -o $file_name.qualityScore2.sorted.bam $file_name.qualityScore2.bam " >>$file_name.heatmap_Visualization.sh; echo "samtools index $file_name.qualityScore2.sorted.bam" >>$file_name.heatmap_Visualization.sh ; echo "bamCoverage -b $file_name.qualityScore2.sorted.bam -o $file_name.qualityScore2.bw ">>$file_name.heatmap_Visualization.sh;done
for i in `ls *heatmap_Visualization.sh`;do echo "bash $i">>heatmap_Visualization.sh;done
ParaFly -c heatmap_Visualization.sh -CPU 10
#進行圖形的繪制
computeMatrix scale-regions -S IgG_rep1.qualityScore2.bw \
                               IgG_rep2.qualityScore2.bw \
                               K27me3_rep1.qualityScore2.bw \
                               K27me3_rep2.qualityScore2.bw \
                               K4m3_rep1.qualityScore2.bw \
                               K4m3_rep2.qualityScore2.bw \
                              -R test1 \
                              --beforeRegionStartLength 3000 \
                              --regionBodyLength 5000 \
                              --afterRegionStartLength 3000 \
                              --skipZeros -o matrix_gene.mat.gz -p 8

結(jié)果:


image.png

03.繪制heatmap進行可視化——Peak

#生成用于繪制圖形的peak bed文件、bw文件,并進行圖形的繪制
for i in `ls /public/home/jlwang_gibh/project/05.gu/01.CUTTag_test/07.Peak_calling/01.SEACR/*stringent.bed`;do dir_path=$(dirname  $i); file_name=$(basename $i .stringent.bed); file_name1=${file_name%%.*};echo "awk '{split(\$6, summit, \":\"); split(summit[2], region, \"-\"); print summit[1]\"\t\"region[1]\"\t\"region[2]}' $i>$file_name.summitRegion.bed" >>$file_name.heatmap_peak.sh; echo "computeMatrix reference-point -S /public/home/jlwang_gibh/project/05.gu/01.CUTTag_test/08.Visualization/01.heatmap/$file_name1.qualityScore2.bw  -R $file_name.summitRegion.bed --skipZeros -o $file_name.SEACR.mat.gz -p 8 -a 3000 -b 3000 --referencePoint center">>$file_name.heatmap_peak.sh; echo "plotHeatmap -m $file_name.SEACR.mat.gz -out $file_name.png --sortUsing sum --startLabel \"Peak Start\" --endLabel \"Peak End\" --xAxisLabel \"\" --regionsLabel \"Peaks\" --samplesLabel \" $file_name1\" ">>$file_name.heatmap_peak.sh;done

for i in `ls *heatmap_peak.sh`;do echo "bash $i">>heatmap_peak.sh;done
ParaFly -c heatmap_peak.sh -CPU 1 

K4me3 peak的圖形:


image.png

K27me3 peak的圖形:


image.png

10.差異分析

使用負二項分布模型進行差異分析,使用包為DESeq2。一般要進行差異分析的數(shù)據(jù)應(yīng)該為同一類型的,但因為這邊數(shù)據(jù)的限制,使用不同histone的數(shù)據(jù)進行演示。

library(GenomicRanges)
library(dplyr)
library(chromVAR)
library(DESeq2)
setwd("C:/Users/Kong_lab_user/Desktop")
#設(shè)置sample名稱
sampleList = c("K27me3_rep1", "K27me3_rep2", "K4me3_rep1", "K4me3_rep2", "IgG_rep1", "IgG_rep2")
histL = c("K27me3", "K4m3")
repL = c("rep1", "rep2")
peakTypes = c("0.01", "IgG_rep1","IgG_rep2")
inPeakData = c()
#循環(huán)讀取peak的bed文件,并整合進mPeak
for(hist in histL){
  for(rep in repL){
    for(peakType in peakTypes){
      #讀取peak的bed文件
      peakRes = read.table(paste0(hist, "_", rep, ".",peakType,".stringent.bed"), header = FALSE, fill = TRUE)
      #將所有peak儲存與mpeak中
      mPeak = GRanges(seqnames = peakRes$V1, IRanges(start = peakRes$V2, end = peakRes$V3), strand = "*") %>% append(mPeak, .)    
  }
}
}
#合并具有overlap的peak,保留主峰
masterPeak = reduce(mPeak)

#獲取主峰列表中峰的比對數(shù)

countMat = matrix(NA, length(masterPeak), length(histL)*length(repL))

## overlap with bam file to get count
i = 1
for(hist in histL){
  for(rep in repL){
    
    bamFile = paste0( hist, "_", rep, ".qualityScore2.mapped.bam")
    fragment_counts <- getCounts(bamFile, masterPeak, paired = TRUE, by_rg = FALSE, format = "bam")
    countMat[, i] = counts(fragment_counts)[,1]
    i = i + 1
  }
}
colnames(countMat) = paste(rep(histL, 2), rep(repL, each = 2), sep = "_")

#進行測序深度的歸一化和差異peak的檢測
##刪除low counts的峰
selectR = which(rowSums(countMat) > 5) 
dataS = countMat[selectR,]
condition = factor(rep(histL, each = length(repL)))
##構(gòu)建DDS對象
dds = DESeqDataSetFromMatrix(countData = dataS,
                             colData = DataFrame(condition),
                             design = ~ condition)
#進行差異分析
DDS = DESeq(dds)
#進行歸一化
normDDS = counts(DDS, normalized = TRUE) ## normalization with respect to the sequencing depth
colnames(normDDS) = paste0(colnames(normDDS), "_norm")
res = results(DDS, independentFiltering = FALSE, altHypothesis = "greaterAbs")
#合并結(jié)果
countMatDiff = cbind(dataS, normDDS, res)

結(jié)果:


image.png
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容