GATK官方給出了從RNA-seq數(shù)據(jù)中尋找變異位點(diǎn)的流程,但這個(gè)示意圖比較簡潔,實(shí)際操作時(shí)一不小心就會(huì)報(bào)錯(cuò),故經(jīng)過探索,記錄下這個(gè)流程的細(xì)節(jié)以及半自動(dòng)化的腳本。

1. 軟件及環(huán)境
samtools: 1.9
STAR: 2.7.0e
GATK: 4.1.4.1
操作系統(tǒng):Red Hat 4.8.5-28
2.流程概述
STAR進(jìn)行比對生成BAM文件 -> GATK內(nèi)置Picard工具對BAM文件進(jìn)行處理,以適配后續(xù)分析流程,否則會(huì)報(bào)錯(cuò) -> HaplotypeCaller尋找變異 -> 分SNP,Indel兩種模式對變異進(jìn)行過濾。
*若有生物學(xué)重復(fù),可以在call變異這一步為每一個(gè)樣本生成gvcf文件,然后合并gvcf,進(jìn)行joint-calling。
3.代碼
1)mapping
對于RNA-seq數(shù)據(jù),GATK建議使用STAR這一比對軟件。這一步我并沒有寫到一個(gè)腳本里。
首先建立index:
STAR --runThreadN 20 \ #線程數(shù)
--runMode genomeGenerate \
--genomeDir /your_path/STAR/1.index \ #生成index的路徑
--genomeFastaFiles /your_path/hg38_UCSC.fa \ #參考基因組文件
--sjdbGTFfile /your_path/gencode.v32.annotation.gtf \ #參考基因組注釋文件
--sjdbOverhang 140 #reads長度-1
結(jié)果如下:

然后進(jìn)行mapping:
STAR --runMode alignReads\
--runThreadN 20 \
--genomeDir /your_path/STAR/1.index/ \ #上一步生成的index路徑
--outFileNamePrefix /your_path/2.mapping/${sample} \ #mapping結(jié)果的路徑及前綴
--outSAMmapqUnique 255 \ #唯一比對的質(zhì)量值設(shè)為255
--outSAMtype BAM SortedByCoordinate \ #直接輸出排序好的BAM文件
--outBAMsortingThreadN 10 \
--readFilesIn /your_path/${sample}_1.clean.trimmed.fq /your_path/${sample}_2.clean.trimmed.fq #雙端測序文件
結(jié)果如下:

接著利用上一步mapping結(jié)果收集到的junction信息,也就是SJ.out.tab文件第二次構(gòu)建index,多個(gè)樣本時(shí)將它們的結(jié)果一起輸入:*
STAR --runThreadN 40 \
--runMode genomeGenerate \
--genomeDir /your_path/STAR/3.index \
--genomeFastaFiles /your_path/hg38_UCSC.fa \
--sjdbGTFfile /your_path/gencode.v32.annotation.gtf \
--sjdbFileChrStartEnd /your_path/STAR/2.mapping/*SJ.out.tab \ #第一次mapping得到的SJ.out.tab文件
--sjdbOverhang 140 \
--limitSjdbInsertNsj 1100000 #由于我的樣本太多,得到的junction數(shù)量超過上限,因此加上這個(gè)參數(shù)更改上限(默認(rèn)值:1000000)
最后進(jìn)行第二次mapping,代碼和第一次mapping類似,只是更換下index和output路徑,代碼就不貼了。
2)多樣本模式進(jìn)行variants calling
首先從bam文件到gvcf文件,我寫到了一個(gè)腳本里,命名bam2gvcf.bash:
*#!/bin/bash*
#mapping BAM to vcf
#軟件, reference genome路徑,根據(jù)實(shí)際情況修改,=兩邊不能有空格
gatk=your path to GATK
samtools=your path to samtools
ref_genome=your path to reference genome
dbsnp=your path to dbsnp
#輸入?yún)?shù)
BAM=$1 #mapping后的BAM文件
RGID=$2 #read group ID
library=$3 #測序文庫編號(hào)
sample=$4 #sample名稱
outdir=$5 #輸出文件路徑
#按樣本設(shè)置目錄
outdir = ${outdir}/${sample}
#output diretory
if [ ! -d $outdir/variants ]
then mkdir -p $outdir/variants
fi
if [ ! -d $outdir/processBAM ]
then mkdir -p $outdir/processBAM
fi
#為BAM文件添加RG標(biāo)簽,這步省略后續(xù)會(huì)報(bào)錯(cuò)
time $gatk AddOrReplaceReadGroups \
--INPUT $BAM \
--OUTPUT $outdir/processBAM/${sample}.RG.bam \
--RGID $RGID \
--RGLB $library \
--RGPL illumina \
--RGPU snpcall \
--RGSM ${sample} && echo "** ADD RG done **"
#標(biāo)記因PCR產(chǎn)生的重復(fù)reads,mark duplicates & index
time $gatk MarkDuplicates \
-I $outdir/processBAM/${sample}.RG.bam \
-M $outdir/processBAM/${sample}.markdup_metrics.txt \
-O $outdir/processBAM/${sample}.RG.markdup.bam &&\
time $samtools index $outdir/processBAM/${sample}.RG.markdup.bam && echo "** ${sample}.RG.markdup.bam index done **"
#split N Trim,去除落到內(nèi)含子上的片段,但運(yùn)行結(jié)果發(fā)現(xiàn)仍然有很多intron上的變異
time $gatk SplitNCigarReads \
--R $ref_genome \
--I $outdir/processBAM/${sample}.RG.markdup.bam \
--O $outdir/processBAM/${sample}.RG.markdup.split.bam && echo "** ${sample}.RG.markdup.bam split N done **"
#HaplotypeCaller
time $gatk HaplotypeCaller \
-R $ref_genome \
-I $outdir/processBAM/${sample}.RG.markdup.split.bam \
-O $outdir/variants/${sample}.g.vcf.gz \
-dont-use-soft-clipped-bases \
--emit-ref-confidence GVCF \
--standard-min-confidence-threshold-for-calling 20 \
--dbsnp $dbsnp && echo "** ${sample}.gvcf done **"
#執(zhí)行腳本,參數(shù)根據(jù)實(shí)際情況填進(jìn)去
bash bam2gvcf.sh bam文件 RG LB SAMPLE_ID OUTPUT
然后合并gvcf,并對變異進(jìn)行硬過濾,腳本名gvcf2vcf.sh:
#!/bin/bash
*#gvcf to vcf*
#路徑
gatk=path to GATK
ref_genome=path to reference genome
#執(zhí)行參數(shù)
samples=$1 #所有的樣本ID, 用","分隔開
indir=$2 #輸入目錄的路徑,與上個(gè)腳本的輸出路徑相同
outname=$3 #輸出文件名的前綴
outdir=$indir
#設(shè)置群體變異檢測結(jié)果輸出目錄
if [ ! -d $outdir/population ]
then mkdir -p $outdir/population
fi
#按照",",把所有樣本ID拆分出來存至數(shù)組中
samples = $(echo $samples | tr "," "\n")
##群體的joint genotyping, 先合并所有的gvcf結(jié)果,然后統(tǒng)一進(jìn)行GenotypeGVCFs
sample_gvcfs=""
for sample in $samples; do
sample_gvcfs=${sample_gvcfs}"-V $outdir/${sample}/variants/${sample}.g.vcf.gz "
done
time $gatk CombineGVCFs \
-R $ref_genome \
${sample_gvcfs} \
-O $outdir/population/${outname}.combine.g.vcf.gz && echo "** ${outname}.combine.g.vcf done **" && \
time $gatk GenotypeGVCFs \
-R $ref_genome \
-V $outdir/population/${outname}.combine.g.vcf.gz \
-O $outdir/population/${outname}.vcf.gz && echo "** ${outname}.vcf.gz done **"
#select SNP & filter
time $gatk SelectVariants \
-select-type SNP \
-V $outdir/population/${outname}.vcf.gz \
-O $outdir/population/${outname}.snp.vcf.gz && \
time $gatk VariantFiltration \
-V $outdir/population/${outname}.snp.vcf.gz \
-O $outdir/population/${outname}.snp.filter.vcf.gz \
-R $ref_genome \
--filter-expression 'QD < 2.0 || FS > 60.0 || MQ <25.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0' \ #過濾的閾值可以根據(jù)實(shí)驗(yàn)進(jìn)行設(shè)置
--filter-name "my_filter" && echo "** ${outname}.snp filter done **"
對于indel,過濾的方法類似,但由于我的實(shí)驗(yàn)?zāi)康牟魂P(guān)心indel,因此腳本中不涉及。
3)單樣本進(jìn)行variants calling
可以放在一個(gè)腳本中:
#!/bin/bash
#mapping BAM to vcf
#軟件, reference genome路徑,根據(jù)實(shí)際情況修改
gatk=path to GATK
samtools=path to samtools
ref_genome=path to reference genome
dbsnp=path to dbsnp
#參數(shù)
BAM=$1 #mapping后的BAM文件
RGID=$2 #read group ID
library=$3 #測序文庫編號(hào)
sample=$4 #sample名稱
outdir=$5 #輸出文件路徑
#按樣本設(shè)置目錄
outdir = ${outdir}/${sample}
#output diretory
if [ ! -d $outdir/variants ]
then mkdir -p $outdir/variants
fi
if [ ! -d $outdir/processBAM ]
then mkdir -p $outdir/processBAM
fi
#add RG & sort
time $gatk AddOrReplaceReadGroups \
--INPUT $BAM \
--OUTPUT $outdir/processBAM/${sample}.RG.bam \
--RGID $RGID \
--RGLB $library \
--RGPL illumina \
--RGPU snpcall \
--RGSM ${sample} && echo "** ADD RG done **"
#mark duplicates & index
time $gatk MarkDuplicates \
-I $outdir/processBAM/${sample}.RG.bam \
-M $outdir/processBAM/${sample}.markdup_metrics.txt \
-O $outdir/processBAM/${sample}.RG.markdup.bam &&\
time $samtools index $outdir/processBAM/${sample}.RG.markdup.bam && echo "** ${sample}.RG.markdup.bam index done **"
#split N Trim
time $gatk SplitNCigarReads \
--R $ref_genome \
--I $outdir/processBAM/${sample}.RG.markdup.bam \
--O $outdir/processBAM/${sample}.RG.markdup.split.bam && echo "** ${sample}.RG.markdup.bam split N done **"
#HaplotypeCaller
time $gatk HaplotypeCaller \
-R $ref_genome \
-I $outdir/processBAM/${sample}.RG.markdup.split.bam \
-O $outdir/variants/${sample}.vcf.gz \
-dont-use-soft-clipped-bases \
--standard-min-confidence-threshold-for-calling 20 \
--dbsnp $dbsnp && echo "** ${sample}.vcf done **"
#選擇SNP,并filter
time $gatk SelectVariants \
-select-type SNP \
-V $outdir/variants/${sample}.vcf.gz \
-O $outdir/variants/${sample}.snp.vcf.gz && \
time $gatk VariantFiltration \
-V $outdir/variants/${sample}.vcf.gz \
-O $outdir/variants/${sample}.snp.filter.vcf.gz \
-R $ref_genome \
--filter-expression 'QD < 2.0 || FS > 60.0 || MQ <25.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0' \ #過濾的閾值可以根據(jù)實(shí)驗(yàn)進(jìn)行設(shè)置
--filter-name "my_filter" && echo "** ${sample}.snp filter done **"
接下來就可以用ANNOVAR對變異進(jìn)行注釋了。