GATK官方給出了從RNA-seq數(shù)據(jù)中尋找變異位點(diǎn)的流程，但這個(gè)示意圖比較簡潔，實(shí)際操作時(shí)一不小心就會(huì)報(bào)錯(cuò)，故經(jīng)過探索，記錄下這個(gè)流程的細(xì)節(jié)以及半自動(dòng)化的腳本。

image

1. 軟件及環(huán)境

samtools: 1.9

STAR: 2.7.0e

GATK: 4.1.4.1

操作系統(tǒng)：Red Hat 4.8.5-28

2.流程概述

STAR進(jìn)行比對生成BAM文件 -> GATK內(nèi)置Picard工具對BAM文件進(jìn)行處理，以適配后續(xù)分析流程，否則會(huì)報(bào)錯(cuò) -> HaplotypeCaller尋找變異 -> 分SNP，Indel兩種模式對變異進(jìn)行過濾。

*若有生物學(xué)重復(fù)，可以在call變異這一步為每一個(gè)樣本生成gvcf文件，然后合并gvcf，進(jìn)行joint-calling。

3.代碼

1）mapping

對于RNA-seq數(shù)據(jù)，GATK建議使用STAR這一比對軟件。這一步我并沒有寫到一個(gè)腳本里。

首先建立index：


STAR --runThreadN 20 \ #線程數(shù)

    --runMode genomeGenerate \

    --genomeDir /your_path/STAR/1.index \ #生成index的路徑

    --genomeFastaFiles /your_path/hg38_UCSC.fa \  #參考基因組文件

    --sjdbGTFfile /your_path/gencode.v32.annotation.gtf \ #參考基因組注釋文件

    --sjdbOverhang 140 #reads長度-1

結(jié)果如下：

image

然后進(jìn)行mapping：


STAR --runMode alignReads\

      --runThreadN 20 \ 

      --genomeDir /your_path/STAR/1.index/  \ #上一步生成的index路徑

      --outFileNamePrefix /your_path/2.mapping/${sample} \ #mapping結(jié)果的路徑及前綴

      --outSAMmapqUnique 255  \ #唯一比對的質(zhì)量值設(shè)為255

      --outSAMtype BAM SortedByCoordinate \ #直接輸出排序好的BAM文件

      --outBAMsortingThreadN 10 \

      --readFilesIn /your_path/${sample}_1.clean.trimmed.fq /your_path/${sample}_2.clean.trimmed.fq #雙端測序文件

結(jié)果如下：

image

接著利用上一步mapping結(jié)果收集到的junction信息，也就是SJ.out.tab文件第二次構(gòu)建index，多個(gè)樣本時(shí)將它們的結(jié)果一起輸入：*


STAR --runThreadN 40 \

    --runMode genomeGenerate \

    --genomeDir  /your_path/STAR/3.index \

    --genomeFastaFiles /your_path/hg38_UCSC.fa \

    --sjdbGTFfile /your_path/gencode.v32.annotation.gtf \

    --sjdbFileChrStartEnd /your_path/STAR/2.mapping/*SJ.out.tab \ #第一次mapping得到的SJ.out.tab文件

    --sjdbOverhang 140 \

    --limitSjdbInsertNsj 1100000 #由于我的樣本太多，得到的junction數(shù)量超過上限，因此加上這個(gè)參數(shù)更改上限（默認(rèn)值：1000000）

最后進(jìn)行第二次mapping，代碼和第一次mapping類似，只是更換下index和output路徑，代碼就不貼了。

2）多樣本模式進(jìn)行variants calling

首先從bam文件到gvcf文件，我寫到了一個(gè)腳本里，命名bam2gvcf.bash：


*#！/bin/bash* 

#mapping BAM to vcf

#軟件, reference genome路徑，根據(jù)實(shí)際情況修改，=兩邊不能有空格

gatk=your path to GATK

samtools=your path to samtools

ref_genome=your path to reference genome

dbsnp=your path to dbsnp

#輸入?yún)?shù)

BAM=$1 #mapping后的BAM文件

RGID=$2 #read group ID

library=$3 #測序文庫編號(hào)

sample=$4 #sample名稱

outdir=$5 #輸出文件路徑

#按樣本設(shè)置目錄

outdir = ${outdir}/${sample}

#output diretory

if [ ! -d $outdir/variants ]

then mkdir -p $outdir/variants

fi

if [ ! -d $outdir/processBAM ]

then mkdir -p $outdir/processBAM

fi

#為BAM文件添加RG標(biāo)簽，這步省略后續(xù)會(huì)報(bào)錯(cuò)

time $gatk AddOrReplaceReadGroups \

    --INPUT $BAM \

    --OUTPUT $outdir/processBAM/${sample}.RG.bam \

    --RGID $RGID \

    --RGLB $library \

    --RGPL illumina \

    --RGPU snpcall \

    --RGSM ${sample} && echo "** ADD RG done **"

#標(biāo)記因PCR產(chǎn)生的重復(fù)reads，mark duplicates & index

time $gatk MarkDuplicates \

    -I $outdir/processBAM/${sample}.RG.bam \

    -M $outdir/processBAM/${sample}.markdup_metrics.txt \

    -O $outdir/processBAM/${sample}.RG.markdup.bam &&\

time $samtools index $outdir/processBAM/${sample}.RG.markdup.bam && echo "** ${sample}.RG.markdup.bam index done **"

#split N Trim，去除落到內(nèi)含子上的片段，但運(yùn)行結(jié)果發(fā)現(xiàn)仍然有很多intron上的變異

time $gatk SplitNCigarReads \

    --R $ref_genome \

    --I $outdir/processBAM/${sample}.RG.markdup.bam \

    --O $outdir/processBAM/${sample}.RG.markdup.split.bam && echo "** ${sample}.RG.markdup.bam split N done **"

#HaplotypeCaller

time $gatk HaplotypeCaller \

    -R $ref_genome \

    -I $outdir/processBAM/${sample}.RG.markdup.split.bam \

    -O $outdir/variants/${sample}.g.vcf.gz \

    -dont-use-soft-clipped-bases \

    --emit-ref-confidence GVCF \

    --standard-min-confidence-threshold-for-calling 20 \

    --dbsnp $dbsnp && echo "** ${sample}.gvcf done **"


#執(zhí)行腳本，參數(shù)根據(jù)實(shí)際情況填進(jìn)去

bash bam2gvcf.sh bam文件 RG LB SAMPLE_ID OUTPUT

然后合并gvcf，并對變異進(jìn)行硬過濾，腳本名gvcf2vcf.sh：


#!/bin/bash

*#gvcf to vcf* 

#路徑

gatk=path to GATK

ref_genome=path to reference genome

#執(zhí)行參數(shù)

samples=$1 #所有的樣本ID， 用","分隔開

indir=$2 #輸入目錄的路徑，與上個(gè)腳本的輸出路徑相同

outname=$3 #輸出文件名的前綴

outdir=$indir

#設(shè)置群體變異檢測結(jié)果輸出目錄

if [ ! -d $outdir/population ]

then mkdir -p $outdir/population

fi

#按照","，把所有樣本ID拆分出來存至數(shù)組中

samples = $(echo $samples | tr "," "\n")

##群體的joint genotyping, 先合并所有的gvcf結(jié)果，然后統(tǒng)一進(jìn)行GenotypeGVCFs

sample_gvcfs=""

for sample in $samples; do

    sample_gvcfs=${sample_gvcfs}"-V $outdir/${sample}/variants/${sample}.g.vcf.gz "

done

time $gatk CombineGVCFs \

    -R $ref_genome \

    ${sample_gvcfs} \

    -O $outdir/population/${outname}.combine.g.vcf.gz && echo "** ${outname}.combine.g.vcf done **" && \

time $gatk GenotypeGVCFs \

    -R $ref_genome \

    -V $outdir/population/${outname}.combine.g.vcf.gz \

    -O $outdir/population/${outname}.vcf.gz && echo "** ${outname}.vcf.gz done **"

#select SNP & filter

time $gatk SelectVariants \

    -select-type SNP \

    -V $outdir/population/${outname}.vcf.gz \

    -O $outdir/population/${outname}.snp.vcf.gz && \

time $gatk VariantFiltration \

    -V $outdir/population/${outname}.snp.vcf.gz \

    -O $outdir/population/${outname}.snp.filter.vcf.gz \

    -R $ref_genome \

    --filter-expression 'QD < 2.0 || FS > 60.0 || MQ <25.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0' \ #過濾的閾值可以根據(jù)實(shí)驗(yàn)進(jìn)行設(shè)置

    --filter-name "my_filter" && echo "** ${outname}.snp filter done **"

對于indel，過濾的方法類似，但由于我的實(shí)驗(yàn)?zāi)康牟魂P(guān)心indel，因此腳本中不涉及。

3）單樣本進(jìn)行variants calling

可以放在一個(gè)腳本中：


#！/bin/bash

#mapping BAM to vcf

#軟件, reference genome路徑，根據(jù)實(shí)際情況修改

gatk=path to GATK

samtools=path to samtools

ref_genome=path to reference genome

dbsnp=path to dbsnp

#參數(shù)

BAM=$1 #mapping后的BAM文件

RGID=$2 #read group ID

library=$3 #測序文庫編號(hào)

sample=$4 #sample名稱

outdir=$5 #輸出文件路徑

#按樣本設(shè)置目錄

outdir = ${outdir}/${sample}

#output diretory

if [ ! -d $outdir/variants ]

then mkdir -p $outdir/variants

fi

if [ ! -d $outdir/processBAM ]

then mkdir -p $outdir/processBAM

fi

#add RG & sort

time $gatk AddOrReplaceReadGroups \

    --INPUT $BAM \

    --OUTPUT $outdir/processBAM/${sample}.RG.bam \

    --RGID $RGID \

    --RGLB $library \

    --RGPL illumina \

    --RGPU snpcall \

    --RGSM ${sample} && echo "** ADD RG done **"

#mark duplicates & index

time $gatk MarkDuplicates \

    -I $outdir/processBAM/${sample}.RG.bam \

    -M $outdir/processBAM/${sample}.markdup_metrics.txt \

    -O $outdir/processBAM/${sample}.RG.markdup.bam &&\

time $samtools index $outdir/processBAM/${sample}.RG.markdup.bam && echo "** ${sample}.RG.markdup.bam index done **"

#split N Trim

time $gatk SplitNCigarReads \

    --R $ref_genome \

    --I $outdir/processBAM/${sample}.RG.markdup.bam \

    --O $outdir/processBAM/${sample}.RG.markdup.split.bam && echo "** ${sample}.RG.markdup.bam split N done **"

#HaplotypeCaller

time $gatk HaplotypeCaller \

    -R $ref_genome \

    -I $outdir/processBAM/${sample}.RG.markdup.split.bam \

    -O $outdir/variants/${sample}.vcf.gz \

    -dont-use-soft-clipped-bases \

    --standard-min-confidence-threshold-for-calling 20 \

    --dbsnp $dbsnp && echo "** ${sample}.vcf done **"

#選擇SNP，并filter

time $gatk SelectVariants \

    -select-type SNP \

    -V $outdir/variants/${sample}.vcf.gz \

    -O $outdir/variants/${sample}.snp.vcf.gz && \

time $gatk VariantFiltration \

    -V $outdir/variants/${sample}.vcf.gz \

    -O $outdir/variants/${sample}.snp.filter.vcf.gz \

    -R $ref_genome \

    --filter-expression 'QD < 2.0 || FS > 60.0 || MQ <25.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0' \ #過濾的閾值可以根據(jù)實(shí)驗(yàn)進(jìn)行設(shè)置

    --filter-name "my_filter" && echo "** ${sample}.snp filter done **"

接下來就可以用ANNOVAR對變異進(jìn)行注釋了。

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

使用GATK從RNA-seq數(shù)據(jù)中call variants

使用GATK從RNA-seq數(shù)據(jù)中call variants

1. 軟件及環(huán)境

2.流程概述

3.代碼

1）mapping

2）多樣本模式進(jìn)行variants calling

3）單樣本進(jìn)行variants calling

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

使用GATK從RNA-seq數(shù)據(jù)中call variants

1. 軟件及環(huán)境

2.流程概述

3.代碼

1）mapping

2）多樣本模式進(jìn)行variants calling

3）單樣本進(jìn)行variants calling

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av