Extract high quality variants from VCF files

Here shows a convenient way to extract the high quality variant sites form multiple VCF files using a simple Python script.
The script just screen the whole VCF file, select the high quality variations and generate a new VCF file that only include the high quality variations sites.

The low quality variations as follows were excluded:
  1. variant sites with sequencing depth "lower than min_depth" and "bigger than max_depth"
  2. heterozygous variant sites

Here's the python script:

# -*- coding:utf-8 -*-
import sys

min_depth = sys.argv[1]
max_depth = sys.argv[2]

for line in sys.stdin:
    if line.startswith("#"):
        print(line.strip())
    if line.startswith("Chr"):
        gt = line.split()[9].split(":")[0]
        gt_1 = gt.split("/")[0]
        gt_2 = gt.split("/")[1]
        dp = int(line.split()[9].split(":")[2])
        if gt_1 == gt_2 and dp >= int(min_depth) and dp <= int(max_depth):
            print(line.strip())

Usually the VCF file is in the format of "*.vcf.gz". Let's say your store all your VCF files in one folder, your just need to run the commond line as follows:

for sample in *.vcf.gz; do zcat $sample | python3 select_hq_vcf.py <your_min_depth> <your_max_depth> > sample.hq.vcf; done
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書(shū)系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

友情鏈接更多精彩內(nèi)容