寫在前面
為什么會(huì)用這個(gè)工具呢
因?yàn)槲衣犝f很快,并且被 samtools markdup 和 picard 傷到了。用 samtools markdup的時(shí)候提醒我要先 fixmate 并且 sort 按照 read name 來,可是我先前是按照默認(rèn)的sort方式來的,emmm。gatk picard 去除重復(fù)后,比原先文件還大,加了什么鬼東西啊
附上此工具鏈接
http://lomereiter.github.io/sambamba/docs/sambamba-markdup.html
開始
gzip -d sambamba-0.6.8.gz
chmod a+x sambamba-0.6.8
./sambamba-0.8.6
下載解壓,放進(jìn)環(huán)境變量,就是如此簡單,不需要安裝。
NAME
sambamba-markdup - finding duplicate reads in BAM file
SYNOPSIS
sambamba markdup OPTIONS <input.bam> <output.bam>
DESCRIPTION
Marks (by default) or removes duplicate reads. For determining whether a read is a duplicate or not, the same criteria as in Picard are used.
OPTIONS
-r, --remove-duplicates
remove duplicates instead of just marking them
-t, --nthreads=NTHREADS
number of threads to use
-l, --compression-level=N
specify compression level of the resulting file (from 0 to 9)");
-p, --show-progress
show progressbar in STDERR
--tmpdir=TMPDIR
specify directory for temporary files; default is /tmp
--hash-table-size=HASHTABLESIZE
size of hash table for finding read pairs (default is 262144 reads); will be rounded down to the nearest power of two; should be > (average coverage) * (insert size) for good performance
--overflow-list-size=OVERFLOWLISTSIZE
size of the overflow list where reads, thrown away from the hash table, get a second chance to meet their pairs (default is 200000 reads); increasing the size reduces the number of temporary files created
--io-buffer-size=BUFFERSIZE
controls sizes of two buffers of BUFFERSIZE megabytes each, used for reading and writing BAM during the second pass (default is 128)
測試
去重復(fù)特別快,3G的bam文件去重復(fù)時(shí)間只用了1min。