BWA比對(duì)詳解

bwa主要用于將低差異度的短序列(一般是同物種)與參考基因組進(jìn)行比對(duì)。主要包含三種比對(duì)算法:backtrack、SW和MEM,第一種只支持短序列比對(duì)(<100bp),后兩種支持長(zhǎng)序列比對(duì)(70bp~1M),并支持分割比對(duì)(split alignment)。MEM算法是最新的也是官方推薦的。

 注:如果短序列是嵌合體,可能會(huì)輸出兩條結(jié)果。

 注:參考基因組單挑染色體長(zhǎng)度要小于2G

主要語(yǔ)法:


bwa index -a bwtsw ref.fasta         #對(duì)于大基因組建立FM-Index

bwa index -a is ref.fasta             #對(duì)小基因組建立index,速度快,內(nèi)存消耗大

bwa mem ref.fa reads.fq > aln-se.sam

bwa mem ref.fa read1.fq read2.fq > aln-pe.sam

bwa aln ref.fa short_read.fq > aln_sa.sai

bwa samse ref.fa aln_sa.sai short_read.fq > aln-se.sam

bwa sampe ref.fa aln_sa1.sai aln_sa2.sai read1.fq read2.fq > aln-pe.sam

bwa bwasw ref.fa long_read.fq > aln.sam

內(nèi)置程序包括:index,mem,aln samse,sampe,bwasw,

Index:對(duì)參考基因組建立索引

-p: prefix of output database
-a:Algorithm for constructing index
        is          :速度快,內(nèi)存消耗大,適合小基因組<2G
        bwtsw  :速度慢,省內(nèi)存,適合大參考基因組

MEM:maximal exact matches

  先做seeding alignment,再用SW算法做延伸
  主要參數(shù):
  -t INT    Number of threads [1]
            線程數(shù),默認(rèn)為1
  -k INT    Minimum seed length. Matches shorter than INT will be missed. The alignment speed is usually insensitive to this value unless it significantly deviates 20. [19]
            最小種子匹配長(zhǎng)度,影響比對(duì)速度,默認(rèn)為19
  -w INT    Band width. Essentially, gaps longer than INT will not be found. Note that the maximum gap length is also affected by the scoring matrix and the hit length, not solely determined by this option. [100]
            gap最大長(zhǎng)度,超過后將不會(huì)被匹配,默認(rèn)為100
  -d INT    Off-diagonal X-dropoff (Z-dropoff). Stop extension when the difference between the best and the current extension score is above |i-j|*A+INT, where i and j are the current positions of the query and reference, respectively, and A is the matching score. Z-dropoff is similar to BLAST’s X-dropoff except that it doesn’t penalize gaps in one of the sequences in the alignment. Z-dropoff not only avoids unnecessary extension, but also reduces poor alignments inside a long good alignment. [100]
            影響停止延伸的一個(gè)參數(shù)。
  -r FLOAT  Trigger re-seeding for a MEM longer than minSeedLen*FLOAT. This is a key heuristic parameter for tuning the performance. Larger value yields fewer seeds, which leads to faster alignment speed but lower accuracy. [1.5]
            一個(gè)關(guān)鍵的啟發(fā)式搜索參數(shù),較大的數(shù)值意味著較少的種子,比對(duì)速度也更快,默認(rèn)1.5,即1.5倍的最小種子長(zhǎng)度
  -c INT    Discard a MEM if it has more than INT occurence in the genome. This is an insensitive parameter. [10000]
  -P    In the paired-end mode, perform SW to rescue missing hits only but do not try to find hits that fit a proper pair.
  -A INT    Matching score. [1]
  -B INT    Mismatch penalty. The sequence error rate is approximately: {.75 * exp[-log(4) * B/A]}. [4]
  -O INT    Gap open penalty. [6]
  -E INT    Gap extension penalty. A gap of length k costs O + k*E (i.e. -O is for opening a zero-length gap). [1]
  -L INT    Clipping penalty. When performing SW extension, BWA-MEM keeps track of the best score reaching the end of query. If this score is larger than the best SW score minus the clipping penalty, clipping will not be applied. Note that in this case, the SAM AS tag reports the best SW score; clipping penalty is not deducted. [5]
  -U INT    Penalty for an unpaired read pair. BWA-MEM scores an unpaired read pair as scoreRead1+scoreRead2-INT and scores a paired as scoreRead1+scoreRead2-insertPenalty. It compares these two scores to determine whether we should force pairing. [9]
  -p    Assume the first input query file is interleaved paired-end FASTA/Q. See the command description for details.
  -R STR    Complete read group header line. ’\t’ can be used in STR and will be converted to a TAB in the output SAM. The read group ID will be attached to every read in the output. An example is ’@RG\tID:foo\tSM:bar’. [null]
  - T INT   Don’t output alignment with score lower than INT. This option only affects output. [30]
  -a    Output all found alignments for single-end or unpaired paired-end reads. These alignments will be flagged as secondary alignments.
  -C    Append append FASTA/Q comment to SAM output. This option can be used to transfer read meta information (e.g. barcode) to the SAM output. Note that the FASTA/Q comment (the string after a space in the header line) must conform the SAM spec (e.g. BC:Z:CGTAC). Malformated comments lead to incorrect SAM output.
  -H    Use hard clipping ’H’ in the SAM output. This option may dramatically reduce the redundancy of output when mapping long contig or BAC sequences.
  -M    Mark shorter split hits as secondary (for Picard compatibility).
  -v INT    Control the verbose level of the output. This option has not been fully supported throughout BWA. Ideally, a value 0 for disabling all the output to stderr; 1 for outputting errors only; 2 for warnings and errors; 3 for all normal messages; 4 or higher for debugging. When this option takes value 4, the output is not SAM. [3]
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容