high-throughput seq analysis急先鋒——STAR的使用介紹

在所有物是人非的景色里，我最中意你。

正體

這次給大家?guī)淼氖荅NCODE project的御用比對軟件STAR，ENCODE項目是一個由美國國家人類基因組研究所(NHGRI)在2003年9月發(fā)起的一項公共聯(lián)合研究項目，旨在找出人類基因組中所有功能組件[。這是既完成人類基因組計劃后國家人類基因組研究所開始的最重要的項目之一。所有在該項目中產(chǎn)生的數(shù)據(jù)都會被迅速的在公共數(shù)據(jù)庫中公開。
在我之前的那篇RNA-seq數(shù)據(jù)分析---方法學(xué)文章的實(shí)戰(zhàn)練習(xí)文章里關(guān)于比對軟件的比較中STAR也展現(xiàn)了不俗的表現(xiàn)。所以在處理比對時我也考慮了將HISAT2與STAR共同使用，查看它們的表現(xiàn)情況，選取適合的比對工具。

STAR的安裝

cd biosoft && mkdir STAR && cd STAR
wget https://github.com/alexdobin/STAR/archive/2.5.3a.tar.gz
tar -xzf 2.5.3a.tar.gz
cd STAR-2.5.3a

# for easy use, add bin/ to your PATH

下載需要參考基因組并進(jìn)行index構(gòu)建

# downloading dna index fasta file
nohup wget -r -np -nH -nd -R index.html -L ftp://ftp.ensembl.org/pub/release-90/fasta/homo_sapiens/dna_index/ &

# download gft annotation file
nohup wget ftp://ftp.ensembl.org/pub/release-90/gtf/homo_sapiens/Homo_sapiens.GRCh38.90.chr_patch_hapl_scaff.gtf.gz &

mkdir STAR_index && cd STAR_index
STAR --runMode genomeGenerate --genomeDir ~/reference/STAR_index/ --genomeFastaFiles ~/reference/genome/hg38/Homo_sapiens.GRCh38.dna.toplevel.fa --sjdbGTFfile ~/reference/genome/hg38/Homo_sapiens.GRCh38.90.chr_patch_hapl_scaff.gtf --sjdbOverhang 199

# --sjdbOverhang 數(shù)值為reads長度-1
# Mode 為generate
# --genomeFastaFiles --sjdbGTFfile 分別對應(yīng)fasta文件和GTF文件

STAR的使用

# STAR的manual里面給了最基本的比對參數(shù)示例
STAR
--runThreadN NumberOfThreads
--genomeDir /path/to/genomeDir
--readFilesIn /path/to/read1 [/path/to/read2 ]

# 基本示例，針對fastq.gz文件增加--readFilesCommand gunzip -c 參數(shù)/--readFilesCommand zcat參數(shù)，針對bzip2文件使用--readFilesCommand bunzip2 -c參數(shù)
STAR --runThreadN 20 --genomeDir ~/reference/STAR_index/ --readFilesCommand zcat --readFilesIn ~/RNA-seq/LiuPing_data/RNA-seq/SC_w2q20m35_N_1.fq.gz ~/RNA-seq/LiuPing_data/RNA-seq/SC_w2q20m35_N_2.fq.gz

# 輸出unsorted or sorted bam file
--outSAMtype BAM Unsorted 實(shí)際上就是-name 的sort，下游可以直接接HTSeq
--outSAMtype BAM SortedByCoordinate
--outSAMtype BAM Unsorted SortedByCoordinate 兩者都輸出

額外參數(shù)說明

# 單獨(dú)指定注釋文件，而不用在構(gòu)建的時候使用
--sjdbGTFfile /path/to/ann.gtf
--sjdbFileChrStartEnd /path/to/sj.tab

# ENCODE參數(shù)

# 減少偽junction的幾率
--outFilterType BySJout

# 最多允許一個reads被匹配到多少個地方
--outFilterMultimapNmax 20

# 在未有注釋的junction區(qū)域，最低允許突出多少個bp的單鏈序列
--alignSJoverhangMin 8

# 在有注釋的junction區(qū)域，最低允許突出多少個bp的單鏈序列
--alignSJDBoverhangMin 1

# 過濾掉每個paired read mismatch數(shù)目超過N的數(shù)據(jù)，999代表著忽略這個過濾
--outFilterMismatchNmax 999

# 相對paired read長度可以允許的mismatch數(shù)目，如果read長度為100，數(shù)值設(shè)定為0.04，則會過濾掉100*2*0.04=8個以上的數(shù)據(jù)
--outFilterMismatchNoverReadLmax 0.04

# 最小的intro長度
--alignIntronMin 20

# 最大的intro長度
--alignIntronMax 1000000

# maximum genomic distance between mates，翻譯不出來，自行理解
--alignMatesGapMax 1000000

STAR的輸出

STAR可以根據(jù)你的參數(shù)設(shè)定輸出多個結(jié)果文件，包含各種信息，下面對默認(rèn)參數(shù)情況下的輸出文件做了一個詳細(xì)的展示，有些不好翻譯的地方我選擇使用原汁原味的manual text

Aligned.out.sam
Aligned.out.sam當(dāng)然就是我們的比對結(jié)果啦！

E00516:168:H37WKCCXY:8:1101:6400:59130  99  1   92836373    255 20M1063N129M    =   92837548    4244    GGCTTGTCTATCCCTCACAGTACCAAACGATTCCCTGGTTATGATTCTGAAAGCAAGGAATTTAATGCAGAAGTACATCGGAAGCACATCATGGGCCAGAATGTTGCAGATTACATGCGCTACTTAATGGAAGAAGATGAAGATGCTTA   AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ   NH:i:1  HI:i:1  AS:i:289    nM:i:0

# 我截取了一條比對信息
我們來看一下最后面的 NH:i:1  HI:i:1  AS:i:289    nM:i:0
NH:i:后面的數(shù)值代表著此條read比對到幾個loci，1代表著unique map，數(shù)值大于1代表著multi-mappers
HI:i:后面的數(shù)值attrbiutes enumerates multiple
alignments of a read starting with 1，下游分析接cufflinks or stringtie的時候需要使用參數(shù)--outSAMattrIHstart 0
AS:i:的數(shù)值代表著local alignment score (paired for paired-edn reads)
nM:i:的數(shù)值代表著the number of mismatches per (paired) alignment, not to be confused with NM, which is the number of mismatches in each mate
關(guān)于下游處理工具的兼容性還需要使用者自己仔細(xì)參考manual

Log.out文件
Log.out文件記錄了程序運(yùn)行時的信息，可以用來回溯錯誤信息。

tail Log.out
Joined thread # 12
Completed: thread #13
Joined thread # 13
Joined thread # 14
Joined thread # 15
Joined thread # 16
Joined thread # 17
Joined thread # 18
Joined thread # 19
ALL DONE!

Log.progress.out文件
Log.progress.out報告比對進(jìn)程情況，1分鐘記錄一次

tail Log.progress.out
Sep 08 17:57:52     33.1    23115987      285    94.1%    284.0     0.2%     4.0%     0.1%     0.0%     1.8%     0.0%
Sep 08 17:58:53     34.0    24349711      285    94.1%    284.0     0.2%     4.0%     0.1%     0.0%     1.8%     0.0%
Sep 08 18:00:23     33.5    24789186      285    94.1%    284.1     0.2%     4.0%     0.1%     0.0%     1.8%     0.0%
Sep 08 18:01:51     33.3    25493588      285    94.1%    284.0     0.2%     4.0%     0.1%     0.0%     1.8%     0.0%
Sep 08 18:02:58     33.5    26284824      285    94.1%    284.1     0.2%     4.0%     0.1%     0.0%     1.8%     0.0%
Sep 08 18:04:23     33.7    27163519      285    94.1%    284.1     0.2%     4.0%     0.1%     0.0%     1.8%     0.0%
Sep 08 18:05:36     33.1    27428080      285    94.1%    284.1     0.2%     4.0%     0.1%     0.0%     1.8%     0.0%
Sep 08 18:06:54     33.8    28659661      285    94.1%    284.1     0.2%     4.0%     0.1%     0.0%     1.8%     0.0%
Sep 08 18:08:00     34.3    29741743      285    94.1%    283.9     0.2%     4.0%     0.1%     0.0%     1.8%     0.0%
ALL DONE!

Log.final.out文件
Log.final.out包含了比對結(jié)束后比對統(tǒng)計的信息

head Log.progress.out 
           Time    Speed        Read     Read   Mapped   Mapped   Mapped   Mapped Unmapped Unmapped Unmapped Unmapped
                    M/hr      number   length   unique   length   MMrate    multi   multi+       MM    short    other
Sep 08 17:17:47      2.9       88583      288    94.2%    287.4     0.1%     4.0%     0.1%     0.0%     1.7%     0.0%
Sep 08 17:18:53     14.5      711158      282    94.1%    281.9     0.2%     4.0%     0.1%     0.0%     1.8%     0.0%
Sep 08 17:20:06     19.2     1329197      284    94.1%    283.8     0.2%     4.0%     0.1%     0.0%     1.8%     0.0%
Sep 08 17:21:19     32.7     2923414      284    94.1%    283.7     0.2%     4.0%     0.1%     0.0%     1.8%     0.0%
Sep 08 17:22:39     32.5     3629649      285    94.1%    283.9     0.2%     4.0%     0.1%     0.0%     1.8%     0.0%
Sep 08 17:23:49     32.4     4248206      285    94.1%    284.0     0.2%     4.0%     0.1%     0.0%     1.8%     0.0%
Sep 08 17:24:57     36.6     5483555      285    94.1%    284.3     0.2%     4.0%     0.1%     0.0%     1.8%     0.0%
Sep 08 17:26:03     35.7     6012744      285    94.1%    284.4     0.2%     4.0%     0.1%     0.0%     1.8%     0.0%
tail Log.progress.out 
Sep 08 17:57:52     33.1    23115987      285    94.1%    284.0     0.2%     4.0%     0.1%     0.0%     1.8%     0.0%
Sep 08 17:58:53     34.0    24349711      285    94.1%    284.0     0.2%     4.0%     0.1%     0.0%     1.8%     0.0%
Sep 08 18:00:23     33.5    24789186      285    94.1%    284.1     0.2%     4.0%     0.1%     0.0%     1.8%     0.0%
Sep 08 18:01:51     33.3    25493588      285    94.1%    284.0     0.2%     4.0%     0.1%     0.0%     1.8%     0.0%
Sep 08 18:02:58     33.5    26284824      285    94.1%    284.1     0.2%     4.0%     0.1%     0.0%     1.8%     0.0%
Sep 08 18:04:23     33.7    27163519      285    94.1%    284.1     0.2%     4.0%     0.1%     0.0%     1.8%     0.0%
Sep 08 18:05:36     33.1    27428080      285    94.1%    284.1     0.2%     4.0%     0.1%     0.0%     1.8%     0.0%
Sep 08 18:06:54     33.8    28659661      285    94.1%    284.1     0.2%     4.0%     0.1%     0.0%     1.8%     0.0%
Sep 08 18:08:00     34.3    29741743      285    94.1%    283.9     0.2%     4.0%     0.1%     0.0%     1.8%     0.0%
ALL DONE!

SJ.out.tab文件
SJ.out.tab包含了剪切信息，其實(shí)目前我還沒怎么看懂，等以后再來補(bǔ)坑。

head SJ.out.tab 
1   14830   14969   2   2   0   1   9   69
1   14844   14969   2   2   0   0   2   30
1   15039   15795   2   2   1   2   7   53
1   15948   16606   2   2   1   1   1   41
1   16028   16606   2   2   0   0   1   57
1   16311   16606   2   2   0   2   0   67
1   16766   16853   2   2   0   2   0   43
1   16766   16857   2   2   1   17  108 73
1   16766   16875   2   2   0   0   1   61
1   16789   16875   2   2   0   0   1   53

# 參數(shù)釋義
column 1: chromosome
column 2: first base of the intron (1-based)
column 3: last base of the intron (1-based)
column 4: strand (0: undened, 1: +, 2: -)
column 5: intron motif: 0: non-canonical; 1: GT/AG, 2: CT/AC, 3: GC/AG, 4: CT/GC, 5:AT/AC, 6: GT/AT
column 6: 0: unannotated, 1: annotated (only if splice junctions database is used)
column 7: number of uniquely mapping reads crossing the junction
column 8: number of multi-mapping reads crossing the junction
column 9: maximum spliced alignment overhang