本文主要參考知乎帖子三種方法提取miRNA成熟體序列
如何提取感興趣物種的miRNA成熟體序列,有三種方式。
- perl、python 腳本
- R腳本
- Notepad++ 或 EmEditor正則表達(dá)式查找替換
miRNA分析流程
具體參考[miRNA 數(shù)據(jù)過濾我使用cutadapt](miRNA 數(shù)據(jù)過濾我使用cutadapt),進(jìn)行了一些整理,感謝博主的分享。
一、miRNA 數(shù)據(jù)過濾(植物18~30nt)
cutadapt -a AGATCGGAAGAGCACACGTCT -m 15 -q 20 --discard-untrimmed -o outname .fa
-
--discard-untrimmed把reads 中不含有adaper的reads 去掉。 -
-a剪切reads 3' 端adapter (雙端測序第一條read),加$表示adapter錨定在reads 3'端(可找公司要)。 -
-g剪切reads 5'端adapter (雙端測序第一條reads),加$表示adapter錨定在reads 5'端。 -
-q低質(zhì)量堿基。 -
-mreads 短于15時(shí),丟棄該reads。
獲得合適長度的reads
二、miRNA 比對
- 方案1. 比對到Rfam中的ncRNA,去除snRNA,snoRNA,rRNA和tRNA等。
- 方案2. 將miRNA 比對到目標(biāo)物種的參考基因組上,去除那些匹配不上的序列。
為了減少比對時(shí)間,在比對之前可將每個(gè)樣本中的reads 進(jìn)行合并,得到fasta 格式,其命名規(guī)則為:
樣本_r數(shù)字_x數(shù)字,其中r中的數(shù)字表示reads序號;x中數(shù)字表示該條reads重復(fù)次數(shù)
miR-PREFeR 軟件的使用
介紹:miR-PREFeR: microRNA PREdiction From small RNAseq data,本文主要參考github上的tutorial。
借助miR-PREFeR軟件比對到參考基因組,鑒定新的miRNA。
分析流程
1. Required programs (必要的安裝包)
a. 提前安裝ViennaRNA,且版本最好在1.8.5、2.1.2、 2.1.5及以上 。
wget https://www.tbi.univie.ac.at/RNA/download/sourcecode/2_4_x/ViennaRNA-2.4.18.tar.gz
tar zvxf ViennaRNA-2.4.18.tar.gz
cd ViennaRNA-2.4.18.tar.gz
./configure --prefix="/user/tools/ViennaRNA/" --without-perl
make
make install
b. 安裝samtools (0.1.15 或之后的版本)
cd /manager/biosoft/
tar jfx samtools-0.1.19.tar.bz2
cd samtools-0.1.19
make
?注意:由于miR-PREFeR是基于Python2版本,所以Python3版本運(yùn)行會報(bào)錯!
The current version is only tested under Python 2.6.7, Python 2.7.2 and Python 2.7.3 and should work under Python 2.6. and Python 2.7.
2. Obtain and install the pipeline (下載安裝miR-PREFeR)
git clone https://github.com/hangelwen/miR-PREFeR.git
?如果沒法上下載git,可以從我網(wǎng)盤下載。
鏈接:https://pan.baidu.com/s/1UqkKYDOGcjv13dHm9pi9ew
提取碼:volh
3. Test the pipeline (軟件調(diào)試用,可以跳過)
作者貼心的給出了測試數(shù)據(jù)(example/exampledata.tar.gz)以及測試整個(gè)軟件的pipeline(HOW_TO_RUN_EXAMPLE.txt)。
以下是該HOW_TO_RUN_EXAMPLE.txt的具體內(nèi)容,下面具體看看
================================================================================
1. Test the pipeline.
# The package provides a small example dataset for testing the pipeline. The
# dataset is for Aradidopsis, chromosome 1. To run the example, first change
# directory to the example folder:
cd example
tar xvf exampledata.tar.gz # Then decompress the exampledata.tar.gz file:
# Then open the config.example file, change the PIPELINE_PATH to the path where
# you put the miR-PREFeR package folder. For example, if you put miR-PREFeR at
# /home/username/tools/miR-PREFeR-v0.09, then set PIPELINE_PATH as:
PIPELINE_PATH=/home/username/tools/miR-PREFeR-v0.09
# Save the config.example file. In the example folder, execute command:
python ../miR_PREFeR.py -L -k pipeline config.example
# The -L option generates a log file in the output directory example-result. The
# -k option keeps the temp directory used to store the intermediate files. The
# temp directory is in the example-result directory.
# If you have python, samtools, RNALfold installed and in the PATH, you should be
# able to run the test program. It takes about one or two minutes to
# finish. You'll be able to see the result in the example-result folder.
================================================================================
2. Test how to do checkpointing.
# Before testing this, if you have run the pipeline with the example.config file
# in this folder, please remove the example-result folder first.
# Then change the 'CHECKPOINT_SIZE' option to a smaller value (30, for
# example). The reason to do this is that by default the pipeline makes a
# checkpoint after finishing folding every 3000 sequences, but the sample data is
# so small that the total number of sequences is smaller than the default.
# Then run the pipeline with 'pipeline' command:
python ../miR_PREFeR.py -L -k pipeline config.example
# After running for a while (10 seconds, for example. You should let it run for
# enough time to do at least one checkpoint. A "Done" is shown when a checkpoint
# is applied), kill the process by "Ctrl-C". To check where the pipeline was stopped,
# run:
python ../miR_PREFeR.py -L check config.example
# This will show the checkpoint information.
# To restart the pipeline from where it was stopped, run:
python ../miR_PREFeR.py -L recover config.example
# The pipeline will continue to finish the job specified in the config.example
file.
================================================================================
4. How to run the pipeline (現(xiàn)在正式干活了)
a. Prepare input data for the pipeline.
-
A fasta file, which contains the gnome sequences of the species under study. - one or more
SAM fileswhich contains the alignments of small RNAseq data with the gnome. - (Optional)
An GFF(http://www.sanger.ac.uk/resources/software/gff/spec.html) file which lists regions in the gnome sequences that should be ignored from miRNA analysis.
a). Genome fasta file (是A fasta file的解讀)
Fasta format specification can be found at http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml. In miR-PREFeR, for the string following ">", only the first word that is delimited by any white space characters (whitespace, tab, etc) is used. For example, for the following sequence, 'ath-MIR773a' is used as the identifier of the seqeunce. Thus, please ensure that all the sequences in the FASTA files have different identifiers.
>ath-MIR773a MI0005103
AGGAGGCAAUAGCUUGAGCAAAUAAUUGAUUGCAGAAGUCCAUCGACUAAAGCUGUCACCUGUUUGCUUCCAGCUUUUGUCUCCU
b). SAM alignment files (是SAM files的解讀)
The miR-PREFeR pipeline takes SAM format alignment files. SAM alignment files can be generated by many aligners. Here we use Bowtie (http://bowtie-bio.sourceforge.net/index.shtml) as an example.
今天累了,未完待續(xù)....