IRanges用于解決序列在基因組上的位置問題,GRanges在IRanges的基礎(chǔ)上增加了染色體和DNA鏈的信息。在GRanges內(nèi)部可以使用IRanges構(gòu)建序列位置信息,速度比data.frame快許多
基礎(chǔ)操作
一個單獨的ranges對象可以有多個intervals,我們可以對其中的每一個intervals進(jìn)行操作也可以對ranges對象整體進(jìn)行操作
創(chuàng)建對象
創(chuàng)建一個GRanges需要指定names,seqnames,ranges,strand等信息,這些稱作對象的元數(shù)據(jù),另外還可以創(chuàng)建其他meta信息,在GRanges中用|分隔,在對象中不僅包含區(qū)間信息,還包含染色體信息,下面的seqinfo展示了對象中的染色體信息,包括seqnames染色體名稱,seqlengths染色體總長度,isCircular是否成環(huán),genome基因組信息
-
Rle快速記錄冗余信息,包括種類和重復(fù)次數(shù) -
IRanges快速記錄位置信息,包括起點,終點和長度
> gr <- GRanges(
+ seqnames = Rle(c("chr1", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)),
+ ranges = IRanges(101:110, end = 111:120, names = head(letters, 10)),
+ strand = Rle(strand(c("-", "+", "*", "+", "-")), c(1, 2, 2, 3, 2)),
+ score = 1:10,
+ GC = seq(1, 0, length=10))
> gr
GRanges object with 10 ranges and 2 metadata columns:
seqnames ranges strand | score GC
<Rle> <IRanges> <Rle> | <integer> <numeric>
a chr1 101-111 - | 1 1
b chr2 102-112 + | 2 0.888888888888889
c chr2 103-113 + | 3 0.777777777777778
d chr2 104-114 * | 4 0.666666666666667
e chr1 105-115 * | 5 0.555555555555556
f chr1 106-116 + | 6 0.444444444444444
g chr3 107-117 + | 7 0.333333333333333
h chr3 108-118 + | 8 0.222222222222222
i chr3 109-119 - | 9 0.111111111111111
j chr3 110-120 - | 10 0
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
獲取屬性
-
start()每個intervals的起點 -
end()每個intervals的終點 -
width()每個intervals的區(qū)間寬度
# 獲取每一條序列的長度,并得到其分布
> width(gr)
[1] 11 11 11 11 11 11 11 11 11 11
-
length()返回對象的長度
> length(gr)
[1] 10
-
strand()獲取鏈屬性 -
names()獲取行名
> names(gr)
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
# 也可以進(jìn)行賦值操作
> names(gr) <- 1:10
> names(gr)
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
-
seqinfo()獲取染色體信息,包括seqnames,seqlengths,isCircular,genome -
mcols()獲取列metadata,注意GRanges中seqnames,ranges,strand屬于元數(shù)據(jù),不能通過mcols獲取
區(qū)間操作
IRanges和GRanges也可有一些類似向量的操作,使用向量,名字以及邏輯值進(jìn)行索引,也可以進(jìn)行算術(shù)加減,不同對象之間也可以進(jìn)行合并,分隔,取交集等操作。如果有多個對象,我們通過創(chuàng)建一個GRangesList是很有用的,例如用于表示分組信息(比如每個基因的外顯子)。該列表的元素是基因,并且在每個元素中,外顯子的范圍被定義為GRanges。數(shù)據(jù)結(jié)構(gòu)類似于list,可以使用lapply操作
# 使用邏輯判斷獲取子集
> gr[gr$score < 5]
GRanges object with 4 ranges and 2 metadata columns:
seqnames ranges strand | score GC
<Rle> <IRanges> <Rle> | <integer> <numeric>
1 chr1 101-111 - | 1 1
2 chr2 102-112 + | 2 0.888888888888889
3 chr2 103-113 + | 3 0.777777777777778
4 chr2 104-114 * | 4 0.666666666666667
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
-
shift指定intervals進(jìn)行平移
# 整體平移,正值指定向染色體上游,負(fù)值指定向染色體下游
> shift(gr, shift = 10)
GRanges object with 10 ranges and 2 metadata columns:
seqnames ranges strand | score GC
<Rle> <IRanges> <Rle> | <integer> <numeric>
1 chr1 111-121 - | 1 1
2 chr2 112-122 + | 2 0.888888888888889
3 chr2 113-123 + | 3 0.777777777777778
4 chr2 114-124 * | 4 0.666666666666667
5 chr1 115-125 * | 5 0.555555555555556
6 chr1 116-126 + | 6 0.444444444444444
7 chr3 117-127 + | 7 0.333333333333333
8 chr3 118-128 + | 8 0.222222222222222
9 chr3 119-129 - | 9 0.111111111111111
10 chr3 120-130 - | 10 0
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
-
restrict范圍截取,指定起點和終點,獲取指定范圍內(nèi)的序列
> restrict(gr, 105, 110)
GRanges object with 10 ranges and 2 metadata columns:
seqnames ranges strand | score GC
<Rle> <IRanges> <Rle> | <integer> <numeric>
1 chr1 105-110 - | 1 1
2 chr2 105-110 + | 2 0.888888888888889
3 chr2 105-110 + | 3 0.777777777777778
4 chr2 105-110 * | 4 0.666666666666667
5 chr1 105-110 * | 5 0.555555555555556
6 chr1 106-110 + | 6 0.444444444444444
7 chr3 107-110 + | 7 0.333333333333333
8 chr3 108-110 + | 8 0.222222222222222
9 chr3 109-110 - | 9 0.111111111111111
10 chr3 110 - | 10 0
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
-
flank獲取指定長度上下游序列;promoters是該功能的增強(qiáng)版,可以輕易獲取指定區(qū)間上下游序列
# 獲取序列上游10bp
> flank(gr, width = 10)
GRanges object with 10 ranges and 2 metadata columns:
seqnames ranges strand | score GC
<Rle> <IRanges> <Rle> | <integer> <numeric>
1 chr1 112-121 - | 1 1
2 chr2 92-101 + | 2 0.888888888888889
3 chr2 93-102 + | 3 0.777777777777778
4 chr2 94-103 * | 4 0.666666666666667
5 chr1 95-104 * | 5 0.555555555555556
6 chr1 96-105 + | 6 0.444444444444444
7 chr3 97-106 + | 7 0.333333333333333
8 chr3 98-107 + | 8 0.222222222222222
9 chr3 120-129 - | 9 0.111111111111111
10 chr3 121-130 - | 10 0
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
# 獲取序列下游10bp,指定start=F
> flank(gr, width = 10, start = F)
GRanges object with 10 ranges and 2 metadata columns:
seqnames ranges strand | score GC
<Rle> <IRanges> <Rle> | <integer> <numeric>
1 chr1 91-100 - | 1 1
2 chr2 113-122 + | 2 0.888888888888889
3 chr2 114-123 + | 3 0.777777777777778
4 chr2 115-124 * | 4 0.666666666666667
5 chr1 116-125 * | 5 0.555555555555556
6 chr1 117-126 + | 6 0.444444444444444
7 chr3 118-127 + | 7 0.333333333333333
8 chr3 119-128 + | 8 0.222222222222222
9 chr3 99-108 - | 9 0.111111111111111
10 chr3 100-109 - | 10 0
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
# 獲取上下游序列時需要注意不能超出chr的范圍,需要指定范圍
-
reduce組裝,獲取序列的并集
> reduce(gr)
GRanges object with 7 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 106-116 +
[2] chr1 101-111 -
[3] chr1 105-115 *
[4] chr2 102-113 +
[5] chr2 104-114 *
[6] chr3 107-118 +
[7] chr3 109-120 -
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
-
disjoin拆分,去掉所有overlap的序列區(qū)域,獲取所有序列的補(bǔ)集,在研究可變剪切中很有用,類似于gaps()
> disjoin(gr)
GRanges object with 13 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 106-116 +
[2] chr1 101-111 -
[3] chr1 105-115 *
[4] chr2 102 +
[5] chr2 103-112 +
... ... ... ...
[9] chr3 108-117 +
[10] chr3 118 +
[11] chr3 109 -
[12] chr3 110-119 -
[13] chr3 120 -
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
-
sort對GRanges對象內(nèi)部進(jìn)行排序
# 按照基因組的順序排序,先排染色體再排正負(fù)鏈
> sort(gr)
GRanges object with 10 ranges and 2 metadata columns:
seqnames ranges strand | score GC
<Rle> <IRanges> <Rle> | <integer> <numeric>
6 chr1 106-116 + | 6 0.444444444444444
1 chr1 101-111 - | 1 1
5 chr1 105-115 * | 5 0.555555555555556
2 chr2 102-112 + | 2 0.888888888888889
3 chr2 103-113 + | 3 0.777777777777778
4 chr2 104-114 * | 4 0.666666666666667
7 chr3 107-117 + | 7 0.333333333333333
8 chr3 108-118 + | 8 0.222222222222222
9 chr3 109-119 - | 9 0.111111111111111
10 chr3 110-120 - | 10 0
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
-
findOverlaps獲取兩個對象之間重復(fù)的區(qū)域,指定序列在指定區(qū)域是否富集,返回結(jié)果表示第一個對象中的第幾條序列與第二個對象中的第幾條序列存在overlap,類似于%over%,%over%直接返回邏輯值
> gr6 <- GRanges(seqnames = "chr2",
+ ranges = IRanges(start = c(6,8,12,14,21,22,23),width = c(11,4,2,5,7,7,7)),
+ strand = "*")
> gr7 <- GRanges(seqnames = "chr2",
+ ranges = IRanges(start = c(6,15),width = 10),
+ strand = "*")
> gr6
GRanges object with 7 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr2 6-16 *
[2] chr2 8-11 *
[3] chr2 12-13 *
[4] chr2 14-18 *
[5] chr2 21-27 *
[6] chr2 22-28 *
[7] chr2 23-29 *
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
> gr7
GRanges object with 2 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr2 6-15 *
[2] chr2 15-24 *
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
> findOverlaps(gr6,gr7)
Hits object with 9 hits and 0 metadata columns:
queryHits subjectHits
<integer> <integer>
[1] 1 1
[2] 1 2
[3] 2 1
[4] 3 1
[5] 4 1
[6] 4 2
[7] 5 2
[8] 6 2
[9] 7 2
-------
queryLength: 7 / subjectLength: 2
# 或者根據(jù)邏輯判斷直接獲取overlap子集
> gr6[gr6 %over% gr7]
GRanges object with 7 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr2 6-16 *
[2] chr2 8-11 *
[3] chr2 12-13 *
[4] chr2 14-18 *
[5] chr2 21-27 *
[6] chr2 22-28 *
[7] chr2 23-29 *
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
-
tile創(chuàng)建窗口,可以指定窗口數(shù)量以及窗口寬度, -
slidingWindows創(chuàng)建滑動窗口,指定窗口長度以及窗口移動的步長 -
tileGenome返回一組基因組區(qū)域,這些區(qū)域構(gòu)成特定基因組的分區(qū)