GRanges

IRanges用于解決序列在基因組上的位置問題,GRanges在IRanges的基礎(chǔ)上增加了染色體和DNA鏈的信息。在GRanges內(nèi)部可以使用IRanges構(gòu)建序列位置信息,速度比data.frame快許多

基礎(chǔ)操作

一個單獨的ranges對象可以有多個intervals,我們可以對其中的每一個intervals進(jìn)行操作也可以對ranges對象整體進(jìn)行操作

創(chuàng)建對象

創(chuàng)建一個GRanges需要指定names,seqnames,ranges,strand等信息,這些稱作對象的元數(shù)據(jù),另外還可以創(chuàng)建其他meta信息,在GRanges中用|分隔,在對象中不僅包含區(qū)間信息,還包含染色體信息,下面的seqinfo展示了對象中的染色體信息,包括seqnames染色體名稱,seqlengths染色體總長度,isCircular是否成環(huán),genome基因組信息

  • Rle 快速記錄冗余信息,包括種類和重復(fù)次數(shù)
  • IRanges 快速記錄位置信息,包括起點,終點和長度
> gr <- GRanges(
+     seqnames = Rle(c("chr1", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)),
+     ranges = IRanges(101:110, end = 111:120, names = head(letters, 10)),
+     strand = Rle(strand(c("-", "+", "*", "+", "-")), c(1, 2, 2, 3, 2)),
+     score = 1:10,
+     GC = seq(1, 0, length=10))
> gr
GRanges object with 10 ranges and 2 metadata columns:
    seqnames    ranges strand |     score                GC
       <Rle> <IRanges>  <Rle> | <integer>         <numeric>
  a     chr1   101-111      - |         1                 1
  b     chr2   102-112      + |         2 0.888888888888889
  c     chr2   103-113      + |         3 0.777777777777778
  d     chr2   104-114      * |         4 0.666666666666667
  e     chr1   105-115      * |         5 0.555555555555556
  f     chr1   106-116      + |         6 0.444444444444444
  g     chr3   107-117      + |         7 0.333333333333333
  h     chr3   108-118      + |         8 0.222222222222222
  i     chr3   109-119      - |         9 0.111111111111111
  j     chr3   110-120      - |        10                 0
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths
獲取屬性
  • start() 每個intervals的起點
  • end() 每個intervals的終點
  • width() 每個intervals的區(qū)間寬度
# 獲取每一條序列的長度,并得到其分布
> width(gr)
 [1] 11 11 11 11 11 11 11 11 11 11
  • length() 返回對象的長度
> length(gr)
[1] 10
  • strand() 獲取鏈屬性
  • names()獲取行名
> names(gr)
 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"

# 也可以進(jìn)行賦值操作
> names(gr) <- 1:10
> names(gr)
 [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"
  • seqinfo() 獲取染色體信息,包括seqnames,seqlengthsisCircular,genome
  • mcols()獲取列metadata,注意GRanges中seqnames,ranges,strand屬于元數(shù)據(jù),不能通過mcols獲取
區(qū)間操作

IRangesGRanges也可有一些類似向量的操作,使用向量,名字以及邏輯值進(jìn)行索引,也可以進(jìn)行算術(shù)加減,不同對象之間也可以進(jìn)行合并,分隔,取交集等操作。如果有多個對象,我們通過創(chuàng)建一個GRangesList是很有用的,例如用于表示分組信息(比如每個基因的外顯子)。該列表的元素是基因,并且在每個元素中,外顯子的范圍被定義為GRanges。數(shù)據(jù)結(jié)構(gòu)類似于list,可以使用lapply操作

# 使用邏輯判斷獲取子集
> gr[gr$score < 5]
GRanges object with 4 ranges and 2 metadata columns:
    seqnames    ranges strand |     score                GC
       <Rle> <IRanges>  <Rle> | <integer>         <numeric>
  1     chr1   101-111      - |         1                 1
  2     chr2   102-112      + |         2 0.888888888888889
  3     chr2   103-113      + |         3 0.777777777777778
  4     chr2   104-114      * |         4 0.666666666666667
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths
  • shift 指定intervals進(jìn)行平移
# 整體平移,正值指定向染色體上游,負(fù)值指定向染色體下游
> shift(gr, shift = 10)
GRanges object with 10 ranges and 2 metadata columns:
     seqnames    ranges strand |     score                GC
        <Rle> <IRanges>  <Rle> | <integer>         <numeric>
   1     chr1   111-121      - |         1                 1
   2     chr2   112-122      + |         2 0.888888888888889
   3     chr2   113-123      + |         3 0.777777777777778
   4     chr2   114-124      * |         4 0.666666666666667
   5     chr1   115-125      * |         5 0.555555555555556
   6     chr1   116-126      + |         6 0.444444444444444
   7     chr3   117-127      + |         7 0.333333333333333
   8     chr3   118-128      + |         8 0.222222222222222
   9     chr3   119-129      - |         9 0.111111111111111
  10     chr3   120-130      - |        10                 0
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths
  • restrict 范圍截取,指定起點和終點,獲取指定范圍內(nèi)的序列
> restrict(gr, 105, 110)
GRanges object with 10 ranges and 2 metadata columns:
     seqnames    ranges strand |     score                GC
        <Rle> <IRanges>  <Rle> | <integer>         <numeric>
   1     chr1   105-110      - |         1                 1
   2     chr2   105-110      + |         2 0.888888888888889
   3     chr2   105-110      + |         3 0.777777777777778
   4     chr2   105-110      * |         4 0.666666666666667
   5     chr1   105-110      * |         5 0.555555555555556
   6     chr1   106-110      + |         6 0.444444444444444
   7     chr3   107-110      + |         7 0.333333333333333
   8     chr3   108-110      + |         8 0.222222222222222
   9     chr3   109-110      - |         9 0.111111111111111
  10     chr3       110      - |        10                 0
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths
  • flank 獲取指定長度上下游序列;promoters是該功能的增強(qiáng)版,可以輕易獲取指定區(qū)間上下游序列
# 獲取序列上游10bp
> flank(gr, width = 10)
GRanges object with 10 ranges and 2 metadata columns:
     seqnames    ranges strand |     score                GC
        <Rle> <IRanges>  <Rle> | <integer>         <numeric>
   1     chr1   112-121      - |         1                 1
   2     chr2    92-101      + |         2 0.888888888888889
   3     chr2    93-102      + |         3 0.777777777777778
   4     chr2    94-103      * |         4 0.666666666666667
   5     chr1    95-104      * |         5 0.555555555555556
   6     chr1    96-105      + |         6 0.444444444444444
   7     chr3    97-106      + |         7 0.333333333333333
   8     chr3    98-107      + |         8 0.222222222222222
   9     chr3   120-129      - |         9 0.111111111111111
  10     chr3   121-130      - |        10                 0
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

# 獲取序列下游10bp,指定start=F
> flank(gr, width = 10, start = F)
GRanges object with 10 ranges and 2 metadata columns:
     seqnames    ranges strand |     score                GC
        <Rle> <IRanges>  <Rle> | <integer>         <numeric>
   1     chr1    91-100      - |         1                 1
   2     chr2   113-122      + |         2 0.888888888888889
   3     chr2   114-123      + |         3 0.777777777777778
   4     chr2   115-124      * |         4 0.666666666666667
   5     chr1   116-125      * |         5 0.555555555555556
   6     chr1   117-126      + |         6 0.444444444444444
   7     chr3   118-127      + |         7 0.333333333333333
   8     chr3   119-128      + |         8 0.222222222222222
   9     chr3    99-108      - |         9 0.111111111111111
  10     chr3   100-109      - |        10                 0
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

# 獲取上下游序列時需要注意不能超出chr的范圍,需要指定范圍
  • reduce組裝,獲取序列的并集
> reduce(gr)
GRanges object with 7 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1   106-116      +
  [2]     chr1   101-111      -
  [3]     chr1   105-115      *
  [4]     chr2   102-113      +
  [5]     chr2   104-114      *
  [6]     chr3   107-118      +
  [7]     chr3   109-120      -
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths
  • disjoin拆分,去掉所有overlap的序列區(qū)域,獲取所有序列的補(bǔ)集,在研究可變剪切中很有用,類似于gaps()
> disjoin(gr)
GRanges object with 13 ranges and 0 metadata columns:
       seqnames    ranges strand
          <Rle> <IRanges>  <Rle>
   [1]     chr1   106-116      +
   [2]     chr1   101-111      -
   [3]     chr1   105-115      *
   [4]     chr2       102      +
   [5]     chr2   103-112      +
   ...      ...       ...    ...
   [9]     chr3   108-117      +
  [10]     chr3       118      +
  [11]     chr3       109      -
  [12]     chr3   110-119      -
  [13]     chr3       120      -
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths
  • sortGRanges對象內(nèi)部進(jìn)行排序
# 按照基因組的順序排序,先排染色體再排正負(fù)鏈
> sort(gr)
GRanges object with 10 ranges and 2 metadata columns:
     seqnames    ranges strand |     score                GC
        <Rle> <IRanges>  <Rle> | <integer>         <numeric>
   6     chr1   106-116      + |         6 0.444444444444444
   1     chr1   101-111      - |         1                 1
   5     chr1   105-115      * |         5 0.555555555555556
   2     chr2   102-112      + |         2 0.888888888888889
   3     chr2   103-113      + |         3 0.777777777777778
   4     chr2   104-114      * |         4 0.666666666666667
   7     chr3   107-117      + |         7 0.333333333333333
   8     chr3   108-118      + |         8 0.222222222222222
   9     chr3   109-119      - |         9 0.111111111111111
  10     chr3   110-120      - |        10                 0
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths
  • findOverlaps 獲取兩個對象之間重復(fù)的區(qū)域,指定序列在指定區(qū)域是否富集,返回結(jié)果表示第一個對象中的第幾條序列與第二個對象中的第幾條序列存在overlap,類似于%over%,%over%直接返回邏輯值
> gr6 <- GRanges(seqnames = "chr2",
+               ranges = IRanges(start = c(6,8,12,14,21,22,23),width = c(11,4,2,5,7,7,7)),
+               strand =  "*")
> gr7 <- GRanges(seqnames = "chr2",
+               ranges = IRanges(start = c(6,15),width = 10),
+               strand =  "*")
> gr6
GRanges object with 7 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr2      6-16      *
  [2]     chr2      8-11      *
  [3]     chr2     12-13      *
  [4]     chr2     14-18      *
  [5]     chr2     21-27      *
  [6]     chr2     22-28      *
  [7]     chr2     23-29      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
> gr7
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr2      6-15      *
  [2]     chr2     15-24      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

> findOverlaps(gr6,gr7)
Hits object with 9 hits and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1           1
  [2]         1           2
  [3]         2           1
  [4]         3           1
  [5]         4           1
  [6]         4           2
  [7]         5           2
  [8]         6           2
  [9]         7           2
  -------
  queryLength: 7 / subjectLength: 2

# 或者根據(jù)邏輯判斷直接獲取overlap子集
> gr6[gr6 %over% gr7]
GRanges object with 7 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr2      6-16      *
  [2]     chr2      8-11      *
  [3]     chr2     12-13      *
  [4]     chr2     14-18      *
  [5]     chr2     21-27      *
  [6]     chr2     22-28      *
  [7]     chr2     23-29      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
  • tile創(chuàng)建窗口,可以指定窗口數(shù)量以及窗口寬度,
  • slidingWindows創(chuàng)建滑動窗口,指定窗口長度以及窗口移動的步長
  • tileGenome返回一組基因組區(qū)域,這些區(qū)域構(gòu)成特定基因組的分區(qū)
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

友情鏈接更多精彩內(nèi)容