R學(xué)習(xí)筆記(7):使用stringr處理字符串(2)

目標(biāo):結(jié)合正則表達(dá)式,實(shí)現(xiàn)

確定與某種模式匹配的字符串
找出匹配位置
提取匹配內(nèi)容
替換匹配內(nèi)容
基于匹配拆分字符串

1. 匹配檢測(cè)

1.1 str_detect()
#返回邏輯向量
> str_detect(c("huang","si","yuan"),"a")
[1]  TRUE FALSE  TRUE
#能匹配上幾個(gè)向量元素
> sum(str_detect(c("huang","si","yuan"),"a"))
[1] 2
#匹配上的向量元素的占比
> mean(str_detect(c("huang","si","yuan"),"a"))
[1] 0.6666667
還能選取出匹配某種模式的元素

預(yù)備知識(shí):邏輯取子集,如

> hsy <- c("huang","si","yuan")
> hsy[c(TRUE,TRUE,FALSE)]
[1] "huang" "si"

繼續(xù)

> hsy <- c("huang","si","yuan")
> hsy[str_detect(hsy,"a")]
[1] "huang" "yuan"

其他方法

> str_subset(hsy,"a")
[1] "huang" "yuan"
更實(shí)用的場(chǎng)景

針對(duì)數(shù)據(jù)框的某一列,篩選出符合條件的行記錄

> df <- tibble(
+   word=words,
+   i=seq_along(word) #添加行號(hào)
+ )
> df %>% filter(str_detect(word,"x$"))
# A tibble: 4 x 2
  word      i
  <chr> <int>
1 box     108
2 sex     747
3 six     772
4 tax     841
1.2 str_count()
#返回每一個(gè)元素匹配的次數(shù)
> str_count(hsy,"a")
[1] 1 0 1
#平均每個(gè)元素匹配的次數(shù)
> mean(str_count(hsy,"a"))
[1] 0.6666667

2. 提取匹配內(nèi)容

這里是指匹配的內(nèi)容,與上面的提取向量元素有區(qū)別

2.1 str_extract()

sentences數(shù)據(jù)集是stringr包自帶的,為720個(gè)元素的字符串向量

先提取能匹配上的句子/行看看

> has_red_blue <- str_subset(sentences,"red|blue")
> head(has_red_blue)
[1] "Glue the sheet to the dark blue background."
[2] "Two blue fish swam in the tank."            
[3] "The colt reared and threw the tall rider."  
[4] "The wide road shimmered in the hot sun."    
[5] "See the cat glaring at the scared mouse."   
[6] "A wisp of cloud hung in the blue air."  

提取匹配內(nèi)容, 注意str_extract()只會(huì)提取第一個(gè)匹配

> matches <- str_extract(has_red_blue,"red|blue") 
> head(matches)
[1] "blue" "blue" "red"  "red"  "red"  "blue"
2.2 str_extract_all()

如何提取多個(gè)匹配呢?
先來(lái)看看有沒(méi)有多次匹配的行

> more <- has_red_blue[str_count(has_red_blue,"red|blue") > 1]
> more
[1] "It is hard to erase blue or red ink."

str_extract_all()提取

> str_extract_all(more,"red|blue") #返回列表
[[1]]
[1] "blue" "red" 
> str_extract_all(more,"red|blue",simplify = T) #返回矩陣
     [,1]   [,2] 
[1,] "blue" "red"
> head(str_extract_all(has_red_blue,"red|blue",simplify = T)) #每一行長(zhǎng)度自動(dòng)統(tǒng)一
     [,1]   [,2]
[1,] "blue" ""  
[2,] "blue" ""  
[3,] "red"  ""  
[4,] "red"  ""  
[5,] "red"  ""  
[6,] "blue" ""

3. 分組匹配

str_match()可以給出每個(gè)分組的詳細(xì)匹配內(nèi)容,比括號(hào)搭配\1, \2方便

> two_words <- "(a|the) ([^ ]+)"
> has_two_words <- sentences %>% str_subset(two_words) %>% head(10) 
> has_two_words %>% str_extract(two_words) #給出模式的完整匹配
 [1] "the smooth" "the sheet"  "the depth"  "a chicken"  "the parked" "the sun"   
 [7] "the huge"   "the ball"   "the woman"  "a helps"   
> has_two_words %>% str_match(two_words) #給出完整匹配以及分組匹配
      [,1]         [,2]  [,3]     
 [1,] "the smooth" "the" "smooth" 
 [2,] "the sheet"  "the" "sheet"  
 [3,] "the depth"  "the" "depth"  
 [4,] "a chicken"  "a"   "chicken"
 [5,] "the parked" "the" "parked" 
 [6,] "the sun"    "the" "sun"    
 [7,] "the huge"   "the" "huge"   
 [8,] "the ball"   "the" "ball"   
 [9,] "the woman"  "the" "woman"  
[10,] "a helps"    "a"   "helps"

4. 替換匹配內(nèi)容

str_replace()

> hsy <- c("huang","si","yuan")
> str_replace(hsy,"[aeiou]"," ")
[1] "h ang" "s "    "y an" 
> str_replace_all(hsy, "[aeiou]", " ")
[1] "h  ng" "s "    "y  n" 

同時(shí)執(zhí)行多種替換

> x <- c("1 house", "2 cars", "3 people")
> str_replace_all(x, c("1" = "one","2" = "two", "3" = "three"))
[1] "one house"    "two cars"     "three people"

5. 拆分

str_split()
str_split()返回列表,加了simplify之后變?yōu)榫仃?/p>

> sentences %>% head(4) %>% str_split(" ",simplify = T)
     [,1]    [,2]    [,3]    [,4]      [,5]  [,6]    [,7]     [,8]          [,9]   
[1,] "The"   "birch" "canoe" "slid"    "on"  "the"   "smooth" "planks."     ""     
[2,] "Glue"  "the"   "sheet" "to"      "the" "dark"  "blue"   "background." ""     
[3,] "It's"  "easy"  "to"    "tell"    "the" "depth" "of"     "a"           "well."
[4,] "These" "days"  "a"     "chicken" "leg" "is"    "a"      "rare"        "dish."

如何提取str_split()返回的列表元素

> "a|b|c" %>% str_split("\\|") %>% .[[1]]
[1] "a" "b" "c"
> "a|b|c" %>% str_split("\\|") %>% .[[1]] %>% .[2]
[1] "b"

6. 定位匹配內(nèi)容

str_locate()

> str_locate(hsy,"[aeiou]")
     start end
[1,]     2   2
[2,]     2   2
[3,]     2   2
> str_locate_all(hsy,"[aeiou]")
[[1]]
     start end
[1,]     2   2
[2,]     3   3

[[2]]
     start end
[1,]     2   2

[[3]]
     start end
[1,]     2   2
[2,]     3   3

7. 使用regex()調(diào)整模式規(guī)則

str_view_all(hsy,regex("[aeiou]",ignore_case = T,multiline = T,comments = T,dotall = T))

ignore_case = T:不區(qū)分大小寫(xiě)
multiline = T:^和$分別表示每一行的開(kāi)頭和結(jié)尾,而不是整個(gè)字符串的
comments = T:添加注釋
dotall = T:點(diǎn)號(hào).能夠代表?yè)Q行符

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書(shū)系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

友情鏈接更多精彩內(nèi)容