目標(biāo):結(jié)合正則表達(dá)式,實(shí)現(xiàn)
確定與某種模式匹配的字符串
找出匹配位置
提取匹配內(nèi)容
替換匹配內(nèi)容
基于匹配拆分字符串
1. 匹配檢測(cè)
1.1 str_detect()
#返回邏輯向量
> str_detect(c("huang","si","yuan"),"a")
[1] TRUE FALSE TRUE
#能匹配上幾個(gè)向量元素
> sum(str_detect(c("huang","si","yuan"),"a"))
[1] 2
#匹配上的向量元素的占比
> mean(str_detect(c("huang","si","yuan"),"a"))
[1] 0.6666667
還能選取出匹配某種模式的元素
預(yù)備知識(shí):邏輯取子集,如
> hsy <- c("huang","si","yuan")
> hsy[c(TRUE,TRUE,FALSE)]
[1] "huang" "si"
繼續(xù)
> hsy <- c("huang","si","yuan")
> hsy[str_detect(hsy,"a")]
[1] "huang" "yuan"
其他方法
> str_subset(hsy,"a")
[1] "huang" "yuan"
更實(shí)用的場(chǎng)景
針對(duì)數(shù)據(jù)框的某一列,篩選出符合條件的行記錄
> df <- tibble(
+ word=words,
+ i=seq_along(word) #添加行號(hào)
+ )
> df %>% filter(str_detect(word,"x$"))
# A tibble: 4 x 2
word i
<chr> <int>
1 box 108
2 sex 747
3 six 772
4 tax 841
1.2 str_count()
#返回每一個(gè)元素匹配的次數(shù)
> str_count(hsy,"a")
[1] 1 0 1
#平均每個(gè)元素匹配的次數(shù)
> mean(str_count(hsy,"a"))
[1] 0.6666667
2. 提取匹配內(nèi)容
這里是指匹配的內(nèi)容,與上面的提取向量元素有區(qū)別
2.1 str_extract()
sentences數(shù)據(jù)集是stringr包自帶的,為720個(gè)元素的字符串向量
先提取能匹配上的句子/行看看
> has_red_blue <- str_subset(sentences,"red|blue")
> head(has_red_blue)
[1] "Glue the sheet to the dark blue background."
[2] "Two blue fish swam in the tank."
[3] "The colt reared and threw the tall rider."
[4] "The wide road shimmered in the hot sun."
[5] "See the cat glaring at the scared mouse."
[6] "A wisp of cloud hung in the blue air."
提取匹配內(nèi)容, 注意str_extract()只會(huì)提取第一個(gè)匹配
> matches <- str_extract(has_red_blue,"red|blue")
> head(matches)
[1] "blue" "blue" "red" "red" "red" "blue"
2.2 str_extract_all()
如何提取多個(gè)匹配呢?
先來(lái)看看有沒(méi)有多次匹配的行
> more <- has_red_blue[str_count(has_red_blue,"red|blue") > 1]
> more
[1] "It is hard to erase blue or red ink."
用str_extract_all()提取
> str_extract_all(more,"red|blue") #返回列表
[[1]]
[1] "blue" "red"
> str_extract_all(more,"red|blue",simplify = T) #返回矩陣
[,1] [,2]
[1,] "blue" "red"
> head(str_extract_all(has_red_blue,"red|blue",simplify = T)) #每一行長(zhǎng)度自動(dòng)統(tǒng)一
[,1] [,2]
[1,] "blue" ""
[2,] "blue" ""
[3,] "red" ""
[4,] "red" ""
[5,] "red" ""
[6,] "blue" ""
3. 分組匹配
str_match()可以給出每個(gè)分組的詳細(xì)匹配內(nèi)容,比括號(hào)搭配\1, \2方便
> two_words <- "(a|the) ([^ ]+)"
> has_two_words <- sentences %>% str_subset(two_words) %>% head(10)
> has_two_words %>% str_extract(two_words) #給出模式的完整匹配
[1] "the smooth" "the sheet" "the depth" "a chicken" "the parked" "the sun"
[7] "the huge" "the ball" "the woman" "a helps"
> has_two_words %>% str_match(two_words) #給出完整匹配以及分組匹配
[,1] [,2] [,3]
[1,] "the smooth" "the" "smooth"
[2,] "the sheet" "the" "sheet"
[3,] "the depth" "the" "depth"
[4,] "a chicken" "a" "chicken"
[5,] "the parked" "the" "parked"
[6,] "the sun" "the" "sun"
[7,] "the huge" "the" "huge"
[8,] "the ball" "the" "ball"
[9,] "the woman" "the" "woman"
[10,] "a helps" "a" "helps"
4. 替換匹配內(nèi)容
str_replace()
> hsy <- c("huang","si","yuan")
> str_replace(hsy,"[aeiou]"," ")
[1] "h ang" "s " "y an"
> str_replace_all(hsy, "[aeiou]", " ")
[1] "h ng" "s " "y n"
同時(shí)執(zhí)行多種替換
> x <- c("1 house", "2 cars", "3 people")
> str_replace_all(x, c("1" = "one","2" = "two", "3" = "three"))
[1] "one house" "two cars" "three people"
5. 拆分
str_split()
str_split()返回列表,加了simplify之后變?yōu)榫仃?/p>
> sentences %>% head(4) %>% str_split(" ",simplify = T)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] "The" "birch" "canoe" "slid" "on" "the" "smooth" "planks." ""
[2,] "Glue" "the" "sheet" "to" "the" "dark" "blue" "background." ""
[3,] "It's" "easy" "to" "tell" "the" "depth" "of" "a" "well."
[4,] "These" "days" "a" "chicken" "leg" "is" "a" "rare" "dish."
如何提取str_split()返回的列表元素
> "a|b|c" %>% str_split("\\|") %>% .[[1]]
[1] "a" "b" "c"
> "a|b|c" %>% str_split("\\|") %>% .[[1]] %>% .[2]
[1] "b"
6. 定位匹配內(nèi)容
str_locate()
> str_locate(hsy,"[aeiou]")
start end
[1,] 2 2
[2,] 2 2
[3,] 2 2
> str_locate_all(hsy,"[aeiou]")
[[1]]
start end
[1,] 2 2
[2,] 3 3
[[2]]
start end
[1,] 2 2
[[3]]
start end
[1,] 2 2
[2,] 3 3
7. 使用regex()調(diào)整模式規(guī)則
str_view_all(hsy,regex("[aeiou]",ignore_case = T,multiline = T,comments = T,dotall = T))
ignore_case = T:不區(qū)分大小寫(xiě)
multiline = T:^和$分別表示每一行的開(kāi)頭和結(jié)尾,而不是整個(gè)字符串的
comments = T:添加注釋
dotall = T:點(diǎn)號(hào).能夠代表?yè)Q行符