R for data science ||使用dplyr處理關(guān)系數(shù)據(jù)

在處理數(shù)據(jù)的時(shí)候,很少遇到那種只對一張表進(jìn)行處理的情況。經(jīng)常的狀況是關(guān)于某個(gè)主題有幾張表,這些表以某種關(guān)鍵字聯(lián)系在一起。

這時(shí)候:
? 合并連接:向數(shù)據(jù)框中加入新變量,新變量的值是另一個(gè)數(shù)據(jù)框中的匹配觀測。
? 篩選連接:根據(jù)是否匹配另一個(gè)數(shù)據(jù)框中的觀測,篩選數(shù)據(jù)框中的觀測。
? 集合操作:將觀測作為集合元素來處理。

> library(tidyverse)
-- Attaching packages --------------------------------------- tidyverse 1.2.1 --
√ ggplot2 3.2.0       √ purrr   0.3.2  
√ tibble  2.1.1       √ dplyr   0.8.0.1
√ tidyr   0.8.3       √ stringr 1.4.0  
√ readr   1.3.1       √ forcats 0.4.0  
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
Warning messages:
1: 程輯包‘tidyverse’是用R版本3.5.3 來建造的 
2: 程輯包‘ggplot2’是用R版本3.5.3 來建造的 
3: 程輯包‘tibble’是用R版本3.5.3 來建造的 
4: 程輯包‘tidyr’是用R版本3.5.3 來建造的 
5: 程輯包‘readr’是用R版本3.5.2 來建造的 
6: 程輯包‘purrr’是用R版本3.5.3 來建造的 
7: 程輯包‘dplyr’是用R版本3.5.2 來建造的 
8: 程輯包‘stringr’是用R版本3.5.2 來建造的 
> setwd("D:\\Users\\Administrator\\Desktop\\RStudio\\R-Programming")

描述航空公司的四個(gè)tibble數(shù)據(jù):

  • airlines:可以根據(jù)航空公司的縮寫碼查到公司全名。
  • airports:給出了每個(gè)機(jī)場的信息,通過 faa 機(jī)場編碼進(jìn)行標(biāo)識。
  • planes:給出了每架飛機(jī)的信息,通過 tailnum 進(jìn)行標(biāo)識。
  • weather:給出了紐約機(jī)場每小時(shí)的天氣狀況。
> library(nycflights13)
Warning message:
程輯包‘nycflights13’是用R版本3.5.3 來建造的 
> airlines
# A tibble: 16 x 2
   carrier name                       
   <chr>   <chr>                      
 1 9E      Endeavor Air Inc.          
 2 AA      American Airlines Inc.     
 3 AS      Alaska Airlines Inc.       
 4 B6      JetBlue Airways            
 5 DL      Delta Air Lines Inc.       
 6 EV      ExpressJet Airlines Inc.   
 7 F9      Frontier Airlines Inc.     
 8 FL      AirTran Airways Corporation
 9 HA      Hawaiian Airlines Inc.     
10 MQ      Envoy Air                  
11 OO      SkyWest Airlines Inc.      
12 UA      United Air Lines Inc.      
13 US      US Airways Inc.            
14 VX      Virgin America             
15 WN      Southwest Airlines Co.     
16 YV      Mesa Airlines Inc.         
> airports
# A tibble: 1,458 x 8
   faa   name                             lat    lon   alt    tz dst   tzone              
   <chr> <chr>                          <dbl>  <dbl> <int> <dbl> <chr> <chr>              
 1 04G   Lansdowne Airport               41.1  -80.6  1044    -5 A     America/New_York   
 2 06A   Moton Field Municipal Airport   32.5  -85.7   264    -6 A     America/Chicago    
 3 06C   Schaumburg Regional             42.0  -88.1   801    -6 A     America/Chicago    
 4 06N   Randall Airport                 41.4  -74.4   523    -5 A     America/New_York   
 5 09J   Jekyll Island Airport           31.1  -81.4    11    -5 A     America/New_York   
 6 0A9   Elizabethton Municipal Airport  36.4  -82.2  1593    -5 A     America/New_York   
 7 0G6   Williams County Airport         41.5  -84.5   730    -5 A     America/New_York   
 8 0G7   Finger Lakes Regional Airport   42.9  -76.8   492    -5 A     America/New_York   
 9 0P2   Shoestring Aviation Airfield    39.8  -76.6  1000    -5 U     America/New_York   
10 0S9   Jefferson County Intl           48.1 -123.    108    -8 A     America/Los_Angeles
# ... with 1,448 more rows
> planes
# A tibble: 3,322 x 9
   tailnum  year type                    manufacturer     model     engines seats speed engine   
   <chr>   <int> <chr>                   <chr>            <chr>       <int> <int> <int> <chr>    
 1 N10156   2004 Fixed wing multi engine EMBRAER          EMB-145XR       2    55    NA Turbo-fan
 2 N102UW   1998 Fixed wing multi engine AIRBUS INDUSTRIE A320-214        2   182    NA Turbo-fan
 3 N103US   1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214        2   182    NA Turbo-fan
 4 N104UW   1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214        2   182    NA Turbo-fan
 5 N10575   2002 Fixed wing multi engine EMBRAER          EMB-145LR       2    55    NA Turbo-fan
 6 N105UW   1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214        2   182    NA Turbo-fan
 7 N107US   1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214        2   182    NA Turbo-fan
 8 N108UW   1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214        2   182    NA Turbo-fan
 9 N109UW   1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214        2   182    NA Turbo-fan
10 N110UW   1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214        2   182    NA Turbo-fan
# ... with 3,312 more rows
> weather
# A tibble: 26,115 x 15
   origin  year month   day  hour  temp  dewp humid wind_dir wind_speed wind_gust precip pressure visib time_hour          
   <chr>  <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>     <dbl>  <dbl>    <dbl> <dbl> <dttm>             
 1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4         NA      0    1012     10 2013-01-01 01:00:00
 2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06        NA      0    1012.    10 2013-01-01 02:00:00
 3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5         NA      0    1012.    10 2013-01-01 03:00:00
 4 EWR     2013     1     1     4  39.9  28.0  62.2      250      12.7         NA      0    1012.    10 2013-01-01 04:00:00
 5 EWR     2013     1     1     5  39.0  28.0  64.4      260      12.7         NA      0    1012.    10 2013-01-01 05:00:00
 6 EWR     2013     1     1     6  37.9  28.0  67.2      240      11.5         NA      0    1012.    10 2013-01-01 06:00:00
 7 EWR     2013     1     1     7  39.0  28.0  64.4      240      15.0         NA      0    1012.    10 2013-01-01 07:00:00
 8 EWR     2013     1     1     8  39.9  28.0  62.2      250      10.4         NA      0    1012.    10 2013-01-01 08:00:00
 9 EWR     2013     1     1     9  39.9  28.0  62.2      260      15.0         NA      0    1013.    10 2013-01-01 09:00:00
10 EWR     2013     1     1    10  41    28.0  59.6      260      13.8         NA      0    1012.    10 2013-01-01 10:00:00
# ... with 26,105 more rows

  • flights connects to planes via a single variable, tailnum.
  • flights connects to airlines through the carrier variable.
  • flights connects to airports in two ways: via the origin and dest variables.
  • flights connects to weather via origin (the location), and year, month, day and hour (the time).

用于連接每對數(shù)據(jù)表的變量稱為鍵。鍵是能唯一標(biāo)識觀測的變量(或變量集合)。
? 主鍵:唯一標(biāo)識其所在數(shù)據(jù)表中的觀測。例如, planes
$tailnum 是一個(gè)主鍵,因?yàn)樗梢晕ㄒ槐硎緋lanes的每架飛機(jī)。

  • 外鍵:唯一表示另一個(gè)數(shù)據(jù)框的觀測數(shù)據(jù)。如filghts$tailnum 是一個(gè)外鍵,因?yàn)槠涑霈F(xiàn)在 flights 表中,并可以將每次航班與唯一一架飛機(jī)匹配。
> # 對主鍵進(jìn)行 count() 操作,然后查看是否有 n 大于 1 的記錄,即查看是否唯一值
> planes %>%
+   count(tailnum) %>%
+   filter(n > 1)
# A tibble: 0 x 2
# ... with 2 variables: tailnum <chr>, n <int>
> weather %>%
+   count(year, month, day, hour, origin) %>%
+   filter(n > 1)
# A tibble: 3 x 6
   year month   day  hour origin     n
  <dbl> <dbl> <int> <int> <chr>  <int>
1  2013    11     3     1 EWR        2
2  2013    11     3     1 JFK        2
3  2013    11     3     1 LGA        2
> #flights 表中的主鍵是什么?
> flights %>%
+   count(year, month, day, flight) %>%
+   filter(n > 1)
# A tibble: 29,768 x 5
    year month   day flight     n
   <int> <int> <int>  <int> <int>
 1  2013     1     1      1     2
 2  2013     1     1      3     2
 3  2013     1     1      4     2
 4  2013     1     1     11     3
 5  2013     1     1     15     2
 6  2013     1     1     21     2
 7  2013     1     1     27     4
 8  2013     1     1     31     2
 9  2013     1     1     32     2
10  2013     1     1     35     2
# ... with 29,758 more rows
> flights %>%
+   count(year, month, day, tailnum) %>%
+   filter(n > 1)
# A tibble: 64,928 x 5
    year month   day tailnum     n
   <int> <int> <int> <chr>   <int>
 1  2013     1     1 N0EGMQ      2
 2  2013     1     1 N11189      2
 3  2013     1     1 N11536      2
 4  2013     1     1 N11544      3
 5  2013     1     1 N11551      2
 6  2013     1     1 N12540      2
 7  2013     1     1 N12567      2
 8  2013     1     1 N13123      2
 9  2013     1     1 N13538      3
10  2013     1     1 N13566      3
# ... with 64,918 more rows
> 

如果一張表沒有主鍵,有時(shí)就需要使用 mutate() 函數(shù)和 row_number() 函數(shù)為表加上一個(gè)主鍵。這樣一來,
如果你完成了一些篩選工作,并想要使用原始數(shù)據(jù)檢查的話,就可以更容易地匹配觀測。這種主鍵稱為代理鍵。

理解鏈接
x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     3, "x3"
)
y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2",
     4, "y3"
)
Inner join
x %>% 
  inner_join(y, by = "key")
#> # A tibble: 2 x 3
#>     key val_x val_y
#>   <dbl> <chr> <chr>
#> 1     1 x1    y1   
#> 2     2 x2    y2
Outer joins

A left join keeps all observations in x.
A right join keeps all observations in y.
A full join keeps all observations in x and y.

Another way to depict the different types of joins is with a Venn diagram:

Duplicate keys
x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     2, "x3",
     1, "x4"
)
y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2"
)
left_join(x, y, by = "key")
#> # A tibble: 4 x 3
#>     key val_x val_y
#>   <dbl> <chr> <chr>
#> 1     1 x1    y1   
#> 2     2 x2    y2   
#> 3     2 x3    y2   
#> 4     1 x4    y1
x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     2, "x3",
     3, "x4"
)
y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2",
     2, "y3",
     3, "y4"
)
left_join(x, y, by = "key")
#> # A tibble: 6 x 3
#>     key val_x val_y
#>   <dbl> <chr> <chr>
#> 1     1 x1    y1   
#> 2     2 x2    y2   
#> 3     2 x2    y3   
#> 4     2 x3    y2   
#> 5     2 x3    y3   
#> 6     3 x4    y4
Defining the key columns
flights2 %>% 
  left_join(weather)
#> Joining, by = c("year", "month", "day", "hour", "origin")
#> # A tibble: 336,776 x 18
#>    year month   day  hour origin dest  tailnum carrier  temp  dewp humid
#>   <dbl> <dbl> <int> <dbl> <chr>  <chr> <chr>   <chr>   <dbl> <dbl> <dbl>
#> 1  2013     1     1     5 EWR    IAH   N14228  UA       39.0  28.0  64.4
#> 2  2013     1     1     5 LGA    IAH   N24211  UA       39.9  25.0  54.8
#> 3  2013     1     1     5 JFK    MIA   N619AA  AA       39.0  27.0  61.6
#> 4  2013     1     1     5 JFK    BQN   N804JB  B6       39.0  27.0  61.6
#> 5  2013     1     1     6 LGA    ATL   N668DN  DL       39.9  25.0  54.8
#> 6  2013     1     1     5 EWR    ORD   N39463  UA       39.0  28.0  64.4
#> # … with 3.368e+05 more rows, and 7 more variables: wind_dir <dbl>,
#> #   wind_speed <dbl>, wind_gust <dbl>, precip <dbl>, pressure <dbl>,
#> #   visib <dbl>, time_hour <dttm>
flights2 %>% 
  left_join(planes, by = "tailnum")
#> # A tibble: 336,776 x 16
#>   year.x month   day  hour origin dest  tailnum carrier year.y type 
#>    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>    <int> <chr>
#> 1   2013     1     1     5 EWR    IAH   N14228  UA        1999 Fixe…
#> 2   2013     1     1     5 LGA    IAH   N24211  UA        1998 Fixe…
#> 3   2013     1     1     5 JFK    MIA   N619AA  AA        1990 Fixe…
#> 4   2013     1     1     5 JFK    BQN   N804JB  B6        2012 Fixe…
#> 5   2013     1     1     6 LGA    ATL   N668DN  DL        1991 Fixe…
#> 6   2013     1     1     5 EWR    ORD   N39463  UA        2012 Fixe…
#> # … with 3.368e+05 more rows, and 6 more variables: manufacturer <chr>,
#> #   model <chr>, engines <int>, seats <int>, speed <int>, engine <chr>
flights2 %>% 
  left_join(airports, c("dest" = "faa"))
#> # A tibble: 336,776 x 15
#>    year month   day  hour origin dest  tailnum carrier name    lat   lon
#>   <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr> <dbl> <dbl>
#> 1  2013     1     1     5 EWR    IAH   N14228  UA      Geor…  30.0 -95.3
#> 2  2013     1     1     5 LGA    IAH   N24211  UA      Geor…  30.0 -95.3
#> 3  2013     1     1     5 JFK    MIA   N619AA  AA      Miam…  25.8 -80.3
#> 4  2013     1     1     5 JFK    BQN   N804JB  B6      <NA>   NA    NA  
#> 5  2013     1     1     6 LGA    ATL   N668DN  DL      Hart…  33.6 -84.4
#> 6  2013     1     1     5 EWR    ORD   N39463  UA      Chic…  42.0 -87.9
#> # … with 3.368e+05 more rows, and 4 more variables: alt <int>, tz <dbl>,
#> #   dst <chr>, tzone <chr>

flights2 %>% 
  left_join(airports, c("origin" = "faa"))
#> # A tibble: 336,776 x 15
#>    year month   day  hour origin dest  tailnum carrier name    lat   lon
#>   <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr> <dbl> <dbl>
#> 1  2013     1     1     5 EWR    IAH   N14228  UA      Newa…  40.7 -74.2
#> 2  2013     1     1     5 LGA    IAH   N24211  UA      La G…  40.8 -73.9
#> 3  2013     1     1     5 JFK    MIA   N619AA  AA      John…  40.6 -73.8
#> 4  2013     1     1     5 JFK    BQN   N804JB  B6      John…  40.6 -73.8
#> 5  2013     1     1     6 LGA    ATL   N668DN  DL      La G…  40.8 -73.9
#> 6  2013     1     1     5 EWR    ORD   N39463  UA      Newa…  40.7 -74.2
#> # … with 3.368e+05 more rows, and 4 more variables: alt <int>, tz <dbl>,
#> #   dst <chr>, tzone <chr>
Other implementations
dplyr merge
inner_join(x, y) merge(x, y)
left_join(x, y) merge(x, y, all.x = TRUE)
right_join(x, y) merge(x, y, all.y = TRUE)
full_join(x, y) merge(x, y, all.x = TRUE, all.y = TRUE)
Filtering joins

semi_join(x, y): 保留 x 表中與 y 表中的觀測相匹配的所有觀測。
? anti_join(x, y): 丟棄 x 表中與 y 表中的觀測相匹配的所有觀測。

top_dest <- flights %>%
count(dest, sort = TRUE) %>%
head(10)
top_dest
# 篩選出前10個(gè)
flights %>%
filter(dest %in% top_dest$dest)
# 半連接,只保留兩個(gè)數(shù)據(jù)框都有的行的數(shù)據(jù)
flights %>%
semi_join(top_dest)

# 反連接保留 x 表中那些沒有匹配 y 表的行。
# 反連接可以用于診斷連接中的不匹配。例如,在連接 flights 和 planes 時(shí),你可能想知道flights 中是否有很多行在 planes 中沒有匹配記錄:
flights %>%
anti_join(planes, by = "tailnum") %>%
count(tailnum, sort = TRUE)
Join problems
Set operations

The final type of two-table verb are the set operations. Generally, I use these the least frequently, but they are occasionally useful when you want to break a single complex filter into simpler pieces. All these operations work with a complete row, comparing the values of every variable. These expect the x and y inputs to have the same variables, and treat the observations like sets:

intersect(x, y): return only observations in both x and y.
union(x, y): return unique observations in x and y.
setdiff(x, y): return observations in x, but not in y.
Given this simple data:

df1 <- tribble(
  ~x, ~y,
   1,  1,
   2,  1
)
df2 <- tribble(
  ~x, ~y,
   1,  1,
   1,  2
)
The four possibilities are:
intersect(df1, df2)
#> # A tibble: 1 x 2
#>       x     y
#>   <dbl> <dbl>
#> 1     1     1
# Note that we get 3 rows, not 4
union(df1, df2)
#> # A tibble: 3 x 2
#>       x     y
#>   <dbl> <dbl>
#> 1     1     2
#> 2     2     1
#> 3     1     1
setdiff(df1, df2)
#> # A tibble: 1 x 2
#>       x     y
#>   <dbl> <dbl>
#> 1     2     1
setdiff(df2, df1)
#> # A tibble: 1 x 2
#>       x     y
#>   <dbl> <dbl>
#> 1     1     2

r4ds
R數(shù)據(jù)科學(xué)(九)使用dplyr處理關(guān)系數(shù)據(jù)

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請結(jié)合常識與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容