在處理數(shù)據(jù)的時(shí)候,很少遇到那種只對一張表進(jìn)行處理的情況。經(jīng)常的狀況是關(guān)于某個(gè)主題有幾張表,這些表以某種關(guān)鍵字聯(lián)系在一起。
這時(shí)候:
? 合并連接:向數(shù)據(jù)框中加入新變量,新變量的值是另一個(gè)數(shù)據(jù)框中的匹配觀測。
? 篩選連接:根據(jù)是否匹配另一個(gè)數(shù)據(jù)框中的觀測,篩選數(shù)據(jù)框中的觀測。
? 集合操作:將觀測作為集合元素來處理。
> library(tidyverse)
-- Attaching packages --------------------------------------- tidyverse 1.2.1 --
√ ggplot2 3.2.0 √ purrr 0.3.2
√ tibble 2.1.1 √ dplyr 0.8.0.1
√ tidyr 0.8.3 √ stringr 1.4.0
√ readr 1.3.1 √ forcats 0.4.0
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
Warning messages:
1: 程輯包‘tidyverse’是用R版本3.5.3 來建造的
2: 程輯包‘ggplot2’是用R版本3.5.3 來建造的
3: 程輯包‘tibble’是用R版本3.5.3 來建造的
4: 程輯包‘tidyr’是用R版本3.5.3 來建造的
5: 程輯包‘readr’是用R版本3.5.2 來建造的
6: 程輯包‘purrr’是用R版本3.5.3 來建造的
7: 程輯包‘dplyr’是用R版本3.5.2 來建造的
8: 程輯包‘stringr’是用R版本3.5.2 來建造的
> setwd("D:\\Users\\Administrator\\Desktop\\RStudio\\R-Programming")
描述航空公司的四個(gè)tibble數(shù)據(jù):
- airlines:可以根據(jù)航空公司的縮寫碼查到公司全名。
- airports:給出了每個(gè)機(jī)場的信息,通過 faa 機(jī)場編碼進(jìn)行標(biāo)識。
- planes:給出了每架飛機(jī)的信息,通過 tailnum 進(jìn)行標(biāo)識。
- weather:給出了紐約機(jī)場每小時(shí)的天氣狀況。
> library(nycflights13)
Warning message:
程輯包‘nycflights13’是用R版本3.5.3 來建造的
> airlines
# A tibble: 16 x 2
carrier name
<chr> <chr>
1 9E Endeavor Air Inc.
2 AA American Airlines Inc.
3 AS Alaska Airlines Inc.
4 B6 JetBlue Airways
5 DL Delta Air Lines Inc.
6 EV ExpressJet Airlines Inc.
7 F9 Frontier Airlines Inc.
8 FL AirTran Airways Corporation
9 HA Hawaiian Airlines Inc.
10 MQ Envoy Air
11 OO SkyWest Airlines Inc.
12 UA United Air Lines Inc.
13 US US Airways Inc.
14 VX Virgin America
15 WN Southwest Airlines Co.
16 YV Mesa Airlines Inc.
> airports
# A tibble: 1,458 x 8
faa name lat lon alt tz dst tzone
<chr> <chr> <dbl> <dbl> <int> <dbl> <chr> <chr>
1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/New_York
2 06A Moton Field Municipal Airport 32.5 -85.7 264 -6 A America/Chicago
3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/Chicago
4 06N Randall Airport 41.4 -74.4 523 -5 A America/New_York
5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/New_York
6 0A9 Elizabethton Municipal Airport 36.4 -82.2 1593 -5 A America/New_York
7 0G6 Williams County Airport 41.5 -84.5 730 -5 A America/New_York
8 0G7 Finger Lakes Regional Airport 42.9 -76.8 492 -5 A America/New_York
9 0P2 Shoestring Aviation Airfield 39.8 -76.6 1000 -5 U America/New_York
10 0S9 Jefferson County Intl 48.1 -123. 108 -8 A America/Los_Angeles
# ... with 1,448 more rows
> planes
# A tibble: 3,322 x 9
tailnum year type manufacturer model engines seats speed engine
<chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
1 N10156 2004 Fixed wing multi engine EMBRAER EMB-145XR 2 55 NA Turbo-fan
2 N102UW 1998 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
3 N103US 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
4 N104UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
5 N10575 2002 Fixed wing multi engine EMBRAER EMB-145LR 2 55 NA Turbo-fan
6 N105UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
7 N107US 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
8 N108UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
9 N109UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
10 N110UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
# ... with 3,312 more rows
> weather
# A tibble: 26,115 x 15
origin year month day hour temp dewp humid wind_dir wind_speed wind_gust precip pressure visib time_hour
<chr> <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dttm>
1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4 NA 0 1012 10 2013-01-01 01:00:00
2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06 NA 0 1012. 10 2013-01-01 02:00:00
3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5 NA 0 1012. 10 2013-01-01 03:00:00
4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7 NA 0 1012. 10 2013-01-01 04:00:00
5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7 NA 0 1012. 10 2013-01-01 05:00:00
6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5 NA 0 1012. 10 2013-01-01 06:00:00
7 EWR 2013 1 1 7 39.0 28.0 64.4 240 15.0 NA 0 1012. 10 2013-01-01 07:00:00
8 EWR 2013 1 1 8 39.9 28.0 62.2 250 10.4 NA 0 1012. 10 2013-01-01 08:00:00
9 EWR 2013 1 1 9 39.9 28.0 62.2 260 15.0 NA 0 1013. 10 2013-01-01 09:00:00
10 EWR 2013 1 1 10 41 28.0 59.6 260 13.8 NA 0 1012. 10 2013-01-01 10:00:00
# ... with 26,105 more rows

- flights connects to planes via a single variable, tailnum.
- flights connects to airlines through the carrier variable.
- flights connects to airports in two ways: via the origin and dest variables.
- flights connects to weather via origin (the location), and year, month, day and hour (the time).
用于連接每對數(shù)據(jù)表的變量稱為鍵。鍵是能唯一標(biāo)識觀測的變量(或變量集合)。
? 主鍵:唯一標(biāo)識其所在數(shù)據(jù)表中的觀測。例如, planes
$tailnum 是一個(gè)主鍵,因?yàn)樗梢晕ㄒ槐硎緋lanes的每架飛機(jī)。
- 外鍵:唯一表示另一個(gè)數(shù)據(jù)框的觀測數(shù)據(jù)。如filghts$tailnum 是一個(gè)外鍵,因?yàn)槠涑霈F(xiàn)在 flights 表中,并可以將每次航班與唯一一架飛機(jī)匹配。
> # 對主鍵進(jìn)行 count() 操作,然后查看是否有 n 大于 1 的記錄,即查看是否唯一值
> planes %>%
+ count(tailnum) %>%
+ filter(n > 1)
# A tibble: 0 x 2
# ... with 2 variables: tailnum <chr>, n <int>
> weather %>%
+ count(year, month, day, hour, origin) %>%
+ filter(n > 1)
# A tibble: 3 x 6
year month day hour origin n
<dbl> <dbl> <int> <int> <chr> <int>
1 2013 11 3 1 EWR 2
2 2013 11 3 1 JFK 2
3 2013 11 3 1 LGA 2
> #flights 表中的主鍵是什么?
> flights %>%
+ count(year, month, day, flight) %>%
+ filter(n > 1)
# A tibble: 29,768 x 5
year month day flight n
<int> <int> <int> <int> <int>
1 2013 1 1 1 2
2 2013 1 1 3 2
3 2013 1 1 4 2
4 2013 1 1 11 3
5 2013 1 1 15 2
6 2013 1 1 21 2
7 2013 1 1 27 4
8 2013 1 1 31 2
9 2013 1 1 32 2
10 2013 1 1 35 2
# ... with 29,758 more rows
> flights %>%
+ count(year, month, day, tailnum) %>%
+ filter(n > 1)
# A tibble: 64,928 x 5
year month day tailnum n
<int> <int> <int> <chr> <int>
1 2013 1 1 N0EGMQ 2
2 2013 1 1 N11189 2
3 2013 1 1 N11536 2
4 2013 1 1 N11544 3
5 2013 1 1 N11551 2
6 2013 1 1 N12540 2
7 2013 1 1 N12567 2
8 2013 1 1 N13123 2
9 2013 1 1 N13538 3
10 2013 1 1 N13566 3
# ... with 64,918 more rows
>
如果一張表沒有主鍵,有時(shí)就需要使用 mutate() 函數(shù)和 row_number() 函數(shù)為表加上一個(gè)主鍵。這樣一來,
如果你完成了一些篩選工作,并想要使用原始數(shù)據(jù)檢查的話,就可以更容易地匹配觀測。這種主鍵稱為代理鍵。
理解鏈接

x <- tribble(
~key, ~val_x,
1, "x1",
2, "x2",
3, "x3"
)
y <- tribble(
~key, ~val_y,
1, "y1",
2, "y2",
4, "y3"
)


Inner join

x %>%
inner_join(y, by = "key")
#> # A tibble: 2 x 3
#> key val_x val_y
#> <dbl> <chr> <chr>
#> 1 1 x1 y1
#> 2 2 x2 y2
Outer joins
A left join keeps all observations in x.
A right join keeps all observations in y.
A full join keeps all observations in x and y.

Another way to depict the different types of joins is with a Venn diagram:

Duplicate keys

x <- tribble(
~key, ~val_x,
1, "x1",
2, "x2",
2, "x3",
1, "x4"
)
y <- tribble(
~key, ~val_y,
1, "y1",
2, "y2"
)
left_join(x, y, by = "key")
#> # A tibble: 4 x 3
#> key val_x val_y
#> <dbl> <chr> <chr>
#> 1 1 x1 y1
#> 2 2 x2 y2
#> 3 2 x3 y2
#> 4 1 x4 y1

x <- tribble(
~key, ~val_x,
1, "x1",
2, "x2",
2, "x3",
3, "x4"
)
y <- tribble(
~key, ~val_y,
1, "y1",
2, "y2",
2, "y3",
3, "y4"
)
left_join(x, y, by = "key")
#> # A tibble: 6 x 3
#> key val_x val_y
#> <dbl> <chr> <chr>
#> 1 1 x1 y1
#> 2 2 x2 y2
#> 3 2 x2 y3
#> 4 2 x3 y2
#> 5 2 x3 y3
#> 6 3 x4 y4
Defining the key columns
flights2 %>%
left_join(weather)
#> Joining, by = c("year", "month", "day", "hour", "origin")
#> # A tibble: 336,776 x 18
#> year month day hour origin dest tailnum carrier temp dewp humid
#> <dbl> <dbl> <int> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 2013 1 1 5 EWR IAH N14228 UA 39.0 28.0 64.4
#> 2 2013 1 1 5 LGA IAH N24211 UA 39.9 25.0 54.8
#> 3 2013 1 1 5 JFK MIA N619AA AA 39.0 27.0 61.6
#> 4 2013 1 1 5 JFK BQN N804JB B6 39.0 27.0 61.6
#> 5 2013 1 1 6 LGA ATL N668DN DL 39.9 25.0 54.8
#> 6 2013 1 1 5 EWR ORD N39463 UA 39.0 28.0 64.4
#> # … with 3.368e+05 more rows, and 7 more variables: wind_dir <dbl>,
#> # wind_speed <dbl>, wind_gust <dbl>, precip <dbl>, pressure <dbl>,
#> # visib <dbl>, time_hour <dttm>
flights2 %>%
left_join(planes, by = "tailnum")
#> # A tibble: 336,776 x 16
#> year.x month day hour origin dest tailnum carrier year.y type
#> <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <int> <chr>
#> 1 2013 1 1 5 EWR IAH N14228 UA 1999 Fixe…
#> 2 2013 1 1 5 LGA IAH N24211 UA 1998 Fixe…
#> 3 2013 1 1 5 JFK MIA N619AA AA 1990 Fixe…
#> 4 2013 1 1 5 JFK BQN N804JB B6 2012 Fixe…
#> 5 2013 1 1 6 LGA ATL N668DN DL 1991 Fixe…
#> 6 2013 1 1 5 EWR ORD N39463 UA 2012 Fixe…
#> # … with 3.368e+05 more rows, and 6 more variables: manufacturer <chr>,
#> # model <chr>, engines <int>, seats <int>, speed <int>, engine <chr>
flights2 %>%
left_join(airports, c("dest" = "faa"))
#> # A tibble: 336,776 x 15
#> year month day hour origin dest tailnum carrier name lat lon
#> <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
#> 1 2013 1 1 5 EWR IAH N14228 UA Geor… 30.0 -95.3
#> 2 2013 1 1 5 LGA IAH N24211 UA Geor… 30.0 -95.3
#> 3 2013 1 1 5 JFK MIA N619AA AA Miam… 25.8 -80.3
#> 4 2013 1 1 5 JFK BQN N804JB B6 <NA> NA NA
#> 5 2013 1 1 6 LGA ATL N668DN DL Hart… 33.6 -84.4
#> 6 2013 1 1 5 EWR ORD N39463 UA Chic… 42.0 -87.9
#> # … with 3.368e+05 more rows, and 4 more variables: alt <int>, tz <dbl>,
#> # dst <chr>, tzone <chr>
flights2 %>%
left_join(airports, c("origin" = "faa"))
#> # A tibble: 336,776 x 15
#> year month day hour origin dest tailnum carrier name lat lon
#> <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
#> 1 2013 1 1 5 EWR IAH N14228 UA Newa… 40.7 -74.2
#> 2 2013 1 1 5 LGA IAH N24211 UA La G… 40.8 -73.9
#> 3 2013 1 1 5 JFK MIA N619AA AA John… 40.6 -73.8
#> 4 2013 1 1 5 JFK BQN N804JB B6 John… 40.6 -73.8
#> 5 2013 1 1 6 LGA ATL N668DN DL La G… 40.8 -73.9
#> 6 2013 1 1 5 EWR ORD N39463 UA Newa… 40.7 -74.2
#> # … with 3.368e+05 more rows, and 4 more variables: alt <int>, tz <dbl>,
#> # dst <chr>, tzone <chr>
Other implementations
| dplyr | merge |
|---|---|
| inner_join(x, y) | merge(x, y) |
| left_join(x, y) | merge(x, y, all.x = TRUE) |
| right_join(x, y) | merge(x, y, all.y = TRUE) |
| full_join(x, y) | merge(x, y, all.x = TRUE, all.y = TRUE) |
Filtering joins
semi_join(x, y): 保留 x 表中與 y 表中的觀測相匹配的所有觀測。
? anti_join(x, y): 丟棄 x 表中與 y 表中的觀測相匹配的所有觀測。
top_dest <- flights %>%
count(dest, sort = TRUE) %>%
head(10)
top_dest
# 篩選出前10個(gè)
flights %>%
filter(dest %in% top_dest$dest)
# 半連接,只保留兩個(gè)數(shù)據(jù)框都有的行的數(shù)據(jù)
flights %>%
semi_join(top_dest)
# 反連接保留 x 表中那些沒有匹配 y 表的行。
# 反連接可以用于診斷連接中的不匹配。例如,在連接 flights 和 planes 時(shí),你可能想知道flights 中是否有很多行在 planes 中沒有匹配記錄:
flights %>%
anti_join(planes, by = "tailnum") %>%
count(tailnum, sort = TRUE)



Join problems
Set operations
The final type of two-table verb are the set operations. Generally, I use these the least frequently, but they are occasionally useful when you want to break a single complex filter into simpler pieces. All these operations work with a complete row, comparing the values of every variable. These expect the x and y inputs to have the same variables, and treat the observations like sets:
intersect(x, y): return only observations in both x and y.
union(x, y): return unique observations in x and y.
setdiff(x, y): return observations in x, but not in y.
Given this simple data:
df1 <- tribble(
~x, ~y,
1, 1,
2, 1
)
df2 <- tribble(
~x, ~y,
1, 1,
1, 2
)
The four possibilities are:
intersect(df1, df2)
#> # A tibble: 1 x 2
#> x y
#> <dbl> <dbl>
#> 1 1 1
# Note that we get 3 rows, not 4
union(df1, df2)
#> # A tibble: 3 x 2
#> x y
#> <dbl> <dbl>
#> 1 1 2
#> 2 2 1
#> 3 1 1
setdiff(df1, df2)
#> # A tibble: 1 x 2
#> x y
#> <dbl> <dbl>
#> 1 2 1
setdiff(df2, df1)
#> # A tibble: 1 x 2
#> x y
#> <dbl> <dbl>
#> 1 1 2