第5章 工具箱
目錄
- 5.1 簡介
- 5.2 圖層疊加的總體策略
- 5.3 基本圖形類型
- 5.4 展示數(shù)據(jù)分布
- 5.5 處理遮蓋問題
- 5.6 曲線圖
- 5.7 繪制地圖
- 5.8 揭示不確定性
- 5.9 統(tǒng)計摘要
- 5.10 添加圖形注釋
- 5.11 含權數(shù)據(jù)
5.1 簡介
混合使用ggplot2和qplot來概述基本的幾何對象和統(tǒng)計變換
5.2 圖層疊加的總體策略
圖層由三種用途:
- 用以展示數(shù)據(jù)本身
- 用以展示數(shù)據(jù)的統(tǒng)計摘要
- 用以添加額外的元數(shù)據(jù)(metadata),上下文信息和注解。
library(ggplot2)
5.3 基本圖形類型
面積圖、條形圖、線條圖、散點圖、多邊形、添加標簽、色深圖(水平圖),以下代碼繪制了以上的幾何對象
df <- data.frame(
x = c(3, 1, 5),
y = c(2, 4, 6),
label = c("a","b","c")
)
p <- ggplot(df, aes(x, y)) + xlab(NULL) + ylab(NULL)
p + geom_point() + labs(title = "geom_point")
p + geom_bar(stat = "identity") + labs(title = "geom_bar(stat = \"identity\")")
p + geom_line() + labs( title = "geom_line")
p + geom_area() + labs(title = "geom_area")
p + geom_path() + labs(title = "geom_path")
p + geom_text(aes(label = label)) + labs(title = "geom_text")
p + geom_tile() + labs(title = "geom_tile")
p + geom_polygon() + labs(title = "geom_polygon")
上面的元素比較簡單,不再貼圖了。
5.4 展示數(shù)據(jù)分布
例子:對于一維連續(xù)分布,最重要的是直方圖(默認統(tǒng)計count)或者是頻率多邊形(默認統(tǒng)計density)。永遠不要奢望默認的參數(shù)可以取得強有力的表現(xiàn)。
這三幅圖均展示了一個有趣的模式:隨著鉆石質量的提高,分布逐漸左偏移且愈發(fā)對稱。
depth_dist <- ggplot(diamonds, aes(depth)) + xlim(58, 68)
depth_dist +
geom_histogram(aes(y = ..density..), binwidth = 0.1) +
facet_grid(cut ~.)

depth_dist + geom_histogram(aes(fill = cut), binwidth = 0.1, position = "fill")

depth_dist + geom_freqpoly(aes(y = ..density.., colour = cut), binwidth = 0.1)

例子:針對類別性或連續(xù)性變量取條件所得到的的箱線圖
library(plyr)
qplot(cut, depth, data = diamonds, geom = "boxplot")

qplot(carat, depth, data = diamonds, geom = "boxplot", group = round_any(carat, 0.1, floor),xlim = c(0, 3))

例子:擾動點圖通過在離散型分布上添加隨機噪聲以避免遮蓋繪制問題,這是一種較為粗糙的方法
qplot(class, cty, data = mpg, geom = "jitter")

qplot(class, drv, data = mpg, geom = "jitter")

例子:密度圖,必須是已知潛在的密度分布為平滑、連續(xù)且無界的時候使用這種密度圖
qplot(depth, data = diamonds, geom = "density", xlim = c(54, 70))

qplot(depth, data = diamonds, geom = "density", xlim = c(54, 70), fill = cut, alpha = I(0.2))

5.5 處理遮蓋問題
散點圖是研究兩個連續(xù)變量間關系的重要工具。但是當數(shù)據(jù)量很大時,這些點經常會出現(xiàn)重疊現(xiàn)象,從而掩蓋真實的關系。根據(jù)這種圖形得到任何結論都是值得懷疑的,這種問題被稱為遮蓋繪制(overplotting)。
- 方法一:小規(guī)模的遮蓋繪制問題可以通過繪制更小的點
df <- data.frame(x = rnorm(2000), y = rnorm(2000))
norm <- ggplot(df, aes(x, y))
norm + geom_point()

norm + geom_point(shape = 1)

norm + geom_point(shape = ".") ##點的大小為像素級

- 方法二:更大數(shù)據(jù)集,調整透明度, R中最小為1/256
norm + geom_point(colour = "black", alpha = 1/3)

norm + geom_point(colour = "black", alpha = 1/5)

norm + geom_point(colour = "black", alpha = 1/10)

- 方法三:在點上增加隨機擾動減輕重疊
td <- ggplot(diamonds, aes(table, depth)) + xlim(50, 70) + ylim(50, 70)
td + geom_point()
td + geom_jitter()

jit <- position_jitter(width = 0.5)
td + geom_jitter(position = jit)

td + geom_jitter(position = jit, colour = "black", alpha = 1/10)

td + geom_jitter(position = jit, colour = "black", alpha = 1/50)

td + geom_jitter(position = jit, colour = "black", alpha = 1/200)

- 方法四;借鑒二維核密度圖的思想,分箱統(tǒng)計其中的數(shù)據(jù),可視化該數(shù)值
d <- ggplot(diamonds, aes(carat, price)) + xlim(1,3) +theme(legend.position = "none")
d + stat_bin2d()

d + stat_bin2d(bins = 10)

d + stat_bin2d(binwidth = c(0.02, 200))

d + stat_binhex()

d + stat_binhex(bins = 10)

d + stat_binhex(binwidth = c(0.02, 200))

- 方法五:使用stat_density2d做二維密度估計,并添加等高線或者是著色瓦片直接顯示密度,或者是大小院分布密度成比例的點
d <- ggplot(diamonds, aes(carat, price)) + xlim(1, 3) + theme(legend.position = "none")
d + geom_point() + geom_density2d()

d + stat_density2d(geom = "point", aes(size = ..density..), contour = F) + scale_size_area()

d + stat_density2d(geom = "tile", aes(fill = ..density..), contour = F)

last_plot() + scale_fill_gradient(limits = c(1e-5, 8e-4))

5.6 曲線圖
常用工具:著色瓦片,等高線圖,氣泡圖
5.7 繪制地圖
maps包與ggplot2的結合十分方便,使用地圖的原因,一是為了空間數(shù)據(jù)添加參考輪廓線,一個是不同區(qū)域填充顏色構建等值線圖
添加地圖邊界可以用borders()來完成,以下是一個使用實例。
library(maps)
data(us.cities)
big_cities <- subset(us.cities, pop > 500000)
qplot(long, lat, data = big_cities) +borders("state", size = 0.5)

tx_cities <- subset(us.cities, country.etc == "TX")
ggplot(tx_cities, aes(long, lat))+
borders("county", "texas", colour = "grey70") +
geom_point(colour = "black", alpha = 0.5)

等值線圖:使用map_data()將地圖數(shù)據(jù)轉換為數(shù)據(jù)框,此數(shù)據(jù)框之后可以通過merge()操作與數(shù)據(jù)融合,最后繪制等值線,如下所示:
library(maps)
states <- map_data("state")
arrests <- USArrests
names(arrests) <- tolower(names(arrests))
arrests$region <- tolower(rownames(USArrests))
choro <- merge(states, arrests, by = "region")
choro <- choro[order(choro$order),]
qplot(long, lat, data = choro, group = group, fill = assault, geom = "polygon")

qplot(long, lat, data = choro, group = group, fill = assault / murder, geom = "polygon")

例子:對地圖數(shù)據(jù)進行標注
library(plyr)
ia <- map_data("county", "iowa")
mid_range <- function(x) mean(range(x, na.rm = TRUE))
centres <- ddply(ia, .(subregion), colwise(mid_range, .(lat, long)))
ggplot(ia, aes(long, lat))+
geom_polygon(aes(group = group), fill = NA, colour = "grey60") +
geom_text(aes(label = subregion), data = centres, size = 2, angle = 45)

5.8 揭示不確定性
在ggplot中,對于不確定信息的可視化主要有四種幾何對象:
連續(xù)型X變量:geom_ribbon(僅展示區(qū)間),geom_smooth(stat = "identity")(同時展示區(qū)間和中間值)
離散型X變量:geom_errorbar(僅展示區(qū)間),geom_crossbar(同時展示區(qū)間和中間值);geom_linerange(僅展示區(qū)間),geom_pointrange(同時展示區(qū)間和中間值)
對于線性模型,effect包(Fox, 2008)非常適合提取這類值。下面的例子擬合了一個雙因素含交互效應回歸模型,并且展示了如何提取邊際效應和條件效應。
d <- subset(diamonds, carat <2.5 & rbinom(nrow(diamonds), 1, 0.2) == 1)
d$lcarat <- log10(d$carat)
d$lprice <- log10(d$price)
#剔除整體的線性趨勢
detrend <- lm(lprice ~ lcarat, data = d)
d$lprice2 <- resid(detrend)
mod <- lm(lprice2 ~ lcarat*color, data = d)
library(effects)
effectdf <- function(...){
suppressWarnings(as.data.frame(effect(...)))
}
color <- effectdf("color", mod)
both1 <- effectdf("lcarat:color", mod)
carat <- effectdf("lcarat", mod, default.levels = 50)
both2 <- effectdf("lcarat:color", mod, default.leves = 3)
## 圖 進行數(shù)據(jù)變換以移除顯而易見的效應,1為對x軸和y軸的數(shù)據(jù)均以10對底的對數(shù)以剔除非線性, 2 為剔除了主要的線性趨勢
qplot(lcarat, lprice, data = d, colour = color)

qplot(lcarat, lprice2, data = d, colour = color)

## 圖 展示模型估計結果中變量color的不確定性,左圖為color的邊際效應,有圖則是針對變量carat的不同水平,變量color的條件效應,誤差棒顯示了95%的逐點置信區(qū)間
fplot <- ggplot(mapping = aes(y = fit, ymin = lower, yamx = upper)) +
ylim(range(both2$lower, both2$upper))
fplot %+% color + aes(x = color) + geom_point() + geom_errorbar(aes(ymin = lower, ymax = upper))

fplot %+% both2 +
aes(x = color, colour = lcarat, group = interaction(color, lcarat)) +
geom_errorbar(aes(ymin = lower, ymax = upper)) +
geom_line(aes(group = lcarat)) +
scale_colour_gradient()

## 圖 展示模型估計結果中變量carat的不確定性
fplot %+% carat + aes(x = lcarat) + geom_smooth(stat = "identity", se = TRUE)

ends <- subset(both1, lcarat == max(lcarat))
fplot %+% both1 + aes(x = lcarat, colour = color)+
geom_smooth(stat = "identity", se = TRUE) +
scale_colour_hue() +
theme(legend.position = "none")+
geom_text(aes(label = color, x = lcarat +0.02),ends)

5.9 統(tǒng)計摘要
stat_summary():對于每個x取值,計算對應y值的統(tǒng)計摘要
5.9.1 單獨的摘要計算函數(shù)
midm <- function(x) mean(x, trim = 0.5)
m2 + stat_summary(aes(colour = "trimmed"), fun.y = midm, geom = "point") +
stat_summary(aes(colour = "raw"), fun.y = mean, geom = "point") +
scale_colour_hue("Mean")
5.9.2 統(tǒng)一的摘要計算函數(shù)
fun.data可以支持更復雜的函數(shù),比如來自Hmisc包的摘要計算函數(shù)。
iqr <- function(x,...) {
qs <- quantile(as.numberic(x), c(0.25,0.75), na.rm = T)
names(qs) <- c("ymin", "ymax")
qs
}
m + stat_summary(fun.data = "iqr", geom = "ribbon")
5.10 添加圖形注解
這些注解僅僅是額外的數(shù)據(jù)而已。有逐個添加或者是批量添加兩種方式。
下面的例子:向經濟數(shù)據(jù)中添加有關美國總統(tǒng)的信息
繪制原始失業(yè)率曲線
(unemp <- qplot(date, unemploy, data = economics, geom = "line", xlab = "", ylab = "No. unemployed (1000s)"))

# 添加總統(tǒng)就職時間豎線
presidential <- presidential[-(1:3),]
yrng <- range(economics$unemploy)
xrng <- range(economics$date)
unemp + geom_vline(aes(xintercept = as.numeric(start)), data = presidential)

library(scales)
unemp + geom_rect(aes(NULL, NULL, xmin = start, xmax = end, fill = party), ymin = yrng[1], ymax = yrng[2], data = presidential, alpha = 0.2)+
scale_fill_manual(values = c("blue","red"))

last_plot() + geom_text(aes(x = start, y = yrng[1],label = name), data = presidential, size = 3, hjust = 0, vjust = 0)

caption <- paste(strwrap("Unemployment rates in the US have varied a lot over the years", 40), collapse = "\n")
unemp + geom_text(aes(x, y, label = caption), data = data.frame(x = xrng[2], y = yrng[2]), hjust = 1, vjust = 1, size = 4)

highest <- subset(economics, unemploy == max(unemploy))
unemp + geom_point(data = highest, size = 3, colour = "red", alpha = 0.5)

5.11 含權數(shù)據(jù)
例子:使用點的大小來表達權重
qplot(percwhite, percbelowpoverty, data = midwest)

qplot(percwhite, percbelowpoverty, data = midwest, size = poptotal / 1e6) +
scale_size_area("Population\n(millions)", breaks = c(0.5, 1, 2, 4))

qplot(percwhite, percbelowpoverty, data = midwest, size = area) +
scale_size_area()

例子:將人口密度作為權重,觀察白種人比例和貧困線以下人口比例的關系
lm_smooth <- geom_smooth(method = lm, size = 1)
qplot(percwhite, percbelowpoverty, data = midwest) + lm_smooth

qplot(percwhite, percbelowpoverty, data = midwest, weight = popdensity, size = popdensity) +lm_smooth

例子:不含權重的直方圖展示了郡的數(shù)量,含權重信息的直方圖展示了人口數(shù)量
qplot(percbelowpoverty, data = midwest, binwidth = 1)

qplot(percbelowpoverty, data = midwest, weight = poptotal, binwidth = 1) +ylab("population")

本章完結,撒花~