《R for Data Science》第二、三章 Data visualisation 啃書知識(shí)點(diǎn)積累
參考書籍
- 《R for data science》
- 《R數(shù)據(jù)科學(xué)》
- The Layered Grammar of Graphics.
- ggplot2: Points
“The simple graph has brought more information to the data analyst’s mind than any other device.” — John Tukey
“The greatest value of a picture is when it forces us to notice what we never expected to see.” — John Tukey
A graphing template
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(
mapping = aes(<MAPPINGS>),
stat = <STAT>,
position = <POSITION>
) +
<COORDINATE_FUNCTION> +
<FACET_FUNCTION>
Aesthetic mappings
# Left
p1 <- ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, alpha = class))
# Right
p2 <- ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = class))
p1 + p2
# Warning messages:
# 1: Using alpha for a discrete variable is not advised.
# 2: The shape palette can deal with a maximum of 6 discrete values
# because more than 6 becomes difficult to discriminate; you have
# 7. Consider specifying shapes manually if you must have them.
# 3: Removed 62 rows containing missing values (geom_point).

ggplot2 will only use
six shapesat a time. By default, additional groups will go unplotted when you use the shape aesthetic.

- How do these aesthetics behave differently for categorical vs. continuous variables
'''
color 有序?qū)傩?1. 分類變量映射:對(duì)應(yīng)多種不同顏色
2. 連續(xù)變量映射:形成有固定范圍的色階,在色階內(nèi)部取色
size 有序?qū)傩?1. 分類變量映射:點(diǎn)大小和分類類型逐一對(duì)應(yīng)但不相關(guān),且會(huì)警告
2. 連續(xù)變量映射:點(diǎn)的大小和連續(xù)變量線性相關(guān)
shape 無序?qū)傩?1. 分類變量映射:對(duì)應(yīng)多種形狀,最多同時(shí)出現(xiàn)6種,超過則不顯示且有警告
2. 連續(xù)變量映射:無法映射
'''
- mpg的變量類型
- stroke屬性
p1 <- ggplot(mpg,aes(x = displ, y = hwy)) +
geom_point(shape = 1)
p2 <- ggplot(mpg,aes(x = displ, y = hwy)) +
geom_point(shape = 1,stroke = 2)
p1 + p2

Facet 分面
- 封裝型 wrap
ggplot(mpg) +
geom_point(aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)

facet_wrap()參數(shù)如下:

# strip.position參數(shù)調(diào)節(jié)標(biāo)簽的朝向
p1 <- ggplot(mpg) +
geom_point(aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2, strip.position = 'bottom')
p2 <- ggplot(mpg) +
geom_point(aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2, strip.position = 'right')
p1 + p2

- 在分面中呈現(xiàn)總數(shù)據(jù)
ggplot(mpg, aes(displ, hwy)) +
geom_point(data = transform(mpg, class = NULL),
colour = "grey85") +
geom_point() +
facet_wrap(~ class)

- 網(wǎng)格型 grid
# . 的作用表示的是不想在行或者列的維度上進(jìn)行分面
p1 <- ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ .) # 列 ~ 行
p2 <- ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(. ~ cyl)
p1 + p2

Geometric objects
- 不顯示圖例和置信區(qū)間
p1 <- ggplot(mpg) +
geom_smooth(aes(x = displ, y = hwy))
p2 <- ggplot(mpg,aes(x = displ, y = hwy, group = drv)) +
geom_smooth(se = FALSE)
p3 <- ggplot(mpg) +
geom_smooth(
aes(x = displ, y = hwy, color = drv),
show.legend = FALSE)
p1 + p2 + p3

- 配合filter
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = class)) +
geom_smooth(data = filter(mpg, class == "subcompact"), se = FALSE)

- 細(xì)節(jié)畫圖
同樣是外白內(nèi)其他顏色的點(diǎn),一種重疊后有白色,一種無白色在內(nèi)
p1 <- ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(fill=drv),shape=21,color='white',size=2.5,stroke=1.5)
p2 <- ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(color='white',size=3.5)+
geom_point(aes(color=drv),shape=16,size=2.3)
p1 + p2

Statistical transformations
barcharts, histograms, and frequency polygons bin your data and then plot bin counts, the number of points that fall in each bin.
smoothersfit a model to your data and then plot predictions from the model.
boxplotscompute a robust summary of the distribution and then display a specially formatted box.

- 幾種常用互換
You can generally use geoms and stats interchangeably. For example, you can recreate the previous plot using stat_count() instead of geom_bar()
ggplot(data = diamonds) +
stat_count(mapping = aes(x = cut))
# 等價(jià)于
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut), stat = 'identity') # 默認(rèn)stat可以不寫
ggplot(data = diamonds) +
geom_pointrange(
mapping = aes(x = cut, y = depth),
stat = "summary",
fun.ymin = min,
fun.ymax = max,
fun.y = median
)
# 等價(jià)于
ggplot(data = diamonds) +
stat_summary(
mapping = aes(x = cut, y = depth),
fun.ymin = min,
fun.ymax = max,
fun.y = median
)
# 也可以手動(dòng)復(fù)現(xiàn)
ggplot(diamonds, aes(cut,depth)) +
geom_line(size=1) +
# 更換data需要重新指名data = xxx
geom_point(data = diamonds %>%
group_by(cut) %>%
summarise(median(depth)),
aes(cut, `median(depth)`), size=2)

- 覆蓋默認(rèn)映射
ggplot(diamonds) +
geom_bar(aes(x = cut, y = stat(prop), group = 1, fill = stat(prop)))
# 等價(jià)于
p1 <- ggplot(diamonds) +
geom_bar(aes(x = cut, y = ..prop.., group = 1, fill = ..prop..))
p2 <- ggplot(diamonds) +
geom_bar(aes(x = cut, y = ..prop.., group = color, fill = color))
p1 + p2

- What does geom_col() do? How is it different to geom_bar()?
- geom_col() 函數(shù)也是用來繪制柱狀圖,"identity" 表示不做統(tǒng)計(jì)變換
- geom_bar() 函數(shù)默認(rèn)是 count,表示計(jì)數(shù)
- Most geoms and stats come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common?



Position adjustments
position = "identity" 將每個(gè)對(duì)象直接顯示在圖中,這樣數(shù)據(jù)會(huì)彼此重疊,不適合展示結(jié)果
position = "fill" 堆疊百分比條形圖
position = "dodge" 并列條形圖
position = "stack" 堆疊起來
position = "jitter" 數(shù)據(jù)隨機(jī)抖動(dòng),一般應(yīng)用于散點(diǎn)圖
用一下劉博的案例
library(ggplot2)
library(patchwork)
v <- data.frame(x = 1:20,
y = runif(40,min = 10,max = 20),
z = rep(c("A","B"),each = 20))
p1 <- ggplot(v, aes(x, y, fill = z))+
geom_area(position = position_dodge(), alpha = 0.5) +
labs(title = "position_dodge()")
p2 <- ggplot(v, aes(x, y, fill = z))+
geom_area(position = position_fill(), alpha = 0.5) +
labs(title = "position_fill()")
p3 <- ggplot(v, aes(x, y, fill = z))+
geom_area(position = position_stack(), alpha = 0.5) +
labs(title = "position_stack()")
p4 <- ggplot(v, aes(x, y, fill = z))+
geom_area(position = position_identity(), alpha = 0.5) +
labs(title = "position_identity()")
p5 <- ggplot(v, aes(x, y, fill = z))+
geom_area(position = position_jitter(), alpha = 0.5) +
labs(title = "position_jitter(), usually for point")
(p1 + p2 + p3)/(p4 + p5)

- geom_jitter() 抖動(dòng)
geom_jitter() 對(duì)數(shù)據(jù)進(jìn)行隨機(jī)抖動(dòng)
geom_count() 將重疊的位置數(shù)目進(jìn)行計(jì)數(shù)
p1 <- ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point()
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_jitter()
# 等價(jià)于
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point(position = position_jitter())
# 等價(jià)于
p2 <- ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point(position = 'jitter')
# geom_count()
p3 <- ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_count()

Coordinate systems
- coord_flip()
coord_flip() switches the x and y axes. This is useful (for example), if you want horizontal boxplots. It’s also useful for
long labels: it’s hard to get them to fit without overlapping on the x-axis.
p1 <- ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot()
p2 <- ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot() +
coord_flip()
p1 + p2

- coord_quickmap()
幫助地圖設(shè)置成正確比例
coord_quickmap() sets the aspect ratio correctly for maps. This is very important if you’re plotting spatial data with ggplot2.
nz <- map_data("nz")
p1 <- ggplot(nz, aes(long, lat, group = group)) +
geom_polygon(fill = "white", colour = "black")
p2 <- ggplot(nz, aes(long, lat, group = group)) +
geom_polygon(fill = "white", colour = "black") +
coord_quickmap()
p1 + p2

- coord_polar()
bar <- ggplot(data = diamonds) +
geom_bar(
mapping = aes(x = cut, fill = cut),
show.legend = FALSE,
width = 1
) +
theme(aspect.ratio = 1) +
labs(x = NULL, y = NULL)
p1 <- bar + coord_flip()
p2 <- bar + coord_polar()
p1 + p2

進(jìn)一步拓展:
- Turn a stacked bar chart into a pie chart using coord_polar()
p1 <- ggplot(diamonds) +
geom_bar(aes(x = cut, fill = clarity)) +
coord_polar()
p2 <- ggplot(diamonds) +
geom_bar(aes(x = cut, fill = clarity),
position = 'fill') +
coord_polar()
# theta 參數(shù)表示 variable to map angle to (x or y)
# 意思就是根據(jù)值計(jì)算出所占的比例,然后再映射到角度
p3 <- ggplot(diamonds) +
geom_bar(aes(x = cut, fill = clarity),
position = 'fill') +
coord_polar(theta = "y")
p1 + p2 + p3

- What does the plot below tell you about the relationship between city and highway mpg? Why is coord_fixed() important? What does geom_abline() do?
'''
城市和公路燃油效率之間呈現(xiàn)正相關(guān)。
coord_fixed()能夠固定x軸和y軸的比例。
geom_abline()是繪制斜線,默認(rèn)45度,截距適應(yīng)圖形
可以指定intercept截距,slope坡度
'''
p1 <- ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point() +
geom_abline() +
coord_fixed()
p2 <- ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point() +
geom_abline(intercept=-5,slope=1) +
coord_fixed()
p1 + p2
