[R語言] ggplot2包 可視化《R for data science》 1

《R for Data Science》第二、三章 Data visualisation 啃書知識(shí)點(diǎn)積累

參考書籍

  1. 《R for data science》
  2. 《R數(shù)據(jù)科學(xué)》
  3. The Layered Grammar of Graphics.
  4. ggplot2: Points

“The simple graph has brought more information to the data analyst’s mind than any other device.” — John Tukey
“The greatest value of a picture is when it forces us to notice what we never expected to see.” — John Tukey

A graphing template

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(
     mapping = aes(<MAPPINGS>),
     stat = <STAT>, 
     position = <POSITION>
  ) +
  <COORDINATE_FUNCTION> +
  <FACET_FUNCTION>

Aesthetic mappings

# Left
p1 <- ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, alpha = class))

# Right
p2 <- ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, shape = class))  

p1 + p2
# Warning messages:
# 1: Using alpha for a discrete variable is not advised. 
# 2: The shape palette can deal with a maximum of 6 discrete values
# because more than 6 becomes difficult to discriminate; you have
# 7. Consider specifying shapes manually if you must have them. 
# 3: Removed 62 rows containing missing values (geom_point). 

ggplot2 will only use six shapes at a time. By default, additional groups will go unplotted when you use the shape aesthetic.

- How do these aesthetics behave differently for categorical vs. continuous variables

'''
color 有序?qū)傩?1. 分類變量映射:對(duì)應(yīng)多種不同顏色
2. 連續(xù)變量映射:形成有固定范圍的色階,在色階內(nèi)部取色

size 有序?qū)傩?1. 分類變量映射:點(diǎn)大小和分類類型逐一對(duì)應(yīng)但不相關(guān),且會(huì)警告
2. 連續(xù)變量映射:點(diǎn)的大小和連續(xù)變量線性相關(guān)

shape 無序?qū)傩?1. 分類變量映射:對(duì)應(yīng)多種形狀,最多同時(shí)出現(xiàn)6種,超過則不顯示且有警告
2. 連續(xù)變量映射:無法映射
'''

- mpg的變量類型

  • stroke屬性
p1 <- ggplot(mpg,aes(x = displ, y = hwy)) +
  geom_point(shape = 1)

p2 <- ggplot(mpg,aes(x = displ, y = hwy)) +
  geom_point(shape = 1,stroke = 2)

p1 + p2

Facet 分面

- 封裝型 wrap

ggplot(mpg) + 
  geom_point(aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)

facet_wrap()參數(shù)如下:


# strip.position參數(shù)調(diào)節(jié)標(biāo)簽的朝向
p1 <- ggplot(mpg) + 
  geom_point(aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2, strip.position = 'bottom')

p2 <- ggplot(mpg) + 
  geom_point(aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2, strip.position = 'right')

p1 + p2

- 在分面中呈現(xiàn)總數(shù)據(jù)

ggplot(mpg, aes(displ, hwy)) +
  geom_point(data = transform(mpg, class = NULL), 
             colour = "grey85") +
  geom_point() +
  facet_wrap(~ class)

- 網(wǎng)格型 grid

# . 的作用表示的是不想在行或者列的維度上進(jìn)行分面
p1 <- ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(drv ~ .) # 列 ~ 行

p2 <- ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(. ~ cyl)

p1 + p2

Geometric objects

- 不顯示圖例和置信區(qū)間

p1 <- ggplot(mpg) +
  geom_smooth(aes(x = displ, y = hwy))

p2 <- ggplot(mpg,aes(x = displ, y = hwy, group = drv)) +
  geom_smooth(se = FALSE)

p3 <- ggplot(mpg) +
  geom_smooth(
    aes(x = displ, y = hwy, color = drv),
    show.legend = FALSE)

p1 + p2 + p3

- 配合filter

ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point(aes(color = class)) + 
  geom_smooth(data = filter(mpg, class == "subcompact"), se = FALSE)

- 細(xì)節(jié)畫圖

同樣是外白內(nèi)其他顏色的點(diǎn),一種重疊后有白色,一種無白色在內(nèi)

p1 <- ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(aes(fill=drv),shape=21,color='white',size=2.5,stroke=1.5)

p2 <- ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(color='white',size=3.5)+
  geom_point(aes(color=drv),shape=16,size=2.3)

p1 + p2

Statistical transformations

barcharts, histograms, and frequency polygons bin your data and then plot bin counts, the number of points that fall in each bin.
smoothers fit a model to your data and then plot predictions from the model.
boxplots compute a robust summary of the distribution and then display a specially formatted box.

- 幾種常用互換

You can generally use geoms and stats interchangeably. For example, you can recreate the previous plot using stat_count() instead of geom_bar()

ggplot(data = diamonds) + 
  stat_count(mapping = aes(x = cut))
# 等價(jià)于
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut), stat = 'identity') # 默認(rèn)stat可以不寫
ggplot(data = diamonds) +
  geom_pointrange(
    mapping = aes(x = cut, y = depth),
    stat = "summary",
    fun.ymin = min,
    fun.ymax = max,
    fun.y = median
  )
# 等價(jià)于
ggplot(data = diamonds) +
  stat_summary(
    mapping = aes(x = cut, y = depth),
    fun.ymin = min,
    fun.ymax = max,
    fun.y = median
  )

# 也可以手動(dòng)復(fù)現(xiàn)
ggplot(diamonds, aes(cut,depth)) + 
  geom_line(size=1) + 
  # 更換data需要重新指名data = xxx
  geom_point(data = diamonds %>%   
               group_by(cut) %>% 
               summarise(median(depth)),
               aes(cut, `median(depth)`), size=2) 

- 覆蓋默認(rèn)映射

ggplot(diamonds) + 
  geom_bar(aes(x = cut, y = stat(prop), group = 1, fill = stat(prop)))
# 等價(jià)于
p1 <- ggplot(diamonds) + 
  geom_bar(aes(x = cut, y = ..prop.., group = 1, fill = ..prop..))

p2 <- ggplot(diamonds) + 
  geom_bar(aes(x = cut, y = ..prop.., group = color, fill = color))

p1 + p2

- What does geom_col() do? How is it different to geom_bar()?

  1. geom_col() 函數(shù)也是用來繪制柱狀圖,"identity" 表示不做統(tǒng)計(jì)變換
  2. geom_bar() 函數(shù)默認(rèn)是 count,表示計(jì)數(shù)

- Most geoms and stats come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common?

Position adjustments

position = "identity" 將每個(gè)對(duì)象直接顯示在圖中,這樣數(shù)據(jù)會(huì)彼此重疊,不適合展示結(jié)果
position = "fill" 堆疊百分比條形圖
position = "dodge" 并列條形圖
position = "stack" 堆疊起來
position = "jitter" 數(shù)據(jù)隨機(jī)抖動(dòng),一般應(yīng)用于散點(diǎn)圖

用一下劉博的案例

library(ggplot2)
library(patchwork)

v <- data.frame(x = 1:20, 
                y = runif(40,min = 10,max = 20),
                z = rep(c("A","B"),each = 20))
                
p1 <- ggplot(v, aes(x, y, fill = z))+
  geom_area(position = position_dodge(), alpha = 0.5) +
  labs(title = "position_dodge()")

p2 <- ggplot(v, aes(x, y, fill = z))+
  geom_area(position = position_fill(), alpha = 0.5) +
  labs(title = "position_fill()")

p3 <- ggplot(v, aes(x, y, fill = z))+
  geom_area(position = position_stack(), alpha = 0.5) +
  labs(title = "position_stack()")

p4 <- ggplot(v, aes(x, y, fill = z))+
  geom_area(position = position_identity(), alpha = 0.5) +
  labs(title = "position_identity()")

p5 <- ggplot(v, aes(x, y, fill = z))+
  geom_area(position = position_jitter(), alpha = 0.5) +
  labs(title = "position_jitter(), usually for point")

(p1 + p2 + p3)/(p4 + p5) 
  • geom_jitter() 抖動(dòng)

geom_jitter() 對(duì)數(shù)據(jù)進(jìn)行隨機(jī)抖動(dòng)
geom_count() 將重疊的位置數(shù)目進(jìn)行計(jì)數(shù)

p1 <- ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point()

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_jitter()
# 等價(jià)于
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point(position = position_jitter())
# 等價(jià)于
p2 <- ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point(position = 'jitter')

# geom_count()
p3 <- ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_count()

Coordinate systems

- coord_flip()

coord_flip() switches the x and y axes. This is useful (for example), if you want horizontal boxplots. It’s also useful for long labels: it’s hard to get them to fit without overlapping on the x-axis.

p1 <- ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot()

p2 <- ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot() +
  coord_flip()

p1 + p2 

- coord_quickmap()

幫助地圖設(shè)置成正確比例

coord_quickmap() sets the aspect ratio correctly for maps. This is very important if you’re plotting spatial data with ggplot2.

nz <- map_data("nz")

p1 <- ggplot(nz, aes(long, lat, group = group)) +
  geom_polygon(fill = "white", colour = "black")

p2 <- ggplot(nz, aes(long, lat, group = group)) +
  geom_polygon(fill = "white", colour = "black") +
  coord_quickmap()

p1 + p2 

- coord_polar()

bar <- ggplot(data = diamonds) + 
  geom_bar(
    mapping = aes(x = cut, fill = cut), 
    show.legend = FALSE,
    width = 1
  ) + 
  theme(aspect.ratio = 1) +
  labs(x = NULL, y = NULL)

p1 <- bar + coord_flip()
p2 <- bar + coord_polar()

p1 + p2 

進(jìn)一步拓展:

- Turn a stacked bar chart into a pie chart using coord_polar()

p1 <- ggplot(diamonds) +
  geom_bar(aes(x = cut, fill = clarity)) + 
  coord_polar()

p2 <- ggplot(diamonds) +
  geom_bar(aes(x = cut, fill = clarity),
           position = 'fill') + 
  coord_polar()

# theta 參數(shù)表示 variable to map angle to (x or y)
# 意思就是根據(jù)值計(jì)算出所占的比例,然后再映射到角度
p3 <- ggplot(diamonds) +
  geom_bar(aes(x = cut, fill = clarity),
           position = 'fill') + 
  coord_polar(theta = "y")

p1 + p2 + p3

- What does the plot below tell you about the relationship between city and highway mpg? Why is coord_fixed() important? What does geom_abline() do?

'''
城市和公路燃油效率之間呈現(xiàn)正相關(guān)。
coord_fixed()能夠固定x軸和y軸的比例。
geom_abline()是繪制斜線,默認(rèn)45度,截距適應(yīng)圖形
可以指定intercept截距,slope坡度
'''

p1 <- ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point() + 
  geom_abline() +
  coord_fixed()

p2 <- ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point() +
  geom_abline(intercept=-5,slope=1) +
  coord_fixed()

p1 + p2
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容