dplyr: 数据塑形利器

介绍

dplyr是data manipulation的包,其包含多个处理数据的函数。主要函数有:

  • mutate() 添加新变量.

  • select() 选择列名.

  • filter() 过滤行.

  • summarise() 求和统计.

  • arrange() 排序.

  • group_by() 分组处理.

安装

1
2
3
4
5
#install.packages("dplyr")
#install.packages("nycflights13")
#devtools::install_github("tidyverse/dplyr")
library(dplyr)
library(nycflights13)

tibbles数据类型

tibbles可以取代data.frame,虽然前者仍然可以认为是数据框类型,在数据处理过程中,tibbles数据类型消耗资源更少,处理速度更快。dplyr的函数可以直接处理tibbles数据类型。

1
2
3
#install.packages("tibble")
library(tibble)
as_tibble(iris)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
## # A tibble: 150 x 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## # ... with 140 more rows
1
2
3
4
5
tibble(
x = 1:5,
y = 1,
z = x ^ 2 + y
)
1
2
3
4
5
6
7
8
## # A tibble: 5 x 3
## x y z
## <int> <dbl> <dbl>
## 1 1 1 2
## 2 2 1 5
## 3 3 1 10
## 4 4 1 17
## 5 5 1 26

Add new variables with mutate()

添加新的列:新的列一般在数据集的最后一列

1
2
3
4
5
6
7
8
9
flights %>%
select(
year:day,
ends_with("delay"),
distance,
air_time) %>%
mutate(gain = dep_delay - arr_delay,
speed = distance / air_time * 60
)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
## # A tibble: 336,776 x 9
## year month day dep_delay arr_delay distance air_time gain speed
## <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2013 1 1 2 11 1400 227 -9 370.
## 2 2013 1 1 4 20 1416 227 -16 374.
## 3 2013 1 1 2 33 1089 160 -31 408.
## 4 2013 1 1 -1 -18 1576 183 17 517.
## 5 2013 1 1 -6 -25 762 116 19 394.
## 6 2013 1 1 -4 12 719 150 -16 288.
## 7 2013 1 1 -5 19 1065 158 -24 404.
## 8 2013 1 1 -3 -14 229 53 11 259.
## 9 2013 1 1 -3 -8 944 140 5 405.
## 10 2013 1 1 -2 8 733 138 -10 319.
## # ... with 336,766 more rows

如果只想保留新生成的列,则使用transmute():

1
2
3
4
flights %>% 
transmute(gain = dep_delay - arr_delay,
hours = air_time / 60,
gain_per_hour = gain / hours)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
## # A tibble: 336,776 x 3
## gain hours gain_per_hour
## <dbl> <dbl> <dbl>
## 1 -9 3.78 -2.38
## 2 -16 3.78 -4.23
## 3 -31 2.67 -11.6
## 4 17 3.05 5.57
## 5 19 1.93 9.83
## 6 -16 2.5 -6.4
## 7 -24 2.63 -9.11
## 8 11 0.883 12.5
## 9 5 2.33 2.14
## 10 -10 2.3 -4.35
## # ... with 336,766 more rows

Select columns with select()

筛选列,可以给出确切列名,也可通过函数匹配列名:

  • starts_with(“abc”): matches names that begin with “abc”.

  • ends_with(“xyz”): matches names that end with “xyz”.

  • contains(“ijk”): matches names that contain “ijk”.

  • matches(“(.)\1”): selects variables that match a regular expression. This one matches any variables that contain repeated characters. You’ll learn more about regular expressions in strings.

  • num_range(“x”, 1:3): matches x1, x2 and x3.

1
2
flights %>% 
select(ends_with("time"))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
## # A tibble: 336,776 x 5
## dep_time sched_dep_time arr_time sched_arr_time air_time
## <int> <int> <int> <int> <dbl>
## 1 517 515 830 819 227
## 2 533 529 850 830 227
## 3 542 540 923 850 160
## 4 544 545 1004 1022 183
## 5 554 600 812 837 116
## 6 554 558 740 728 150
## 7 555 600 913 854 158
## 8 557 600 709 723 53
## 9 557 600 838 846 140
## 10 558 600 753 745 138
## # ... with 336,766 more rows

everything()选择所有剩余列名(除已选择列名外),可以将某些关心的列排在前面

1
2
flights %>% 
select(time_hour, air_time, everything())
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
## # A tibble: 336,776 x 19
## time_hour air_time year month day dep_time sched_dep_time dep_delay arr_time
## <dttm> <dbl> <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013-01-01 05:00:00 227 2013 1 1 517 515 2 830
## 2 2013-01-01 05:00:00 227 2013 1 1 533 529 4 850
## 3 2013-01-01 05:00:00 160 2013 1 1 542 540 2 923
## 4 2013-01-01 05:00:00 183 2013 1 1 544 545 -1 1004
## 5 2013-01-01 06:00:00 116 2013 1 1 554 600 -6 812
## 6 2013-01-01 05:00:00 150 2013 1 1 554 558 -4 740
## 7 2013-01-01 06:00:00 158 2013 1 1 555 600 -5 913
## 8 2013-01-01 06:00:00 53 2013 1 1 557 600 -3 709
## 9 2013-01-01 06:00:00 140 2013 1 1 557 600 -3 838
## 10 2013-01-01 06:00:00 138 2013 1 1 558 600 -2 753
## # ... with 336,766 more rows, and 10 more variables: sched_arr_time <int>, arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>,
## # hour <dbl>, minute <dbl>

Filter rows with filter

根据阈值筛选数据的行: 多个筛选条件,可以通过,链接。判断条件可以是逻辑运算符,如 >, < != 等。

1
2
flights %>% 
filter(month == 1, day == 1)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
## # A tibble: 842 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 517 515 2 830 819 11 UA
## 2 2013 1 1 533 529 4 850 830 20 UA
## 3 2013 1 1 542 540 2 923 850 33 AA
## 4 2013 1 1 544 545 -1 1004 1022 -18 B6
## 5 2013 1 1 554 600 -6 812 837 -25 DL
## 6 2013 1 1 554 558 -4 740 728 12 UA
## 7 2013 1 1 555 600 -5 913 854 19 B6
## 8 2013 1 1 557 600 -3 709 723 -14 EV
## 9 2013 1 1 557 600 -3 838 846 -8 B6
## 10 2013 1 1 558 600 -2 753 745 8 AA
## # ... with 832 more rows, and 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
## # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Grouped summaries with summarise()

summarise()单独使用时,直接对数据集加和

1
2
flights %>% 
summarise(delay = mean(dep_delay, na.rm = TRUE))
1
2
3
4
## # A tibble: 1 x 1
## delay
## <dbl>
## 1 12.6

summarise()结合group_by()使用:分组求和,过滤并画图(使用 %>% 管道符)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
library(ggplot2)

# 分组
flights %>% group_by(dest) %>%
# 每组均值
summarise(count = n(),
dist = mean(distance, na.rm = TRUE), # na.rm=TRUE移除NA值
delay = mean(arr_delay, na.rm = TRUE)) %>%
# 解除分组
ungroup() %>%
# 过滤
filter(count > 20, dest != "HNL") %>%
# 画图
ggplot(aes(x = dist, y = delay)) +
geom_point(aes(size = count), alpha = 1/3) +
geom_smooth(se = FALSE) +
theme_bw()

除了mean()函数外,还有其他summary函数:

  • Measures of Location: mean(x), median(x).

  • Measures of spread: sd(x), IQR(x), mad(x).

  • Measures of rank: min(x), quantile(x, 0.25), max(x).

  • Measures of position: first(x), nth(x, 2), last(x).

1
2
3
4
5
6
7
flights %>% 
filter(!is.na(dep_delay), !is.na(arr_delay)) %>%
group_by(year, month, day) %>%
summarise(
first_dep = first(dep_time),
last_dep = last(dep_time)
)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
## # A tibble: 365 x 5
## # Groups: year, month [12]
## year month day first_dep last_dep
## <int> <int> <int> <int> <int>
## 1 2013 1 1 517 2356
## 2 2013 1 2 42 2354
## 3 2013 1 3 32 2349
## 4 2013 1 4 25 2358
## 5 2013 1 5 14 2357
## 6 2013 1 6 16 2355
## 7 2013 1 7 49 2359
## 8 2013 1 8 454 2351
## 9 2013 1 9 2 2252
## 10 2013 1 10 3 2320
## # ... with 355 more rows
  • Counts : n()简单计数加和延误航班次数,例如D942DN有四次延误记录。
1
2
3
4
5
6
flights %>% 
filter(!is.na(dep_delay), !is.na(arr_delay)) %>%
group_by(tailnum) %>%
summarise(
delay = mean(arr_delay),
n = n())
1
2
3
4
5
6
7
8
9
10
11
12
13
14
## # A tibble: 4,037 x 3
## tailnum delay n
## <chr> <dbl> <int>
## 1 D942DN 31.5 4
## 2 N0EGMQ 9.98 352
## 3 N10156 12.7 145
## 4 N102UW 2.94 48
## 5 N103US -6.93 46
## 6 N104UW 1.80 46
## 7 N10575 20.7 269
## 8 N105UW -0.267 45
## 9 N107US -5.73 41
## 10 N108UW -1.25 60
## # ... with 4,027 more rows
1
flights %>% filter(tailnum == "D942DN")
1
2
3
4
5
6
7
8
9
## # A tibble: 4 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 2 11 1508 1400 68 1807 1636 91 DL
## 2 2013 3 23 1340 1300 40 1638 1554 44 DL
## 3 2013 3 24 859 835 24 1142 1140 2 DL
## 4 2013 7 5 1253 1259 -6 1518 1529 -11 DL
## # ... with 9 more variables: flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

统计非NA值使用sum(!is.na(x)),统计unique值使用n_distinct(x).

1
2
3
4
5
6
7
8
flights %>% 
filter(!is.na(dep_delay), !is.na(arr_delay)) %>%
group_by(dest) %>%
summarise(carriers_unique = n_distinct(carrier),
carrriers_nona = sum(!is.na(carrier)),
carrriers_n = n()) %>%
arrange(desc(carriers_unique)) %>%
ungroup()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
## # A tibble: 104 x 4
## dest carriers_unique carrriers_nona carrriers_n
## <chr> <int> <int> <int>
## 1 ATL 7 16837 16837
## 2 BOS 7 15022 15022
## 3 CLT 7 13674 13674
## 4 ORD 7 16566 16566
## 5 TPA 7 7390 7390
## 6 AUS 6 2411 2411
## 7 DCA 6 9111 9111
## 8 DTW 6 9031 9031
## 9 IAD 6 5383 5383
## 10 MSP 6 6929 6929
## # ... with 94 more rows
1
2
3
flights %>% 
filter(!is.na(dep_delay), !is.na(arr_delay)) %>%
count(dest)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
## # A tibble: 104 x 2
## dest n
## <chr> <int>
## 1 ABQ 254
## 2 ACK 264
## 3 ALB 418
## 4 ANC 8
## 5 ATL 16837
## 6 AUS 2411
## 7 AVL 261
## 8 BDL 412
## 9 BGR 358
## 10 BHM 269
## # ... with 94 more rows

Grouping by multiple variables

根据多个变量分组计算

1
2
3
flights %>% 
group_by(year, month, day) %>%
summarise(flights = n())
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
## # A tibble: 365 x 4
## # Groups: year, month [12]
## year month day flights
## <int> <int> <int> <int>
## 1 2013 1 1 842
## 2 2013 1 2 943
## 3 2013 1 3 914
## 4 2013 1 4 915
## 5 2013 1 5 720
## 6 2013 1 6 832
## 7 2013 1 7 933
## 8 2013 1 8 899
## 9 2013 1 9 902
## 10 2013 1 10 932
## # ... with 355 more rows

Arrange rows with arrange()

按照从大到小对行排序:desc(rownames)

1
2
flights %>% 
arrange(year, desc(month), day)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 12 1 13 2359 14 446 445 1 B6
## 2 2013 12 1 17 2359 18 443 437 6 B6
## 3 2013 12 1 453 500 -7 636 651 -15 US
## 4 2013 12 1 520 515 5 749 808 -19 UA
## 5 2013 12 1 536 540 -4 845 850 -5 AA
## 6 2013 12 1 540 550 -10 1005 1027 -22 B6
## 7 2013 12 1 541 545 -4 734 755 -21 EV
## 8 2013 12 1 546 545 1 826 835 -9 UA
## 9 2013 12 1 549 600 -11 648 659 -11 US
## 10 2013 12 1 550 600 -10 825 854 -29 B6
## # ... with 336,766 more rows, and 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
## # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

across()

对多个列做同一个操作,可以使用across()处理

1
2
3
4
flights %>% group_by(dest) %>%                  
summarise(dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)) %>%
ungroup()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
## # A tibble: 105 x 3
## dest dist delay
## <chr> <dbl> <dbl>
## 1 ABQ 1826 4.38
## 2 ACK 199 4.85
## 3 ALB 143 14.4
## 4 ANC 3370 -2.5
## 5 ATL 757. 11.3
## 6 AUS 1514. 6.02
## 7 AVL 584. 8.00
## 8 BDL 116 7.05
## 9 BGR 378 8.03
## 10 BHM 866. 16.9
## # ... with 95 more rows
1
2
3
4
flights %>% select(dest, distance, arr_delay)  %>%
group_by(dest) %>%
summarise(across(where(is.numeric), mean, na.rm = TRUE)) %>%
ungroup()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
## # A tibble: 105 x 3
## dest distance arr_delay
## <chr> <dbl> <dbl>
## 1 ABQ 1826 4.38
## 2 ACK 199 4.85
## 3 ALB 143 14.4
## 4 ANC 3370 -2.5
## 5 ATL 757. 11.3
## 6 AUS 1514. 6.02
## 7 AVL 584. 8.00
## 8 BDL 116 7.05
## 9 BGR 378 8.03
## 10 BHM 866. 16.9
## # ... with 95 more rows

across() 常用场景

1
2
3
4
5
6
7
8
9
10
11
df %>% mutate_if(is.numeric, mean, na.rm = TRUE)
# ->
df %>% mutate(across(where(is.numeric), mean, na.rm = TRUE))

df %>% mutate_at(vars(x, starts_with("y")), mean, na.rm = TRUE)
# ->
df %>% mutate(across(c(x, starts_with("y")), mean, na.rm = TRUE))

df %>% mutate_all(mean, na.rm = TRUE)
# ->
df %>% mutate(across(everything(), mean, na.rm = TRUE))

R information

1
sessionInfo()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19042)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
## system code page: 936
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_3.3.2 tibble_3.0.3 dplyr_1.0.2 nycflights13_1.0.1
##
## loaded via a namespace (and not attached):
## [1] pillar_1.4.6 compiler_4.0.2 highr_0.8 tools_4.0.2 digest_0.6.25
## [6] jsonlite_1.7.1 evaluate_0.14 lifecycle_0.2.0 gtable_0.3.0 nlme_3.1-150
## [11] lattice_0.20-41 mgcv_1.8-33 pkgconfig_2.0.3 rlang_0.4.7 Matrix_1.2-18
## [16] cli_2.1.0 yaml_2.2.1 xfun_0.18 withr_2.3.0 stringr_1.4.0
## [21] knitr_1.30 generics_0.1.0 vctrs_0.3.4 grid_4.0.2 tidyselect_1.1.0
## [26] glue_1.4.2 R6_2.5.0 fansi_0.4.1 rmarkdown_2.5 purrr_0.3.4
## [31] farver_2.0.3 magrittr_1.5 scales_1.1.1 ellipsis_0.3.1 htmltools_0.5.0
## [36] splines_4.0.2 assertthat_0.2.1 colorspace_1.4-1 labeling_0.4.2 utf8_1.1.4
## [41] stringi_1.5.3 munsell_0.5.0 crayon_1.3.4

引用

  1. dplyr

  2. R for Data Science

  3. dplyr across

参考文章如引起任何侵权问题,可以与我联系,谢谢。


------------- The End Thanks for reading --------