介绍 dplyr是data manipulation的包,其包含多个处理数据的函数。主要函数有:
mutate() 添加新变量.
select() 选择列名.
filter() 过滤行.
summarise() 求和统计.
arrange() 排序.
group_by() 分组处理.
安装 1 2 3 4 5 library(dplyr) library(nycflights13)
tibbles数据类型 tibbles
可以取代data.frame
,虽然前者仍然可以认为是数据框类型,在数据处理过程中,tibbles
数据类型消耗资源更少,处理速度更快。dplyr
的函数可以直接处理tibbles
数据类型。
1 2 3 library(tibble) as_tibble(iris)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 ## # A tibble: 150 x 5 ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## <dbl> <dbl> <dbl> <dbl> <fct> ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa ## 7 4.6 3.4 1.4 0.3 setosa ## 8 5 3.4 1.5 0.2 setosa ## 9 4.4 2.9 1.4 0.2 setosa ## 10 4.9 3.1 1.5 0.1 setosa ## # ... with 140 more rows
1 2 3 4 5 tibble( x = 1 :5 , y = 1 , z = x ^ 2 + y )
1 2 3 4 5 6 7 8 ## # A tibble: 5 x 3 ## x y z ## <int> <dbl> <dbl> ## 1 1 1 2 ## 2 2 1 5 ## 3 3 1 10 ## 4 4 1 17 ## 5 5 1 26
Add new variables with mutate()
添加新的列:新的列一般在数据集的最后一列
1 2 3 4 5 6 7 8 9 flights %>% select( year:day, ends_with("delay" ), distance, air_time) %>% mutate(gain = dep_delay - arr_delay, speed = distance / air_time * 60 )
1 2 3 4 5 6 7 8 9 10 11 12 13 14 ## # A tibble: 336,776 x 9 ## year month day dep_delay arr_delay distance air_time gain speed ## <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 2013 1 1 2 11 1400 227 -9 370. ## 2 2013 1 1 4 20 1416 227 -16 374. ## 3 2013 1 1 2 33 1089 160 -31 408. ## 4 2013 1 1 -1 -18 1576 183 17 517. ## 5 2013 1 1 -6 -25 762 116 19 394. ## 6 2013 1 1 -4 12 719 150 -16 288. ## 7 2013 1 1 -5 19 1065 158 -24 404. ## 8 2013 1 1 -3 -14 229 53 11 259. ## 9 2013 1 1 -3 -8 944 140 5 405. ## 10 2013 1 1 -2 8 733 138 -10 319. ## # ... with 336,766 more rows
如果只想保留新生成的列,则使用transmute()
:
1 2 3 4 flights %>% transmute(gain = dep_delay - arr_delay, hours = air_time / 60 , gain_per_hour = gain / hours)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 ## # A tibble: 336,776 x 3 ## gain hours gain_per_hour ## <dbl> <dbl> <dbl> ## 1 -9 3.78 -2.38 ## 2 -16 3.78 -4.23 ## 3 -31 2.67 -11.6 ## 4 17 3.05 5.57 ## 5 19 1.93 9.83 ## 6 -16 2.5 -6.4 ## 7 -24 2.63 -9.11 ## 8 11 0.883 12.5 ## 9 5 2.33 2.14 ## 10 -10 2.3 -4.35 ## # ... with 336,766 more rows
Select columns with select()
筛选列,可以给出确切列名,也可通过函数匹配列名:
starts_with(“abc”): matches names that begin with “abc”.
ends_with(“xyz”): matches names that end with “xyz”.
contains(“ijk”): matches names that contain “ijk”.
matches(“(.)\1”): selects variables that match a regular expression. This one matches any variables that contain repeated characters. You’ll learn more about regular expressions in strings.
num_range(“x”, 1:3): matches x1, x2 and x3.
1 2 flights %>% select(ends_with("time" ))
1 2 3 4 5 6 7 8 9 10 11 12 13 14 ## # A tibble: 336,776 x 5 ## dep_time sched_dep_time arr_time sched_arr_time air_time ## <int> <int> <int> <int> <dbl> ## 1 517 515 830 819 227 ## 2 533 529 850 830 227 ## 3 542 540 923 850 160 ## 4 544 545 1004 1022 183 ## 5 554 600 812 837 116 ## 6 554 558 740 728 150 ## 7 555 600 913 854 158 ## 8 557 600 709 723 53 ## 9 557 600 838 846 140 ## 10 558 600 753 745 138 ## # ... with 336,766 more rows
everything()
选择所有剩余列名(除已选择列名外),可以将某些关心的列排在前面
1 2 flights %>% select(time_hour, air_time, everything())
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ## # A tibble: 336,776 x 19 ## time_hour air_time year month day dep_time sched_dep_time dep_delay arr_time ## <dttm> <dbl> <int> <int> <int> <int> <int> <dbl> <int> ## 1 2013-01-01 05:00:00 227 2013 1 1 517 515 2 830 ## 2 2013-01-01 05:00:00 227 2013 1 1 533 529 4 850 ## 3 2013-01-01 05:00:00 160 2013 1 1 542 540 2 923 ## 4 2013-01-01 05:00:00 183 2013 1 1 544 545 -1 1004 ## 5 2013-01-01 06:00:00 116 2013 1 1 554 600 -6 812 ## 6 2013-01-01 05:00:00 150 2013 1 1 554 558 -4 740 ## 7 2013-01-01 06:00:00 158 2013 1 1 555 600 -5 913 ## 8 2013-01-01 06:00:00 53 2013 1 1 557 600 -3 709 ## 9 2013-01-01 06:00:00 140 2013 1 1 557 600 -3 838 ## 10 2013-01-01 06:00:00 138 2013 1 1 558 600 -2 753 ## # ... with 336,766 more rows, and 10 more variables: sched_arr_time <int>, arr_delay <dbl>, ## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>, ## # hour <dbl>, minute <dbl>
Filter rows with filter
根据阈值筛选数据的行: 多个筛选条件,可以通过,
链接。判断条件可以是逻辑运算符,如 >, < != 等。
1 2 flights %>% filter(month == 1 , day == 1 )
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ## # A tibble: 842 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier ## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> ## 1 2013 1 1 517 515 2 830 819 11 UA ## 2 2013 1 1 533 529 4 850 830 20 UA ## 3 2013 1 1 542 540 2 923 850 33 AA ## 4 2013 1 1 544 545 -1 1004 1022 -18 B6 ## 5 2013 1 1 554 600 -6 812 837 -25 DL ## 6 2013 1 1 554 558 -4 740 728 12 UA ## 7 2013 1 1 555 600 -5 913 854 19 B6 ## 8 2013 1 1 557 600 -3 709 723 -14 EV ## 9 2013 1 1 557 600 -3 838 846 -8 B6 ## 10 2013 1 1 558 600 -2 753 745 8 AA ## # ... with 832 more rows, and 9 more variables: flight <int>, tailnum <chr>, origin <chr>, ## # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Grouped summaries with summarise()
summarise()
单独使用时,直接对数据集加和
1 2 flights %>% summarise(delay = mean(dep_delay, na.rm = TRUE ))
1 2 3 4 ## # A tibble: 1 x 1 ## delay ## <dbl> ## 1 12.6
summarise()
结合group_by()
使用:分组求和,过滤并画图(使用 %>% 管道符)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 library(ggplot2) flights %>% group_by(dest) %>% summarise(count = n(), dist = mean(distance, na.rm = TRUE ), delay = mean(arr_delay, na.rm = TRUE )) %>% ungroup() %>% filter(count > 20 , dest != "HNL" ) %>% ggplot(aes(x = dist, y = delay)) + geom_point(aes(size = count), alpha = 1 /3 ) + geom_smooth(se = FALSE ) + theme_bw()
除了mean()
函数外,还有其他summary函数:
Measures of Location: mean(x)
, median(x)
.
Measures of spread: sd(x)
, IQR(x)
, mad(x)
.
Measures of rank: min(x)
, quantile(x, 0.25)
, max(x)
.
Measures of position: first(x)
, nth(x, 2)
, last(x)
.
1 2 3 4 5 6 7 flights %>% filter(!is.na (dep_delay), !is.na (arr_delay)) %>% group_by(year, month, day) %>% summarise( first_dep = first(dep_time), last_dep = last(dep_time) )
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ## # A tibble: 365 x 5 ## # Groups: year, month [12] ## year month day first_dep last_dep ## <int> <int> <int> <int> <int> ## 1 2013 1 1 517 2356 ## 2 2013 1 2 42 2354 ## 3 2013 1 3 32 2349 ## 4 2013 1 4 25 2358 ## 5 2013 1 5 14 2357 ## 6 2013 1 6 16 2355 ## 7 2013 1 7 49 2359 ## 8 2013 1 8 454 2351 ## 9 2013 1 9 2 2252 ## 10 2013 1 10 3 2320 ## # ... with 355 more rows
Counts : n()
简单计数加和延误航班次数,例如D942DN 有四次延误记录。
1 2 3 4 5 6 flights %>% filter(!is.na (dep_delay), !is.na (arr_delay)) %>% group_by(tailnum) %>% summarise( delay = mean(arr_delay), n = n())
1 2 3 4 5 6 7 8 9 10 11 12 13 14 ## # A tibble: 4,037 x 3 ## tailnum delay n ## <chr> <dbl> <int> ## 1 D942DN 31.5 4 ## 2 N0EGMQ 9.98 352 ## 3 N10156 12.7 145 ## 4 N102UW 2.94 48 ## 5 N103US -6.93 46 ## 6 N104UW 1.80 46 ## 7 N10575 20.7 269 ## 8 N105UW -0.267 45 ## 9 N107US -5.73 41 ## 10 N108UW -1.25 60 ## # ... with 4,027 more rows
1 flights %>% filter(tailnum == "D942DN" )
1 2 3 4 5 6 7 8 9 ## # A tibble: 4 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier ## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> ## 1 2013 2 11 1508 1400 68 1807 1636 91 DL ## 2 2013 3 23 1340 1300 40 1638 1554 44 DL ## 3 2013 3 24 859 835 24 1142 1140 2 DL ## 4 2013 7 5 1253 1259 -6 1518 1529 -11 DL ## # ... with 9 more variables: flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, ## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
统计非NA值使用sum(!is.na(x))
,统计unique值使用n_distinct(x)
.
1 2 3 4 5 6 7 8 flights %>% filter(!is.na (dep_delay), !is.na (arr_delay)) %>% group_by(dest) %>% summarise(carriers_unique = n_distinct(carrier), carrriers_nona = sum (!is.na (carrier)), carrriers_n = n()) %>% arrange(desc(carriers_unique)) %>% ungroup()
1 2 3 4 5 6 7 8 9 10 11 12 13 14 ## # A tibble: 104 x 4 ## dest carriers_unique carrriers_nona carrriers_n ## <chr> <int> <int> <int> ## 1 ATL 7 16837 16837 ## 2 BOS 7 15022 15022 ## 3 CLT 7 13674 13674 ## 4 ORD 7 16566 16566 ## 5 TPA 7 7390 7390 ## 6 AUS 6 2411 2411 ## 7 DCA 6 9111 9111 ## 8 DTW 6 9031 9031 ## 9 IAD 6 5383 5383 ## 10 MSP 6 6929 6929 ## # ... with 94 more rows
1 2 3 flights %>% filter(!is.na (dep_delay), !is.na (arr_delay)) %>% count(dest)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 ## # A tibble: 104 x 2 ## dest n ## <chr> <int> ## 1 ABQ 254 ## 2 ACK 264 ## 3 ALB 418 ## 4 ANC 8 ## 5 ATL 16837 ## 6 AUS 2411 ## 7 AVL 261 ## 8 BDL 412 ## 9 BGR 358 ## 10 BHM 269 ## # ... with 94 more rows
Grouping by multiple variables
根据多个变量分组计算
1 2 3 flights %>% group_by(year, month, day) %>% summarise(flights = n())
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ## # A tibble: 365 x 4 ## # Groups: year, month [12] ## year month day flights ## <int> <int> <int> <int> ## 1 2013 1 1 842 ## 2 2013 1 2 943 ## 3 2013 1 3 914 ## 4 2013 1 4 915 ## 5 2013 1 5 720 ## 6 2013 1 6 832 ## 7 2013 1 7 933 ## 8 2013 1 8 899 ## 9 2013 1 9 902 ## 10 2013 1 10 932 ## # ... with 355 more rows
Arrange rows with arrange()
按照从大到小对行排序:desc(rownames)
1 2 flights %>% arrange(year, desc(month), day)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ## # A tibble: 336,776 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier ## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> ## 1 2013 12 1 13 2359 14 446 445 1 B6 ## 2 2013 12 1 17 2359 18 443 437 6 B6 ## 3 2013 12 1 453 500 -7 636 651 -15 US ## 4 2013 12 1 520 515 5 749 808 -19 UA ## 5 2013 12 1 536 540 -4 845 850 -5 AA ## 6 2013 12 1 540 550 -10 1005 1027 -22 B6 ## 7 2013 12 1 541 545 -4 734 755 -21 EV ## 8 2013 12 1 546 545 1 826 835 -9 UA ## 9 2013 12 1 549 600 -11 648 659 -11 US ## 10 2013 12 1 550 600 -10 825 854 -29 B6 ## # ... with 336,766 more rows, and 9 more variables: flight <int>, tailnum <chr>, origin <chr>, ## # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
across()
对多个列做同一个操作,可以使用across()
处理
1 2 3 4 flights %>% group_by(dest) %>% summarise(dist = mean(distance, na.rm = TRUE ), delay = mean(arr_delay, na.rm = TRUE )) %>% ungroup()
1 2 3 4 5 6 7 8 9 10 11 12 13 14 ## # A tibble: 105 x 3 ## dest dist delay ## <chr> <dbl> <dbl> ## 1 ABQ 1826 4.38 ## 2 ACK 199 4.85 ## 3 ALB 143 14.4 ## 4 ANC 3370 -2.5 ## 5 ATL 757. 11.3 ## 6 AUS 1514. 6.02 ## 7 AVL 584. 8.00 ## 8 BDL 116 7.05 ## 9 BGR 378 8.03 ## 10 BHM 866. 16.9 ## # ... with 95 more rows
1 2 3 4 flights %>% select(dest, distance, arr_delay) %>% group_by(dest) %>% summarise(across(where(is.numeric ), mean, na.rm = TRUE )) %>% ungroup()
1 2 3 4 5 6 7 8 9 10 11 12 13 14 ## # A tibble: 105 x 3 ## dest distance arr_delay ## <chr> <dbl> <dbl> ## 1 ABQ 1826 4.38 ## 2 ACK 199 4.85 ## 3 ALB 143 14.4 ## 4 ANC 3370 -2.5 ## 5 ATL 757. 11.3 ## 6 AUS 1514. 6.02 ## 7 AVL 584. 8.00 ## 8 BDL 116 7.05 ## 9 BGR 378 8.03 ## 10 BHM 866. 16.9 ## # ... with 95 more rows
across()
常用场景
1 2 3 4 5 6 7 8 9 10 11 df %>% mutate_if(is.numeric , mean, na.rm = TRUE ) df %>% mutate(across(where(is.numeric ), mean, na.rm = TRUE )) df %>% mutate_at(vars(x, starts_with("y" )), mean, na.rm = TRUE ) df %>% mutate(across(c (x, starts_with("y" )), mean, na.rm = TRUE )) df %>% mutate_all(mean, na.rm = TRUE ) df %>% mutate(across(everything(), mean, na.rm = TRUE ))
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 ## R version 4.0.2 (2020-06-22) ## Platform: x86_64-w64-mingw32/x64 (64-bit) ## Running under: Windows 10 x64 (build 19042) ## ## Matrix products: default ## ## locale: ## [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 ## [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C ## [5] LC_TIME=English_United States.1252 ## system code page: 936 ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] ggplot2_3.3.2 tibble_3.0.3 dplyr_1.0.2 nycflights13_1.0.1 ## ## loaded via a namespace (and not attached): ## [1] pillar_1.4.6 compiler_4.0.2 highr_0.8 tools_4.0.2 digest_0.6.25 ## [6] jsonlite_1.7.1 evaluate_0.14 lifecycle_0.2.0 gtable_0.3.0 nlme_3.1-150 ## [11] lattice_0.20-41 mgcv_1.8-33 pkgconfig_2.0.3 rlang_0.4.7 Matrix_1.2-18 ## [16] cli_2.1.0 yaml_2.2.1 xfun_0.18 withr_2.3.0 stringr_1.4.0 ## [21] knitr_1.30 generics_0.1.0 vctrs_0.3.4 grid_4.0.2 tidyselect_1.1.0 ## [26] glue_1.4.2 R6_2.5.0 fansi_0.4.1 rmarkdown_2.5 purrr_0.3.4 ## [31] farver_2.0.3 magrittr_1.5 scales_1.1.1 ellipsis_0.3.1 htmltools_0.5.0 ## [36] splines_4.0.2 assertthat_0.2.1 colorspace_1.4-1 labeling_0.4.2 utf8_1.1.4 ## [41] stringi_1.5.3 munsell_0.5.0 crayon_1.3.4
引用
dplyr
R for Data Science
dplyr across
参考文章如引起任何侵权问题,可以与我联系 ,谢谢。