R语言笔记之数据管理dplyr,sql以及数据合并

R语言的数据管理

在进行数据分析时,常常会遇到某些情形,例如我要挑出满足某一条的行,我要挑出某些列。此时可以使用dplyr包,或SQL操作。

dplyr包的操作

dplyr包的安装

1
2
install.packages("dplyr")
library(dplyr)

数据集

现在使用datasets包中的ChickWeight数据集,如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
> library(datasets)
> dim(ChickWeight)
[1] 578 4
> head(ChickWeight)
weight Time Chick Diet
1 42 0 1 1
2 51 2 1 1
3 59 4 1 1
4 64 6 1 1
5 76 8 1 1
6 93 10 1 1
> str(ChickWeight)
Classes ‘nfnGroupedData’, ‘nfGroupedData’, ‘groupedData’ and 'data.frame': 578 obs. of 4 variables:
$ weight: num 42 51 59 64 76 93 106 125 149 171 ...
$ Time : num 0 2 4 6 8 10 12 14 16 18 ...
$ Chick : Ord.factor w/ 50 levels "18"<"16"<"15"<..: 15 15 15 15 15 15 15 15 15 15 ...
$ Diet : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, "formula")=Class 'formula' language weight ~ Time | Chick
.. ..- attr(*, ".Environment")=<environment: R_EmptyEnv>
- attr(*, "outer")=Class 'formula' language ~Diet
.. ..- attr(*, ".Environment")=<environment: R_EmptyEnv>
- attr(*, "labels")=List of 2
..$ x: chr "Time"
..$ y: chr "Body weight"
- attr(*, "units")=List of 2
..$ x: chr "(days)"
..$ y: chr "(gm)"

从上面结果可以看出来,ChickWeight含有4个变量,分别为数值型变量weight,数值型变量Time,因子型变量Chick,因子型变量Diet。

tibble数据类型

tibble是R语言中一个用来替换data.frame类型的扩展的数据框,tibble继承了data.frame,是弱类型的,同时与data.frame有相同的语法,使用起来更方便。tibble对data.frame做了重新的设计:

1
2
3
4
5
6
7
tibble,不关心输入类型,可存储任意类型,包括list类型
tibble,没有行名设置 row.names
tibble,支持任意的列名
tibble,会自动添加列名
tibble,类型只能回收长度为1的输入
tibble,会懒加载参数,并按顺序运行
tibble,是tbl_df类型

这里再注明一下,tibble这个包是包含在tidyverse项目中的,tidyverse是一组处理与可视化R包的集合,其中包含以下的包:

1
2
3
4
5
6
7
8
9
10
11
12
ggplot2 - 可视化数据
dplyr - 数据操作语法,可以用它解决大部分数据处理问题
tidyr - 清理数据
readr - 读入表格数据
purrr - 提供一个完整一致的工具集增强R的函数编程
tibble - 新一代数据框
stringr - 提供函数集用来处理字符数据
forcats - 提供有用工具用来处理因子问题
# 安装tidyverse:
install.packages("tidyverse")
library(tidyverse)

计数count

dplyr包中的count函数可以进行计数,用法为count(x, ..., wt = NULL, sort = FALSE),其中x是目标数据集,...是数据集中的变量,如下所示:

1
2
3
4
5
6
7
8
> count(ChickWeight,Diet)
# A tibble: 4 x 2
Diet n
<fct> <int>
1 1 220
2 2 120
3 3 120
4 4 118

从上面的结果我们可以看出来,输出的结果是一个tibble,其中它与data.frame不同的地方在于,它会在行名称的下面再加一行,就上面的结果而言,它添加了一行为<fct>, <int>的信息,指明了不同列的类型,其它的数据类型信息如下所示:

1
2
3
4
5
6
7
int,代表integer
dbl,代表double
chr,代表character向量或字符串。
dttm,代表日期+时间(a date + a time)
lgl,代表逻辑判断TRUE或者FALSE
fctr,代表因子类型factor
date,代表日期dates.

汇总summaryise()

summaryise(),用法为summarise(.data, ...),其中.data是数据集,...是需要汇总的函数,用法如下所示

1
2
3
4
5
6
> summarise(ChickWeight,mean(weight,na.rm=TRUE))
mean(weight, na.rm = TRUE)
1 121.8183
> summarise(ChickWeight,mean(Time,na.rm=TRUE),min(weight))
mean(Time, na.rm = TRUE) min(weight)
1 10.71799 35

分组group_by

group_by,用法为group_by(.data, ..., add = FALSE),其中.data是数据集,...是分组对象,如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
> CW_g1 <- group_by(ChickWeight,Diet)
> CW_g1
# A tibble: 578 x 4
# Groups: Diet [4]
weight Time Chick Diet
* <dbl> <dbl> <ord> <fct>
1 42 0 1 1
2 51 2 1 1
3 59 4 1 1
4 64 6 1 1
5 76 8 1 1
6 93 10 1 1
7 106 12 1 1
8 125 14 1 1
9 149 16 1 1
10 171 18 1 1
# ... with 568 more rows

group_by可以和summarise连用,如下所示:

1
2
3
4
5
6
7
8
> summarise(group_by(ChickWeight,Diet),mean(Time,na.rm=TRUE))
# A tibble: 4 x 2
Diet `mean(Time, na.rm = TRUE)`
<fct> <dbl>
1 1 10.5
2 2 10.9
3 3 10.9
4 4 10.8

筛选filter

filter()函数,用法为filter(.data, ...)。举例说明,如下所示:

挑选出weight大于等于40的数据,如下所示:

1
2
3
4
5
6
7
8
9
> CW_F1 <- filter(ChickWeight,weight >= 40)
> head(CW_F1)
weight Time Chick Diet
1 42 0 1 1
2 51 2 1 1
3 59 4 1 1
4 64 6 1 1
5 76 8 1 1
6 93 10 1 1

挑选出weight大于等于40,并且Time大于等于2的数据,如下所示:

1
2
3
4
5
6
7
8
9
> CW_F2<- filter(ChickWeight,weight >= 40,Time>=2)
> head(CW_F2)
weight Time Chick Diet
1 51 2 1 1
2 59 4 1 1
3 64 6 1 1
4 76 8 1 1
5 93 10 1 1
6 106 12 1 1

挑选出Diet为1的数据,如下所示:

1
2
3
4
5
6
7
8
9
> CW_F3<- filter(ChickWeight,Diet==1)
> head(CW_F3)
weight Time Chick Diet
1 42 0 1 1
2 51 2 1 1
3 59 4 1 1
4 64 6 1 1
5 76 8 1 1
6 93 10 1 1

操行出Diet为1或3的数据,此时需要使用%in%符号,如下所示:

1
2
CW_F4<- filter(ChickWeight,Diet %in% c(1,3))
CW_F4

挑选出Chick为2,Diet为1的数据,如下所示:

1
2
3
4
5
6
7
8
9
> CW_F5<- filter(ChickWeight,Diet==1,Chick==2)
> head(CW_F5)
weight Time Chick Diet
1 40 0 2 1
2 49 2 2 1
3 58 4 2 1
4 72 6 2 1
5 84 8 2 1
6 103 10 2 1

挑选满足条件的某一列select

filter在使用时,它是将满足某一条件的整个数据集都给拿了出来,而select函数则是只将某一列拿出来,例如我们现在只将weight和Diet挑选出来,并与filter对比一下,如下所示:

1
CW_s1<-selet(ChickWeight,weight,Diet)

排列某一列,只选择剩下的部分,如下所示:

1
2
3
4
5
6
7
8
> head(select(ChickWeight,-weight))
Time Chick Diet
1 0 1 1
2 2 1 1
3 4 1 1
4 6 1 1
5 8 1 1
6 10 1 1

除去某2列,如下所示:

1
2
3
4
5
6
7
8
> head(select(ChickWeight,-weight,-Diet))
Time Chick
1 0 1
2 2 1
3 4 1
4 6 1
5 8 1
6 10 1

选择从Time到Diet这之间的列,如下所示:

1
2
3
4
5
6
7
8
> head(select(ChickWeight,Time:Diet))
Time Chick Diet
1 0 1 1
2 2 1 1
3 4 1 1
4 6 1 1
5 8 1 1
6 10 1 1

添加mutate

mutate()可以添加不同的列,现在我们在原始数据后面加上一列,名称是Diff,列的内容是该行weight值与weight均值之差,如下所示:

1
2
3
4
5
6
7
8
9
> CW_m1 <- mutate(ChickWeight,Diff=(weight-mean(weight)))
> head(CW_m1)
weight Time Chick Diet Diff
1 42 0 1 1 -79.81834
2 51 2 1 1 -70.81834
3 59 4 1 1 -62.81834
4 64 6 1 1 -57.81834
5 76 8 1 1 -45.81834
6 93 10 1 1 -28.81834

管道命令%in%

%in%是dplyr的管道命令,可以将几个函数串联起来。看下面的案例:

1
2
3
4
CW_p1 <- ChickWeight%>%
filter(Chick!=c(15,18,20,17))%>%
group_by(Diet)%>%
summarise(mean(Time,na.rm=TRUE))

结果如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
> CW_p1 <- ChickWeight%>%
+ filter(Chick!=c(15,18,20,17))%>%
+ group_by(Diet)%>%
+ summarise(mean(Time,na.rm=TRUE))
Warning messages:
1: In is.na(e1) | is.na(e2) :
longer object length is not a multiple of shorter object length
2: In `!=.default`(Chick, c(15, 18, 20, 17)) :
longer object length is not a multiple of shorter object length
> CW_p1
# A tibble: 4 x 2
Diet `mean(Time, na.rm = TRUE)`
<fct> <dbl>
1 1 10.5
2 2 10.9
3 3 10.9
4 4 10.8

再看一案例,如下所示:

1
2
3
4
5
CW_p2 <- ChickWeight%>%
filter(weight>=50,Diet!=1)%>%
mutate(DiffTime=(Time-mean(Time,na.rm=TRUE)))%>%
select(Diet:DiffTime)
head(CW_p2)

运行结果如下所示:

1
2
3
4
5
6
7
8
> head(CW_p2)
Diet DiffTime
1 2 -10.2310127
2 2 -8.2310127
3 2 -6.2310127
4 2 -4.2310127
5 2 -2.2310127
6 2 -0.2310127

排序arrange

arrange()可以将数据集进行排序,用法如下所示:

1
2
3
4
5
6
7
8
> head(arrange(ChickWeight,desc(weight)))
weight Time Chick Diet
1 373 21 35 3
2 361 20 35 3
3 341 21 34 3
4 332 18 35 3
5 331 21 21 2
6 327 20 34 3

sqldf包的操作

sqldf包提供了使用SQL操作数据框的方式,通常这种操作非常适合那些有数据库背景的人使用。

安装sqldf

1
2
install.packages("sqldf")
library(sqldf)

sqldf包的使用

我们使用learningr包中的deer_endocranial_volume数据集来演示一下sqldf的操作,如下所示:

1
2
3
library(learningr)
data("deer_endocranial_volume")
head(deer_endocranial_volume)

运行结果如下所示:

1
2
3
4
5
6
7
8
> head(deer_endocranial_volume)
SkullID VolCT VolBead VolLWH VolFinarelli VolCT2 VolBead2 VolLWH2
1 DIC44 389 375 1484 337 NA NA NA
2 B11 389 370 1722 377 NA NA NA
3 DIC90 352 345 1495 328 NA NA NA
4 DIC83 388 370 1683 377 NA NA NA
5 DIC787 375 355 1458 328 NA NA NA
6 DIC1573 325 320 1363 291 NA NA NA

现在挑选出VolCt大于400,或者是VolCT2大于400的行。

我们先用R语言来操作,如下所示:

1
2
3
4
5
subset(
deer_endocranial_volume,
VolCT > 400 | VolCT2 > 400,
c(VolCT, VolCT2)
)

运行结果如下所示:

1
2
3
4
5
6
7
8
9
10
> subset(
+ deer_endocranial_volume,
+ VolCT > 400 | VolCT2 > 400,
+ c(VolCT, VolCT2)
+ )
VolCT VolCT2
10 410 413
11 405 408
13 416 417
16 418 NA

现在使用sqldf包来操作,如下所示:

1
2
3
4
5
6
7
8
9
10
query <-
"SELECT
VolCT,
VolCT2
FROM
deer_endocranial_volume
WHERE
VolCT > 400 OR
VolCT2 > 400"
sqldf(query)

运行结果如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
> query <-
+ "SELECT
+ VolCT,
+ VolCT2
+ FROM
+ deer_endocranial_volume
+ WHERE
+ VolCT > 400 OR
+ VolCT2 > 400"
> sqldf(query)
VolCT VolCT2
1 410 413
2 405 408
3 416 417
4 418 NA

具体的SQL语法可以参考专业的SQL书籍。

数据的合并

R中与合并有关的函数包括:

1
2
3
4
5
cbind
rbind
merge
plyr::join
tidyr::inter_join/full_join/left_join/right_join

构建原始数据

1
2
3
4
5
6
7
8
ID<-c(1,2,3,4)
Name<-c("A","B","C","D")
Score<-c(60,70,80,90)
Sex<-c("M","F","M","M")
One<-data.frame(ID,Name)
Two<-data.frame(Score,Sex)
One
Two

数据如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
> One
ID Name
1 1 A
2 2 B
3 3 C
4 4 D
> Two
Score Sex
1 60 M
2 70 F
3 80 M
4 90 M

横向合并cbind()

cbind()函数是横向合并,也就是按column进行合并,即左右合并,如下所示:

1
2
Total_by_column <- cbind(One, Two)
Total_by_column

结果如下所示:

1
2
3
4
5
6
7
> Total_by_column <- cbind(One, Two)
> Total_by_column
ID Name Score Sex
1 1 A 60 M
2 2 B 70 F
3 3 C 80 M
4 4 D 90 M

纵向合并rbind()

rbind()是按row进行合并,即纵向合并,在进行纵向合并时,两个数据框要有相同的列,如下所示:

1
2
3
4
5
6
7
8
ID<-c(1,2,3,4)
Name<-c("A","B","C","D")
Student1<-data.frame(ID,Name)
ID<-c(5,6,7,8)
Name<-c("E","F","G","H")
Student2<-data.frame(ID,Name)
Student1
Student2

数据如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
> Student1
ID Name
1 1 A
2 2 B
3 3 C
4 4 D
> Student2
ID Name
1 5 E
2 6 F
3 7 G
4 8 H

此时按列进行合并,即使用rbind()函数进行合并,如下所示:

1
2
Total_by_row <- rbind(Student1,Student2)
Total_by_row

合并后的结果如下所示:

1
2
3
4
5
6
7
8
9
10
11
> Total_by_row <- rbind(Student1,Student2)
> Total_by_row
ID Name
1 1 A
2 2 B
3 3 C
4 4 D
5 5 E
6 6 F
7 7 G
8 8 H

按照某字段进行合并merge()

merge()函数主要是针对横向的列字段进行合并,而且可以针对主字段进行匹配,如果主字段名称不同,还可以指定前后相匹配的字段,使用语法如下所示:

1
merge(x, y, by = , by.x = , by.y = , all = , all.x = , all.y = , sort = , suffixes = , incomparables = , ...)

具体的参数为:

  • x:数据框1;
  • y:数据框2;
  • by:连接的列名,默认是intersect(names(x),names(y))
  • by.x第一个数据框要连接的列名,默认是by
  • by.y第二个数据框要连接的名,默认是by
  • all:如果连接后没有这一行记录,是否要包含此行,默认是FALSE;
  • all.x:如果连接后第一个数据框中没有这一行的记录,是否要包含此行,默认是all;
  • all.y:如果连接后第一个数据框中没有这一行的记录,是否要包含此行,默认是all。

列名相同的merge()

原始数据如下所示:

1
2
3
4
5
6
7
ID<-c(1,2,3,4)
name<-c("A","B","C","D")
score<-c(60,70,80,90)
student1<-data.frame(ID,name)
student2<-data.frame(ID,score)
student1
student2

数据为:

1
2
3
4
5
6
7
8
9
10
11
12
> student1
ID name
1 1 A
2 2 B
3 3 C
4 4 D
> student2
ID score
1 1 60
2 2 70
3 3 80
4 4 90

从上面结果可以看出来,student1student2中都有相同的列ID,按此列进行合并,如下所示:

1
2
3
4
5
6
> merge(student1,student2,by="ID")
ID name score
1 1 A 60
2 2 B 70
3 3 C 80
4 4 D 90

如果两具数据框中有相同的列,by参数可以省略,如下所示:

1
2
3
4
5
6
> merge(student1,student2)
ID name score
1 1 A 60
2 2 B 70
3 3 C 80
4 4 D 90

横向合并

先创建一个原始数据集,如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
authors <- data.frame(
surname = I(c("Tukey", "Venables", "Tierney", "Ripley", "McNeil")),
nationality = c("US", "Australia", "US", "UK", "Australia"),
deceased = c("yes", rep("no", 4)),
hobby = rep('surf', 5))
books <- data.frame(
name = I(c("Tukey", "Venables", "Tierney",
"Ripley", "Ripley", "McNeil", "R Core")),
title = c("Exploratory Data Analysis",
"Modern Applied Statistics ...",
"LISP-STAT",
"Spatial Statistics", "Stochastic Simulation",
"Interactive Data Analysis",
"An Introduction to R"),
other.author = c(NA, "Ripley", NA, NA, NA, NA,
"Venables & Smith"),
hobby = rep('chess', 7))
## FIN
# Echamos un vistazo a ambas:
authors
books

数据如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
> authors
surname nationality deceased hobby
1 Tukey US yes surf
2 Venables Australia no surf
3 Tierney US no surf
4 Ripley UK no surf
5 McNeil Australia no surf
> books
name title other.author hobby
1 Tukey Exploratory Data Analysis <NA> chess
2 Venables Modern Applied Statistics ... Ripley chess
3 Tierney LISP-STAT <NA> chess
4 Ripley Spatial Statistics <NA> chess
5 Ripley Stochastic Simulation <NA> chess
6 McNeil Interactive Data Analysis <NA> chess
7 R Core An Introduction to R Venables & Smith chess

从上面的两个数据框,authorsbooks可知,它们有1列属性相同,也就是authors中的surname列和books中的name列,都是字符串,但是名称不同,因此,在使用merge()进行合并时,需要指定匹配的主字段。

内部连接

此时,我们按照authorssurname列和booksname列进行合并,如下所示:

1
m1 <- merge(authors, books,by.x='surname',by.y="name")

合并后的数据如下所示:

1
2
3
4
5
6
7
8
> m1
surname nationality deceased hobby.x title other.author hobby.y
1 McNeil Australia no surf Interactive Data Analysis <NA> chess
2 Ripley UK no surf Spatial Statistics <NA> chess
3 Ripley UK no surf Stochastic Simulation <NA> chess
4 Tierney US no surf LISP-STAT <NA> chess
5 Tukey US yes surf Exploratory Data Analysis <NA> chess
6 Venables Australia no surf Modern Applied Statistics ... Ripley chess

从上面可以看出来,合并的时候,只合并了authorsbooks的交集。

左连接

1
m2 <- merge(authors, books, by.x = "surname", by.y = "name",all.x=TRUE)

合并后的数据如下所示:

1
2
3
4
5
6
7
8
> m2
surname nationality deceased hobby.x title other.author hobby.y
1 McNeil Australia no surf Interactive Data Analysis <NA> chess
2 Ripley UK no surf Spatial Statistics <NA> chess
3 Ripley UK no surf Stochastic Simulation <NA> chess
4 Tierney US no surf LISP-STAT <NA> chess
5 Tukey US yes surf Exploratory Data Analysis <NA> chess
6 Venables Australia no surf Modern Applied Statistics ... Ripley chess

从上面可以看出来,内部连接与左连接合并的效果是相同的,使用==判断一下,也是如此:

1
2
3
4
5
6
7
8
> m1 == m2
surname nationality deceased hobby.x title other.author hobby.y
[1,] TRUE TRUE TRUE TRUE TRUE NA TRUE
[2,] TRUE TRUE TRUE TRUE TRUE NA TRUE
[3,] TRUE TRUE TRUE TRUE TRUE NA TRUE
[4,] TRUE TRUE TRUE TRUE TRUE NA TRUE
[5,] TRUE TRUE TRUE TRUE TRUE NA TRUE
[6,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE

右连接

1
m3 <- merge(authors, books, by.x = "surname", by.y = "name",all.y=TRUE)

合并后的数据框如下所示:

1
2
3
4
5
6
7
8
9
> m3
surname nationality deceased hobby.x title other.author hobby.y
1 McNeil Australia no surf Interactive Data Analysis <NA> chess
2 R Core <NA> <NA> <NA> An Introduction to R Venables & Smith chess
3 Ripley UK no surf Spatial Statistics <NA> chess
4 Ripley UK no surf Stochastic Simulation <NA> chess
5 Tierney US no surf LISP-STAT <NA> chess
6 Tukey US yes surf Exploratory Data Analysis <NA> chess
7 Venables Australia no surf Modern Applied Statistics ... Ripley chess

右连接合并后的,取是的并集。

外连接

1
m4 <- merge(authors, books, by.x = "surname", by.y = "name",all=TRUE)

合并后的数据如下所示:

1
2
3
4
5
6
7
8
9
> m4
surname nationality deceased hobby.x title other.author hobby.y
1 McNeil Australia no surf Interactive Data Analysis <NA> chess
2 R Core <NA> <NA> <NA> An Introduction to R Venables & Smith chess
3 Ripley UK no surf Spatial Statistics <NA> chess
4 Ripley UK no surf Stochastic Simulation <NA> chess
5 Tierney US no surf LISP-STAT <NA> chess
6 Tukey US yes surf Exploratory Data Analysis <NA> chess
7 Venables Australia no surf Modern Applied Statistics ... Ripley chess

外连接与右连接相同,判断一下:

1
2
3
4
5
6
7
8
9
> m3 == m4
surname nationality deceased hobby.x title other.author hobby.y
[1,] TRUE TRUE TRUE TRUE TRUE NA TRUE
[2,] TRUE NA NA NA TRUE TRUE TRUE
[3,] TRUE TRUE TRUE TRUE TRUE NA TRUE
[4,] TRUE TRUE TRUE TRUE TRUE NA TRUE
[5,] TRUE TRUE TRUE TRUE TRUE NA TRUE
[6,] TRUE TRUE TRUE TRUE TRUE NA TRUE
[7,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE

使用dplyr包进行合并

先创建一个原始数据,如下所示:

1
2
x<-data.frame(Name=c("John","Paul","George","Ringo","Stuart","Pete"),instrument=c("guitar","bass","guitar","drums","bass","drums"))
y<-data.frame(name=c("John","Paul","George","Ringo","Brian"),band=c("TRUE","TRUE","TRUE","TRUE","FALSE"))

数据如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
> x
Name instrument
1 John guitar
2 Paul bass
3 George guitar
4 Ringo drums
5 Stuart bass
6 Pete drums
> y
name band
1 John TRUE
2 Paul TRUE
3 George TRUE
4 Ringo TRUE
5 Brian FALSE

inner_join(x,y):只包含同时出现在x和y数据框中的行:

1
2
3
4
5
6
> inner_join(x,y,by=c("Name"="name"))
Name instrument band
1 John guitar TRUE
2 Paul bass TRUE
3 George guitar TRUE
4 Ringo drums TRUE

left_join(x, y):包含所有x中以及y中匹配的行

1
2
3
4
5
6
7
8
9
10
> left_join(x,y,by=c("Name"="name"))
Name instrument band
1 John guitar TRUE
2 Paul bass TRUE
3 George guitar TRUE
4 Ringo drums TRUE
5 Stuart bass <NA>
6 Pete drums <NA>
Warning message:
Column `Name`/`name` joining factors with different levels, coercing to character vector

left_join(x,y)包含所有y中以及x中匹配的行

1
2
3
4
5
6
7
8
9
> right_join(x, y,by=c("Name"="name"))
Name instrument band
1 John guitar TRUE
2 Paul bass TRUE
3 George guitar TRUE
4 Ringo drums TRUE
5 Brian <NA> FALSE
Warning message:
Column `Name`/`name` joining factors with different levels, coercing to character vector

full_join(x,y):包含所有x、y中的行

1
2
3
4
5
6
7
8
9
10
11
> full_join(x,y,by=c("Name"="name"))
Name instrument band
1 John guitar TRUE
2 Paul bass TRUE
3 George guitar TRUE
4 Ringo drums TRUE
5 Stuart bass <NA>
6 Pete drums <NA>
7 Brian <NA> FALSE
Warning message:
Column `Name`/`name` joining factors with different levels, coercing to character vector

semi_join(x, y) :包含x中,在y中有匹配的行,结果为x的子集

1
2
3
4
5
6
7
8
> semi_join(x,y,by=c("Name"="name"))
Name instrument
1 John guitar
2 Paul bass
3 George guitar
4 Ringo drums
Warning message:
Column `Name`/`name` joining factors with different levels, coercing to character vector

anti_join(x, y) :包含x中,不匹配y的行,结果为x的子集,与semi_join相反

1
2
3
4
5
6
> anti_join(x,y,by=c("Name"="name"))
Name instrument
1 Stuart bass
2 Pete drums
Warning message:
Column `Name`/`name` joining factors with different levels, coercing to character vector

参考资料

  1. 时光加速器-dplyr.李楠 克里克学苑
  2. R语言数据科学新类型tibble
  3. 学习R.[美] Richard,Cotton 著刘军 译
  4. manipular-tablas.R
  5. 数据追加与合并