R for Stata users

Join datasets

SQL joins

The packages dplyr uses the SQL concepts of left, right, full, inner, semi, anti and cross joins. The following table gives the correspondence between the Stata commands merge and joinby with the dplyr functions:

Stata merge v1, keep(master matched using)
dplyr full_join(df1, df2, by = "v1")
Base R merge(df1, df2, by = "v1", all.x = TRUE, all.y = TRUE)
Stata merge v1, keep(master matched)
dplyr left_join(df1, df2, by = "v1")
Base R merge(df1, df2, by = "v1", all.x = TRUE, all.y = FALSE)
Stata merge v1, keep(matched using)
dplyr right_join(df1, df2, by = "v1")
Base R merge(df1, df2, by = "v1", all.x = FALSE, all.y = TRUE)
Stata merge v1, keep(matched)
dplyr inner_join(df1, df2, by = "v1")
Base R merge(df1, df2, by = "v1", all.x = FALSE, all.y = FALSE)

There are two important differences with Stata merge


Semi and anti joins

A semi join keeps rows with a match:

Stata merge v1, keep(matched) keepusing(v1)
dplyr semi_join(df1, df2, by = "v1")

An anti join deletes rows with a match:

Stata merge v1, keep(master) keepusing(v1)
dplyr anti_join(df1, df2)


Cross joins

The last join is the cartesian productiopn (the only one without key variable)

Stata crossby
dplyr cross_join(df1, df2)


Append datasets

To append one dataset to another:

Stata append using "using.dta"
dplyr row_binds(df1, df2)

When string columns are append too numeric columns, Stata returns an error (and with the option force, it replaces the using dataset to missing values). In contrast, in R, columns with different types are coerced to their higher type (logical < strings < int < real < character).


Reshape datasets

The package tidyr includes the functions gather and spread that alternate between wide and long datasets.


From wide to long

dfWide --> dfLong
 i stub1 stub2
1  4.1    4.5
2  4.5    3.0
--> i variable value
1 stub1     4.1
1 stub2     4.5
2 stub1     3.3
2 stub2     3.0
Stata reshape long stub, i(i) j(variable) string
tidyr gather(dfwide, variable, value, starts_with("stub"))

From long to wide

dfLong --> dfWide
id variable value
1 stub1     4.1
1 stub2     4.5
2 stub1     3.3
2 stub2     3.0
-->  id stub1 stub2
1  4.1    4.5
2  4.5    3.0
Stata reshape long stub, i(id) j(variable) string
rename stub value
tidyr gather(dfwide, variable, value, starts_with("stub"))
The names in red specify the name of new variables. While `reshape long` requires to specify both the variable to reshape and the key variables, `gather` only requires to specify the variables to reshape - the key variable are inferred to be the remaining ones.
Stata reshape wide value, i(i) j(variable) string
tidyr spread(dflong, variable, value)