R for Stata users

Join datasets

The packages dplyr uses the SQL concepts of left, right, full, inner, semi, anti and cross joins. The following table gives the correspondence between the Stata commands merge and joinby with the dplyr functions:

Stata merge v1, keep(master matched using)
dplyr full_join(df1, df2, by = "v1")
Stata merge v1, keep(master matched)
dplyr left_join(df1, df2, by = "v1")
Stata merge v1, keep(matched using)
dplyr right_join(df1, df2, by = "v1")
Stata merge v1, keep(matched)
dplyr inner_join(df1, df2, by = "v1")
Stata merge v1, keep(matched) keepusing(v1)
dplyr semi_join(df1, df2, by = "v1")
Stata merge v1, keep(master) keepusing(v1)
dplyr anti_join(df1, df2)
Stata crossby
dplyr cross_join(df1, df2)

Wen there are multiple matches both in the master and using datasets, these functions create all combinations of rows, similarly to Stata joinby (and contrary to Stata merge m:m).


Append datasets

To append one dataset to another:

Stata append using "using.dta"
dplyr row_binds(df1, df2)

When string columns are append too numeric columns, Stata returns an error (and with the option force, it replaces the using dataset to missing values). In contrast, in R, columns with different types are coerced to their higher type (logical < strings < int < real < character).


Reshape datasets

The package tidyr includes the functions pivot_longer and pivot_wider that alternate between wide and long datasets.


From wide to long

dfwide –> dflong
 id stub1 stub2
1  4.1    4.5
2  4.5    3.0
–> id variable value
1 stub1     4.1
1 stub2     4.5
2 stub1     3.3
2 stub2     3.0
Stata reshape long stub, i(id) j(variable) string
tidyr pivot_longer(dfwide, starts_with("stub"), names_to = "variable", values_to = "value")

The names in red specify the name of new variables. While reshape long requires to specify both the variable to reshape and the key variables, picot_longer only requires to specify the variables to reshape - the key variable are inferred to be the remaining ones.


From long to wide

dflong –> dfwide
id variable value
1 stub1     4.1
1 stub2     4.5
2 stub1     3.3
2 stub2     3.0
–>  id stub1 stub2
1  4.1    4.5
2  4.5    3.0
Stata reshape wide value, i(i) j(variable) string
tidyr pivot_wider(dflong, names_from = variable, values_from = value)