The packages `dplyr`

uses the SQL concepts of left, right, full, inner, semi, anti and cross joins. The following table gives the correspondence between the Stata commands `merge`

and `joinby`

with the dplyr `functions`

:

Stata | merge v1, keep(master matched using) |

dplyr | full_join(df1, df2, by = "v1") |

Base R | merge(df1, df2, by = "v1", all.x = TRUE, all.y = TRUE) |

Stata | merge v1, keep(master matched) |

dplyr | left_join(df1, df2, by = "v1") |

Base R | merge(df1, df2, by = "v1", all.x = TRUE, all.y = FALSE) |

Stata | merge v1, keep(matched using) |

dplyr | right_join(df1, df2, by = "v1") |

Base R | merge(df1, df2, by = "v1", all.x = FALSE, all.y = TRUE) |

Stata | merge v1, keep(matched) |

dplyr | inner_join(df1, df2, by = "v1") |

Base R | merge(df1, df2, by = "v1", all.x = FALSE, all.y = FALSE) |

There are two important differences with Stata `merge`

- When the option
`by`

is not specified, merges are based on variables with common names - Wen there are multiple matches both in the master and using datasets, R functions create all combinations of rows, similarly to
`joinby`

(and contrary to`merge m:m`

).

A semi join keeps rows with a match:

Stata | merge v1, keep(matched) keepusing(v1) |

dplyr | semi_join(df1, df2, by = "v1") |

An anti join deletes rows with a match:

Stata | merge v1, keep(master) keepusing(v1) |

dplyr | anti_join(df1, df2) |

The last join is the cartesian productiopn (the only one without key variable)

Stata | crossby |

dplyr | cross_join(df1, df2) |

To `append`

one dataset to another:

Stata | append using "using.dta" |

dplyr | row_binds(df1, df2) |

When string columns are append too numeric columns, Stata returns an error (and with the option `force`

, it replaces the using dataset to missing values).
In contrast, in R, columns with different types are coerced to their higher type (logical < strings < int < real < character).

The package `tidyr`

includes the functions `gather`

and `spread`

that alternate between wide and long datasets.

From wide to long

dfWide | --> | dfLong |
---|---|---|

i stub1 stub2 1 4.1 4.5 2 4.5 3.0 |
--> | i variable value 1 stub1 4.1 1 stub2 4.5 2 stub1 3.3 2 stub2 3.0 |

Stata | reshape long stub, i(i) j(variable) string |

tidyr | gather(dfwide, variable, value, starts_with("stub")) |

From long to wide

dfLong | --> | dfWide |
---|---|---|

id variable value 1 stub1 4.1 1 stub2 4.5 2 stub1 3.3 2 stub2 3.0 |
--> | id stub1 stub2 1 4.1 4.5 2 4.5 3.0 |

Stata | reshape long stub, i(id) j(variable) string rename stub value |

tidyr | gather(dfwide, variable, value, starts_with("stub")) |

Stata | reshape wide value, i(i) j(variable) string |

tidyr | spread(dflong, variable, value) |