Make sure you have version of R > 3.1.0 and install the following package:

```
install.packages("dplyr")
```

The structure that corresponds the most to a Stata datase is a `tibble`

.

```
N <- 100
df <- tibble(
id = sample(c("id01", "id02", "id03"), N, TRUE),
v1 = sample(5, N, TRUE),
v2 = sample(round(runif(100, max = 100), 4), N, TRUE)
)
```

To select a few columns from a dataset:

Stata | keep id v1 |

dplyr | df %>% select(id, v1) |

In Stata, wildcards allow to select multiple variables. In dplyr, helper functions allow very similar results:

Stata | keep v* |

dplyr | select(df, starts_with("v")) |

This table gives the list of helper functions:

Stata | dplyr |
---|---|

keep v* | select(df, starts_with(“v”)) |

keep *v | select(df, ends_with(“v”)) |

keep *v* | select(df, contains(“v”)) |

keep v? | select(df, matches(“^v.$”)) |

keep * | select(df, everything()) |

drop v1 | select(df, -v1) |

keep id-v2 | select(df, id:v2) |

To rename columns

Stata | rename id id1 |

dplyr | df %>% rename(id1 = id) |

To reorder columns,

Stata | order v1 |

dplyr | df %>% select(DT, v1, everything()) |

To create new columns

Stata | gen new = 1 |

dplyr | df %>% mutate(new = 1) |

To modify a column

Stata | egen cov = cov(v1, v2) |

dplyr | df %>% mutate(cov = cov(v1, v2)) |

To modify only certain rows of a column:

Stata | replace v1 = 0 if id =="id01" |

dplyr | df %>% mutate(v1 = ifelse(id == "id01", 0, v1)) |

To apply the same function to multiple columns, use `across`

Stata | tostring v1 v2, replace force |

dplyr | df %>% mutate(across(c(v1, v2), as.character)) |

The syntax for collapsing dataset is very similar to the syntax for modifying columns : just use `summarize`

instead of `mutate`

To return a dataset composed of summary statistics computed over multiple rows :

Stata | collapse (mean) v1 (sd) v2 |

dplyr | df %>% summarize(mean(v1, na.rm = TRUE), sd(v2, na.rm = TRUE)) |

To apply each function to multiple variables:

Stata | collapse (mean) v* (sd) v* |

dplyr | df %>% summarize(across(starts_with("v"), list(~mean(., na.rm = TRUE), ~sd(., na.rm = TRUE)))) |

Compared to Stata, these commands don’t overwrite the existing dataset.

You can filter rows using logical conditions

Stata | keep if v1 >= 2 |

dplyr | df %>% filter(v1 >= 2) |

You can also filter rows based on their position:

Stata | keep if _n <= 100 |

dplyr | df %>% filter(row_number() <= 100) |

The equivalent of Stata `inlist`

is `%in%`

Stata | keep if inlist(id, "id01", "id02") |

dplyr | df %>% filter(id %in% c("id01", "id02")) |

The equivalent of Stata `inrange`

is `between`

Stata | keep if inrange(v1, 3, 5) |

dplyr | df %>% filter(between(v2, 3, 5)) |

In Stata, missing values behave like `+Inf`

. In R, missing values are special values that represents epistemic uncertainty. Operations involving NA return NA when the result of the operation cannot be determined.

```
NA + 1
#> NA
TRUE | NA
#> [1] TRUE
```

Use `is.na`

to test for missing values

```
1 == NA
#> [1] NA
is.na(NA)
#> [1] 1
```

In Stata, the empty character “” is a missing value. This is not true in R:

```
is.na("")
#> [1] FALSE
```

To filter rows with missing observations for `y`

:

```
df <- tibble(y = c(1, 2, 3, 4, 5, NA), x = c(3, 1, NA, 4, 6, 4))
df %>% filter(!is.na(y))
```

`filter(df, condition)`

only filters rows where the condition evaluates to TRUE. In particular, rows that evaluate to NA are dropped. Contrast the following behaviors with Stata

```
df <- tibble(x = c(1, 2, NA))
#> v
#> 1 1
#> 2 2
#> 3 NA
filter(df, x >= 2))
#> x
#> 1 2
filter(df, !(x == 1))
#> x
#> 1 2
```

To sort rows

Stata | sort id v1 |

dplyr | arrange(df, id, v1) |

Missing values are sorted last, like in Stata.