Make sure you have version of R > 3.1.0 and install the following package:
install.packages("dplyr")
The structure that corresponds the most to a Stata datase is a tibble.
N <- 100
df <- tibble(
id = sample(c("id01", "id02", "id03"), N, TRUE),
v1 = sample(5, N, TRUE),
v2 = sample(round(runif(100, max = 100), 4), N, TRUE)
)
To select a few columns from a dataset:
| Stata | keep id v1 |
| dplyr | df %>% select(id, v1) |
In Stata, wildcards allow to select multiple variables. In dplyr, helper functions allow very similar results:
| Stata | keep v* |
| dplyr | select(df, starts_with("v")) |
This table gives the list of helper functions:
| Stata | dplyr |
|---|---|
| keep v* | select(df, starts_with(“v”)) |
| keep *v | select(df, ends_with(“v”)) |
| keep *v* | select(df, contains(“v”)) |
| keep v? | select(df, matches(“^v.$”)) |
| keep * | select(df, everything()) |
| drop v1 | select(df, -v1) |
| keep id-v2 | select(df, id:v2) |
To rename columns
| Stata | rename id id1 |
| dplyr | df %>% rename(id1 = id) |
To reorder columns,
| Stata | order v1 |
| dplyr | df %>% select(DT, v1, everything()) |
To create new columns
| Stata | gen new = 1 |
| dplyr | df %>% mutate(new = 1) |
To modify a column
| Stata | egen cov = cov(v1, v2) |
| dplyr | df %>% mutate(cov = cov(v1, v2)) |
To modify only certain rows of a column:
| Stata | replace v1 = 0 if id =="id01" |
| dplyr | df %>% mutate(v1 = ifelse(id == "id01", 0, v1)) |
To apply the same function to multiple columns, use across
| Stata | tostring v1 v2, replace force |
| dplyr | df %>% mutate(across(c(v1, v2), as.character)) |
The syntax for collapsing dataset is very similar to the syntax for modifying columns : just use summarize instead of mutate
To return a dataset composed of summary statistics computed over multiple rows :
| Stata | collapse (mean) v1 (sd) v2 |
| dplyr | df %>% summarize(mean(v1, na.rm = TRUE), sd(v2, na.rm = TRUE)) |
To apply each function to multiple variables:
| Stata | collapse (mean) v* (sd) v* |
| dplyr | df %>% summarize(across(starts_with("v"), list(~mean(., na.rm = TRUE), ~sd(., na.rm = TRUE)))) |
Compared to Stata, these commands don’t overwrite the existing dataset.
You can filter rows using logical conditions
| Stata | keep if v1 >= 2 |
| dplyr | df %>% filter(v1 >= 2) |
You can also filter rows based on their position:
| Stata | keep if _n <= 100 |
| dplyr | df %>% filter(row_number() <= 100) |
The equivalent of Stata inlist is %in%
| Stata | keep if inlist(id, "id01", "id02") |
| dplyr | df %>% filter(id %in% c("id01", "id02")) |
The equivalent of Stata inrange is between
| Stata | keep if inrange(v1, 3, 5) |
| dplyr | df %>% filter(between(v2, 3, 5)) |
In Stata, missing values behave like +Inf. In R, missing values are special values that represents epistemic uncertainty. Operations involving NA return NA when the result of the operation cannot be determined.
NA + 1
#> NA
TRUE | NA
#> [1] TRUE
Use is.na to test for missing values
1 == NA
#> [1] NA
is.na(NA)
#> [1] 1
In Stata, the empty character “” is a missing value. This is not true in R:
is.na("")
#> [1] FALSE
To filter rows with missing observations for y:
df <- tibble(y = c(1, 2, 3, 4, 5, NA), x = c(3, 1, NA, 4, 6, 4))
df %>% filter(!is.na(y))
filter(df, condition) only filters rows where the condition evaluates to TRUE. In particular, rows that evaluate to NA are dropped. Contrast the following behaviors with Stata
df <- tibble(x = c(1, 2, NA))
#> v
#> 1 1
#> 2 2
#> 3 NA
filter(df, x >= 2))
#> x
#> 1 2
filter(df, !(x == 1))
#> x
#> 1 2
To sort rows
| Stata | sort id v1 |
| dplyr | arrange(df, id, v1) |
Missing values are sorted last, like in Stata.