R for Stata users

Manipulate datasets in R

Make sure you have version of R > 3.1.0 and install the following package:

install.packages("dplyr")

The structure that corresponds the most to a Stata datase is a tibble.

N <- 100
df <- tibble(
  id = sample(c("id01", "id02", "id03"), N, TRUE),     
  v1 = sample(5, N, TRUE),                          
  v2 = sample(round(runif(100, max = 100), 4), N, TRUE) 
)

Select columns

To select a few columns from a dataset:

Stata keep id v1
dplyr df %>% select(id, v1)

Memory

Contrary to Stata, R returns a new dataset without destroying the existing one.
This does not always require more memory: when subsetting columns, the new dataset is a shallow copy of the existing one - at least until the new dataset is modified.

In Stata, wildcards allow to select multiple variables. In dplyr, helper functions allow very similar results:

Stata keep v*
dplyr select(df, starts_with("v"))

This table gives the list of helper functions:

Stata dplyr
keep v* select(df, starts_with(“v”))
keep *v select(df, ends_with(“v”))
keep *v* select(df, contains(“v”))
keep v? select(df, matches(“^v.$”))
keep * select(df, everything())
drop v1 select(df, -v1)
keep id-v2 select(df, id:v2)

Modify columns

To rename columns

Stata rename id id1
dplyr df %>% rename(id1 = id)

To reorder columns,

Stata order v1
dplyr df %>% select(DT, v1, everything())

To create new columns

Stata gen new = 1
dplyr df %>% mutate(new = 1)

To modify a column

Stata egen cov = cov(v1, v2)
dplyr df %>% mutate(cov = cov(v1, v2))

To modify only certain rows of a column:

Stata replace v1 = 0 if id =="id01"
dplyr df %>% mutate(v1 = ifelse(id == "id01", 0, v1))

To apply the same function to multiple columns, use across

Stata tostring v1 v2, replace force
dplyr df %>% mutate(across(c(v1, v2), as.character))

Memory

When replacing every variable in the dataset, `dplyr` requires twice the amount of memory compared to data.table since a whole new dataset is temporarly created. In case your dataset is very large, `mutate` one variable at a timer rather than using `mutate_at`

Collapse datasets

The syntax for collapsing dataset is very similar to the syntax for modifying columns : just use summarize instead of mutate To return a dataset composed of summary statistics computed over multiple rows :

Stata collapse (mean) v1 (sd) v2
dplyr df %>% summarize(mean(v1, na.rm = TRUE), sd(v2, na.rm = TRUE))

To apply each function to multiple variables:

Stata collapse (mean) v* (sd) v*
dplyr df %>% summarize(across(starts_with("v"), list(~mean(., na.rm = TRUE), ~sd(., na.rm = TRUE))))

Compared to Stata, these commands don’t overwrite the existing dataset.

Filter rows

You can filter rows using logical conditions

Stata keep if v1 >= 2
dplyr df %>% filter(v1 >= 2)

You can also filter rows based on their position:

Stata keep if _n <= 100
dplyr df %>% filter(row_number() <= 100)

The equivalent of Stata inlist is %in%

Stata keep if inlist(id, "id01", "id02")
dplyr df %>% filter(id %in% c("id01", "id02"))

The equivalent of Stata inrange is between

Stata keep if inrange(v1, 3, 5)
dplyr df %>% filter(between(v2, 3, 5))

Memory

When subsetting a dataset wrt rows, R returns a new dataset without destroying the existing one. This means memory is required both for the existing and the new dataset. This contrasts with column subsetting, which only creates shallow copies.

Filter non-missing values

In Stata, missing values behave like +Inf. In R, missing values are special values that represents epistemic uncertainty. Operations involving NA return NA when the result of the operation cannot be determined.

NA + 1
#> NA
TRUE | NA
#> [1] TRUE

Use is.na to test for missing values

1 == NA
#> [1] NA 
is.na(NA)
#> [1] 1

In Stata, the empty character “” is a missing value. This is not true in R:

is.na("") 
#> [1] FALSE

To filter rows with missing observations for y:

df <- tibble(y = c(1, 2, 3, 4, 5, NA), x = c(3, 1, NA, 4, 6, 4))
df %>% filter(!is.na(y))

filter(df, condition) only filters rows where the condition evaluates to TRUE. In particular, rows that evaluate to NA are dropped. Contrast the following behaviors with Stata

df <- tibble(x = c(1, 2, NA))
#>   v
#> 1  1
#> 2  2
#> 3 NA
filter(df, x >= 2))
#>    x
#> 1  2
filter(df, !(x == 1))
#>    x
#> 1  2

Sort rows

To sort rows

Stata sort id v1
dplyr arrange(df, id, v1)

Missing values are sorted last, like in Stata.

Memory

When sorting a dataset, dplyr returns a new dataset without destroying the existing one. This means memory is required both for the existing and the new dataset.