Select and Sort | R for Stata Users

Manipulate datasets in R

Make sure you have version of R > 3.1.0 and install the following package:

install.packages("dplyr")

The structure that corresponds the most to a Stata datase is a tibble.

N <- 100
df <- tibble(
  id = sample(c("id01", "id02", "id03"), N, TRUE),     
  v1 = sample(5, N, TRUE),                          
  v2 = sample(round(runif(100, max = 100), 4), N, TRUE) 
)

Select columns

To select a few columns from a dataset:

Stata	keep id v1
dplyr	df %>% select(id, v1)

In Stata, wildcards allow to select multiple variables. In dplyr, helper functions allow very similar results:

Stata	keep v*
dplyr	select(df, starts_with("v"))

This table gives the list of helper functions:

Stata	dplyr
keep v*	select(df, starts_with(“v”))
keep *v	select(df, ends_with(“v”))
keep v	select(df, contains(“v”))
keep v?	select(df, matches(“^v.$”))
keep *	select(df, everything())
drop v1	select(df, -v1)
keep id-v2	select(df, id:v2)

Modify columns

To rename columns

Stata	rename id id1
dplyr	df %>% rename(id1 = id)

To reorder columns,

Stata	order v1
dplyr	df %>% select(DT, v1, everything())

To create new columns

Stata	gen new = 1
dplyr	df %>% mutate(new = 1)

To modify a column

Stata	egen cov = cov(v1, v2)
dplyr	df %>% mutate(cov = cov(v1, v2))

To modify only certain rows of a column:

Stata	replace v1 = 0 if id =="id01"
dplyr	df %>% mutate(v1 = ifelse(id == "id01", 0, v1))

To apply the same function to multiple columns, use across

Stata	tostring v1 v2, replace force
dplyr	df %>% mutate(across(c(v1, v2), as.character))

Collapse datasets

The syntax for collapsing dataset is very similar to the syntax for modifying columns : just use summarize instead of mutate To return a dataset composed of summary statistics computed over multiple rows :

Stata	collapse (mean) v1 (sd) v2
dplyr	df %>% summarize(mean(v1, na.rm = TRUE), sd(v2, na.rm = TRUE))

To apply each function to multiple variables:

Stata	collapse (mean) v* (sd) v*
dplyr	df %>% summarize(across(starts_with("v"), list(~mean(., na.rm = TRUE), ~sd(., na.rm = TRUE))))

Compared to Stata, these commands don’t overwrite the existing dataset.

Filter rows

You can filter rows using logical conditions

Stata	keep if v1 >= 2
dplyr	df %>% filter(v1 >= 2)

You can also filter rows based on their position:

Stata	keep if _n <= 100
dplyr	df %>% filter(row_number() <= 100)

The equivalent of Stata inlist is %in%

Stata	keep if inlist(id, "id01", "id02")
dplyr	df %>% filter(id %in% c("id01", "id02"))

The equivalent of Stata inrange is between

Stata	keep if inrange(v1, 3, 5)
dplyr	df %>% filter(between(v2, 3, 5))

Filter non-missing values

In Stata, missing values behave like +Inf. In R, missing values are special values that represents epistemic uncertainty. Operations involving NA return NA when the result of the operation cannot be determined.

NA + 1
#> NA
TRUE | NA
#> [1] TRUE

Use is.na to test for missing values

1 == NA
#> [1] NA 
is.na(NA)
#> [1] 1

In Stata, the empty character “” is a missing value. This is not true in R:

is.na("") 
#> [1] FALSE

To filter rows with missing observations for y:

df <- tibble(y = c(1, 2, 3, 4, 5, NA), x = c(3, 1, NA, 4, 6, 4))
df %>% filter(!is.na(y))

filter(df, condition) only filters rows where the condition evaluates to TRUE. In particular, rows that evaluate to NA are dropped. Contrast the following behaviors with Stata

df <- tibble(x = c(1, 2, NA))
#>   v
#> 1  1
#> 2  2
#> 3 NA
filter(df, x >= 2))
#>    x
#> 1  2
filter(df, !(x == 1))
#>    x
#> 1  2

Sort rows

To sort rows

Stata	sort id v1
dplyr	arrange(df, id, v1)

Missing values are sorted last, like in Stata.

R for Stata users

Manipulate datasets in R

Select columns

Memory

Modify columns

Memory

Collapse datasets

Filter rows

Memory

Filter non-missing values

Sort rows

Memory