R for Stata users

Manipulate datasets in R

Make sure you have version of R > 3.1.0 and install the following package:

install.packages("dplyr")

The structure that corresponds the most to a Stata datase is a data_frame.

N <- 100
df <- data_frame(
  id = sample(c("id01", "id02", "id03"), N, TRUE),     
  v1 = sample(5, N, TRUE),                          
  v2 = sample(round(runif(100, max = 100), 4), N, TRUE) 
)

Select columns

To select a few columns from a dataset:

Stata keep id v1
dplyr df %>% select(id, v1)

Memory

Contrary to Stata, R returns a new dataset without destroying the existing one.
This does not always require more memory: when subsetting columns, the new dataset is a shallow copy of the existing one - at least until the new dataset is modified.

In Stata, wildcards allow to select multiple variables. In dplyr, helper functions allow very similar results:

Stata keep v*
dplyr select(df, starts_with(v))

This table gives the list of helper functions:

Stata dplyr
keep v* select(df, starts_with("v"))
keep *v select(df, ends_with("v"))
keep *v* select(df, contains("v"))
keep v? select(df, matches("^v.$"))
keep * select(df, everything())
drop v1 select(df, -v1)
keep id-v2 select(df, id:v2)

Modify columns

To rename columns

Stata rename id id1
dplyr df %>% rename(id1 = id)

To reorder columns,

Stata order v1
dplyr df %>% select(DT, v1, everything())

To create new columns

Stata gen new = 1
dplyr df %>% mutate(new = 1)

To modify a column

Stata egen cov = cov(v1, v2)
dplyr df %>% mutate(cov = cov(v1, v2))

To modify only certain rows of a column:

Stata replace v1 = 0 if id =="id01"
dplyr df %>% mutate(v1 = ifelse(id == "id01", 0, v1))

In Stata, the command replace v1 = v1[_n-1] replaces the variable v recursively. In R, the same behavior is obtained via Reduce.

Stata replace v1 = v1[n-1]
dplyr df %>% mutate(v1 = Reduce(sum, v1, accumulate = TRUE))

Modify multiple columns

To apply the same function to multiple columns :

Stata tostring v1 v2, replace force
dplyr df %>% mutate_at(vars(v1, v2), funs(as.character))

funs allows to apply multiple functions

   df %>% mutate_at(vars(v1, v2), funs(as.character, mean))

funs also allows to apply anonymous functions

   df %>% summarize_at(var(v1, v2), funs(quantile(., probs = 0.5, na.rm = TRUE)))

Memory

When replacing every variable in the dataset, dplyr requires twice the amount of memory compared to data.table since a whole new dataset is temporarly created. In case your dataset is very large, mutate one variable at a timer rather than using mutate_at


Collapse datasets

The syntax for collapsing dataset is very similar to the syntax for modifying columns : just use summarize instead of mutate To return a dataset composed of summary statistics computed over multiple rows :

Stata collapse (mean) v1 (sd) v2
dplyr df %>% summarize(mean(v1), sd(v2))
Stata collapse (mean) * (sd) *
dplyr df %>% summarize_all(funs(mean, sd))

Compared to Stata, these commands don't overwrite the existing dataset.

Select rows

Similarly to Stata in, you can select rows based on their position:

Stata keep in 1/100
dplyr slice(1:100)

Similarly to Stata if, you can select rows using logical conditions

Stata keep if v1 >= 2
dplyr df %>% filter(v1 >= 2)

The equivalent of Stata inlist is %in%

Stata keep if inlist(id, "id01", "id02")
dplyr df %>% filter(id %in% c("id01", "id02"))

The equivalent of Stata inrange is between

Stata keep if inrange(v1, 3, 5)
dplyr d df %>% filter(between(v2, 3, 5))
Stata egen temp = max(v1)
keep if v1 == temp
dplyr df %>% filter(v1 == max(v1))

Memory

When subsetting a dataset wrt rows, R returns a new dataset without destroying the existing one. This means memory is required both for the existing and the new dataset. This contrasts with column subsetting, which only creates shallow copies.

Select non-missing values

In Stata, missing values behave like +Inf. In R, missing values are special values that represents epistemic uncertainty. Operations involving NA return NA when the result of the operation cannot be determined.

  NA + 1
  #> NA
  TRUE | NA
  #> [1] TRUE

Use is.na to test for missing values

  1 == NA
  #> [1] NA 
  is.na(NA)
  #> [1] 1

In Stata, the empty character "" is a missing value. This is not true in R:

  is.na("") 
  #> [1] FALSE

To select rows with a missing observations for y are missing:

  df <- data_frame(y = c(1, 2, 3, 4, 5, NA), x = c(3, 1, NA, 4, 6, 4))
  df %>% filter(!is.na(y))

To select rows with a missing observations for any variable:

  df <- data_frame(y = c(1, 2, 3, 4, 5, NA), x = c(3, 1, NA, 4, 6, 4))
  df %>% filter(complete.cases(df))

filter(df, condition) only selects rows where the condition evaluates to TRUE. In particular, rows that evaluate to NA are dropped. Contrast the following behaviors with Stata

  df <- data.frame(x = c(1, 2, NA))
  #>   v
  #> 1  1
  #> 2  2
  #> 3 NA
  filter(df, x >= 2))
  #>    x
  #> 1  2
  filter(df, !(x == 1))
  #>    x
  #> 1  2

Sort rows

To sort rows

Stata sort id v1
dplyr arrange(df, id, v1)

Missing values are sorted last, like in Stata.

Memory

When sorting a dataset, dplyr returns a new dataset without destroying the existing one. This means memory is required both for the existing and the new dataset.