Split Apply Combine | R for Stata Users

The dplyr verbs filter, mutate and summarize can be applied on a grouped data.frame.

Filter within groups

Filter based on a logical condition

Stata	egen temp = max(v1), by(id) keep if v1 == temp
dplyr	df %>% group_by(id) %>% filter(v1 == max(v1))

Filter based on relative row numbers

Stata	bys id: keep if _n == 1
dplyr	df %>% group_by(id) %>% filter(row_number() == 1)

Filter the 2 observation with lowest v1 for each group defined by id

Stata	bys id (v1): keep if _n <= 2
dplyr	df %>% group_by(id) %>% filter(row_number(v2) <= 2)

Apply functions within groups

Compared to Stata, any function that is defined on vectors can be applied within group.

Modify variables within groups:

Stata	egen v1mean = mean(v1), by(id)
dplyr	df %>% group_by(id) %>% mutate(v1mean = mean(v1))

Replace by first observation within group:

Stata	bys id : replace v1 = v1[1]
dplyr	df %>% group_by(id) %>% mutate(v1 = v1[1])

Collapse observations within groups:

Stata	collapse (mean) v1 (sd) v2, by(id)
dplyr	df %>% group_by(id) %>% summarize(mean(v1), sd(v2))

Return distinct rows

Return a dataset that contains distinct values taken by a set of variables

Stata	duplicates drop v1 v2, force
dplyr	distinct(df, v1, v2)

Count the number of distinct values taken by a set of variables

Stata	distinct v1 v2, joint
dplyr	df %>% group_by(v1, v2) %>% n_groups()

Panel data

The package statar includes functions that make working with unbalanced panel data easier. The examples will refer to the following data frame

df <- tibble(
    id    = c(1, 1, 1, 1, 1, 2, 2),
    date  = c(1992, 1989, 1991, 1990, 1994, 1992, 1991),
    value = c(4.1, NA , 3.3, 5.3, 3.0, 3.2, 5.2)
)

Lag variables

The functions lead/lag accept three arguments: the fist argument is the vector of values to lag, the second argument is the number of lags, the third argument corresponds to the time vector.

To create a lagged variable based on the previous row, use the function lag/lead from dplyr

Stata	by id : gen value_l = value[_n-1]
statar	df %>% group_by(id) %>% mutate(value_l = lag(value, n = 1, order_by = date))

To create a lagged variable based on the previous date , use the function tlag/tlead from statar

Stata	tsset id date value_l = L.value
statar	df %>% group_by(id) %>% mutate(value_l = tlag(value, n = 1, date))

lag and tlag differ when the previous date is missing. In this case, the function lag returns the value in the most recent date while the function tlag returns a missing value.

From unbalanced to balanced panel

To make your unbalanced panel balanced ( i.e. to add rows corresponding to gaps in dates), use a combination of the functions complete and full_seq in tidyr

Stata	tsset id date tsfill
tidyr	complete(df, id, year = full_seq(year, 1))

R for Stata users