### R for Stata users

The `dplyr` verbs `select`, `mutate` and `summarize` can be applied on a grouped data.frame.

### Select within groups

Subset based on a logical condition

 Stata egen temp = max(v1), by(id) keep if v1 == temp dplyr df %>% group_by(id) %>% filter(v1 == max(v1))

Subset based on relative row numbers

 Stata bys id: keep if _n == 1 dplyr df %>% group_by(id) %>% slice(1) dplyr df %>% group_by(id) %>% filter(row_number() == 1)

Select the 2 observation with lowest `v1` for each group defined by `id`

 Stata bys id (v1): keep if _n <= 2 dplyr df %>% group_by(id) %>% filter(row_number(v2) <= 2)

### Apply functions within groups

Compared to Stata, any function that is defined on vectors can be applied within group.

Modify variables within groups:

 Stata egen v1mean = mean(v1), by(id) dplyr df %>% group_by(id) %>% mutate(v1mean = mean(v1))

Replace by first observation within group:

 Stata bys id : replace v1 = v1[1] dplyr df %>% group_by(id) %>% mutate(v1 = v1[1])

Collapse observations within groups:

 Stata collapse (mean) v1 (sd) v2, by(id) dplyr df %>% group_by(id) %>% summarize(mean(v1), sd(v2))

The `dplyr` verbs `mutate` and `summarize` work with functions that return vectors. In contrast, the `dplyr` verb `do` works with functions that return dataframes. For instance, to create a dataset of regression coefficients estimated within each group:

 Stata statsby _b[cons] _se[cons] _b[v1] _se[v1] clear, by(id): reg v2 v1 dplyr df %>% group_by(id, v1) %>% do(tidy(lm(v2 ~ v1, .)))
``````#>      id        term  estimate std.error  statistic      p.value
#> 1: id02 (Intercept) 83.700643 10.047882  8.3301777 4.605927e-09
#> 2: id02          v1 -9.516896  3.446395 -2.7614060 1.004458e-02
#> 3: id01 (Intercept) 66.394417 14.672724  4.5250232 8.337954e-05
#> 4: id01          v1 -7.037230  4.098241 -1.7171341 9.593145e-02
#> 5: id03 (Intercept) 49.510889 10.494152  4.7179505 3.749716e-05
#> 6: id03          v1 -1.200503  3.318542 -0.3617561 7.197092e-01
``````

### Return distinct rows

Return a dataset that contains distinct values taken by a set of variables

 Stata duplicates drop v1 v2, force dplyr distinct(df, v1, v2)

#### Memory

When getting only distinct observations, dplyr returns a new dataset without destroying the existing one. This means memory is required both for the existing and the new dataset.

Count the number of distinct values taken by a set of variables:

 Stata distinct v1 v2, joint dplyr df %>% group_by(v1, v2) %>% n_groups()

### Panel data

The package `statar` includes functions that make working with unbalanced panel data easier. The examples will refer to the following data frame

``````df <- data_frame(
id    = c(1, 1, 1, 1, 1, 2, 2),
date  = c(1992, 1989, 1991, 1990, 1994, 1992, 1991),
value = c(4.1, NA , 3.3, 5.3, 3.0, 3.2, 5.2)
)
``````

#### Lag variables

The functions lead/lag accept three arguments: the fist argument is the vector of values to lag, the second argument is the number of lags, the third argument corresponds to the time vector.

To create a lagged variable based on the previous row, use the function `lag/lead` from `dplyr`

 Stata by id : gen value_l = value[_n-1] statar df %>% group_by(id) %>% mutate(value_l = lag(value, n = 1, order_by = date))

To create a lagged variable based on the previous date , use the function `tlag/tlead` from `statar`

 Stata tsset id date value_l = L.value statar df %>% group_by(id) %>% mutate(value_l = tlag(value, n = 1, date))

`lag` and `tlag` differ when the previous date is missing. In this case, the function `lag` returns the value in the most recent date while the function `tlag` returns a missing value.

#### From unbalanced to balanced panel

To make your unbalanced panel balanced ( i.e. to add rows corresponding to gaps in dates), use a combination of the functions `complete` and `full_seq` in `tidyr`

 Stata tsset id date tsfill tidyr complete(df, id, year = full_seq(year, 1))