The `dplyr`

verbs `filter`

, `mutate`

and `summarize`

can be applied on a grouped data.frame.

Filter based on a logical condition

Stata | egen temp = max(v1), by(id)
keep if v1 == temp |

dplyr | df %>% group_by(id) %>% filter(v1 == max(v1)) |

Filter based on relative row numbers

Stata | bys id: keep if _n == 1 |

dplyr | df %>% group_by(id) %>% filter(row_number() == 1) |

Filter the 2 observation with lowest `v1`

for each group defined by `id`

Stata | bys id (v1): keep if _n <= 2 |

dplyr | df %>% group_by(id) %>% filter(row_number(v2) <= 2) |

Compared to Stata, *any* function that is defined on vectors can be applied within group.

Modify variables within groups:

Stata | egen v1mean = mean(v1), by(id) |

dplyr | df %>% group_by(id) %>% mutate(v1mean = mean(v1)) |

Replace by first observation within group:

Stata | bys id : replace v1 = v1[1] |

dplyr | df %>% group_by(id) %>% mutate(v1 = v1[1]) |

Collapse observations within groups:

Stata | collapse (mean) v1 (sd) v2, by(id) |

dplyr | df %>% group_by(id) %>% summarize(mean(v1), sd(v2)) |

Return a dataset that contains distinct values taken by a set of variables

Stata | duplicates drop v1 v2, force |

dplyr | distinct(df, v1, v2) |

Count the number of distinct values taken by a set of variables

Stata | distinct v1 v2, joint |

dplyr | df %>% group_by(v1, v2) %>% n_groups() |

The package `statar`

includes functions that make working with unbalanced panel data easier. The examples will refer to the following data frame

```
df <- tibble(
id = c(1, 1, 1, 1, 1, 2, 2),
date = c(1992, 1989, 1991, 1990, 1994, 1992, 1991),
value = c(4.1, NA , 3.3, 5.3, 3.0, 3.2, 5.2)
)
```

The functions lead/lag accept three arguments: the fist argument is the vector of values to lag, the second argument is the number of lags, the third argument corresponds to the time vector.

To create a lagged variable based on the *previous row*, use the function `lag/lead`

from `dplyr`

Stata | by id : gen value_l = value[_n-1] |

statar | df %>% group_by(id) %>% mutate(value_l = lag(value, n = 1, order_by = date)) |

To create a lagged variable based on the *previous date* , use the function `tlag/tlead`

from `statar`

Stata | tsset id date value_l = L.value |

statar | df %>% group_by(id) %>% mutate(value_l = tlag(value, n = 1, date)) |

`lag`

and `tlag`

differ when the previous date is missing. In this case, the function `lag`

returns the value in the most recent date while the function `tlag`

returns a missing value.

To make your unbalanced panel balanced ( i.e. to add rows corresponding to gaps in dates), use a combination of the functions `complete`

and `full_seq`

in `tidyr`

Stata | tsset id date tsfill |

tidyr | complete(df, id, year = full_seq(year, 1)) |