❤ dplyr

Dplyr makes my life 10x easier. 5x of that reason is speed (just compare base R `merge` with `left_join` ❤❤). The other 5x (or is that 2x?) of that reason is that it reads like language. Base R was a step-up from Stata for programming econometrics, but it takes me forever to waddle through its nit-picking logic and syntax. Great for programming, but I’d rather spend more time econometricking and less time translating into this weird language.

It’s like translating English to French and having to switch all your adjectives — you start with “the red house” and end up with “la maison rouge” (but not always: the old house is still “la vieille maison”) It’s no issue if you’re bilingual, but I don’t want to have to be an expert to translate my statistics into R, I just want to get the job done (or more likely, see if the job is even possible / will this give me a defendable answer / should I just quit forever).

The problem

I have panel data and I want to trim outliers of growth rates by year. I need to calculate the 2.5th (and 97.5th) percentile of growth rates by year, then drop observations that with growth below the 2.5th percentile or above the 97.5th. [Note: this is one of many strategies one has to use: winsorizing and hard-thresholding are two others, then you pray your results are robust to any choice of trimming.]

In Stata, this is `egen low = pctile(g), p(.025) by(year)`, then `drop if g < low`. Now consider base vs. dplyr:

It’s around the same number of keystrokes, so why is this so great?

1. I can read and understand my own code without documentation.

I can’t even read the first line of the base code. Why `unlist`? Why am I switching from matrices to data frames? In dplyr, I know exactly what each bit does, and why I need it, without having to spell out each line.

2. It’s easy to document.

If you have #1, you (the author) don’t need documentation to understand it, but since you do understand it, it’s really easy to write documentation for people that don’t understand it. Win-win.

3. I can write code the way I think about code.

I don’t have to translate the logic. Logic is “start with data, look within each year, calculate 2.5th percentile of growth g.” In base…?? In dplyr, that’s how you write it: start with data, then `group_by(year)`, then call function `quantile` on growth `g` to return percentile `.025`.

4. It’s easy to learn via tutorial.

5. Stata code is much easier to translate into dplyr relative to base R.

For coding in base R, the general strategy for writing was:

  1. Google “egen stata r”
  2. Click stackoverflow link
  3. ctrl+c, ctrl+v
  4. 🙏

In dplyr, the logic is simple enough that I can translate it myself with only the dplyr docs for reference.

6. It does a lot of the dirty work.

See above. I don’t know why I need all this `as.data.frame/as.matrix/unlist` stuff. In dplyr, I don’t need to worry about it.

7. There’s a lot of help available.

It’s so easy to use that (a) everyone knows it, so everyone can offer help, and (b) the easiest solution to all my questions like “help me do `egen x=mode(y), by(z)`” usually start with `library(dplyr)…`.

8. Your code will look so good you’ll want to share it.

That’s rare in economics, especially before you publish something. Especially for the dirty stuff like concordances, conversions, processing, and so on — you just figure it out on your own.

In the base R code above, I wouldn’t want to share it. It’s ugly, I can’t explain it, I can’t document it, and I won’t be able to understand one day after I write it. In dplyr, it’s easy to document and easy to format so people can understand it. The payoff — after I’ve done all the research and written the paper, revised, resubmitted, and so on, I don’t have to spend weeks sorting through my code to figure out how to post it online.