How to Learn SQL as an R User in Academia

This simple approach lets you use your knowledge of R to learn SQL, broadening your skill set and your future career options.

Published in

Meta Research

9 min readJun 15, 2021

I joined Facebook’s research team after completing a PhD in psychology. Although many of my skills transferred from academia to industry, there were others I still had to learn. One of those was using SQL (Structured Query Language) to extract insights from data. I’d like to help academics learn SQL in a familiar way to enable them to apply this new skill to their current research.

There are plenty of good, free SQL tutorials available, and they’re an excellent starting point to master the basics. (I’m a big fan of this one). But relying on them exclusively raises two problems. First, there’s opportunity cost: every hour spent learning SQL is an hour that can’t be spent finishing a dissertation or crafting a diplomatic response to Reviewer 2. Second, they can be unfamiliar. Many academics use R for data analysis, but SQL tutorials don’t connect with R, and the tutorials’ datasets aren’t useful for scholarly publication.

Learn SQL in a familiaR environment

In this article, I’ll walk through how to learn SQL in R using datasets collected by academics. My hope is that it inspires current R users in academia to incorporate SQL into their analysis pipelines. Doing so will help them develop a valuable skill for the future while contributing to their current analyses.

To illustrate with a toy example, we’ll use the mtcars dataset that’s built into R. Short for Motor Trend, a magazine about cars, this dataset contains 32 rows and 11 columns, where each row is a car and each column describes an attribute of that car, like gas mileage or number of cylinders.

Let’s say we wanted to know the average miles per gallon for cars with different numbers of cylinders (which is akin to wanting to know the average response for participants in different conditions of an experiment). The code below does this in R.

library(tidyverse)

mtcars %>% 
  dplyr::group_by(cyl) %>%  
  dplyr::summarise(avg_mpg = base::mean(x = mpg))

We can see from the output that cars with fewer cylinders get, on average, more miles per gallon than cars with more cylinders.

## # A tibble: 3 x 2
##     cyl avg_mpg
##   <dbl>   <dbl>
## 1     4    26.7
## 2     6    19.7
## 3     8    15.1

To learn SQL in R, let’s reproduce the above analysis using the sqldf package, which allows SQL code to be run in R.

library(sqldf)

sqldf::sqldf(x = "
  SELECT  
    cyl,
    AVG(mpg) AS avg_mpg
  FROM mtcars
  GROUP BY
    1  
"
)

And we get the same results (differences are due to rounding).

##   cyl  avg_mpg
## 1   4 26.66364
## 2   6 19.74286
## 3   8 15.10000

Let’s break down the SQL and compare it to R:

We’re pulling data from the mtcars dataset, which is specified in SQL as FROM mtcars. This is equivalent to mtcars %>% at the very beginning of the pipe in R.
Our output is aggregated by the number of cylinders, which is reflected in SQL as GROUP BY 1, where 1 refers to the first column under SELECT, which is cyl. This is equivalent to dplyr::group_by(cyl) in R.
For each number of cylinders, we want the average miles per gallon, which is indicated by AVG(mpg) and has been renamed avg_mpg. This is equivalent to base::mean(x = mpg) in R.

Using SQL with a real academic dataset

Now let’s see this approach in action with a real dataset that was part of a scholarly publication. We’ll use a dataset collected by Adam Mastroianni and colleagues, whose paper showed that conversations between two people rarely end when either person wants them to end (see this Jimmy Kimmel video for a funnier take). You can follow along yourself by downloading the raw data here, which were collected by a team unaffiliated with Facebook.

We’ll reproduce some analyses from Study 1 of Adam’s paper, first using R and then in SQL. To begin, we’ll need to import the raw data, which consists of survey responses collected on Amazon Mechanical Turk, into R.

study1_raw <- readr::read_csv(file = "study1.csv")
base::dim(x = study1_raw)

There are 1,100 rows, one per participant. Each of the 53 columns contains information such as whether the participant passed an attention check or how they answered a particular question.

## [1] 1100 53

Study 1 reports data from 806 participants, 367 of whom identified as female and 439 of whom identified as male. The raw data contain 1,100 participants, so 294 participants were excluded (e.g., for failing manipulation checks). To count the number of participants who identified as female vs. male, we’ll first need to reproduce the exclusions, which means filtering from 1,100 rows to 806 rows.

The R code below excludes 294 participants and then counts the number of participants who identified as female vs. male.

study1_raw %>% 

  # exclusion process  
  dplyr::filter(

    # participants who showed they're paying paying attention
    kindergarten == "Kindergarten" 
    & ohtwoone == "2138" 
    & turkey == "Eating turkey"

    # participants who finished; make_sense was the last question
    & !base::is.na(make_sense)

    # participants who remember last conversation that met criteria
    & remember == "Yes"

    # participants who remember length of that conversation
    & remember_length == "Yes"

    # participants who remembered there was a point ready to end
    & wanted_end != "I don't remember"

    # participants whose conversation was shorter than 45 min
    & actual <= 45

    # participants who passed attention check
    & tree == "Other"
    & !(tree_4_TEXT %in% c("free", "three", "tree541"))
    ) %>% 

  # one row per gender
  dplyr::group_by(gender) %>% 

  # count gender
  dplyr::summarise(n = dplyr::n())

As reported in the paper, 367 participants identified as female and 439 identified as male.

## # A tibble: 2 x 2
##   gender     n
##   <chr>  <int>
## 1 Female   367
## 2 Male     439

The equivalent in SQL is below.

sqldf::sqldf(
  x = "
  SELECT
    gender,
    COUNT(*) AS n
  FROM study1_raw
  WHERE
    kindergarten = 'Kindergarten'
    AND ohtwoone = '2138'
    AND turkey = 'Eating turkey'
    AND make_sense IS NOT NULL
    AND remember = 'Yes'
    AND remember_length = 'Yes'
    AND wanted_end IN ('No', 'Yes')
    AND actual <= 45
    AND tree = 'Other'
    AND tree_4_TEXT NOT IN ('free', 'three', 'tree541')
  GROUP BY
    1
"
)

And we see the same results:

##   gender   n
## 1 Female 367
## 2 Male   439

Let’s break down the SQL:

We’re pulling data from study1_raw, which is specified as FROM study1_raw. This is equivalent to study1_raw %>% at the very beginning of the pipe in R.
Everything between WHERE and GROUP BY (e.g., kindergarten = 'Kindergarten') is equivalent to what’s inside dplyr::filter(). This is where the filtering from 1,100 rows to 806 rows takes place. In order for a row to be kept, it must meet all the specified conditions (note that AND in SQL is equivalent to & in R).
To count gender, we don’t want 806 rows; we want one row per each of the study’s two gender options and two columns: one column for gender and another showing how many participants identified as such. GROUP BY 1creates one row per gender option, where 1 refers to the first column under SELECT. This is equivalent to dplyr::group_by(gender) in R.
COUNT(*) creates a count column, which we’ve renamed n. The * inside COUNT() means all rows. So because there’s one row per participant and we’re grouping by gender, we end up counting how many participants identified as each of the study’s gender options. The logic here is equivalent to dplyr::n().

The process above can be adapted to reproduce the average age of the 806 participants, which was 36.78 years.

After describing the sample demographics, Adam and colleagues write that 78.41% of participants reported on a conversation that occurred the day they took the survey or the day before.

This particular question has its own column in study1_raw, called when. The response options were Today, Yesterday, A few days ago, and More than few days ago. The R and SQL below replicates 78.41%. The same process can be used to replicate the other percentages – 79.16%, 88.59%, and 84.12% – which could be a good exercise if you’re interested.

In R:

study1_raw %>% 

  # exclusion process  
  dplyr::filter(
    kindergarten == "Kindergarten" 
    & ohtwoone == "2138" 
    & turkey == "Eating turkey"
    & !base::is.na(make_sense)
    & remember == "Yes"
    & remember_length == "Yes"
    & wanted_end != "I don't remember"
    & actual <= 45
    & tree == "Other"
    & !(tree_4_TEXT %in% c("free", "three", "tree541"))
  ) %>% 

  # numeric version of when
  dplyr::mutate(when_num = dplyr::case_when(
    when == "Today" ~ 1, 
    when == "Yesterday" ~ 2, 
    when == "A few days ago" ~ 3, 
    when == "More than a few days ago" ~ 4
  )) %>% 

  # group by both versions of when
  dplyr::group_by(when_num, when) %>% 

  # count how many of each response
  dplyr::count() %>% 

  # order from 1 to 4
  dplyr::arrange(when_num) %>% 

  # calculate counts and percents, and a running version
  dplyr::ungroup() %>% 
  dplyr::mutate(cum_total = base::cumsum(x = n),
                total = base::sum(n),
                cum_pct = cum_total / total )

The result shows how many participants gave each response option and a cumulative total and percent.

## # A tibble: 4 x 6
##   when_num when                        n cum_total total cum_pct
##      <dbl> <chr>                    <int>    <int> <int>   <dbl>
## 1        1 Today                      404      404  806    0.501
## 2        2 Yesterday                  228      632  806    0.784
## 3        3 A few days ago             147      779  806    0.967
## 4        4 More than a few days ago    27      806  806    1

And in SQL:

sqldf::sqldf(
x = "
SELECT
  *,
  SUM(n) OVER(ORDER BY when_num) AS cum_total, 
  SUM(n) OVER(PARTITION BY 1) AS total, 
  1.0 * SUM(n) OVER(ORDER BY when_num) / SUM(n) OVER(PARTITION BY 1) AS cum_pct
FROM (
  SELECT
    CASE
      WHEN \"when\" = 'Today' THEN 1
      WHEN \"when\" = 'Yesterday' THEN 2
      WHEN \"when\" = 'A few days ago' THEN 3
      WHEN \"when\" = 'More than a few days ago' THEN 4
    END AS when_num,
    \"when\",
    COUNT(*) AS n
  FROM study1_raw
  WHERE 
    kindergarten = 'Kindergarten'
    AND ohtwoone = '2138'
    AND turkey = 'Eating turkey'
    AND make_sense IS NOT NULL
    AND remember = 'Yes'
    AND remember_length = 'Yes'
    AND wanted_end IN ('No', 'Yes')
    AND actual <= 45
    AND tree = 'Other'
    AND tree_4_TEXT NOT IN ('free', 'three', 'tree541')
  GROUP BY 
    1, 2
)
"
)

Which gives us the same result:

##   when_num                     when   n cum_total total   cum_pct
## 1        1                    Today 404       404   806 0.5012407
## 2        2                Yesterday 228       632   806 0.7841191
## 3        3           A few days ago 147       779   806 0.9665012
## 4        4 More than a few days ago  27       806   806 1.0000000

Let’s break down the SQL:

This consists of two queries, one nested inside the other.
The inner query creates the first three columns in the output by pulling data from study 1_raw (FROM study1_raw) and filtering it from 1,100 rows to 806 rows (everything between WHERE and GROUP BY). We aggregate down to four rows — one per response option — by specifying GROUP BY 1, 2 — where 1 and 2 are the numeric and text versions of when. Lastly, the three columns are a numeric version of when (note the parallel between CASE WHEN in SQL and dplyr::case_when() in R), the text version of when, and a count of how many participants provided each response option.
The outer query adds three additional columns: a cumulative total, the total number of participants, and a cumulative percent. Under SELECT, there’s a * to indicate that we want all columns from our initial sub-query. The cumulative total is formed by summing the counts in order of the numeric version of when, as reflected by SUM(n) OVER(ORDER BY when_num). The total number of participants is formed similarly, except the summing doesn’t require any ordering, so it’s just SUM(n) OVER(PARTITION BY 1). Lastly, the cumulative percentage is formed by dividing the cumulative total over the total number of participants. The entry in the second row and rightmost column shows that 78.41% of participants responded with either Today or Yesterday.

A useful skill now — and down the road

The code here can be adapted for your own academic research. The next time you have raw data to clean and analyze, try doing it in both R and SQL and comparing the results. You’ll expand your technical skills while advancing your research. If and when the time comes that you decide to pivot to industry, you’ll see your practice pay dividends in what you’ll be able to discover by analyzing data with SQL.

Author: Jack Cao, Quantitative UX Researcher at Facebook

Contributor: David Kille, Research Manager at Facebook

Illustrator: Drew Bardana

How to Learn SQL as an R User in Academia

This simple approach lets you use your knowledge of R to learn SQL, broadening your skill set and your future career options.

Learn SQL in a familiaR environment

Using SQL with a real academic dataset

A useful skill now — and down the road

Written by Jack Cao