Building a Data Pipeline — Application of Data Science in Education Data Sets. Part 2

Published in

The Startup

8 min readFeb 6, 2021

Part 1 of this series can be found here.

Loading Packages

####Load Packages####
library(tidyverse)
library(readxl)
library(apaTables)
library(sjPlot)
library(strengejacke)

Import Data

Loading the log-trace data and self-report survey data. Note that we assign a data set to an object three different times, once for each of the three different datasets.

####Import Data####
# Pre-survey for the F15 and S16 semesters
pre_survey<- pre_survey
# Gradebook and log-trace data for F15 and S16 semesters
course_data <- course_data# Log-trace data for F15 and S16 semesters — this is for time spent
course_minutes <- course_minutes

Data Processing

1. Pre-survey data
Often, survey data needs to be processed in order to be most useful. We are going to start with one of the three datasets we just saved as objects into our environment: the dataset with the pre-survey items. Here, we process the self-report items into three scales for 1) interest, 2) perceived competence, and 3) utility value. We do this by:

Renaming the question variables to something more manageabl
Reversing the response scales on questions 4 and 7
Categorizing each question into a measure
Computing the mean of each measure

pre_survey <-
 pre_survey %>%
 # Rename the qustions something easier to work with because R is case sensitive
 # and working with variable names in mix case is prone to error
 rename(
 q1 = Q1MaincellgroupRow1,
 q2 = Q1MaincellgroupRow2,
 q3 = Q1MaincellgroupRow3,
 q4 = Q1MaincellgroupRow4,
 q5 = Q1MaincellgroupRow5,
 q6 = Q1MaincellgroupRow6,
 q7 = Q1MaincellgroupRow7,
 q8 = Q1MaincellgroupRow8,
 q9 = Q1MaincellgroupRow9,
 q10 = Q1MaincellgroupRow10
 ) %>%
 # Convert all question responses to numeric
 mutate_at(vars(q1:q10), list( ~ as.numeric(.)))

Next we’ll reverse the scale of the survey responses on questions 4 and 7 so the responses for all questions can be interpreted in the same way. As you can see from the survey questions we listed earlier in the chapter, the phrasing of questions 4 and 7 is opposite the phrasing of the other questions.

# This part of the code is where we write the function:
# Function for reversing scales 
reverse_scale <- function(question) {
 # Reverses the response scales for consistency
 # Arguments:
 # question — survey question
 # Returns: 
 # a numeric converted response
 # Note: even though 3 is not transformed, case_when expects a match for all
 # possible conditions, so it’s best practice to label each possible input
 # and use TRUE ~ as the final statement returning NA for unexpected inputs
 x <- case_when(
 question == 1 ~ 5,
 question == 2 ~ 4,
 question == 3 ~ 3, 
 question == 4 ~ 2,
 question == 5 ~ 1,
 TRUE ~ NA_real_
 )
 x
}# And here’s where we use that function to reverse the scales
# We use the pipe operator %>% here
# Reverse scale for questions 4 and 7
pre_survey <-
 pre_survey %>%
 mutate(q4 = reverse_scale(q4),
 q7 = reverse_scale(q7))

Next, we’ll use a function called pivot_longer() in order to transform our pre_survey dataset from wide format to long format. That means instead of having 1,102 observations of 12 variables, we will now have 11,020 observations of 4 variables. By using pivot_longer(), we make it so that each question & response pair has its own line in the data.

# Pivot the dataset from wide to long format
measure_mean <-
 pre_survey %>%
 # Gather questions and responses
 pivot_longer(cols = q1:q10,
 names_to = “question”,
 values_to = “response”)

Next, we’ll take our new measure_mean dataset and create a column called measure. We’ll fill that column with one of three question categories:

int: interest
uv: utility value
pc: perceived competence

When we pivoted from wide format to long format in the prior step, we ended up with one variable containing all possible question numbers (q1, q2, etc.). Now, we want to tell R which question numbers correspond to which categories. To do this, we will supply case_when() with a list of all the question numbers that correspond to each category: interest, utility value, and perceived competence.

# Add measure variable 
measure_mean <- measure_mean %>% 
 # Here’s where we make the column of question categories called “measure”
 mutate(
 measure = case_when(
 question %in% c(“q1”, “q4”, “q5”, “q8”, “q10”) ~ “int”,
 question %in% c(“q2”, “q6”, “q9”) ~ “uv”,
 question %in% c(“q3”, “q7”) ~ “pc”,
 TRUE ~ NA_character_)
 )

Last, we’ll take that same measure_mean dataset and create a new variable called mean_response. Since we are calculating the mean by category, we will need to first group the responses together using a function called group_by(). This function helps us get set up to calculate new columns with grouped data. Next, we’ll use the function summarize() to create two new variables: mean_response and percent_NA. We’ll find the mean response of each category using the mean() function.

# Add measure variable 
measure_mean <- measure_mean %>%
 # First, we group by the new variable “measure”
 group_by(measure) %>%
 # Here’s where we compute the mean of the responses
 summarize(
 # Creating a new variable to indicate the mean response for each measure
 mean_response = mean(response, na.rm = TRUE),
 # Creating a new variable to indicate the percent of each measure that 
 # had NAs in the response field
 percent_NA = mean(is.na(response))
 )measure_mean

we have finished processing the pre_survey dataset. We have renamed the relevant variables and computed some means that we can use later.

Course Data

Information about the course subject, semester, and section are stored in a single column, CourseSectionOrigID. This format of data storage is not ideal. If we instead give each piece of information its own column, we’ll have more opportunities for later analysis. We’ll use a function called separate() to do this. Below, we will load course_data and run the separate() function to split up the subject, semester, and section so we can use them later on.

# split course section into components
course_data <- 
 course_data %>%
 # Give course subject, semester, and section their own columns
 separate(
 col = CourseSectionOrigID,
 into = c(“subject”, “semester”, “section”),
 sep = “-”,
 remove = FALSE
 )

Joining the Data

To join the course data and pre-survey data, we need to create similar keys. Our goal here is to have one variable that matches across both datasets. Once we have that common variable in both datasets, we can merge the datasets on the basis of that variable.

When we look at the course_data and pre_survey datasets in our environment, we see that both have variables for the course and the student. However, this information is captured in different variable names in each dataset. Our first goal will be to rename two variables in each dataset so that they will match. One variable will correspond to the course, and the other will correspond to the student. We are not changing anything in the data itself at this step — instead, we are just cleaning up the column headers up so we can look at the data all in one place.

Let’s start with the pre-survey data. We will rename RespondentID and opdata_CourseID to be student_id and course_id, respectively

pre_survey <-
 pre_survey %>%
 rename(student_id = opdata_username,
 course_id = opdata_CourseID)pre_survey

Deleting characters from the end of a string

# Re-create the variable “student_id” so that it excludes the extraneous characters
pre_survey <- pre_survey %>% 
 mutate(student_id = str_sub(student_id, start = 2, end = -3))# Save the new variable as numeric so that R no longer thinks it is text 
pre_survey <- pre_survey %>% 
 mutate(student_id = as.numeric(student_id))

Renaming student_id and course_id variables in course data.

course_data <-
 course_data %>%
 rename(student_id = Bb_UserPK,
 course_id = CourseSectionOrigID)

Joining the data

Now that we have two variables that are consistent across both datasets — we have called them course_id and student_id. We can now join the two datasets using the {dplyr} function.

dat <-
 left_join(course_data, pre_survey,
 by = c(“student_id”, “course_id”))
dat

Our aim with that code is that all of the rows in course_data are retained in our new data frame, dat, with matching rows of pre_survey joined to it. An important note is that there are not multiple matching rows of pre_survey; otherwise, you would end up with more rows in dat than expected.

course_minutes <-
 course_minutes %>%
 rename(student_id = Bb_UserPK,
 course_id = CourseSectionOrigID)course_minutes <-
 course_minutes %>%
 # Change the data type for student_id in course_minutes so we can match to 
 # student_id in dat
 mutate(student_id = as.integer(student_id))dat <- 
 dat %>% 
 left_join(course_minutes, 
 by = c(“student_id”, “course_id”))

Note that they’re now combined, even though the course data had many more rows. The pre-survey data has been joined for each student-course combination.

Finding Distinct Cases at the Student-Level

If a student was enrolled in two courses, he will have a different final grade for each of those two courses. However, our data in its current form has many rows representing each course. An easy way we can visually inspect to make sure every row is the same for the same student, by course.

glimpse(dat)

distinct(dat, Gradebook_Item)

distinct(dat, course_id, Gradebook_Item)

dat <-
 distinct(dat, course_id, student_id, .keep_all = TRUE)

dat <- rename(dat, final_grade = FinalGradeCEMS)

Analysis

We focus on some initial analyses in the form of visualizations and regression model.

The Relationship between Time Spent on Course and Final Grade
One thing we might be wondering is how time spent on course is related to students’ final grade.

Let’s make a plot to depict that relationship.

dat %>%
 # aes() tells ggplot2 what variables to map to what feature of a plot
 # Here we map variables to the x- and y-axis
 ggplot(aes(x = TimeSpent, y = final_grade)) + 
 # Creates a point with x- and y-axis coordinates specified above
 geom_point(color = dataedu_colors(“green”)) + 
 theme_dataedu() +
 labs(x = “Time Spent”,
 y = “Final Grade”)

There appears to be some relationship. Adding a line of best fit — a linear model.

dat %>%
 ggplot(aes(x = TimeSpent, y = final_grade)) +
 geom_point(color = dataedu_colors(“green”)) + # same as above
 # this adds a line of best fit
 # method = “lm” tells ggplot2 to fit the line using linear regression
 geom_smooth(method = “lm”) +
 theme_dataedu() +
 labs(x = “Time Spent”,
 y = “Final Grade”)

Looking at this plot, it appears that the more time students spent on the course, the higher their final grade is.

Based upon the trend observable in the data, the line of best fit predicts that students who spend a particular amount of time on the course earn greater than 100 for their final grade. Of course, this is not possible. THis highlights the importance of understanding your data and carefully interpreting lines of best fit (and other, more sophisticated analyses) carefully, keeping that understanding and knowledge in mind as you present and make sense of the results.

Linear Model (Regression)

We can find out exactly what the relationship between these two variables, time spent on the course and students’ final grades, is using a linear model.

m_linear <-
 lm(final_grade ~ TimeSpent, data = dat)# Generating table output with a function from the {sjPlot} package, tab_model().tab_model(m_linear,
 title = “Table 1.1”)
# apa.reg.table(m_linear, filename = “regression-table-output.doc”)

This is the end of part two. I hope to start part 3 soon. References can be found on on part 1 of this series.