Building a Data Pipeline — Application of Data Science in Education Data Sets. Part 2
Part 1 of this series can be found here.
Loading Packages
####Load Packages####
library(tidyverse)
library(readxl)
library(apaTables)
library(sjPlot)
library(strengejacke)
Import Data
Loading the log-trace data and self-report survey data. Note that we assign a data set to an object three different times, once for each of the three different datasets.
####Import Data####
# Pre-survey for the F15 and S16 semesters
pre_survey<- pre_survey
# Gradebook and log-trace data for F15 and S16 semesters
course_data <- course_data# Log-trace data for F15 and S16 semesters — this is for time spent
course_minutes <- course_minutes
Data Processing
1. Pre-survey data
Often, survey data needs to be processed in order to be most useful. We are going to start with one of the three datasets we just saved as objects into our environment: the dataset with the pre-survey items. Here, we process the self-report items into three scales for 1) interest, 2) perceived competence, and 3) utility value. We do this by:
- Renaming the question variables to something more manageabl
- Reversing the response scales on questions 4 and 7
- Categorizing each question into a measure
- Computing the mean of each measure
pre_survey <-
pre_survey %>%
# Rename the qustions something easier to work with because R is case sensitive
# and working with variable names in mix case is prone to error
rename(
q1 = Q1MaincellgroupRow1,
q2 = Q1MaincellgroupRow2,
q3 = Q1MaincellgroupRow3,
q4 = Q1MaincellgroupRow4,
q5 = Q1MaincellgroupRow5,
q6 = Q1MaincellgroupRow6,
q7 = Q1MaincellgroupRow7,
q8 = Q1MaincellgroupRow8,
q9 = Q1MaincellgroupRow9,
q10 = Q1MaincellgroupRow10
) %>%
# Convert all question responses to numeric
mutate_at(vars(q1:q10), list( ~ as.numeric(.)))
Next we’ll reverse the scale of the survey responses on questions 4 and 7 so the responses for all questions can be interpreted in the same way. As you can see from the survey questions we listed earlier in the chapter, the phrasing of questions 4 and 7 is opposite the phrasing of the other questions.
# This part of the code is where we write the function:
# Function for reversing scales
reverse_scale <- function(question) {
# Reverses the response scales for consistency
# Arguments:
# question — survey question
# Returns:
# a numeric converted response
# Note: even though 3 is not transformed, case_when expects a match for all
# possible conditions, so it’s best practice to label each possible input
# and use TRUE ~ as the final statement returning NA for unexpected inputs
x <- case_when(
question == 1 ~ 5,
question == 2 ~ 4,
question == 3 ~ 3,
question == 4 ~ 2,
question == 5 ~ 1,
TRUE ~ NA_real_
)
x
}# And here’s where we use that function to reverse the scales
# We use the pipe operator %>% here
# Reverse scale for questions 4 and 7
pre_survey <-
pre_survey %>%
mutate(q4 = reverse_scale(q4),
q7 = reverse_scale(q7))
Next, we’ll use a function called pivot_longer() in order to transform our pre_survey dataset from wide format to long format. That means instead of having 1,102 observations of 12 variables, we will now have 11,020 observations of 4 variables. By using pivot_longer(), we make it so that each question & response pair has its own line in the data.
# Pivot the dataset from wide to long format
measure_mean <-
pre_survey %>%
# Gather questions and responses
pivot_longer(cols = q1:q10,
names_to = “question”,
values_to = “response”)
Next, we’ll take our new measure_mean dataset and create a column called measure. We’ll fill that column with one of three question categories:
- int: interest
- uv: utility value
- pc: perceived competence
When we pivoted from wide format to long format in the prior step, we ended up with one variable containing all possible question numbers (q1, q2, etc.). Now, we want to tell R which question numbers correspond to which categories. To do this, we will supply case_when() with a list of all the question numbers that correspond to each category: interest, utility value, and perceived competence.
# Add measure variable
measure_mean <- measure_mean %>%
# Here’s where we make the column of question categories called “measure”
mutate(
measure = case_when(
question %in% c(“q1”, “q4”, “q5”, “q8”, “q10”) ~ “int”,
question %in% c(“q2”, “q6”, “q9”) ~ “uv”,
question %in% c(“q3”, “q7”) ~ “pc”,
TRUE ~ NA_character_)
)
Last, we’ll take that same measure_mean dataset and create a new variable called mean_response. Since we are calculating the mean by category, we will need to first group the responses together using a function called group_by(). This function helps us get set up to calculate new columns with grouped data. Next, we’ll use the function summarize() to create two new variables: mean_response and percent_NA. We’ll find the mean response of each category using the mean() function.
# Add measure variable
measure_mean <- measure_mean %>%
# First, we group by the new variable “measure”
group_by(measure) %>%
# Here’s where we compute the mean of the responses
summarize(
# Creating a new variable to indicate the mean response for each measure
mean_response = mean(response, na.rm = TRUE),
# Creating a new variable to indicate the percent of each measure that
# had NAs in the response field
percent_NA = mean(is.na(response))
)measure_mean
we have finished processing the pre_survey dataset. We have renamed the relevant variables and computed some means that we can use later.
Course Data
Information about the course subject, semester, and section are stored in a single column, CourseSectionOrigID. This format of data storage is not ideal. If we instead give each piece of information its own column, we’ll have more opportunities for later analysis. We’ll use a function called separate() to do this. Below, we will load course_data and run the separate() function to split up the subject, semester, and section so we can use them later on.
# split course section into components
course_data <-
course_data %>%
# Give course subject, semester, and section their own columns
separate(
col = CourseSectionOrigID,
into = c(“subject”, “semester”, “section”),
sep = “-”,
remove = FALSE
)
Joining the Data
To join the course data and pre-survey data, we need to create similar keys. Our goal here is to have one variable that matches across both datasets. Once we have that common variable in both datasets, we can merge the datasets on the basis of that variable.
When we look at the course_data and pre_survey datasets in our environment, we see that both have variables for the course and the student. However, this information is captured in different variable names in each dataset. Our first goal will be to rename two variables in each dataset so that they will match. One variable will correspond to the course, and the other will correspond to the student. We are not changing anything in the data itself at this step — instead, we are just cleaning up the column headers up so we can look at the data all in one place.
Let’s start with the pre-survey data. We will rename RespondentID and opdata_CourseID to be student_id and course_id, respectively
pre_survey <-
pre_survey %>%
rename(student_id = opdata_username,
course_id = opdata_CourseID)pre_survey
Deleting characters from the end of a string
# Re-create the variable “student_id” so that it excludes the extraneous characters
pre_survey <- pre_survey %>%
mutate(student_id = str_sub(student_id, start = 2, end = -3))# Save the new variable as numeric so that R no longer thinks it is text
pre_survey <- pre_survey %>%
mutate(student_id = as.numeric(student_id))
Renaming student_id and course_id variables in course data.
course_data <-
course_data %>%
rename(student_id = Bb_UserPK,
course_id = CourseSectionOrigID)
Joining the data
Now that we have two variables that are consistent across both datasets — we have called them course_id and student_id. We can now join the two datasets using the {dplyr} function.
dat <-
left_join(course_data, pre_survey,
by = c(“student_id”, “course_id”))
dat
Our aim with that code is that all of the rows in course_data are retained in our new data frame, dat, with matching rows of pre_survey joined to it. An important note is that there are not multiple matching rows of pre_survey; otherwise, you would end up with more rows in dat than expected.
course_minutes <-
course_minutes %>%
rename(student_id = Bb_UserPK,
course_id = CourseSectionOrigID)course_minutes <-
course_minutes %>%
# Change the data type for student_id in course_minutes so we can match to
# student_id in dat
mutate(student_id = as.integer(student_id))dat <-
dat %>%
left_join(course_minutes,
by = c(“student_id”, “course_id”))
Note that they’re now combined, even though the course data had many more rows. The pre-survey data has been joined for each student-course combination.
Finding Distinct Cases at the Student-Level
If a student was enrolled in two courses, he will have a different final grade for each of those two courses. However, our data in its current form has many rows representing each course. An easy way we can visually inspect to make sure every row is the same for the same student, by course.
glimpse(dat)
distinct(dat, Gradebook_Item)
distinct(dat, course_id, Gradebook_Item)
dat <-
distinct(dat, course_id, student_id, .keep_all = TRUE)
dat <- rename(dat, final_grade = FinalGradeCEMS)
Analysis
We focus on some initial analyses in the form of visualizations and regression model.
The Relationship between Time Spent on Course and Final Grade
One thing we might be wondering is how time spent on course is related to students’ final grade.
Let’s make a plot to depict that relationship.
dat %>%
# aes() tells ggplot2 what variables to map to what feature of a plot
# Here we map variables to the x- and y-axis
ggplot(aes(x = TimeSpent, y = final_grade)) +
# Creates a point with x- and y-axis coordinates specified above
geom_point(color = dataedu_colors(“green”)) +
theme_dataedu() +
labs(x = “Time Spent”,
y = “Final Grade”)
There appears to be some relationship. Adding a line of best fit — a linear model.
dat %>%
ggplot(aes(x = TimeSpent, y = final_grade)) +
geom_point(color = dataedu_colors(“green”)) + # same as above
# this adds a line of best fit
# method = “lm” tells ggplot2 to fit the line using linear regression
geom_smooth(method = “lm”) +
theme_dataedu() +
labs(x = “Time Spent”,
y = “Final Grade”)
Looking at this plot, it appears that the more time students spent on the course, the higher their final grade is.
Based upon the trend observable in the data, the line of best fit predicts that students who spend a particular amount of time on the course earn greater than 100 for their final grade. Of course, this is not possible. THis highlights the importance of understanding your data and carefully interpreting lines of best fit (and other, more sophisticated analyses) carefully, keeping that understanding and knowledge in mind as you present and make sense of the results.
Linear Model (Regression)
We can find out exactly what the relationship between these two variables, time spent on the course and students’ final grades, is using a linear model.
m_linear <-
lm(final_grade ~ TimeSpent, data = dat)# Generating table output with a function from the {sjPlot} package, tab_model().tab_model(m_linear,
title = “Table 1.1”)
# apa.reg.table(m_linear, filename = “regression-table-output.doc”)
This is the end of part two. I hope to start part 3 soon. References can be found on on part 1 of this series.