Crunching app survey data with logistic regression in R
“It does other things,” says an exasperated Jerry in the sitcom Seinfeld, as his father labors under the notion that a personal digital assistant is nothing more than a tip calculator. While designed to get laughs, it nevertheless illustrates that creating technology is one thing, but understanding how people actually use it is another.
Each year, S&P Global Market Intelligence (my employer) surveys people to learn more about how they use financial technology. More specifically, we survey users of mobile banking and payment apps. This year’s survey revealed many findings, but one of the more surprising ones was that not only did most bank app users still visit physical bank branches, but the more someone visited a branch the more likely he or she was to be a frequent app user.
To get at this, we analyzed our data in R with the handy “survey” package. First, we pulled in the data from a .csv file and loaded the survey package library.
data = read.csv("survey_2017.csv")
The data set consisted of 4,000 observations and 191 variables, with the observations being the people who took the survey and the variables, for the most part, being the questions they answered (some variables were used for other purposes, such as unique identifiers).
Next, we used the survey package to apply weightings to our data. Our survey_2017.csv file contained an extra column at the end, entitled “WT,” which had the weight for each row (the “responseid” variable is a unique identifier for each row in our spreadsheet).
app <- svydesign(ids = data$responseid, data = data, weights = data$WT)
Then, we created a binary variable called “active” based on whether the survey taker used his or her app on a daily basis. A response of “one” or “two” to question nine indicated once a day or more than once a day, so we assigned these a value of one and the other responses a value of zero.
data$active <- ifelse(data$q9 == 1 | data$q9 == 2, 1, 0)
Once we had this, we performed a logistic regression with “active” as the dependent variable. The syntax in the survey package is slightly different from the traditional “glm” function in R, but not too far off. For the sake of illustration, we have only included a few independent variables.
glm <- svyglm(active ~ q3 + q6 + q18 + q19, family = binomial, design = app)
Based on our summary output, we can see that question six appears to be significant, which is our question on how often the user goes to a physical branch. The intercept is positive, indicating that someone who goes to a branch several times a month is more likely to be a frequent app user.
svyglm(formula = active ~ q3 + q6 + q18 + q19, family = binomial,
design = app)
svydesign(ids = data$responseid, data = data, weights = data$WT)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.119293 0.184154 -6.078 1.33e-09 ***
q3 -0.055435 0.058635 -0.945 0.345
q6 0.177783 0.016629 10.691 < 2e-16 ***
q18 -0.008664 0.036386 -0.238 0.812
q19 0.062604 0.063457 0.987 0.324
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1.001364)
Number of Fisher Scoring iterations: 4
With this in mind, we compared the percentage of daily app users that went to a bank branch to those that did not and, sure enough, the percentage was higher for daily users.
The full survey report is available to S&P Global Market Intelligence clients. If you are interested in obtaining a subscription, click here to request a free trial. I plan to submit this blog post to r-bloggers.com, and will provide a link if it is accepted.