A nonlinear model to answer two questions from the data set “Web IQ, Productivity, Being Informed (Sept. 12–18, 2014)”
The report following the dataset “Web IQ, Productivity, Being Informed (Sept. 12–18, 2014)”[Pew Research Center, 1] describes some statistics of internet users, and the role that digital technology is reported to have on their work lives.
Of particular interest is the comparison between “office-based” and “non-office-based” workers and how their productivity is affected by internet usage.
While many key points are already addressed with parametric statistics, two additional questions could be used harnessing machine learning:
i) Is there anything distinctive in the way ‘office-based’ and ‘non-office-based’ employee respond to this questionnaire?
ii) What happens to productivity when employers restrict access to certain websites?
Answering the first question is a matter of capturing the structure in the data and see what factors are important to group the employees’ answers as being from office- or non-office-based. Given the nature of the dataset (survey’s data), a Random Forest could be a good place to start (Nr.Trees = 15). This model should tell us if there is a trend
Data preparation included the dismissal of responses from non-employed users. Data normalisation and regularization were not necessary with the chosen approach. The partition for the training and testing datasets was 90–25%, with a 15% overlap to make up for the small number of data points. This choice isn’t best practice considering the class of non-linear model chosen (tree bagger) as it can lead to overfitting — yet, let’s go ahead for the purposes of this investigation.
The prediction accuracy of our classifier is 82.6%.
Let me explain what this means. Since one can predict from all other answers if a user is office-based or not, we’ve partially answered whether these two groups responded significantly differently in the questionnaire. They do, and we’re saying this with an accuracy of 82.6%. But we haven’t quite answered which features are more important.
Several approaches could be pursued at this stage (e.g. PCA or correlation coefficients), but the accuracy of our non-linear model suggests we can infer the importance of each feature from the increase in prediction error at each feature permutation. Features that don’t affect the prediction error when permuted are less important, and can be neglected by the model. The permutation analysis reveal that using as few as 20 features (a reduction in dimensionality of about 4!) is enough to maintain good classification performance.
Ultimately, this is what we learned from looking into the 20-feature subset:
Some of the factors that predict office and non-office based workers are collinear, such as answering Q6a (“How often, do you work outside your workplace?”). In other cases, there might be an intuitive connection between working remotely and, for instance, answering Q9 (“How important is a landline phone to your job?”). In other cases the relationship is less intuitive, such as answering whether or not Moore’s Law applies to transistors (Q39). (=> if you say yes, then you probably work at home more).
Overall, while these features alone are poor predictors, they can be used successfully in the nonlinear model described above for classification of office/non-office based workers.
Finding an answer to the second question (the effect of internet restrictions on productivity at work) is perhaps harder. This is mostly due to the self-reported nature of the data, especially in evaluating productivity, and partially due to non-specificity of the questions.
Responses were analysed from three questions in the survey that could possibly share causality. The questions are (paraphrasing):
- Q7: How important is internet to your job?
- Q14: How important is internet when working remotely?
- Q26: How has internet affected your productivity at work?
These responses were used to reveal possible differences between two groups: i) those whose employers apply restricted access to certain websites at work (in green), and ii) those whose employers do not (in grey) (from Q15: Does your company block your access to certain websites while you are at work?).
While Q7 does not show any significant group difference in the importance of internet for work, Q14 reveals a statistically significant prevalence of respondents in the internet-restricted category saying that yes, internet is very important when working remotely (Fisher’s test, p<0.01). Conversely, those who do not have internet restrictions at work, find internet less essential when working remotely.
Remarkably, however, this has no effect on (self-reported) productivity (Q26) for either group (cf figure to the left).
If one was rushed to summarize this in a sentence, it could be that restricting internet access at work does not affect productivity, and it may make remote work more reliant on internet. Perhaps because remote work is unrestricted.
(Slight caveat: these differences could arise from the job typology, whereby those who work at home have also fewer reasons to be subject to internet restrictions, e.g. in the case of freelancers. And this is why one should never rush to unsubstantiated conclusions…).
 “Web IQ, Productivity, Being Informed” Pew Research Center, Washington, D.C. (September 12–18, 2014). http://www.pewinternet.org/datasets/sep-12-18-2014-web-iq-productivity-being-informed/.
The present analysis was conducted by the author using the data available from the Pew Research Center. Pew Research Center bears no responsibility for the interpretations presented or conclusions reached based on analysis of the data.