How I Used Machine Learning to Predict My Productivity

Ankur Boyed
Beavr Labs
Published in
6 min readJul 4, 2023
Graphic by vectorjuice on Freepik

Introduction

I’ve been tracking my productivity on all my devices for the past three years using a platform called RescueTime. While this app has been useful in helping me analyze my day-to-day productivity, I’ve gained no insight regarding my trends in productivity. Having recently delved into the field of AI, I wanted to find patterns and, if possible, predict whether or not I’d be productive in a given hour.

Tools and Libraries

I used several Python libraries such as pandas for data manipulation, requests for API requests, scikit-learn for the Random Forest Classifier, and matplotlib for visualizations. The data was collected from RescueTime API.

Data Collection

The first step was to gather the data. I scraped productivity data for the years 2020, 2021, and 2022 from RescueTime, a service that helps keep track of computer usage and categorizes it into different productivity levels. I didn’t manage to get any reliable time tracking data for 2023.

I used the requests library to fetch the hour-by-hour usage for any app I was using on my laptop and stored the data in csv files. This gave me a total of 150,678 records which I could analyze. Check out the code for more details on how I did this.

RescueTime tracks how long you spend on a type of task on an hourly basis, so the data tells me how long I spent on a given activity and its productivity score for every hour.

Data Preprocessing

After collecting the data, the next step was to clean it and prepare it for the machine learning model.

I removed irrelevant columns from the data such as ‘Number of People’ and ‘Activity’.

The “Productivity” column is what I needed to get the model to predict. I wanted it to output 1 if it thinks I’ll be productive during the hour I provide it, -1 if it thinks I’ll be unproductive, or 0 if it’s neither (i.e. I’m sleeping). My “productivity” is calculated by multiplying “Time Spent (Seconds)” and “Productivity” and taking the sum of these values across a given hour. For example, if at 8:00am on a given day, I spent 300 seconds doing a task with a productivity of 1 and 400 seconds doing a task with a productivity score of -1, then my productivity score would be 1*300 + (-1)*400 = -100. I then round down this result to a value of -1, meaning it is an unproductive hour.

A problem with this data is that it doesn’t account for the times that don’t have any data, such as when I’m sleeping. To fix this, I first create a dataframe with productivity values filled with zeroes and then merge the RescueTime data I collected previously.

Finally, I extracted several important features from the timestamps, such as whether it is on a weekend, whether it is during work hours, the hour, the day of the week, and the day of the month.

Model Creation

After the data was prepared, it was time to build the ML model. I chose the RandomForestClassifier from the sklearn library for this task.

The data was split into a training set (70%) and a test set (30%). The model was then trained on the training set.

Model Evaluation and Predictions

After the model was trained, I used it to make predictions on the test set and compared these predictions to the actual values. The accuracy of the model was 70%, which is pretty good given how few features I could extract from the data.

Finally, I used the model to predict my productivity for the next 72 hours from a specific timestamp. The predictions were plotted over time to give a visual representation of how my productivity might change over the next few days.

The graph shows that the model can correctly identify some basic trends in my productivity, such as my downtime during non-work hours between 12am and 8am, and my general productivity dips in the evening. However, the discrete nature of the model’s output hinders the amount of insight it provides — a barely productive hour is treated the same as a very productive hour.

How the Classifier Works

The Random Forest Classifier is a beefier version of the Decision Tree. Decision Trees iterate through each feature and split the sample into groups based on whether or not they meet a condition (i.e. the day of the week is greater than 5). After going through each feature, the groups reach a leaf node, which contain a prediction associated with that group. For example, a group whose day of the week is greater than 5 and the hour is greater than 12 is predicted to have a productivity value of -1.

An example decision tree. Courtesy of HackerEarth

The issue with Decision Trees is that we need to use the right number of “splits” to come to a valid conclusion. Too many splits and the classifier will be unreliable for the test data we provide it because it fit on data points that weren’t valuable. Too few splits and the classifier won’t be able to accurately identify the patterns in the data.

To fix this issue, we use Random Forests, which create many Decision Trees that have varying numbers of splits and then take the average output of their values.

Improvements

Alternative Models: My current model is a RandomForest, which has been a solid choice, but I’m curious to see how other models might fare with this kind of time series data. A model like ARIMA (Autoregressive Integrated Moving Average), which was built specifically for this type of data, could be a game-changer for my productivity predictions.

Deep Learning: I’ve read a lot about deep learning models, especially LSTM (Long Short Term Memory) and GRU (Gated Recurrent Units), and how they can be incredibly effective with time series data. It might be time for me to dive in and explore these models for myself.

Cross Validation: Up until now, I’ve used a simple train/test split. But to make my model even more robust and reliable, I could introduce cross-validation during the model training process. Techniques like K-Fold cross-validation could help ensure my model performs well not just on my test set, but on any new data it encounters.

Feature Engineering: So far, I’ve been extracting features from the date and time, but I’m starting to think there could be other relevant features hiding in plain sight. I’m considering delving deeper into the nature of my tasks and looking for correlations between my productivity and different time slots.

Hyperparameter Tuning: There’s always the possibility of making my model perform better by tuning the hyperparameters. With something like GridSearchCV or RandomizedSearchCV, I could find the sweet spot — the optimal parameters for my model — which could give a real boost to my accuracy.

Conclusions

This project was a fun way to unearth new trends in my productivity. However, this is a very basic project and there’s a lot of improvements to be made to ensure the validity of its predictions.

The code can be found here: Github. Instructions are provided so you can run the Jupyter Notebooks yourself and forecast your productivity :)

--

--