Machine Learning Ideas: Separating background jobs with machine learning

Published in

lemontech-engineering

9 min readNov 20, 2017

In computer programs it is common to trigger/schedule tasks that should be performed without user intervention. These kinds of tasks could be: logging, system monitoring, scheduling, user notification, etc. These tasks are not necessary to be executed immediately and for some cases, the end-user will never notice them, and for that reason, they should be executed separately from the current thread.

A background or asynchronous job (or task) is one that is processed outside of the usual request/response workflow that is part of any modern web framework. Normally, web applications receive a request from the outside world, do some processing (such as querying a database) and immediately return a response within a few milliseconds. This is the normal pattern that we have all become accustomed to when developing applications for the web, and is known as synchronous communication.

Asynchronous tasks on the other hand, are those that may be started from a normal web request, but require a longer time to complete than the normal request. Because these requests cannot be processed immediately and return a response, they are known as asynchronous. In order to avoid interrupting the normal synchronous workflow of an application, asynchronous tasks are normally processed on a separate thread or are spawned as a separate process entirely.

These asynchronous jobs are added to job queues, that are data structures maintained by a job scheduler software containing jobs to run, so the goal of this project is to separate the jobs into different job queues, grouping them by similarity, helping us to use wisely our server’s resources assigning queues to different machines that fit with the computational requirement for each job.

Problem Statement

We were having difficulties escalating our background jobs because we have to process a lot of tasks that are similar categorically, but each one of them can use external services, do heavy processes to accomplish the task, the service requested can be allocated in different countries, among other things that can make it unique or similar to other tasks. In addition to the above, our customers are not subscribed to every task, so that means customers that have tasks types that don’t consume a lot of resources and are really fast to complete, have to be in the same queue with those tasks that take too long to be performed.

How did we solve it?

Simple, we replicated our server until we could complete our tasks in N time.

This solution is not good enough for two main reasons:

It’s not optimal: We have to deploy our services in machines with the same specs, even when some of the tasks need none computational power and others do.
It’s not fair: The customers who are subscribed to fewer tasks than others, and who’s tasks belongs to a group who have lower complexity, will be in the same queue with the rest and diverse tasks.

Because of that, we started to design a solution to get the best result.

First Approach: Original Solution

The goal of the solution is to cluster these tasks to serve them into different queues with specific features that fit perfectly to solve the problems in order to escalate efficiently.
As inputs, we are given information such as customer reference, type of task, where the service requested by the task is allocated, if the task has to perform an external process, if the task has to perform a heavy process, the count of this kind of task for each customer, and the average duration to complete the task. Our original solution was to separate the tasks by different criteria:
• Split the tasks by country.
• Split every subgroup by use of external processes.
• We observed that Task #11 from Chile had too many jobs to process, which led us to decide that this task should be separate from the rest.

This solution was developed after analyzing our clients and data, and understanding what we had and how our service was behaving. We realized this solution could work perfectly at the moment but, because we are constantly adding new tasks and having more customers, in the near future this solution would be obsolete and we would have to repeat the whole analyzing job again in order to develop another solution that fits our current data.

Final Solution: Machine Learning

Data Exploration

As part of this problem, I’ve collected all tasks data (178571 records) grouped by customers, and this dataset contains 1503 records, each record is represented by a row in the csv file, and has several features given for it. I have described each feature given briefly below:

This means we have a total of 6/7 features that we can train on, which includes 2 categorical, 2 boolean and 2 numerical values.

Data Visualization

After creating A LOT of plots and graphs, I got a better understanding about what we had, and how each task is distributed and determined where the limits for each one is, and we can conclude that:

Algorithms and Techniques

The algorithm I used to create the clusters was KMeans clustering; this is a method of vector quantization that is popular for cluster analysis in data mining, and it works partitioning the input points (observation) into K distinct sets (clusters). The concept of this algorithm is as follows:

For the K clusters, select the initial K-centroid, this usually is chosen randomly, then for these initial centroids, the algorithm proceeds by repeating these two main steps:

1. Cluster Assignment: Each observation (each point in the dataset) is assigned to a cluster centroid such as the Within-Cluster Sum of Square (WCSS) objective function is minimized. This can often be translated to assigning each observation to the closest cluster centroid, which minimizes the WCSS for many distance metrics.

2. Update Centroids: After all of the observations have been assigned to a cluster centroid, each centroid is recomputed. For each cluster, the new centroid is calculated by averaging the observations that were assigned to it, which means computing the ‘mean’ of the observations.

These steps repeat until the algorithm ‘converges’, this is detected by several ways, but the most common is to run until none of the observations change cluster membership.

I’ve chosen this algorithm because it is easy to understand, easily adaptable, works well on small or large datasets, it’s fast, efficient and has a good performance; on the other hand, it’s cons is that we need to choose the proper number of clusters, which can be solved through several techniques, but since it is a NP-Hard algorithm it’s not recommended if you are looking for performance with a lot of data. For this problem, after the preprocessing, the data’s length won’t be an issue, therefore the con doesn’t apply.

Data Preprocessing

To convert the data into a format which our model can effectively use, as only numerical features are supported, the feature “country” must be transformed and the whole data should be grouped by task.

To group the data by tasks, I created different datasets to merge them later, and these datasets are:

• Task’s count: Sum of each count per task type
• Average Duration: Group and calculate the mean for the avg_duration of each type of task.
• Discrete values: Unique values for the combination of the discrete values (Task, Country, External Process, Heavy Process).

Then I did some feature engineering to create new features:

• Customer by task: How many customer belongs for each task.
• Total Duration: Multiplication of each task’s count and average duration. It represents how long it takes to complete all tasks per type.

Then all this datasets can merge in a single one.

Because I created the new feature Total Duration which is the result of a multiplication of vectors, I decided to drop it’s multipliers (count and avg_duration) to keep just one feature of that kind instead of 3.

Because country’s values are string, I needed to transform them into numerical values, then I converted them into dummy variables.

In the data above, we can observe how distant are some column values to another, like the boolean ones and the continuous ones. This will affect the calculation to play down some features’ importance, so we need to scale them using a Standard Scaler, that standardizes features by removing the mean and scaling to unit variance. The image below will show a sample of the data scaled.

For the algorithm I’ve chosen to provide a solution to this problem, it depends for distance measure, points near each other are in the same cluster; points far apart are in different clusters. But in high dimensional spaces, distance measures do not work very well, because of that I reduced the dimension of the data, applying Principal Component Analysis (PCA), selecting the most useful components by their explained variance.

As we can see the first 6 component are the most useful, but I picked pick the first 4 to make it shorter, before reducing the dimensionality, our data’s shape was (26, 35) and now is (26, 4).

Results

Before to analyze the results, let’s make a recap about our original solution
Our original solution was separate the task for different criteria:

• Split the tasks by country.
• Split every subgroup by use of external processes.
• We observed that Task #11 from Chile had too many jobs to process, which led us to decide that this task should be separate from the rest.

This solution was developed after analyzing our service by clients and data, understanding what we had and how is behaving. Our original solution is similar to our ultimate solution, let’s compare the results:

Comparing the results with our benchmark model (our human original solution) we can see that, each cluster contain only one country, the Task #11 is alone and those tasks that use external process are separated from the rest as we expected, except for Colombia’s tasks that are separated by if they perform heavy process, but it’s OK because that feature wasn’t included in the original solution.

Comparing the results with random data :

We can observe that our model is robust enough, it groups almost all sample data in their corresponded cluster as our final model will do, except the Task #25 in cluster 8, it doesn’t make sense, I’d expect it to be assigned to cluster 3 (because the country) or alone as cluster 4 and 9 because it use a heavy and external process.

Conclusion

This solution is about to create an unsupervised model to clustering the data and to separate job queues by clusters, to achieve that, first I had to create a pipeline to preprocess the data, convert all non-numeric values to numerics, group the data by task’s types, do some feature engineering and remove unnecessary features. Then train the model to assign a cluster to each task.

This solution will automatized the analysis work that our team had to do to understand our process for then, select the best way to separate the tasks in different job queues, saving us a lot of time.

Machine learning is a method of data analysis that automates analytical model building, and it can be used for anything we can imagine, but it will be useful to develop solutions using algorithms that iteratively learn from data, machine learning allows computers to find hidden insights without being explicitly programmed where to look.

If you have enough data, give a try to build a model instead of coding complex algorithm solutions that should be tuned passed the time, and have fun with it!.