Stories by Vladimir Lazovskiy on Medium

Travel Time Optimization With Machine Learning And Genetic Algorithm

Vladimir Lazovskiy — Mon, 11 Jun 2018 19:58:08 GMT

What is the relationship between machine learning and optimization? — On the one hand, mathematical optimization is used in machine learning during model training, when we are trying to minimize the cost of errors between our model and our data points. On the other hand, what happens when machine learning is used to solve optimization problems?

Consider this: a UPS driver with 25 packages has 15 trillion possible routes to choose from. And if each driver drives just one more mile each day than necessary, the company would be losing $30 million a year.

While UPS would have all the data for their trucks and routes, there is no way they can run 15 trillion computations per each driver with 25 packages. However, this traveling salesman problem can be approached with something called the “genetic algorithm.” The issue here is that this algorithm requires having some inputs, say the travel time between each pair of locations, and UPS wouldn’t have this information, since there are more than even trillions of combinations of addresses. But what if we use the predictive power of machine learning to power up the genetic algorithm?

The Idea

In simple terms, we can use the power of machine learning to forecast travel times between each two locations and use the genetic algorithm to find the best travel itinerary for our delivery truck.

The very first problem we run into is that no commercial company would share their data with strangers. So how can we proceed with such a project — whose goal would be to help a service like UPS — without having the data? What data set could we possibly use that would be a good proxy for our delivery trucks? Well, what about taxis?—Just like delivery trucks, they are motor vehicles, and they deliver… people. Thankfully, taxi data sets are public because they are provided city governments.

The following diagram illustrates the design of the project: we start with taxi data, use this data to make predictions for travel times between locations, and then run the genetic algorithm to optimize the total travel time. We can also read the chart backwards: in order to optimize the travel time, we need to know how long it takes to get from one point to another for each pair of points, and to get that information, we use predictive modeling based on the taxi data.

Data And Feature Engineering

For illustration purposes, let’s stick to this Kaggle data set, which is a sample of the full taxi data set provided by the city of New York. Data sets on Kaggle are generally well processed and do not always require much work (which is a downside if you want to practice data cleansing), but it is always important to look at the data to check for errors and think about feature selection.

Since we are given each location’s coordinates, let’s calculate the Manhattan distances between each pair of points and count the longitude and latitude differences to get a sense of direction (East to West, North to South). We can clean up timestamps a little and keep the original features which might look useless to us at first glance.

When working with geospatial data, Tableau is a very useful alternative to mapping data points in pandas. A quick preliminary check on the map shows that some drop off locations are in Canada, in the Pacific Ocean, or on the Ellis Island (Statue of Liberty), where cars simply don’t go. Removing these points is quite a difficult task without built-in geo-oriented packages, but we can also leave them in the data, since some machine learning models can deal very well with outliers.

In Tableau, we can quickly get a sense of our drop off locations and their density, as well as outliers.

As a bonus, Kaggle conveniently provides data for New York City weather in 2016. This is something we might want to look into in our analysis. But because there are many similar weather conditions—partly cloudy or mostly cloudy—let’s bucket them into major ones to have a smaller variation for these specific features. We can do it in pandas like this:

sample_df["Conditions"] = sample_df["Conditions"].fillna('Unknown')

weather_dict = {'Overcast' : 0, 
                'Haze' : 0,
                'Partly Cloudy' : 0, 
                'Mostly Cloudy' : 0, 
                'Scattered Clouds' : 0, 
                'Light Freezing Fog' : 0,
                
                'Unknown' : 1,
                'Clear' : 2, 
                
                'Heavy Rain' : 3, 
                'Rain' : 3, 
                'Light Freezing Rain' : 3,
                'Light Rain' : 3, 
                
                'Heavy Snow' : 4,
                'Light Snow' : 4,
                'Snow' : 4}

sample_df["Conditions"] = sample_df["Conditions"].apply(lambda x: weather_dict[x])

Picking The Right Model

With our data and goals, a simple linear regression won’t do. Not only do we want to have a low variance model, we also know that the coordinates, while being numbers, do not carry numeric value for the given target variable. Additionally, we want to add direction of the route as a positive or negative numeric value and try supplementing the model with the weather data set, which is almost entirely categorical.

With a few random outliers in a huge data set, possibly extraneous features which came with it, and a number of possible categorical features, we need a tree-based model. Specifically, boosted trees will perform very well on this particular data set and be able to easily capture non-linear relationships, accommodate for complexity, and handle categorical features.

In addition to the standard XGBoost model, we can try the LightGBM model because it is faster and has better encoding for categorical features. Once you encode these features as integers, you can simply specify the columns with categorical variables, and the model will treat the accordingly:

bst = lgb.train(params,
                dtrain,
                num_boost_round = nrounds,
                valid_sets = [dtrain, dval],
                valid_names = ['train', 'valid'],
                categorical_feature = [20, 24]
                )

These models are fairly easy to set up, but are much harder to fine-tune and interpret. Kaggle conveniently offers root mean squared logarithmic error (RMSLE) as the evaluation metric, since it reduces error magnitude. With RMSLE, we can run different parameters for tree depth and learning rate and compare the results. Let’s also create a validation “watchlist” set to track the errors as the model iterates:

dtrain = xgb.DMatrix(X_train, np.log(y_train+1))
dval = xgb.DMatrix(X_val, np.log(y_val+1))

watchlist = [(dval, 'eval'), (dtrain, 'train')]

gbm = xgb.train(params,
                dtrain,
                num_boost_round = nrounds,
                evals = watchlist,
                verbose_eval = True
                )

Optimization With Genetic Algorithm

Now, the machine learning part is only the first step of the project. Once the model is trained and saved, we can start on the genetic algorithm. For those who don’t know, in the genetic algorithm a population of candidate solutions to an optimization problem is evolved toward better solutions, and each candidate solution has a set of properties which can be mutated and altered. Basically, we start with a random solution to the problem and try to “evolve” the solution based on some fitness metric. The result is not guaranteed to be the best solution possible, but it should be close enough.

Let’s say we have 11 points on the map. We would like our delivery truck to visit all these locations on the same day, and we want to know the best route. However, we don’t know how long it will take the driver to go between each point because we don’t have the data for all address combinations.

This is where the machine learning part comes in. With our predictive model, we can find out how long it will take for a truck to get from one point to another, and we can make predictions for each pair of points.

When we use our model as a part of the genetic algorithm, what we start with is a random visit order for each point. Then, based on the fitness score being the shortest total time traveled, the algorithm attempts to find a better visit order, getting predictions from the machine learning model. This process repeats until we figure out a close to ideal solution.

Results

After testing both predictive models, I was surprised to find out that the more basic XGBoost with no weather data performed slightly better than LightGBM, giving a mean absolute error of 4.8 minutes against LightGBM’s 4.9 minutes.

After picking XGBoost and saving the model, I passed it to my genetic algorithm to generate a sample solution and make a demo. Here is a visualization of the end result: we start at a given location, and the genetic algorithm together with machine learning can plan out the optimal route for out delivery truck.

Note that there is a little loop between points 3–4–5–6. If you look closely at the map, you will see that the suggested route goes through the freeway, which is a faster and shorter drive than the residential area. Also, note that this demo is not the exact route planner—it merely suggest the visit order.

Next Steps

Route planning would be the next logical step for this project. For instance, it is possible to incorporate Google Maps API and plan out the exact pathing between each pair of points. Also, the genetic algorithm assumes static time of the day. Accounting for time of the day, while important, is a much more complex problem to solve, and it might require a different approach to constructing the input space altogether.

If you have any questions, thoughts, or suggestions, please feel free to reach out to me on LinkedIn. Code and project description can be found on GitHub.

Thanks for reading!

Travel Time Optimization With Machine Learning And Genetic Algorithm was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

What’s In Your Customer’s Next Shopping Cart?

Vladimir Lazovskiy — Mon, 19 Mar 2018 19:48:49 GMT

Instacart Market Basket Analysis competition on Kaggle is a great example of how machine learning can be applied to a business problem and a useful exercise for feature engineering. Basically, the problem comes down to predicting which products a user will buy again, try for the first time, or add to their cart next during a session. The motivation behind it quite simple: as a grocery delivering company, you would want to optimize your supply chains, minimize waste, and avoid backorders. And the machine learning part is what I am going to cover in this blog.

If you checked the link at the beginning of this article, you would know that quite a few people tried this problem and submitted their models dozens of times and with different approaches. While it could be fun to fiddle with random forests and boosted trees, for this project we will stick with good old logistic regression and investigate just how much can be improved with feature engineering and basic model tuning.

It All Starts With Data

As usual, the first step in approaching any machine learning model is to look at the data. Here, I drafted a couple of basic visualizations in Tableau, and just from these plots alone we can draw insights about user behavior patterns.

For instance, this plot shows how many items in relative size are reordered within a month. We can already see that most people reorder products within a week or never order again (30 stands for 30 or more days since last order).

And here, we can see that the average order size for a customer is 10 items.

Although this exploratory data analysis alone provides useful insights, the goal of this project is to do machine learning and turn these insights into predictive modeling.

And Into Machine Learning

This is the part where I will somewhat diverge from Kaggle. To train a logistic regression model, I am going to construct a new feature which represents the last cart for a given user:

train_carts = (order_products_train_df.groupby('user_id',as_index=False)
                                      .agg({'product_id':(lambda x: set(x))})
                                      .rename(columns={'product_id':'latest_cart'}))

df_X = df_X.merge(train_carts, on='user_id')
df_X['in_cart'] = (df_X.apply(lambda row: row['product_id'] in row['latest_cart'], axis=1).astype(int))

This new feature is a result of looking at user and product id’s and recreating their previous cart. So we end up with their latest order, represented as a set of product id’s. From there, we can create a column which indicates whether an item was ordered previously and fill it with values based on the column which contains product id’s from the previous cart. When displayed, the new feature space would look like this:

Since the product with id 1 was not ordered previously, its values in the in_cart column are 0. This in_cart column, therefore, will be target of classification. If a user is more likely to reorder a certain item, we would get a prediction of 1, and 0 otherwise.

Running a baseline logistic regression on this feature space yielded very poor results, so this is where we can turn to feature engineering magic. I took an iterative approach to my feature engineering and tested how each new set of features (user features, product features, and user-product features) affected my model at each step.

Specifically, when dealing with averages, I checked how well raw average values for days and hours performed against their rounded down values. The reasoning here is that it doesn’t make sense to consider 12.33 days since last purchase because the dataset itself provides only integer values and decimal values for discrete variables are not very insightful. To my surprise, unrounded raw averages performed better overall and gave more signal to my model in the long run.

Most of my engineered features revolved around order frequency and some averaging metric to compare the general ordering trends for the entire customer base with specific customer behavior. I also converted department names into categorical variables because I hoped that they might provide additional signal for my model (they didn’t).

One thing I did not mention at the beginning is that my prediction classes were fairly unbalanced. The engineered in_cart feature showed that items were reordered about one time out of ten. To compensate for this class imbalance, I used sklearn’s default weight balancing:

lr_balanced = LogisticRegression(class_weight='balanced', C=1000000)

However, I took it a step further and tested custom weights for further fine-tuning. It turned out that the manually balanced model performed better overall.

Conclusions

In the end, this simple logistic regression with newly engineered features and manual class balancing yielded pretty good results. With F1 = 0.381, I was not far behind Kaggle’s leaders who hovered around F1 = 0.41 with many submissions and fancier models.

And if we look at the confusion matrix, we can get a better breakdown of what my F1 score actually represents. With precision of 0.3, my model could correctly predict 30% of all reordered items, and the recall of 0.52 determined how many true and false positives altogether my model could predict. At least, the customers can be assured that they won’t need to backorder.

Such suboptimal results can be explained by class imbalance—my model predicted much better when items would NOT be reordered—or the fact that, perhaps, this is not a problem suited for machine learning.

Aside from these somewhat abstract scores, we could look at the coefficients of the model. You can find all the coefficient values in my code, but to highlight the most important findings, I will say that customer order frequency was by far the biggest predictor of the odds that the customer would reorder a certain product again. User and product total orders also player a role: the more frequently and the more in general users order product, the higher are the chances of these products being reordered again.

However, the counting order of a product being added to cart played a negative role in the odds of an item being reordered, which makes perfect sense: we saw at beginning that the average cart size for a user is about 10 products. If an item gets consistently placed as the 15th item in the shopping cart, then it will most likely not make it to the average basket.

The End

And this is it for today. You can find my incredibly fun presentation here. All the code is accessible in that repository as well.

I also used AWS EC2 (Amazon Web Services, Elastic Cloud 2) for my modeling, and I highly recommend Chris Albon’s guide for setting up a virtual machine and Jupyter Notebook to run on it.

What’s In Your Customer’s Next Shopping Cart? was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Predicting Film Ratings With Simple Linear Regression

Vladimir Lazovskiy — Tue, 20 Feb 2018 17:27:13 GMT

The 2017–18 awards season is in full swing, and Hollywood pundits are betting on which films will take home the most awards. Although predicting Oscars is quite tempting, it is more complex and requires classification models. In this blog, I will demonstrate the use of simple linear regression because it turns out that even a basic model with just a few features can be a decent predictor for movie ratings.

Lupita Nyong’o dazzles at Black Panther premiere

First things first: let’s define the data set. For this project, I looked at all films on IMDb.com from the date of the site’s creation to the end of last year, so about 27 years in total. Of course, I did not consider every film—only full-length feature films which were released in the USA. Scraping all of them from IMDb gave me around around 50,000 raw samples. Sounds like a lot? Not so fast: only 9% of the samples had complete information and were available for analysis right away. I ended up using 4455 rows of data.

A few notes on imputation and feature engineering.

First of all, imputing film data is simply not very effective. You can’t “impute” actors, writers, or directors. Picking mean budget or runtime values is also questionable, since budget values, for example, increase over time with inflation and other factors (Note. I did not account for inflation, at least in this iteration of the project).

I also avoided engineering new features. Although I could have created genre or keyword clusters, keywords on IMDb were bizarre, unfitting, and even inappropriate. For simplicity’s sake, I avoided them altogether.

Of data setup and tools.

Now I must admit that going into this problem, I had high expectations for directors’ and actors’ influence on movie ratings. To see if they actually made a difference, along with other features, I divided my data into four sets.

For my base set, I selected only numeric features: budget, total gross, number of votes on the website, and runtime. When doing exploratory data analysis and looking at the pair plots, I noticed strong logarithmic relationships between these variables and my target, movie ratings. Therefore, I applied a log transform on my features and further scaled them to be between 0 and 1. That way, millions of dollars spent on budget would not potentially outweigh meager runtime values.

After that, I built three more data sets on top of my base: one set added MPAA ratings and genres, the next set added top directors and top languages on top of set two, and finally I threw in top writers, top actors, and top countries. Naturally, all categorical variables had to be converted into a sparse matrix of 0’s and 1’s.

Conveniently, the target variable follows the normal (left-skewed) distribution. Great!

In case anyone wonders, my stack for this problem is the following: pandas for data cleaning and preliminary analysis, matplotlib and seaborn libraries for plotting, sklearn library for modeling, and statsmodels for extra nerdiness.

America’s next top (machine learning) model.

So how did my data sets perform? I will start from the last two sets and just say that they were too sparse and too complex. Adding more complexity to them would not make any sense, and regularization did not improve my prediction, so I discarded them after initial exploration.

After that, I focused on two remaining data sets:

budget, gross, runtime, number of votes
budget, gross, runtime, number of votes, MPAA ratings, genres

For set 1, it did not make a lot of sense to regularize (too simple), so instead I added complexity to my model by creating polynomial features. Degree 2 polynomial features provided the best results in my cross-validation runs, so this model was considered for final evaluation.

For set 2, I attempted LASSO and Ridge regularizations (with cross-validation, of course!) because genres and MPAA ratings added a lot of extra “data” when expanded into dummy columns. Unfortunately, MPAA ratings and genres were rather meaningless (only Drama and Animation had substantial coefficients), which made me realize that the former four-feature model was my best bet.

Interestingly enough, the residual plots for both models looked very much alike. Although the variance wasn’t truly random, it had no clear patterns, and therefore no signal could be picked up from these graphs.

And just to test my decision, I compared cross-validation scores for these two models. Model 1, a simple linear regression with added degree 2 polynomial features, had r² = 0.421, while model 2 with MPAA ratings and genres showed r² = 0.546.

However, r² was not the sole metric for my evaluation. In fact, I was more interested in the actual root mean squared error. Model 1 had RMSE = 0.786, while model 2 gave RMSE = 0.685. The difference of 0.101 in error terms is quite meager when compared to the rating scale of 0 to 10. Hence, this simpler model beats a more complex model with meaningless features!

Model 1 gave r² = 0.4 on my test set, which isn’t very high, but as follows from the previous paragraph, actual RMSE is a better accuracy metric for this problem than r². I am pretty satisfied with my result, although it can be improved in the future.

Note. When checking p-values with statsmodels, I noticed that p-values for model 1 were all 0’s, while model 2 had p-values ranging from 0 to 0.8.

And just for fun

As much as I wanted actors to influence my predictions, they were simply insignificant when combined with other features. But when I looked at actor-rating correlation separately and built a model with prediction based only on actor names, I saw that some of them did better than others (based on coefficients). Here is who you would and would not want in your next movie.

From left to right, row by row: Denzel Washington, Ryan Gosling, Leonardo DiCaprio, Jean-Claude van Damm, Johnny Depp, Tom Hanks, Christian Bale, Michael Madsen

PS: Six months from now I am going to revisit the model and see if it can accurately predict the ratings for Black Panther. Until then…

The Joy of Cleaning Data

Vladimir Lazovskiy — Tue, 06 Feb 2018 08:01:50 GMT

Turns out getting workable data is not as simple as it sounds. Sure, there are many ways to obtain data sets — for instance, by visiting the New York MTA website — but downloading data is not enough. There will almost certainly be errors of different degrees in what you get. These errors can vary in type and magnitude: from mismatched data types to absurd values.

Some errors can be very sneaky. For instance, one of the columns in the mentioned New York subway data set contains turnstile hourly counters. Not only are they ordinal counts of how many people had passed each turnstile, but the turnstiles occasionally break and count backwards. Same metro lines are described by differently ordered letter combinations, and sometimes, the station in question may not even exist (or, rather, a certain line does not go through that station, which means there was an error entering data). Wow.

So… what exactly is data cleaning? A dictionary definition would be something like “data cleaning is the act of taking collected data and making it usable in your preferred statistical software.” Cleaning includes removing bad data, creating correct labels and codes, and making everything consistent and to a degree interpretable. Every data scientist will get data that needs cleaning, even if her collection techniques were perfect.

And while it’s true that data cleaning is the most time-consuming part of any research endeavor, it is a necessary step which will lead to easier data analysis and more reliable insights. Besides, cleaning data is quite a satisfying process. It is as fun as gardening or washing dirty dishes—you get something done, and you make order out of chaos.

Finally, working with clean and processed data is enjoyable. You can draw insights and do pretty plots… like this one.

I’ve a feeling we’re not in Kansas anymore.

Vladimir Lazovskiy — Wed, 24 Jan 2018 18:11:27 GMT

Metis is called a boot camp for a reason. You start your mornings with a programming “drill,” followed by a few hours of lecture, followed by continuous work on projects, where you apply the knowledge you just gained. You stay up late at night, get frustrated with your code, and you fail. Yet, you get up to continue working because this is just the beginning. Just like it is always hard to start exercising, learning something new is no easy feat.

Boot camp can be very intense even in its first days. There is plenty of room for self-doubt and anxiety. It can be hard to accept that you may not be fully proficient in all areas. But it is extremely empowering to knock out that piece of code you spent several days writing and optimizing. It is thrilling to finish a project and move on to solve the next machine learning mystery.

But why do a boot camp? Aren’t there plenty of free resources online aimed to teach you virtually anything?

Well, it is entirely possible to teach yourself how to code, build robust algorithms, and work with data. Aside from the fact that you are going to be flooded with information without knowing what truly matters, you will soon realize that studying by yourself is… boring. Unless you belong to the rare breed of unicorns who know how find the right sources and can naturally commit to studying daily, you will most likely start to get overwhelmed by all the available information and begin slacking on those free YouTube videos.

For me, the social aspect of learning is paramount to successful results. This is exactly what I want from my boot camp experience—to be surrounded by people with whom I can share my ideas, my struggles, and my victories—and this is what I am getting at Metis.

So, the journey has begun, and I am excited for what is to come!