It is always the trend topic that big companies are spying on us. But really, you could say that they are spying on us? in my opinion is a “confusing” topic because companies are not listening to everything you do. Companies only prepare their product to be as rental as possible. They just want you to buy or use their product and for that they prefer to give you a better experience. And how we improve that experience, recommending you things that they really know you like.
Okay, if the company knows what I like, it’s because they’re spying on me.
False, for example when you create an amazon account and visit their products, after days amazon sends you the best offers on products similar to those you visited. Really the people of amazon not spying to you. Only reviews the latest products you have visited to create a good recommendation to user. In my opinion, I prefer to be recommended products that I have visited or tried to find. In this way, if I see what I like, I will be able to buy it.
Most importantly, companies that don’t spy only use simple techniques to improve their product and get the best user experience.
Let’s begin, first of all we need to create our dataset. For example, in a ticketing service, we are going to recommend our users possible events that they might like. We analyse the Order table.
SELECT event_id, user_id FROM Orders;
That table is very large and needs to filter the information for our dataset. The first filter try to eliminate the “mini events” to get events with more than N sales, being N a number of sales to determine that the event is big enough for our dataset, for example 300 tickets sold.
Well, the dataset begins to take shape. Below I propose a filter by time, because in this table there are very old events, and we prefer to recommend our new events. Be careful with this filter, because if time is to short, you may not get good results. In our example take last 3 months, with this filter our dataset get the perfect size to work for it.
import pandas as pd
from sklearn import clusterdataset = pd.read_csv("eventId_userId.csv")
limit = 500000
dataset_train = dataset[:limit]
dataset_test = dataset[limit:]
n = 5k_means = cluster.KMeans(n_clusters=n)
k_pred = k_means.predict(dataset_test)
We use 500,000 for the training process and 500 for testing. And to create our recommendation system we use K-means that method observes and groups similar events and users in a same cluster/s. In this way we will organize in groups the most similar events and users. And we will know the preferences of the users. We started using only 5 clusters in the image I call it categories.
😅 ok don’t worry, 5 cluster is not enough. As you can see all events are present in all cluster. Ok we will try with 100 cluster.
😄 GREAT!!! in this histogram can see a little bit groups, some categories are clearly more highlighted than others. Now try with 300 cluster.
😃 👌 PERFECT!!! 300 categories is perfect, you can clearly see that there are independent groups where users and events are grouped. We already have ready to send spam our users with recommendations based on their preferences.
I hope with this example demonstrated how work the companies when recommend something. In this process I don’t know anything about the users or events. Only use mathematics to group and after process the data we can get the info about potential buyers. And when these potential buyers enter our website, we already know what to put in the recommendations section. Just like Netflix, Amazon, HBO… nobody forces you to use or buy their product, they just do their job well so that you find what you like.
Thank you for reading me.