When Magento meets Python (episode Anomalies Detection)

Second post about applying some Python magic to deal with real use cases with Magento.

If you missed the first episode, check it out here.

Anomalies detection is a way to detect strange data behaviour (not necessary an error), finding the outliers, observations that are “far” from the rest, in terms of proximity or statistically, depending from the approach.

Let’s make an example: in an e-commerce scenario, most of the orders are ok, following the same steps and being in a consistent state.

Now let’s suppose that an external factor — of course all the internal code is 100% bug free :), is performing some unwanted changes, updating occasionally orders status in a bad way.

Is it possible to have something to assess the past situation and alert when something is wrong (and make customer care team your best friends)? Let’s find out.

In this PoC, we’re using fictional data orders with just three features: qty invoiced , order state and order status, fields present in Magento.

Let’s take a look at the data.

import pandas as pd df = pd.read_csv(filepath_or_buffer='anomalies_example.csv', sep=';')
df.head()

A very simple table…

Let’s do some very basic EDA (Exploratory Data Analysis)…something to always perform when dealing with a new data set.

print(df.info())
RangeIndex: 5200 entries, 0 to 5199
Data columns (total 3 columns):
state 5200 non-null object
status 5200 non-null object
qty_invoiced 5200 non-null int64
dtypes: int64(1), object(2)
memory usage: 122.0+ KB

We have 5200 entries, with state and status as strings (there are categorical values) and qty_invoiced as integer

print(df.describe())
qty_invoiced
count 5200.000000
mean 0.882885
std 0.325748
min 0.000000
25% 1.000000
50% 1.000000
75% 1.000000
max 3.000000

Interesting, the max qty invoiced for an item is 3 and at least the 75% of records have just one item invoiced.

Let’s see the numbers

df.groupby('qty_invoiced').state.count()
qty_invoiced
0 615
1 4580
2 4
3 1
Name: state, dtype: int64

Only one order with 3 items, 4 with 2 and the majority with 1..let’s keep in mind this.

Before using some algorithm, let’s try to assess the situation visually, using Seaborn, a powerful graphical library

import seaborn as sns
_ = sns.stripplot(x="qty_invoiced", y="state", data​=df)

Whoa, definitely something wrong here. For sure qty_invoiced positive with “pending payment” as status is not ok!

About “hold” and “processing”, well it depends from the implemented BL if is ok or not. But we can make an assumption from the data, using the algorithm as input/output processor, considering the BL simply as a black box.

The main characteristic of data set for anomalies is that the “wrong” examples usually are very, very few and often you don’t know them in advance (especially for continuous features), so it’s difficult to train a supervised algorithm to teach what is right and what is wrong.

So better “throw” all the data to an unsupervised learning algorithm and see what it can find.

As said earlier, there are different approaches.We’ll use an algorithm called “k-nearest neighbors” as unsupervised classifier, meaning that the output will be 1 if an order item is consider an anomaly or 0 if not.

Let’s start but, before, we have to transform the categorical features (state and status values) in something more understandable to an algorithm.

from sklearn import preprocessing df_dummy = pd.get_dummies(df) df_dummy

We transformed the data set, placing a 0/1 for every possibile value for state and status. Now, we deal with a matrix of 5200 rows × 17 columns.

Let’s proceed with KNN.

from pyod.models.knn import KNN

clf_name = 'KNN'
clf = KNN()
clf.fit(df_dummy)

y_train_pred = clf.labels_ # binary labels (0: inliers, 1: outliers)

anomalies = [i for i, x in enumerate(y_train_pred) if x == 1]

print(anomalies)
[975, 1071, 1262, 1573, 1946, 1947, 1948, 1949, 2866, 3308, 3309, 3896, 5120, 5121]

Ok, we found some…let’s take a look at one

df_dummy.iloc[975,:]
qty_invoiced              2
state_canceled 0
state_closed 0
state_complete 1
state_holded 0
state_new 0
state_payment_review 0
state_pending_payment 0
state_processing 0
status_canceled 0
status_closed 0
status_complete 1
status_holded 0
status_payment_review 0
status_pending 0
status_pending_payment 0
status_processing 0

It seems ok as order configuration, but is consider an anomaly….and KNN is right, because it is! Remember the most of the order items have one item?

So we have to change a little bit the data to help the algorithm, doing what is called “feature engineering”. To deal with the imbalance of order items, we can remove the qty_invoiced number and replace with a new feature telling just if qty_invoiced is positive. Let’s try it and let’s run the KNN again

df_dummy['invoiced_positive'] = df.apply(lambda row: row.qty_invoiced > 0, axis=1)

df_dummy.drop(['qty_invoiced'], axis = 1, inplace=True)

clf.fit(df_dummy)

y_train_pred = clf.labels_ # binary labels (0: inliers, 1: outliers)

anomalies = [i for i, x in enumerate(y_train_pred) if x == 1]

print(anomalies)


[1071, 1262, 1946, 1947, 1948, 1949, 3308, 3309, 5120, 5121]

The 975 value is gone! Let’s take a look to 1071

It’s an anomaly and it’s wrong! Let’s see all of them, using the original data set for a better readability.

df.iloc[anomalies,:]

The red ones are wrong, so it’s ok..the others seem ok, there are just few cases in the data set.

Lastly, we have a tool to check new data too (imagine exposed as an API), using the trained classifier. Let’s check two new observations, one ok and one not.

new_data = pd.DataFrame([[False,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0], 
[True,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0]])

pred = clf.predict(new_data)
anomalies = [i for i, x in enumerate(pred) if x == 1]

print(anomalies)
[1]

The classifier found an anomaly, the second record (in Python array starts from 0).

This is just a little example, but it shows how this tools can be used to gain valuable info from data.

See you next time!


Originally published at https://www.linkedin.com.