Who are alike? Use BigObject feature vector to find similarities

Jocelyn Chen
3 min readOct 2, 2015

--

Cluster Analysis is a common technique to group a set of objects in the way that the objects in the same group share certain attributes. It’s commonly used in marketing and sales planning to define market segmentations.

Here at BigObject we adopt a simple approach to explore the similarities between objects. We simply calculate the “Feature Vector” based on given attributes and use the score to determine which objects are “alike.”

This is a simple example to show how to use BigObject to extract product features and then find similar products in your retail data. We use the default sample data in the BigObject docker image to demonstrate the task. You may run the docker image on your own computer or play around in our sandbox.

The sample data schema is:

For example, we would like to extract all products’ feature based on the average quantity sold in each channel.

First, build a table “avg_qty_by_channel” to store product’s average quantity sold in all channel by the statement:

BUILD TABLE avg_qty_by_channel AS (FIND Product.id, channel_name , avg(qty) FROM sales)

The table “avg_qty_by_channel” would be:

Now, convert the “avg_qty_by_channel” table to a product feature vector table by the trans-pivot operation

BUILD TABLE Product_feature(*, channel_name[*]:’AVG(qty)’) FROM avg_qty_by_channel {default_type:DOUBLE}

The feature table “Product_feature” would be:

Finally, we write a simple Lua function which defines a distance function (average difference) and scan the product feature table to find the most similar product ( O(n^2) ).

The result will be stored in the “simProduct” which created by the statement:

CREATE TABLE simProduct (Product.id STRING, simProductId STRING, distance DOUBLE, KEY(Product.id))

After this, upload a Lua function “findSimP” and run the function by:

APPLY findSimP(Product_feature, simProduct )

It may take a while (80~90 sec.) since the implementation is not optimized yet( O(n^2) )

The result can be shown by a select statement as:

SELECT * FROM simProduct LIMIT 10

The result shall be:

How to upload the findSimP.lua by python bosh

bosh>adminbosh:admin>luaupload findSimP.luaokbosh:admin>exit

The findSimP.lua file

function dist(bt, col_size, idx1, idx2)dist_v = 0.0for i=1,col_size dov1 = bt:getValue(idx1,i)v2 = bt:getValue(idx2,i)dist_v = dist_v + math.abs(v1 — v2)endreturn dist_v/col_sizeendfunction findSimP(productTable, resultTable)product=bo.getTable(productTable)resbt=bo.getTable(resultTable)row_size = product:size()col_size = 7for i=1,row_size doproduct_name = product:getValue(i,0)min_dist = 9999sim_p = “”for j=1,row_size doif i~=j thentemp_v=dist(product, col_size , i , j)if min_dist > temp_v thensim_p = product:getValue(j,0)min_dist = temp_vendendendinsert_table = {}table.insert(insert_table, product_name)table.insert(insert_table, sim_p)table.insert(insert_table, min_dist)resbt:insert(insert_table)endend

--

--