Google’s_Kaggle_challenge

Manish Pawar
4 min readJan 7, 2019

--

Ai to predict item prices for online sellers

Skip to content

Image result for price suggestion challenge

Can we teach Ai to suggest product prices to sellers who sell online?
Well, there’s a Kaggle competition called Mercari Prize: Price Suggestion Challenge wherein given the product brand, name, category & one-two short descriptions, we can predict its price.

Here, we will glide through a simple keras model for the same.

We take our data from here.

Our data looks like this…

For the price column, it would be problematic to feed into neural network values that all take wildly different ranges. So, well-known practice to deal with such data is to do feature-wise normalization viz.for each feature in the input data (a column in the input data matrix) we apply log(x+1) to it.And it’s as easy as

train[‘target’] = np.log1p(train[‘price’])

Now, we have to simplify our data. It means you’re with you are , ain’t with am not and so on. We call it contractions.

contractions = { "ain't": "am not", "aren't": "are not", "can't": "cannot", "can't've": "cannot have", # more contractions pairs ... }for contraction in contractions:train['item_description'] = train['item_description'].str.replace(contraction, contractions[contraction]) test['item_description'] = test['item_description'].str.replace(contraction, contractions[contraction]) train['name'] = train['name'].str.replace(contraction, contractions[contraction])test['name'] = test['name'].str.replace(contraction, contractions[contraction])

Now, we need to handle missing values since if anyone ignores it , then he/she may end up drawing an inaccurate inference about the data. It’s simply data.category_name.fillna(value=”missing”, inplace=True)
we put it as …

def handle_missing(dataset): dataset.category_name.fillna(value="missing", inplace=True) dataset.brand_name.fillna(value="missing", inplace=True) dataset.item_description.fillna(value="missing", inplace=True)return (dataset)train = handle_missing(train)test = handle_missing(test)

Now, many products could have same category name or brand name, so it will be helpful to create categorical columns from them. We use Label_Encoder fo this…

from sklearn.preprocessing import LabelEncoder le = LabelEncoder() le.fit(np.hstack([train.category_name, test.category_name])) train['category'] = le.transform(train.category_name) test['category'] = le.transform(test.category_name) le.fit(np.hstack([train.brand_name, test.brand_name])) train['brand'] = le.transform(train.brand_name)test['brand'] = le.transform(test.brand_name)

Then we tokenize, which basically means to seperate words

from keras.preprocessing.text import Tokenizer raw_text = np.hstack([train.category_name.str.lower(), . train.item_description.str.lower(), . train.name.str.lower()])#Tokenizing...tok_raw = Tokenizer() tok_raw.fit_on_texts(raw_text) #converting text to sentences train["seq_category_name"] = tok_raw.texts_to_sequences(train.category_name.str.lower()) test["seq_category_name"] = tok_raw.texts_to_sequences(test.category_name.str.lower()) train["seq_item_description"] = tok_raw.texts_to_sequences(train.item_description.str.lower()) test["seq_item_description"] = tok_raw.texts_to_sequences(test.item_description.str.lower())train["seq_name"]=tok_raw.texts_to_sequences(train.name.str.lower()) test["seq_name"]=tok_raw.texts_to_sequences(test.name.str.lower())

We now pad the sequences. It means every sequence in a batch must have same length since we need to pack them into a single tensor. So sequences that are shorter than others should be padded with zeros, and sequences that are longer should be truncated.

from keras.preprocessing.sequence import pad_sequencesdef get_keras_data(dataset):X = {'name': pad_sequences(dataset.seq_name, maxlen=MAX_NAME_SEQ),'item_desc': pad_sequences(dataset.seq_item_description, maxlen=MAX_ITEM_DESC_SEQ),'brand': np.array(dataset.brand),'category': np.array(dataset.category),'category_name': pad_sequences(dataset.seq_category), maxlen=MAX_CATEGORY_NAME_SEQ),'item_condition': np.array(dataset.item_condition_id),'shipping': np.array(dataset[["shipping"]])}return XX_train = get_keras_data(dtrain) X_valid = get_keras_data(dvalid)X_test = get_keras_data(test)

Now off to the model. (check comments for better understanding). We use simple GRU layers with embeddings then flatten it. Go here & here for a better grasp.
We use Adam as an optimizer(i tried with RMSprop but less accuracy was observed) and mean_squared_error as a loss. Click here for an outstanding comparison of various losses.

from keras.layers import Input, Dropout, Dense,concatenate, GRU, Embedding, Flatten 
from keras.models import Model
from keras import optimizers
def get_model(): #Inputs name = Input(shape=[X_train["name"].shape[1]], name="name") item_desc = Input(shape=[X_train["item_desc"].shape[1]], name="item_desc") brand = Input(shape=[1], name="brand") category = Input(shape=[1], name="category") category_name = Input(shape=[X_train["category_name"].shape[1]], name="category_name") item_condition = Input(shape=[1], name="item_condition") shipping = Input(shape=[X_train["shipping"].shape[1]], name="shipping") #Embeddings layers emb_size = 60 emb_name = Embedding(MAX_TEXT, emb_size//3)(name) emb_item_desc = Embedding(MAX_TEXT, emb_size)(item_desc) emb_category_name = Embedding(MAX_TEXT, emb_size//3) (category_name) emb_brand = Embedding(MAX_BRAND, 10)(brand) emb_category = Embedding(MAX_CATEGORY, 10)(category) emb_item_condition = Embedding(MAX_CONDITION, 5)(item_condition) rnn_layer1 = GRU(16) (emb_item_desc) rnn_layer2 = GRU(8) (emb_category_name) rnn_layer3 = GRU(8) (emb_name) #main layer main_l = concatenate([ Flatten() (emb_brand) , Flatten() (emb_category) , Flatten() (emb_item_condition) , rnn_layer1 , rnn_layer2 , rnn_layer3 , shipping]) main_l = Dropout(dr)(Dense(512,activation='relu') (main_l)) main_l = Dropout(dr)(Dense(64,activation='relu') (main_l)) main_l = Dropout(dr)(Dense(32,activation='relu') (main_l)) #output output = Dense(1,activation="linear") (main_l) #model model = Model([name, item_desc, brand , category, category_name , item_condition, shipping], output) model.compile(loss="mse",optimizer="Adam") return model

Now, we train it ..

history = model.fit(X_train,dtrain.target,epochs=10,batch_size=BATCH_SIZE,validation_split=0.2) #20% data for validation and 80% for main training.

Here’s how it goes… Train on 1453031 samples, validate on 14678 samples :Epoch 1/10 1453031/1453031 [==============================] — 139s — loss: 0.4857 — val_loss: 0.2264 Epoch 2/10 1453031/1453031 [==============================] — 138s — loss: 0.2112 — val_loss: 0.2117

Epoch 3/10 1453031/1453031 [==============================] — 137s — loss: 0.1779 — val_loss: 0.1996

Then we evaluate it…

def eval_model(model): val_preds = model.predict(X_valid) val_preds = np.expm1(val_preds) y_true = np.array(dvalid.price.values) y_pred = val_preds[:, 0] v_rmsle = rmsle(y_true, y_pred) return v_rmsle v_rmsle = eval_model(model)

which outputs
RMSLE error on dev test: 0.44780704411885014

We now create predictions and plot the prices

preds = model.predict(X_test, batch_size=BATCH_SIZE) 
preds = np.expm1(preds)
submission.price.hist(bins=20, range=[0, 100])

Be sure to check & participate in it’s Kaggle competition. Visit here.

Credits to Chengwei Zhang. I’ve merely created a wrapper to understand.
See ya!

Originally published at blog.lipishala.com on January 7, 2019.

--

--