Text and Image Classification for Craigslist using GloVe and MobileNet — Transfer Learning

shikhar kanaskar
Shikhar’s Data Science Projects
14 min readDec 10, 2022

Business Problem:

Craigslist, is an American Classified Advertisements website having various sections about housing, jobs, services (beauty, legal, health, etc.), and products for sale. Anyone can list a product or service on Craigslist for free and those interested can contact the poster. However, there are many listings on Craigslist that are not properly classified and are posted in incorrect sections.

A particular category on the website that is of interest for us is the ‘Bikes’ Section. Just like all other sections on the website, the ‘Bikes’ Section of Craigslist, has many listings that do not belong there. There are bike parts, helmets, scooters, bags, and many other items present on this page that are misclassified.

As customers visiting this section find various products that are misclassified, their satisfaction with the website would decrease, and they may not become a repeat user of our Craigslist. By reducing the misclassification, I aim to enhance user experience for our customers.

Model Architecture— using ensemble of text and image classification

I would scrape both text and images data from craigslist for arund 2500 advertisements, further label those posts manually into bicycles and non-bicycles. Further, I would train separate image and text classification models and would later ensemble them into one model.

Data Collection:

To collect the data required for our analysis, I scraped Craigslist for both images and text using BeautifulSoup and Selenium.

As there were numerous incorrect listings in the scraped data from Craigslist bikes section, I manually cleaned the data by labelling around 3000 records (~2,500 for training from Chicago and ~660 for testing from Cincinnati). The images collected from Google also contained pictures from various categories apart from just bikes. The incorrect images were removed manually.

!pip install selenium
!pip install webdriver_manager
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup as bs
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
chrome_options = Options()
chrome_options.add_argument('--no-sandbox')# operate at the highest authority
chrome_options.add_argument('--disable-dev-shm-usage')#increase the RAM of chrome to load the page
import json
import time
from webdriver_manager.chrome import ChromeDriverManager
# driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options = chrome_options)
!apt-get update # to update ubuntu to correctly run apt install
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')

url_links = []
for i in range(0,960,120):
url_links.append('https://atlanta.craigslist.org/search/bia?s=' + str(i))

# pip install webdriver-manager
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager

# driver = webdriver.Chrome()
import requests
import random

headers_list = [
# Firefox 77 Mac
{
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Referer": "https://www.google.com/",
"DNT": "1",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1"
},
# Chrome 92.0 Win10
{
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate, br",
"Referer": "https://www.google.com/",
"DNT": "1",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1"
},
# Chrome 91.0 Win10
{
"Connection": "keep-alive",
"DNT": "1",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Sec-Fetch-Site": "none",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Dest": "document",
"Referer": "https://www.google.com/",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-GB,en-US;q=0.9,en;q=0.8"
},
# Firefox 90.0 Win10
{
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Sec-Fetch-Site": "same-origin",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-User": "?1",
"Sec-Fetch-Dest": "document",
"Referer": "https://www.google.com/",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.9"
}
]
post_title = []
post_url = []
post_image = []
post_date = []
post_price = []
post_text = []
post_id = []

driver = webdriver.Chrome('chromedriver', options = chrome_options)

for i in url_links:
# counter = 0
page = i
driver.get(page)
time.sleep(np.random.uniform(3,6))
html = driver.page_source
soup = bs(html, 'html.parser')
for j in range(len(soup.find_all('h3', class_ = "result-heading"))):
post_title.append(soup.find_all('h3', class_ = "result-heading")[j].find('a').getText())
post_url.append(soup.find_all('h3', class_ = "result-heading")[j].find('a')['href'])
if len(soup.find_all('a', href = soup.find_all('h3', class_ = "result-heading")[j].find('a')['href'])[0]['class']) == 2:
post_image.append(soup.find_all('a', href = soup.find_all('h3', class_ = "result-heading")[j].find('a')['href'])[0].find_all("img")[0]['src'])
# counter += 1
else:
post_image.append('no image available')
post_date.append(soup.find_all('time', class_ = "result-date")[j]['datetime'])
post_price.append(soup.find_all('span', class_ = "result-price")[j].getText())
headers = random.choice(headers_list)
r = requests.Session()
r.headers = headers
time.sleep( 1.0 + np.random.uniform(0,1) )
# time.sleep(np.random.uniform(3,8)) #Can comment if running on 1 page only for testing
html_post = r.get(soup.find_all('h3', class_ = "result-heading")[j].find('a')['href']).text
soup_post = bs(html_post, 'html.parser')
post_text.append(soup_post.find_all("section", id = "postingbody")[0].getText())
post_id.append(soup_post.find_all('p', class_ = "postinginfo")[1].getText()[9:])
# print(soup_post.find_all('p', class_ = "postinginfo")[1].getText()[9:])
driver.quit()
post_text_1 = [s.replace('\n', ' ') for s in post_text]
import pandas as pd
final_df = pd.DataFrame(
{'Post ID': post_id,
'Post Title': post_title,
'Post URL': post_url,
'Post Image': post_image,
'Date': post_date,
'Item Cost': post_price,
'Post Text':post_text_1
})

final_df.to_csv(r'/content/drive/MyDrive/AUD Project/Craigslist_Web_Scrape_Atlanta v1.csv', index=False, header=True)
scraped_data=final_df.copy()
scraped_data=scraped_data.sort_values('Date').drop_duplicates(['Post Title','Post Text'] ,keep='last')
import urllib
for url, post_id in zip(scraped_data['Post Image'],scraped_data['Post ID']):
image_name=str(post_id)
try:
urllib.request.urlretrieve(url, "/content/drive/MyDrive/AUD Project/post_images_data/"+image_name+".jpg")
except:
continue

Text Classification Model using GloVe and CNN:

For text classification, I used all summary listings of bikes and their classification label (bike or not a bike) as the input. I performed necessary text processing steps such as removal of stop words, punctuations and HTML tags to ensure that no noise is present when I train the model.

import pandas as pd
import numpy as np
data = pd.read_csv("Craigslist_Web_Scrape v0.13.csv", encoding = "utf-8")
labels = pd.read_csv("labels_cragislist.csv")
data = pd.concat([data, labels], axis = 1)
data_deduped = data.drop_duplicates(subset=['Post Title'], keep=False)
data_deduped=data_deduped.drop(["Post ID"], axis=1)
data_deduped['Post Text'] = data_deduped['Post Text'].str[30:]
data_deduped['Post Text'].iloc[6]
data_deduped['Item Cost'] = data_deduped['Item Cost'].replace('[\$,]', '', regex=True).astype(float)
data_deduped['Label'].value_counts()
# Importing libraries
import os
from gensim import models
import numpy as np
import pandas as pd
from tqdm import tqdm
tqdm.pandas()
import matplotlib.pyplot as plt
%matplotlib inline
embeddings_index = {}
f = open('C:/BAIM - Purdue/MGMT590AUD - Analyzing Unstructured Data/Project/glove.6B.50d.txt',encoding='utf-8')
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
f.close()
print('Found %s word vectors.' % len(embeddings_index))
# Cleaning and Pre Processing text
import re

def clean_numbers(text):
text = re.sub('[0-9]{5,}', '#####', text)
text = re.sub('[0-9]{4}', '####', text)
text = re.sub('[0-9]{3}', '###', text)
text = re.sub('[0-9]{2}', '##', text)
return text

def clean_text(text):
text = clean_numbers(text)
text = str(text)

for punct in "/-'":
text = text.replace(punct, ' ')
for punct in '&':
text = text.replace(punct, f' {punct} ')
for punct in '?!.,"$%\'()*+-/:;<=>@[\\]^_`{|}~' + '“”’':
text = text.replace(punct, '')

text = text.lower()
return text
from numpy import zeros
import keras
import tensorflow as tf
from keras.preprocessing.text import Tokenizer
from keras_preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.callbacks import EarlyStopping
from keras.layers import Embedding

data_deduped["processed_data"] = data_deduped['Post Text'].progress_apply(lambda x: clean_text(x))
# train_df["length"] = data_deduped['Post Text'].progress_apply(lambda x: len(x.split()))
docs = data_deduped["processed_data"].values
# prepare tokenizer
t = Tokenizer()
t.fit_on_texts(docs)
vocab_size = len(t.word_index) + 1
# integer encode the documents
encoded_docs = t.texts_to_sequences(docs)
print(encoded_docs)
# pad documents to a max length of 150
max_length = 150
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')

labels = data_deduped["Label"].values

# create a weight matrix for words in training docs
embedding_matrix = zeros((vocab_size, 50))
for word, i in t.word_index.items():
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector

For the first model that I tried, Stochastic Gradient Descent Classifier, used SMOTE to treat the imbalanced data. The data was then divided into training, test and validation set and a SGD classifier was trained with appropriate parameters. The hyper parameters such as Loss function, Maximum Iterations, Tolerance and alpha were tuned using grid search to obtain models with least misclassification error. Also ran K Fold cross validation to ensure the model does not overfit the training data and get a generalized model. However, it was observed that the results of this model were very skewed and it gave all probabilities between 0.4 and 0.6. Therefore, I implemented another model (CNN classification) to improve the accuracy of text listing classification.

# Create Train/Test data
from sklearn.model_selection import train_test_split

X_train_SMOTE, X_test, y_train_SMOTE, y_test = train_test_split(padded_docs,
labels,stratify=labels, random_state=0)
from imblearn.over_sampling import SMOTE
smt = SMOTE(random_state=0)
X_train, y_train = smt.fit_resample(X_train_SMOTE, y_train_SMOTE)

Performed word embedding on the pre-processed data by loading these listings to a pre trained Global Vector (GloVe) model, which is an unsupervised learning algorithm for obtaining vector representations for words. The resulting embeddings, help in visualising linear substructures of the words in a vector space. The output of the word embedding is padded and standardized to ensure that all embeddings are of the same length. The embedded output data was divided into training, testing and validation sets.

model_new = Sequential([
keras.layers.Embedding(vocab_size, 50, weights=[embedding_matrix],
input_length=max_length, trainable=True),
keras.layers.Conv1D(32, 3, activation='relu'),
keras.layers.GlobalMaxPooling1D(),
# keras.layers.LSTM(100),
# keras.layers.Conv1D(64, 3, activation='relu'),
keras.layers.Dense(10, activation='relu'),
keras.layers.Dense(1, activation='sigmoid')])

Further, defined and trained a classification CNN model through which the training set was passed. ReLu activation function was used in Dense Layers and max pooling operation to get the classification of the listing. The maximum values from the 3 different layers were concatenated and passed to a final, fully connected classification layer. In our implementation, the classification layer is trained to output a single value, between 0 and 1, where close to 0 indicates non-bikes and close to 1 indicates a bike​. To evaluate our model, it was tested on bike and non-bike textual listings.

model_new = Sequential([
keras.layers.Embedding(vocab_size, 50, weights=[embedding_matrix],
input_length=max_length, trainable=True),
keras.layers.Conv1D(32, 3, activation='relu'),
keras.layers.GlobalMaxPooling1D(),
# keras.layers.LSTM(100),
# keras.layers.Conv1D(64, 3, activation='relu'),
keras.layers.Dense(10, activation='relu'),
keras.layers.Dense(1, activation='sigmoid')])
model_new.summary()
model_new.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
# early_stop = EarlyStopping(monitor='val_loss', patience=2)
history = model_new.fit(X_train,y_train, epochs=20, validation_data=(X_test,y_test),
# callbacks =[early_stop],
batch_size=256)

# Plot training & validation accuracy values
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()

# Plot training & validation loss values
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()
prob_test = model_new.predict(X_test)
y_pred_1 = []
for i in range(len(prob_test)):
if prob_test[i][0] > 0.5:
y_pred_1.append(1)
else:
y_pred_1.append(0)

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import seaborn as sns
import matplotlib.pyplot as plt
def c_report(y_true, y_pred):
print("Classification Report")
print(classification_report(y_true, y_pred))
acc_sc = accuracy_score(y_true, y_pred)
print("Accuracy : "+ str(acc_sc))
return acc_sc

def plot_confusion_matrix(y_true, y_pred):
mtx = confusion_matrix(y_true, y_pred)
sns.heatmap(mtx, annot=True, fmt='d', linewidths=.5,
cmap="Blues", cbar=False)
plt.ylabel('True label')
plt.xlabel('Predicted label')
data_test = pd.read_csv("C:\\BAIM - Purdue\\MGMT590AUD - Analyzing Unstructured Data\\Project\\Cincinnati - test data.csv")
data_test['Post Text'] = data_test['Post Text'].str[30:]
data_test['Item Cost'] = data_test['Item Cost'].replace('[\$,]', '', regex=True).astype(float)
data_test["processed_data"] = data_test['Post Text'].progress_apply(lambda x: clean_text(x))
docs_test = data_test["processed_data"].values
encoded_docs = t.texts_to_sequences(docs_test)
print(encoded_docs)
# pad documents to a max length of 150
max_length = 150
padded_docs_new = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
labels_test = data_test["y_true_img"].values
prob_test_new = model_new.predict(padded_docs_new)
y_pred_1_new = []
for i in range(len(prob_test_new)):
if prob_test_new[i][0] > 0.5:
y_pred_1_new.append(1)
else:
y_pred_1_new.append(0)

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import seaborn as sns
import matplotlib.pyplot as plt
def c_report(y_true, y_pred):
print("Classification Report")
print(classification_report(y_true, y_pred))
acc_sc = accuracy_score(y_true, y_pred)
print("Accuracy : "+ str(acc_sc))
return acc_sc

def plot_confusion_matrix(y_true, y_pred):
mtx = confusion_matrix(y_true, y_pred)
sns.heatmap(mtx, annot=True, fmt='d', linewidths=.5,
cmap="Blues", cbar=False)
plt.ylabel('True label')
plt.xlabel('Predicted label')
c_report(labels_test, y_pred_1_new)
plot_confusion_matrix(labels_test, y_pred_1_new)

Image Classification — Model Using MobileNet

Started off by performing some pre-processing on the images such as reshaping and padding the images. Smaller size images help in faster training of deep learning models as the number of pixels to be learnt are less. Also performed image augmentation by rotating the images by 20 degrees. This helped increase the size of the training data set and making the model robust.

For training the model, transfer learning was leveraged to enhance the prediction power of our image classification model. The intuition behind transfer learning for image classification is that if a model is trained on a large and general enough dataset (such as ImageNet, which is a research training dataset with categories such as jackfruits and syringes), it will effectively serve as a generic model of the visual world.

import matplotlib.pyplot as plt
import numpy as np
import os
import tensorflow as tf
import os

lst = os.listdir('/content/drive/MyDrive/Mobilenet_pretrained2/Validation/bicycle') # your directory path
number_files = len(lst)
print(number_files)

train_dir = os.path.join("/content/drive/MyDrive/Mobilenet_pretrained2", 'Train')
validation_dir = os.path.join("/content/drive/MyDrive/Mobilenet_pretrained2", 'Validation')

BATCH_SIZE = 32
IMG_SIZE = (160, 160)

train_dataset = tf.keras.utils.image_dataset_from_directory(train_dir,
shuffle=True,
batch_size=BATCH_SIZE,
image_size=IMG_SIZE)
validation_dataset = tf.keras.utils.image_dataset_from_directory(validation_dir,
shuffle=True,
batch_size=BATCH_SIZE,
image_size=IMG_SIZE)
class_names = train_dataset.class_names

plt.figure(figsize=(10, 10))
for images, labels in train_dataset.take(1):
for i in range(9):
ax = plt.subplot(3, 3, i + 1)
plt.imshow(images[i].numpy().astype("uint8"))
plt.title(class_names[labels[i]])
plt.axis("off")
val_batches = tf.data.experimental.cardinality(validation_dataset)
test_dataset = validation_dataset.take(val_batches // 5)
validation_dataset = validation_dataset.skip(val_batches // 5)

AUTOTUNE = tf.data.AUTOTUNE

train_dataset = train_dataset.prefetch(buffer_size=AUTOTUNE)
validation_dataset = validation_dataset.prefetch(buffer_size=AUTOTUNE)
test_dataset = test_dataset.prefetch(buffer_size=AUTOTUNE)

data_augmentation = tf.keras.Sequential([
tf.keras.layers.RandomFlip('horizontal'),
tf.keras.layers.RandomRotation(0.2),
])

for image, _ in train_dataset.take(1):
plt.figure(figsize=(10, 10))
first_image = image[1]
for i in range(9):
ax = plt.subplot(3, 3, i + 1)
augmented_image = data_augmentation(tf.expand_dims(first_image, 0))
plt.imshow(augmented_image[0] / 255)
plt.axis('off')
preprocess_input = tf.keras.applications.mobilenet_v2.preprocess_input
rescale = tf.keras.layers.Rescaling(1./127.5, offset=-1)
IMG_SHAPE = IMG_SIZE + (3,)
base_model = tf.keras.applications.MobileNetV2(input_shape=IMG_SHAPE,
include_top=False,
weights='imagenet')
image_batch, label_batch = next(iter(train_dataset))
feature_batch = base_model(image_batch)
print(feature_batch.shape)
base_model.trainable = False
base_model.summary()

Base model from the MobileNet V2 model was created, which is a pre-trained model on 1.4M images and 1000 classes​. It uses 3 × 3 depth wise separable convolutions that use between 8 to 9 times less computation than standard convolutions at the expense of only a small reduction in accuracy. Leveraged the advantage of these learned feature maps instead of building a model from scratch and training it on a large dataset. This would be helpful in transferring the learnings from this model to our model, hence the name​.

global_average_layer = tf.keras.layers.GlobalAveragePooling2D()
feature_batch_average = global_average_layer(feature_batch)
print(feature_batch_average.shape)
prediction_layer = tf.keras.layers.Dense(1)
prediction_batch = prediction_layer(feature_batch_average)
print(prediction_batch.shape)

inputs = tf.keras.Input(shape=(160, 160, 3))
x = data_augmentation(inputs)
x = preprocess_input(x)
x = base_model(x, training=False)
x = global_average_layer(x)
x = tf.keras.layers.Dropout(0.2)(x)
outputs = prediction_layer(x)
model = tf.keras.Model(inputs, outputs)

base_learning_rate = 0.0001
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=base_learning_rate),
loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
metrics=['accuracy'])
model.summary()

To perform this classification, a few of the top layers of a frozen model base were made unfrozen and jointly trained both the newly added classifier layers and the last layers of the base model.

This allows us to fine-tune the higher-order feature representations in the base model to make them more relevant for the specific task​. The GlobalAveragePooling2D layer enables us to convert the features to a single 1280-element vector image​. 5 of the 7 layers have been frozen for training as it would be leveraging the pre-trained learnings for these layers and the trainable flag for these layers is set to false​. To prevent overtraining, Dropout Layer with rate of 0.2 was used. The final layer is gives the output as one node which would be the sigmoid probability of the bike class.

initial_epochs = 10

loss0, accuracy0 = model.evaluate(validation_dataset)
print("initial loss: {:.2f}".format(loss0))
print("initial accuracy: {:.2f}".format(accuracy0))
#OUTPUT:
# initial loss: 0.64
# initial accuracy: 0.84

history = model.fit(train_dataset,
epochs=initial_epochs,
validation_data=validation_dataset)
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']

loss = history.history['loss']
val_loss = history.history['val_loss']

base_model.trainable = True

# Let's take a look to see how many layers are in the base model
print("Number of layers in the base model: ", len(base_model.layers))

# Fine-tune from this layer onwards
fine_tune_at = 100

# Freeze all the layers before the `fine_tune_at` layer
for layer in base_model.layers[:fine_tune_at]:
layer.trainable = False

model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
optimizer = tf.keras.optimizers.RMSprop(learning_rate=base_learning_rate/10),
metrics=['accuracy'])
model.summary()
fine_tune_epochs = 10
total_epochs = initial_epochs + fine_tune_epochs

history_fine = model.fit(train_dataset,
epochs=total_epochs,
initial_epoch=history.epoch[-1],
validation_data=validation_dataset)
acc += history_fine.history['accuracy']
val_acc += history_fine.history['val_accuracy']

loss += history_fine.history['loss']
val_loss += history_fine.history['val_loss']

loss, accuracy = model.evaluate(test_dataset)
# X, y = model.evaluate(test_dataset)
print('Test accuracy :', accuracy)

# Retrieve a batch of images from the test set

image_batch, label_batch = test_dataset.as_numpy_iterator().next()
predictions = model.predict_on_batch(image_batch).flatten()

# Apply a sigmoid since our model returns logits
predictions = tf.nn.sigmoid(predictions)

predictions = tf.where(predictions < 0.5, 0, 1)

print('Predictions:\n', predictions.numpy())
print('Labels:\n', label_batch)

plt.figure(figsize=(10, 10))
for i in range(30):
ax = plt.subplot(6, 5, i + 1)
plt.imshow(image_batch[i].astype("uint8"))
plt.title(class_names[predictions[i]])
plt.axis("off")

Modelling Outputs

Validating our binary classification model using various metrics such as confusion matrix, precision/recall, and accuracy of the predictions on validation and test data sets. The values of these metrics for the ensembled model and the individual image and text models are summarized below.

Ensembling Text and Image Model for better performance

The final model, which is an ensemble of the text and image models, has an overall accuracy of 98% and a misclassification rate 1.9%. It helps in overcoming the shortcomings of the individual models and increases the accuracy.

import glob
post_id = [f[-14:-4] for f in glob.glob("/content/drive/MyDrive/Test_Images/*.jpg")]
from PIL import Image
from matplotlib.pyplot import imshow
# Read image
#opening all the images and saving it in an array
craig_imgs_arr = np.array([np.array(Image.open("/content/drive/MyDrive/Test_Images/" + i +".jpg")) for i in post_id])

craig_imgs_arr_resized = np.array([Image.fromarray(i).resize((160,160)) for i in craig_imgs_arr])

craig_imgs_arr_resized_axised = np.array([np.array(i)[np.newaxis, ...] for i in craig_imgs_arr_resized])

craig_labels={'Post ID':[],'y_pred_label_img':[],'prob_img':[]}
for i in range(0,len(post_id)):
pred = model.predict(craig_imgs_arr_resized_axised[i]).flatten()
pred = tf.nn.sigmoid(pred)
craig_labels['prob_img'].append(pred.numpy()[0])
pred = tf.where(pred < 0.5, 0, 1)
craig_labels['Post ID'].append(post_id[i])
craig_labels['y_pred_label_img'].append(pred.numpy()[0])

import pandas as pd
df_craig_pred=pd.DataFrame(craig_labels)

df_craig_pred.to_csv('/content/drive/MyDrive/post_labels_cincinnati.csv')
df_text_op= pd.read_csv('/content/drive/MyDrive/Cincinnati_text_model_output.csv')
df_text_op['Post ID']=df_text_op['Post ID'].astype(str)
df_ensemble=pd.merge(df_craig_pred,df_text_op, on = 'Post ID', how='left')
df_ensemble['text_pred_new']=1-df_ensemble['text_pred']
df_ensemble['y_pred_text_label']=df_ensemble['text_pred_new'].apply(lambda x : 1 if (x > 0.5) else 0)
df_ensemble['y_pred_ensemble']=0.35*df_ensemble['text_pred_new'] + 0.65*df_ensemble['prob_img']
df_ensemble['y_pred_ensemble_label']=df_ensemble['y_pred_ensemble'].apply(lambda x : 1 if (x > 0.5) else 0)
df_ensemble.head()

from sklearn.metrics import confusion_matrix
cf_matrix=confusion_matrix(df_ensemble['y_true'], df_ensemble['y_pred_ensemble_label'])
import seaborn as sns

sns.heatmap(cf_matrix, annot=True,fmt=".0f")

CONCLUSION

BUSINESS IMPACT

As the misclassification rate in the listings would decrease, it would enhance both user and seller experience with the website. Users would not have to spend time filtering out those products that are of no interest to them. This would help Craigslist in increasing overall engagement rate and also gain positive user reviews, thus attracting new customers to the website. On the seller side, it would be beneficia as the sellers would have lower chances of being flagged by users for any incorrect listings. Both these factors would contribute towards increases revenues and profits for our client.

FUTURE SCOPE

For this project, I had restricted ourselves to just one location of Chicago. I can expand the geographical scope and include listings in all locations in the US.

Currently, I have focused only on clearing out the bicycles section of the website, however, there are multiple sections that still have high misclassification rates. This model can be expanded to almost all other categories such as Phones, Cars, Watches, Jewellery, etc.

Lastly, a multi-level classification approach can be leveraged, to further classify sub groups of every category. For instance, I can classify each of the bike parts (seats, handles, pedals) and bike accessories to their own sections.

--

--