A tutorial(Part 1)

Published in

Analyzing the Amazon Product Data Set using SparkMLlib LogisticRegression Classification Model

3 min readDec 27, 2017

In this tutorial, we set out to analyze the Amazon product dataset using SparkMLlib. The training data set includes ASIN, Brand Name, Category Name, Product Title, Image URL. For a detailed description of the Amazon product data, the reader can refer to Julian McAuley webpage.

For the purpose of our analyses, the following key features and label are defined:

Features

ASIN: ID of the product
BrandName: name of the brand
Title: Title of the product
ImageUrl: Url of the product image

Label

CategoryName: Name of the category.

Our objective is to use Scala programming language to write a classifier utilizing key product features provided in the training data set and to use them to test an unseen data set without a category label. In machine learning and statistics(MLS), classification is defined as identifying to which of a set of categories a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known.

There could be several reasons for performing classification using the Amazon data set such as to have a better understanding of the customers’ reviews beyond what can be seen from just computing the summary statistics. It may also help know how the product’s features are related to customers’ choices based on the categories of foods, clothes, musical instruments, electronics etc.

It is worth noting that for the Amazon product data, the features and label are mainly categorical in nature; hence it would be necessary to convert them to a numerical type that the machine understands. In a short while, we would get to this.

To accomplish this objective, the following main steps are required:

Reading the models and exploring the feature characteristics
Model transformation
Creating a pipeline
Train and test the model
Evaluating the model

Reading the data and exploring the data characteristics

Typically, most data are saved as a file in CSV or JSON format. Hence, first is to have a quick preview of the data:

CSV

JSON

Next, we would get the file location and set up spark session builder to read the file.

var dir = "****/amazon_dataset"

val csv_file = dir + "data.csv"

val json_file = dir + "data.json"

val spark = SparkSession.builder().master("local").appName("My ML project").config("spark.some.config.option", "").getOrCreate()

import spark.implicits._

You would have to put all these lines of code in a scala object and make the required imports e.g

import org.apache.spark.sql.SparkSession

object amazon extends App {

}

Note: The spark implicit statement is for implicit conversions like converting RDDs to DataFrames

To read CSV file, all you have to do is write

val df = spark.read .format("csv") .option("header", "true") .option("mode", "DROPMALFORMED") .load(csv_file)

And if JSON file

val df= spark.read.json(json_file)

And then we can finally explore the data characteristics

val numOfRowsToShow =10 //change this as you prefer.

df.describe().show(numOfRowsToShow)

df.printSchema()

Now what? It seems this page is getting too long; so I will end here and continue the next topic on a new page. Thanks for reading and see you there!

References:

R. He, J. McAuley. Modeling the visual evolution of fashion trends with one-class collaborative filtering. WWW, 2016

J. McAuley, C. Targett, J. Shi, A. van den Hengel. Image-based recommendations on styles and substitutes. SIGIR, 2015

Next: Model transformation

A tutorial(Part 1)

Features

Label

Reading the data and exploring the data characteristics

Next: Model transformation

Written by Taiwo Adetiloye