Model transformation(Part 2)

In the previous lesson, we performed the basic steps of exploratory data analysis. This and the current lesson constitute important preprocessing steps before a model can be trained.

In the model transformation, the features often need to be transformed in a way that would be easy for the computer to process without error. Machines likes numbers and they do better with little or no noisiness in the data e.g. null, missing values, outliers etc.

The following steps form our TODO list.

TODO: Transformation    StringIndexer()
SQLTransformer()
VectorAssembler()
Normalizer()

From the dataset that was displayed based on the schema, we see that all our selected variables are categorical. Hence, we set out to implement a Singleton by defining a function StringIndexer() given as:

def stringIndex(input: String, output: String) = { new StringIndexer() .setInputCol(input) .setOutputCol(output) .setHandleInvalid(“keep”) }

The StringIndexer takes String categorical input column and transforms it into numerical (Double) value.

Next, we create a trait called Dataprocessto perform the SQL transformation for both the train and test dataframe. Note that the test data lacks the category column and our goal is to predict this column for each of the row samples.

trait DataProcessor {
def mainDataProcessor():DataFrame
}
class trainDataProcessor(data:DataFrame ) extends DataProcessor
{
def mainDataProcessor =
//Index each columns of the training data as numeric
{
val asinIndexed = stringIndex("ASIN", "ASINIndex").fit(data).transform(data) val brandNameIndex = stringIndex("BrandName", "BrandNameIndex").fit(asinIndexed).transform(asinIndexed) val titleNameIndex = stringIndex("Title", "TitleNameIndex").fit(brandNameIndex).transform(brandNameIndex) val imageUrlIndex = stringIndex("ImageUrl", "ImageUrlIndex").fit(titleNameIndex).transform(titleNameIndex) val catergoryNameIndex = stringIndex("CategoryName", "CategoryNameIndex").fit(imageUrlIndex).transform(imageUrlIndex) val amzBasicTransformation = new SQLTransformer()
.setStatement("""
SELECT ASINIndex, BrandNameIndex, TitleNameIndex, ImageUrlIndex, CategoryNameIndex
FROM __THIS__ """)
amzBasicTransformation.transform(catergoryNameIndex ) }
}
class testDataProcessor(data:DataFrame ) extends DataProcessor
{
def mainDataProcessor =
//Index each columns of the training data as numeric
{
val asinIndexed = stringIndex("ASIN", "ASINIndex").fit(data).transform(data) val brandNameIndex = stringIndex("BrandName", "BrandNameIndex").fit(asinIndexed).transform(asinIndexed) val titleNameIndex = stringIndex("Title", "TitleNameIndex").fit(brandNameIndex).transform(brandNameIndex) val imageUrlIndex = stringIndex("ImageUrl", "ImageUrlIndex").fit(titleNameIndex).transform(titleNameIndex) val amzBasicTransformation = new SQLTransformer()
.setStatement("""
SELECT ASINIndex, BrandNameIndex, TitleNameIndex, ImageUrlIndex
FROM __THIS__ """)
amzBasicTransformation.transform(imageUrlIndex ) }
}

Let’s assume that there are training_df and testing_df that contain our train and test data such that we can pass them as parameters into our trainDataProcessor and testDataProcessor respectively.

val trainProcessed = new trainDataProcessor(training_df)

val testProcessed = new testDataProcessor(testing_df)

At every new step, it is always good to check if our model is doing just fine. So we make use of the following:

trainProcessed.mainDataProcessor.show(10) // show only top 10 row

trainProcessed.mainDataProcessor.printSchema()

Train transformation

In the diagram above, we can see the effect of our transformations on the training data frame with the Columns now Indexed as Double.

And below you find that for our test transformation.

Test tranformation

Now, we see that our test dataframe is missing the category column. Well, our goal is to predict the category names on the basics of the features we have.

In the next lesson, we will continue with our transformation but this time, our main focus will be towards building our model pipeline.

Next: Creating the model pipeline

--

--

Taiwo Adetiloye
Analyzing the Amazon Product Data Set using SparkMLlib LogisticRegression Classification Model

Taiwo O. Adetiloye is very interested in large scale data processing and analytics using AI and ML frameworks like Spark, Keras, Tensorflow.