Data loading and Categorical feature binary mapping

Jayasagar
2 min readAug 14, 2018

--

In this post, I will try to explain with an example of data loading and categorical feature binary mapping.

This dataset is inspired from below kaggle, as it simplified version of movie dataset.

Sample implementation can be found at https://github.com/Jayasagar/sparkml-regression-models-movie-revenue-predictions

Loading the dataset and inspecting it:

Removed certain variable which is not required as they do not add value for this task, such as id, imdb_id, original_title, cast, homepage, director, keywords, overview, production_companies, release_date, release_year, budget_adj and revenue_adj

path = “/regression-models/tmdb-movies-final-features-no-header.csv”raw_data = sc.textFile(path)num_data = raw_data.count()records = raw_data.map(lambda x: x.split(“,”))first = records.first()print(‘First record: ‘, first)print(‘Total number of records: ‘, num_data)

Output

First record: [‘Action|Adventure|Science Fiction|Thriller’, ‘2015’, ‘32.985763’, ‘150000000’, ‘124’, ‘5562’, ‘6.5’, ‘1513528810’]

Total number of records: 10866

In the above code, sc is the Spark context available when you run Jupyter!

As we have too many input features and I am not sure whether all required, I just want to consider some of the interesting features and simplified for this implementations purpose!!

Variables consider are

popularity

budget

runtime

genres

vote_count

vote_average

release_year

revenue

Out of them ‘genres’ and ‘release_year’ are the categorical and others are normalized real-valued variables.

Extract categorical feature into a binary vector form

We have two categorical features. Using the two helper functions given in the assignment(extract_label and extract_features) extracted the last column variable (revenue) into a float and extracted mappings to convert the categorical features to binary-encoded features.

1. genres, is at index : 0

2. release_year is at index : 1

def get_mappings(rdd, idx):   print('index:', idx)   return rdd.map(lambda fields:               fields[idx]).distinct().zipWithIndex().collectAsMap()

Apply the mapping function to each categorical column (0, 1)

mappings = [get_mappings(records, i) for i in range(0,2)print(“Feature vector length for categorical features: %d” % cat_len)print(“Feature vector length for numerical features: %d” % num_len)print(“Total feature vector length: %d” % total_len)

We now have the mappings for each variable, and we can see how many values in total we need for our binary vector representation.

OUTPUT:Feature vector length for categorical features: 2096Feature vector length for numerical features: 6Total feature vector length: 2102

The next step is to use our extracted mappings to convert the categorical features to binary-encoded features.

def extract_features(record):
cat_vec = np.zeros(cat_len)
i = 0
step = 0
for field in record[0:1]: # catogorical feature
print('extract_features', i)
m = mappings[i]
idx = m[field]
cat_vec[idx + step] = 1
i = i + 1
step = step + len(m)
num_vec = np.array([float(field) for field in record[1:7]])
return np.concatenate((cat_vec, num_vec))

Extract the data so that we are ready for training and prediction on the Decision Tree model.

def extract_label(record):
return float(record[-1])
def extract_features_dt(record):
return np.array(map(float, record[0:6]))
data_dt = records.map(lambda r: LabeledPoint(extract_label(r), extract_features(r)))

Split the data into training and test sets (30% held out for testing)

(trainingData_dt, testData_dt) = data_dt.randomSplit([0.7, 0.3])

In this Part, we learned how to work with the dataset categorical column mapping to binary for regression models!!

In the next post, we will see some of the Spark Regression models and their performance tuning!!

--

--

Jayasagar

Engineering specialist working with companies and individuals to solve exciting problems with robust and powerful technology. https://techatcore.com