Marvin — A Machine Learning based approach to Performance Artist Recommendation — Part II

Data Gathering, Cleaning and Exploratory Data Analysis.

Yash Sharma
StarClinch Blog
5 min readJul 8, 2018

--

All the code and notebook is available on the marvin GitHub repository.

Image Source

Data

Collection

Collecting Data is the first step in any Data Science related project whether it’s image classification, translation or in our case a recommendation system. As mentioned earlier requests made by clients via Post Your Requirement or Get the Quote gets stored in StarClinch pipedrive automatically.

To fetch the data from pipedrive using the official API Endpoints I have written a python class Deals in DealDetails.py to facilitate the process.

Deals class takes API_Token, limit and number of pages to fetch from pipedrive as an input. Pipedrive response to any request is in JSON Format, fetch() function in Deals Class handles this as it sends a request to the API endpoint, parses the JSON Formatted response into a human-readable format and stores it into a list which then gets converted into pandas DataFrame. save() function writes the DataFrame to the disk in a CSV Format.

Cleaning

The next big step in a Data Science project is cleaning the data that we have gathered.

Image Source.

1. Parse URLs

Deals Dataset

Artists Pitched & Artists Requested URL columns contains Profile-URL of the Artists. The format of URL is https://starclinch.com/<Artists-URL>, we only require (Artists-URL) part from the Profile-URL. Artists Pitched/Requested column can have multiple Profile-URLs as it’s possible to pitch or request for multiple artists at once.

parseURL() function takes a URL and cleans it by removing all the noise that might be present in the URL. parseURL() function also replaces all the `,` with `\n` (newline character), `\r` with ``and splits the URLs.

getArtists() function takes the list of Profile-URL for each row and split each URL and selects the (Artists-URL) part. set(artists) assures that no duplicates are present in the data-set.

allArtists() function takes the set and joins each URL to create a one complete string.

Apply these three method to both Artists_Pitched & Artists_Requested_URL Column.

Cleaned URLs

2. Multiple Categories

Categories

Deals have a category associated with it to separate each of them. There are a total of 14 categories available (e.g. Singer, Dancer, DJ) for any deal but some deals might have multiple categories associated with it. We need each deal to belong to only a single category and not in multiple categories.

Fill NaN with -1 and convert the categories to string then split category using `,` delimiter. I choose to select the first category after splitting as the final category for that particular deal. Convert Categories back to integer.

Unique Categories

For more details and other preprocessing steps applied to the deals dataset, refer to the main notebook.

EDA

Exploratory Data Analysis is the next step in the Data Science Pipeline, where the aim is to answer as many questions as one can using the data they collected and pre-processed in the earlier stage of the pipeline.

question_1.py
Solution

We have 14,894 Deals in total out which 92 are Unknown, 247 have budget 0 and 1,096 doesn’t have a location. Out of 14,894 Deals, only 1,369 Deals (9.1916% Deals) have artists pitched.

Next Big Question is what percentage of artists have been pitched by the company in the past two years.

Artists Pitched

Out of 11,679 Artists registered only 22.7% of the Artists have been pitched by the company so far. The reason for such a small percentage could be the bias choice by the company towards some artist in terms of whether he/she is a premium subscriber or not. Budget is another factor that could affect the number of artists pitched to the client.

Next is the distribution of deals based on

  1. Categories
Deals Distribution

`Singer` (Category 12) is at the top followed by `Comedian` (Category 3) and `Live Band` (Category 7). Also, there are nearly ~2, 000 Deals with missing (-1)Category .

2. Events

Deals Distribution

`Campus Events` (Event Type 1) are more in demand followed by `Weddings` (Event Type 14) and `Corporate Events` (Event Type 4).

Thank you

For more details refer to this notebook.

--

--