Optimizing E-Commerce for Hip Hop Merch Using Data Science

Sunishchal Dev
Jul 20, 2017 · 9 min read

How to Learn and Create Business Value Simultaneously

My final project during my data science immersive bootcamp at General Assembly required me to choose any big data set I wanted and carry out an entire data science project: querying, exploring, structuring, visualizing, analyzing, and implementing a machine learning algorithm. Being a business consultant at heart, I got to thinking about how I could position my project to create some tangible ROI (return on investment). With the wealth of data that’s available today, I knew I could be selective; so I decided my project has to be something I’m personally interested in and something that can create value. I know that I love technology, I love strategy, I love business, and I love music.

I got in touch with one of my serial entrepreneur friends from UW and he told me about a creative startup he’s founding called Integral Studio. His company lives at the intersection of music, tech, and commerce. They design state of the art digital experiences for today’s most influential hip hop artists and record labels. This entails building websites, mobile apps, and e-commerce portals. As a result, they collect massive amounts of data about an artist’s fan base and consumer behavior. Integral’s strategic advantage lies in their ability to create high volume sales channels for artist merchandise by learning from this data. The only thing they had missing was a data scientist.

Exploratory Data Analysis

After learning more about Integral Studio’s business model, I decided this was the perfect challenge for me to tackle. The NDAs were signed and I was given access to hundreds of GB worth of raw data in Google Cloud Platform so my exploratory analysis could begin. I was lucky to be given the creative freedom to experiment with different analytic methods and frame my own business case. I got to treat my capstone project like a freelance consulting gig.

I’ve been looking at data for years as a management consultant and IT analyst, in a world where we get tidy BI dashboards and think Excel is the greatest piece of software available to man. This time was different…I never really grasped the concept of big data until I saw this data warehouse. Every single piece of data you could imagine a website would generate was collected: clicks, page views, orders, IP addresses, etc. At first, I felt pretty overwhelmed and didn’t know where to start; both a blessing and a curse. Wading in ambiguity was frustrating, but the ability to explore different analysis strategies eventually paid off.

Hours of structuring, processing, and cleansing were required before any meaningful analysis was possible. When the data is so vast, there’s a lot of noise standing in the way of insights. However, it afforded me incredible granularity in my analysis. Once I got the data in the format I wanted, I was able to be a detective and reverse interpret customer journeys in great detail, leading to a thorough understanding of what makes people buy merchandise.

I looked at things like geographic breakdowns, web traffic sources, screen sizes, and session lengths. I dug into details like which devices were more likely to convert, what the main drop-off points are along the sales funnel, and even ended up doing manual attribution modeling to find where original referrals were coming from. This analysis lead to an epiphany about “on the fence customers” and how I could implement a machine learning algorithm to automatically identify these users and influence a purchase.

WARNING: The article becomes highly technical from this point forward. If you’d like to skip the nerdy digital marketing and data science related talk, scroll down to “Seeing The Model in Action” to find out how my final implementation of this algorithm looks.

Customer Behavior Insights

There were a few key pieces of information I pulled from Google Analytics that lead to this solution. Follow along in my presentation slide deck to see the data visualizations which led to these insights (starting on slide 5).

During my geographic analysis, I noticed a huge disparity between the percentage of users and percentage of revenue coming from each country. North America had twice as high of a conversion rate. A little more digging revealed that customers in other geographies were being asked to pay astronomical shipping costs and wait several weeks to have their order fulfilled. This tells me that a user’s location has predictive power over whether or not they will convert.

When looking at the user agent (device) breakdown, I learned that most users are visiting the site from a mobile device and are more likely to convert while using one. However, desktop devices had twice as large of an average order size. Another thing I noticed is that over 80% of the users are visiting the site from Apple devices, which have a higher average conversion rate than devices from other manufacturers. As a result, I decided to create binary features indicating whether a user is on a desktop computer and whether they are using an Apple device.

When I got to looking at traffic sources, the first thing that jumped out was direct traffic accounts for 25% of user sessions and 95% of revenue. What?! This created a massive black box, since Google Analytics was telling me that almost everyone who buys products are typing in the URL of the website, not being referred from any of our marketing or social media channels. Furthermore, it was showing that referrals from Kendrick Lamar’s artist website, the hottest rapper right now, contributed to zero sales. My years of experience in marketing tell me this is preposterous.

A bit of research online taught me that Google Analytics treats direct as a catch all for cases where it can’t interpret where the traffic came from. This includes but isn’t limited to: direct URL/bookmarks, going from https to http, mobile app links, and many others. Since this tool wasn’t getting the job done, I decided to take matters into my own hands and do some manual attribution modeling.

For those unfamiliar with digital marketing jargon, attribution modeling is where you determine which web traffic sources and touch points are given credit for each sale. To do this, I used the raw data to map out each customer’s history throughout all their sessions and extracted the original referrer that first led them to the store. I parsed through all the referrer URLs and grouped similar links together. Finally, I added up the total revenue for each source and came up with the following breakdown.

First off, I proved that a good chunk of sales are coming from kendricklamar.com. Take that, Google Analytics! I also notice that shopify is the largest source of revenue… which is interesting to see because shopify is the host for the e-commerce portion of this website. After reading into my user journey maps, I realized that these are users coming from anonymous sources who only start getting tracked once they register for an account through shopify, effectively making them the same as direct traffic for my purposes. So it turns out that direct traffic really is the largest source, but definitely not 95% of sales. I took note that social media, search engines, lifestyle websites (Hypebeast, HotNewHipHop, Complex, etc.), and email campaigns could be attributed to a significant portion of revenue as well.

During my attribution modeling, I also looked at behavior flows and conversion funnels to find that users who view certain pages within the site are more likely to convert. This gave me the idea to vectorize (count a running total of) the number of times each user views each webpage as a way to quantify the user’s journey through the site. Now that all the exploration and preprocessing is done, it’s time to turn my insights into a machine learning algorithm that can help drive sales.

Identifying “On The Fence” Users

Follow along on model training and evaluation in my Jupyter Notebook.

My objective was to create a model that can predict the probability for each customer to convert throughout their browsing session. This would allow us to identify on the fence customers who could be offered a discount on their order or free shipping to help persuade a sale. To recap, the features I’m using to train the model include user location, device type, traffic source, and vectorized pageviews. This resulted in a 62,801 row by 639 column matrix…talk about big data!

After experimenting with several different algorithms, I found that the Random Forest Classifier from sci-kit learn best suits my purposes. My classifier evaluations showed that this model performed 63% better than baseline and the misclassification rate was only 1.8%. When I started framing the business case, I knew a standard classifier algorithm wouldn’t be the final implementation for this project, I would need a more innovative solution to find these on the fence customers.

What I mean by this is we don’t really care to know which users are classified as 1s (definite buyers) and 0s (definite non-buyers). I want to know those who are in the middle, the users who my model is unsure about. Under the hood, this classifier algorithm is really just a probability predictor; anything above a 0.5 probability returns a 1 and under 0.5 returns a 0. To predict on the fence users, reconfigured the algorithm so I can set custom thresholds. I also deconstructed the estimators (trees) from the Random Forest Classifier algorithm to calculate a confidence interval for each prediction.

We don’t want to be handing out coupons to everyone who visits the site. We want to minimize the amount of definite buyers who get promotions, since that’s lost profit margin. We also want to minimize definite non-buyers who get promotions because that would dilute the perceived value of the offer. After some statistical analysis, I determined my custom probability thresholds and set my confidence interval to discourage premature offers (for users who we haven’t collected enough data on yet). The combination of these two criteria minimized the population of users who are offered promotions to 2% of total visitors.

Seeing the Model in Action

So how will this actually work in real life? I put together a quick demo that simulates a user session through a hip hop merchandise website where my algorithm predicts an on the fence customer and offers a flash sale. Follow along below (starting from the homepage):

You’ll see that as the user navigates through the shop, the conversion probability increases, but the confidence interval is too wide to know with enough certainty. After navigating away from the shopping cart and leaving the store, the algorithm predicts a conversion probability within the custom threshold and a confidence interval narrow enough to identify an on the fence customer. Once they are identified, a coupon pops up offering 15% off if the order is completed within 5 minutes.

Creating ROI through ML

How many more sales can we expect from a promotional offer like this? It’s impossible to know for sure until A/B testing is done on real life data. But if we assume that half of the people who see this flash sale end up making a purchase, annual revenue increases by 8% for Integral Studio’s client. For a multi-million dollar sales channel, this is pretty healthy growth from just implementing a few Python scripts.

Although I was using real world data, this is still a theoretical project, so you probably won’t see flash sales popping up in online merchandise stores any time soon…(don’t go clicking around hoping you’ll get a free coupon). I think if online stores start leveraging data science to better understand their users, it will lead to happier customers. Imagine the feeling you’d get if you were looking at a hoodie from your favorite singer but decide you can’t afford the shipping costs, then all of a sudden the website offers you a discount. Now you can show your support without having to break the bank!

This experience of digging deeply into e-commerce data gave me a glimpse of what’s possible with machine learning in the world of online shopping and web optimization. I’m excited to find out where my next data science adventures will lead me!

Link to Jupyter Notebook in GitHub.

Link to presentation slide deck.

)
Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade