All the Data Science in the World won’t save you from crappy data
This started as a personal exercise to test my new Data Science skills (thank you SciKit-Learn, DS @ the Command Line & Numerical Python) by finding usable patterns in a real-world sales funnel dataset. I was especially interested in finding predictable order patterns by using CRM, historical usages and product lifecycle info.
First: a little background. I’m a B2B hardware product manager for a publicly-traded company. We have a product catalog with tens of thousands of orderable SKUs and an identifiable customer list in the low hundreds. Product lives are measured in quarters or years; a “design win” describes a future order stream of a given SKU to a single customer entity (but often to multiple physical locations). Shipment volumes can range from tens to thousands of units per month. Most are true commodities with a few build-to-order products in the mix.
There were three potential customers for this exercise:
- An internal operations / quality team. They rely on having accurate product attributes (financials, quality status, manufacturing BOMs) for best-in-class order fulfillment. The value-add would be accurate modeling of product lifecycle data, so that our 3rd party manufacturing partners could have confidence in our build plans.
- Multiple sales channel partners. They include direct salespeople with Tier-1 accounts, manufacturer’s rep firms — each with their own CRM tools (more on that later), distributors, and 3rd party data aggregation firms. Their needs: price guidelines, competitor’s SKUs, go-to-market talking points, supporting literature, and availability info (leadtimes and rampup/down dates).
- An internal web development team. Their needs: 20–30 searchable product features for each SKU, corresponding documents (datasheets, quality reports, ECNs, EOLs), visibility flags (public vs. custom specs) and recommended replacement SKUs (if a product is obsolete). The value add is a “heads up” of catalog changes — this allows the team to schedule their work accordingly.
Internal datasets are sourced from an ERP tool; external datasets are largely the result of website scraping exercises. The resulting corpus is an ever-growing directory of datestamped CSV files.
I wish I could say this exercise was a roaring success. It would have been a great demonstration of my new Data Science skillset, not to mention that it could have been a turbocharger for my company’s operations. But it wasn’t meant to be. Here’s why.
Learning #1: data hygiene is one thing. Institutional knowledge is a different beast. At first glance my product shipment history and order backlog corpus looked awesome. Tens of thousands of records, annotated, with precise date / logistical / channel / end use info — winning!
Then I realize many customer attributes — recorded by name, not by id# — are either misspelled, obsolete (think acquisitions), or are proxies for the real end customer (think contract manufacturers). If you don’t know ABC was purchased by DEF, you’re hosed.
Learning #2: the most sophisticated algorithm in the world is going to choke on your sales funnel if your salespeople hate, or don’t use, CRM tools. If you use 3rd party sales channels, there’s an excellent chance those reps use somebody else’s tool and hate porting the information to you. In many cases the end customer is not identified because they are too small. This means using CRM touch points (customer visit, price quote, sample product delivery, …) to build a sales funnel regression is not going to happen.
Learning #3: access to data is organizational power. Your success is a function of your organization’s “need to know” culture. Learning how many orders are the result of deep discounts, or publishing an average number of development schedule slips, are great ways of getting your hand slapped.
Learning #4: Whether an enterprise customer continues to buy your SKU is largely out of your control. If your end customer can’t forecast their business, or they need to stop production because of an unrelated device issue, it’s quite likely you’ll never hear about it. There’s no regression line there to help you.
Now if you’ll excuse me, I need a coffee and an aspirin. Thanks for reading.