Timothy Dalbey
Aug 31, 2018 · 2 min read

Hey Tim (agreement),

Thanks for taking the time to check the ideas out — I’m glad you’re into the technique, broad strokes and all.

I think that what you’re asking is really the first hurdle and probably the most significant hurdle in statistical efforts of this nature. Building a really robust, groomed dataset can be a struggle or it can totally be a breeze.

I’ve seen companies acquire customer data in a variety of ways, and homogenize the data is more ways that I care to disclose. With regards to your particular use case, consider the following ideas:

  1. Using the marketing channel campaigns and appended identifiers (affiliate codes, what have you) you can make broad assumptions about your users. In particular, when the channel reliably delivers users by properties, you’re up to your neck in the campaign configurations, but your data will be much more digestible. It will also condition you to collapse too much variability into a single property — but maybe that’s OK for early iterations?
  2. There are services that provide demographics data for most email addresses. This is the easy/expensive route. Note that it would be really easy to use data like this in a cluster analysis (something that happens infrequently and in a “offline” context, normally) but it doesn’t really solve the challenges around realtime implementation of classifier functions.
  3. Abandon all hope. This work is the worst. I should have been a doctor.
  4. Surveys — but please also see the notes in (2) above with regards to the persistent classifier function issues.

I hope those ideas help. I’ve seen them all in live play — with pros and cons of each. Other stuff works too. Let me know how your analysis goes and what you were able to do with the results.

Champagne will be served in the Starlight Room at 8PM. Until then!

TMD