QCon 2017 — Data, Visualisation and Machine Learning
How do we know if our customers are happy?
90% of the world’s data has been created in the past 2 years, but most of that isn’t being used. Being a data-centric company is about using data to drive your business and make your customers happy.
Data & Visualisation
Cathy Polinsky (CTO of StitchFix and formerly of Salesforce and Amazon) gave a keynote presentation called “Data as DNA: Building a company on Data”. These are my notes from that presentation (if you’d like to see this video, get in touch with me via LinkedIn)…
Salesforce needed to re-write a system as updates to it were single-threaded. They needed to define new architecture. So what architecture to use? They started to look at the existing data they had.
- If the data is not visible, it’s meaningless. Data must be seen and visualised.
- Make data open
- Make it interpretable
- Define your questions and metrics
- What problems are you solving
- How will you know you’ve succeeded
- Make it visual
- Opening up the data does not mean you should violate privacy laws — mask sensitive data
Data Science — Using data strategically
- Effective Data Science is creating solutions and algorithms that are testable and iterable.
- Testing via experiments (e.g. A/B testing)
- When changing navigation though, A/B tests showed that the original Amazon tab design was best! Why? Because it takes time for people to understand new navigation systems. So Amazon had to make a call and change the nav with a view to it being better in the long term.
- Is there a way to learn without an A/B test?
- Don’t skip the hypothesis
- Pick a metric that is related to your test
- Don’t peek at the results early
- Test big things (time is limited!)
- True North vs Magnetic North (the closer you are to your goal, the more important it is to re-evaluate your goal metrics)
- Goldilocks of Data (what is meaningful to the organisation, not focussing on how much data is enough)
- Not all big data is interesting data
- Small data (Stanford technique)
Data Enables Personalisation
This an old idea! Shops & store-keepers from the beginning of time.
StitchFix is a personalisation company.
- You complete a survey which asks about yourself: your size, your style, …. A stylist/curator assesses you (and machine learning to make recommendations) and sends you clothes. This is all driven by personalization.
- This is about “data that matters”: price, cut, colour, length, how good an item is for most people, age, where they work, size, past purchases, …
Lesson 1: Feedback loops unlock personalisation
- Style profile
- When returning/accepting clothes, they provide feedback on the clothes (too big, too small, wrong colour)
- Inventory feedback — which items are selling, and where
Lesson 2: Data incentives matter
- Personalisation depends on getting good data. But most companies ask for data without giving consumers a reason to provide that feedback. Physical clothing stores never get feedback from customers when trying on clothes.
- So create “compelling self interest”.
- First order benefit: your experience will get better if you give feedback
- Second order benefit: your feedback helps our company be better (not as compelling)
- Make data collection fun!
- No customer will share data with you if they don’t trust you
Lesson 3: Humans + Machines = Better
- Machines are good at some things. Humans are good at some things
- “The Second Machine Age”. Driving Cars, Alexa, Deep Blue
- Helps leverage unstructured data
- Provides empathy & creativity (“no more skinny jeans!”)
- Frees the algorithm developer from dealing with edge cases
Being a data-centric company is about using data to drive your business and make your customers happy.
- Give access to the data to all employees and make it easy for them to interpret
- Invest in data science to find the problems and guide decision making
- Create highly personalised experiences by blending humans and machines
Machine Learning (ML)
The goal of ML is not to make perfect guess because ML deals in domains where there is no such thing. The goal is to make guesses that are good enough to be useful.
I attended a couple of talks and a workshop which looked at Machine Learning. For developers who are used to writing code after soliciting requirements, prepare to be disappointed.
ML is not about writing code. It is about teaching a computer to learn an algorithm that is too complex to program. It involves the following steps:
- Collecting relevant data
- Analysing the data
- Creating features from data
- Selecting the best model for the problem
- Selecting the best training algorithm
- Training the model (← you might need some code here)
- Evaluating the accuracy of the model
- Deploying the model
2. Analyzing data
- Normalizing (maybe reducing values to between 0 and 1)
- Statistical tests (how distribute is the data itself)
- Visualization (to help with understanding what the data is)
Goal: Determine possible ways to mathematically represent the data.
3. Creating features
- a compact representation of original data
- cleaned and normalized
- redundant data removed
- correlated data removed (when two pieces of data say essentially the same thing. e.g. Nationality and Residency)
Feature generation is both an art and a science.
4. Selecting a model
- Mathematical represntation fo data (hypothesis)
- Independent of tool sets… portable between tools
- Not all data can be represented by all models
5. Selecting an algorithm
Goal: How to learn the model parameters from the data
Linear (Regression) Model — a simple model, but doesn’t fit all data
- Model represented as
f(x) = ax + c(a line, plane)
- Could appear when analysing housing prices, college scores
- There are lots of algorithms to find the linear regression model
6. Training Set
- the training set is a subset of the collected data
- it must be statistically representative of your data (it’s not about the amount of data, but how representative it is)
- it is used by the algorithm to learn the model
- the test set is independent of the training set. E.g. hold back 10% of the whole data which NEVER gets used for training. Maybe 90% is used for training, 10% is for testing.
The main types of ML:
— “Pick One of a Set”
— Spam detection
— Manufacturing defect detection
— Handwriting analysis
— Decision Trees/Forests (example: Titanic survivor)
— Naïve Bayes
— “Score or Rank”
— Likelihood of Purchase
— How: Fitting to some kind of curve
— Linear Regression
— Logistic Regression
— Group Similar
— Find similar items
— Customer segmentation
— cohort detection
— Hierarchical Clustering
There are a bunch of skills needed to become a data scientist:
- Data Analysis Skills
- Data Vizualisation skills
- Programming skills (Python, R, Scala Java)
- Statistics Knowledge (applied stats)
- Distributed system skills
- Apache Spark
- Andrew Ng’s http://cs229.stanford.edu/ course
- Book: Pattern Recognition and Machine Learning (Bishop)
- Pattern Classification (Duda, Hart, Stork)