Machine Learning & Learning to Love it

by Matt Folz

Yammer Analytics
We Are Yammer

--

The Yammer Analytics team has long been skeptical towards implementing machine learning in our day-to-day work. That’s not to say we don’t recognize the value of the machine learning algorithms: they detect fraud on our American Express Cards, recommend TV series or movies to us on Netflix, and serve content to our News Feeds on Facebook. The results are far superior to anything that simple heuristics could produce. But given the other high-ROI activities that an analyst could be working on — like analyzing user behavior or experimentation — we’ve rarely found use cases that would justify investing in machine learning.

Until now.

Picking the right metric matters. A lot.

One of the first major insights of the Yammer Analytics team was that engagement was the best leading indicator of all of the future outcomes we cared about — namely, whether our users would continue using Yammer in the future (retention), and whether networks would convert from free to paid (monetization).

This probably isn’t surprising, as current usage should be an indicator of future usage. But it was surprising just how large the correlation was — the correlation coefficient between current engagement and retention is almost 4 times higher than the correlation coefficient between message usage and retention, for example. This simple analysis gave our team the confidence to orient our entire analytics and product design methodology around maximizing user engagement rather than usage of individual features.

Days engaged was a good core metric for a few reasons:

  • Because it had the right level of granularity. A single user can only engage 7 days in a week, so no single user could skew days engaged. Days engaged are also robust to logging across different devices, which could reliably determine whether an individual used Yammer in a given day, but not whether they were actively using Yammer at a given time.
  • Because it measured something we cared about. Choosing a metric invites stakeholders to game that metric, but long-term engagement is almost impossible to artificially manipulate. When we shipped a feature that boosted engagement, we could be confident that we were improving the product and benefitting our users.
  • Because it was easy to understand, easy to communicate, and relevant to all teams at Yammer.

But can machine learning do even better?

Machine learning is very good at some tasks, such as being able to work with a large number of features at once, or identifying complex combinations of signals which have high predictive power. So for a recent Yammer Hackday, fellow analyst Peter Loscutoff and I decided to see whether we could use machine learning to gain deeper insights into the problem of predicting user retention.

From the start, we wanted to have a model that would do a good job of predicting user retention. And not only that: we also needed something that would:

  • Offer actual human-digestible insights into what features best predicted retention
  • Be relatively simple to tune, because neither Peter nor I are machine learning experts

Random forests were a natural choice for this task. Generically, they are robust to overfitting (that is, they usually don’t read patterns into data that aren’t actually there), they don’t require feature scaling, and they tell us which features have the highest predictive value in the model. So with that in mind, we trained a random forest using just about every metric we could think of, including engagement (at different levels of granularity), feature usage (messages, likes, mentions, etc.), network data (whether the user’s network was paid, network size/activity, etc.), and other user statistics (time since activation, whether the user was an admin, etc.).

The results were as follows:

Overall, the machine learning model was 71.4% accurate, and our naive model (in which a user with at most 3 days is predicted to churn and a user with > 3 days is predicted to be retained) was 69.4% accurate.

Data needs to be actionable.

Here are the top features identified by the random forest.

There’s an interesting mix of signals here: surprisingly, days engaged wasn’t the best feature — a user’s time since activation was. On the other hand, metrics such as time_since_activation or is_network_admin are almost completely inactionable — we can’t change our users’ activation dates, nor would making everyone a network admin be a sensible action. On the other hand, there are some interesting signals: for example, social_post_score is essentially a measure of network effects, and the random forest weights these features heavily.

Incidentally, a simple SQL query showed the reason why both the simple heuristic and the random forest were unable to do much better than 70%. Even at very high levels of activity, about 10% of Yammer users churn. Active users can leave their jobs and active users can have their networks churn.

Humans make decisions, not machines.

So where does this leave us? It came as no surprise to us that days engaged could be improved upon as a predictor, but it was surprising that our naive model fared so well. At Yammer Analytics, we joke that analytics is just counting and domain knowledge, but that philosophy underlies a lot of the analyses we do. A simple, robust model that allows us to iterate quickly is typically more valuable than a complex, slightly more accurate one, and a metric that can be understood by analysts, PMs, and engineers alike is more valuable than a black box.

The real output of analytics is decisions, not numbers.

However, the right use cases could motivate us to invest more in complex analytical techniques such as machine learning and social network analysis. A little while later, I repeated much of the analysis here in the context of network level retention, making some interesting discoveries in the process. We’ve also done some work towards trying to understand network effects in Yammer.

But these are stories for another time.

Matt Folz is a Data Scientist at Yammer. His favorite tree is the bonsai.

--

--