Data and machine learning 2017 review: The silent star of digital experiences

Machine learning is like Rodney Dangerfield or the Dark Knight: It gets no respect, but it really is the hero we deserve. It lurks in the dark, fighting bad metadata, while making us smile with quality search results. Maybe this comparison is a little dramatic, but you get the idea: Machine learning is useful, but most people don’t even know it’s making their digital experiences better.

Google search results, product recommendations, and facial recognition technology are all powered by machine learning, and have transformed how we interact with technology. The data team at Digital Services wanted to give a little love to the neglected champion of delightful digital. Here’s a recap of the lessons and work we did around machine learning in 2017:

  • Document type classifier: One of the first things we had to do to create a new and improved was tackle the document problem. We had more than 150,000 documents on with no metadata. To make a big impact quickly, we designed an API to classify documents by form, map, law, regulation, etc. This improved the relevance of documents appearing in search results. If we add a new class or taxonomy item, our machine learning system can reclassify all documents in a couple hours, and we’re currently working to integrate visual features to help classify documents that aren’t text-heavy.
  • Bad title classifier: When we started work on the document type classifier, we realized it was just the tip of the iceberg. Without quality metadata, titles like “cmr710.pdf” were pretty common. To address this problem, we built a second API to catch bad titles, so we could systematically improve them to clarify and explain what a user will find in each document. Our simple machine learning model was accurate 97 percent of the time. That was a significant improvement on the regex process created for the same purpose: it was correct 76 percent of the time.
  • User journey clustering: Prioritizing what content to start migrating and restructuring was maybe the most important task Digital Services faced at the start of the redesign. We used several classes to group web sessions on legacy Once patterns started to emerge among the clusters that started to form, we could take a look at what people were really searching for. We discovered the top-20 services that drive the majority of our traffic — helping us prioritize that content out the gates — and gained enough information to build an IA that we verified through user testing.
A scatter plot chart showing outcomes of the user journey clustering model.

What we learned

We’re happy with what we did last year, but we learned a couple of important lessons along the way. The following rules are going to help us launch even more streamlined machine learning efforts in the future:

  • Start simple and go fast: Faster models mean sacrificing accuracy, but it’s usually a small difference. This approach gives you more options and flexibility, because your response latency is low. Quick and simple helps you learn more in less time, giving you the information and data you need to build on models down the line.
  • Leverage cloud storage and hashing: Machine learning models and the representations we feed them take time to construct and train. If hardware or software fails, we’ve wasted a lot of time. We used random seeds for stochastic algorithms and hashing ordered parameter dicts in sklearn. We then used this to check remote artifact registries for artifacts matching the hash to reduce the need to redo work. If the hash exists in the registry, you can just fetch artifacts instead of creating them yourself.
  • Limit your toolbox: It can be tempting to use a variety of tools to increase flexibility and stay on top of trends, but it presents a big risk. If you’re using too many frameworks or languages, it’s going to be difficult to maintain everything. We recommend developing a strong list of criteria to justify adding a new tool.

We’re not letting the foot off the machine learning gas in 2018. In fact, we’ve already started building a model to classify and organize user feedback across all pages. We also want to explore text summarization functionality for long documents and reports, automated image captioning for accessibility purposes, and promoted search answers for common questions.

Digital Services team members (lured by the promise of pizza) participate in a feedback tagging lunch to inform the feedback classifier model.

With a little creativity and effort, there are endless ways you can apply machine learning, which is what we plan to do. Let us know what you think about our machine learning efforts, and maybe, just maybe, we can start to give machine learning the love and respect it deserves.