Building a Recommendation System from Unreliable Information

An exercise in heuristics

Foreword

This summer, I’ve been working as a software engineer at Cinchapi, a technology startup based in Atlanta. The company’s product is a platform for gleaning insights from data, through real-time analytics, natural language querying, and machine learning.

The underlying back-end technology making the whole thing possible is ConcourseDB, an open-sourced, self-tuning, NoSQL database that offers a more intuitive approach to managing data.

In my first few weeks of the internship, I spent time getting familiar with the massive codebase, implementing new features and fixing bugs along the way.

However, the meat of the work I have done has been in developing a recommendation system for visualizations, which we commonly refer to as the visualization engine. To understand the motivation behind this project, consider the following scenario.

Motivation

Imagine you’re the owner of a mid-sized fashion retailer in a busy metropolitan area. Since opening up shop, you have been using a database to keep track of a variety of factors and statistics, including store prices, exchange rates, competitor prices, weather, customer preferences, and more.

One day, you notice your revenues are dropping. But why? You remember you recently changed the layout of the store, that could be it! Hmm, but it’s also approaching Winter and that affects customers’ tastes, maybe it’s that. Well, of course you may need to consider the fact that your competitor across the street just released a new line of designer handbags, and that’s a plausible cause.

It’s a catch-22. With such large amounts of data being accumulated through sensors in mobiles, computers, cameras, buildings, cars, and more, it’s increasingly difficult to understand relationships and correlations. At the same time, collecting insufficient amounts of data may lead you to miss out on important problems that you’d miss otherwise, like the fact that customers spend 40% less time in a particular section of your store.

This is where the power of visualization comes into play. Often overlooked, visualizations allow us to digest data as information. It’s a simple transformation that converts raw, unintelligible matter into actionable, intuitive insight.

Maybe not the best example

Well, “simple” to the extent that individuals understand which visualizations will serve them best. See, there are an abundance of plots and graphs and charts and figures out there, each of which is suited for a particular kind of dataset. Got some categorical data indexed by frequency? Throw that on a bar chart! What about bivariate numerical data abiding by a non-functional relationship? Best use a scatter plot!

But being able to characterize data as such is difficult for a human, especially if that data contains thousands, if not millions, of entries. Enter the visualization engine.


Understanding the Problem

The objective of the project was fairly clear: help users understand which visualizations will benefit them, based on their data. However, the nature of the solution was open-ended, and related to the problem. Recommendation systems are a highly researched and published topic, and have seen widespread implementation, most notably in consumer-facing products from companies such as Google, Netflix, Amazon, Spotify, and Apple. But each of these companies employ their systems slightly differently.

They implement their systems to solve the same general problem of recommending something (whatever it may be) to the user. If this sounds ambiguous, it’s because it is. The specifics of a recommendation system often rely on the problem being solved, and differ from one use case to the next. We can broadly classify systems into a dichotomy: dynamic recommendation vs. static recommendation.

Dynamic

This is the category of recommendation system employed by Google search, Netflix, Amazon, and Spotify. These systems collect data to profile the user, through past transactions and behavior, and recommend items viewed/bought by others with similar profiles. Alternatively, the system characterizes items and recommends items with similar profiles to those viewed/bought by the user in the past.

For example, after a bout of researching Apache Spark on Google, I began to type the letters ‘ho’ and search autocompletion provided the rest:

Google search: recommendations based on user history

Of course, Hortonworks is a company focusing on the development of other Apache platforms, such as Hadoop. Google understands the topic I’m interested in through my history, and decides to open some related doors.

After that, I decided to look up a recipe for Eggs Benedict. Then, I typed the same letters and autocompletion offered to finish my sentence:

Contextual recommendations

Immediately, I’m offered with suggestions relating to my most recent searches.

This system is dynamic in the sense that the user’s profile is ever-changing as they continue to use the product, and the recommendation evolves to suit the newest and latest information.

Static

On the other hand, the systems employed by Google images and Apple’s predictive text can be described as largely static recommenders. These systems do process user behavior and history, but do not use these (to a large extent) to influence their recommendations.

For example, observe the following stream of messages and the Predictive Text output:

Trying to get Siri’s attention

Unlike the example from Google search earlier, it seems as if Predictive Text does not completely base recommendations on user history. I say “completely”, because Predictive Text actually suggested ‘Siri’ after I had typed ‘Hi Siri’ twice, but reverted to a generic array of predictions after I sent the the third one (maybe Siri got annoyed).

It is extremely important to note here that Predictive Text is in no way worse than Google’s search suggestions. They are both trying to solve completely different problems.

The Google Search Problem

The problem that Search is addressing is a way to improve search experience for users by opening them to new, yet related, doors. After looking up a recipe to Eggs Benedict, I was serenaded with recipes for home fries, poached eggs, hollandaise sauce, and more. This kind of system, building on the user’s cues and profile, makes perfect sense.

The Predictive Text Problem

The problem that Predictive Text is addressing is quick, yet coherent, sentence construction. Many individuals use shorthand, improper grammar, and unknown words when texting. To train a system to propagate this kind of language would lead to a broken system. No, I don’t want to type “soz”, I’d actually prefer “sorry” in this case. This kind of system needs a separate set of rules, if you will, by which it operates. The user is unreliable and the system should not mimic them. For this case, this design makes perfect sense.


The Customer is Not Always Right

Classification helps when breaking down problems. If I can find a problem to which mine is similar, I can abstract similar concepts and apply them to my case. I kept this in mind when ideating the solution that my visualization engine would embody.

Thinking about the two classifications above, I identified that my solution would take the form of a static recommender. The underlying assumption motivating the project was that users have trouble picking appropriate visualizations for their data, so it would not make sense to build a model that recommended choices based on their behavior. If a user picked a bar chart for absolutely everything, would it make sense for the model to also recommend a bar chart for everything? In this case, the customer is not always right.

The Item-User Feature Matrix

It’s a mouthful, but it’s an important concept. Let’s back up a bit.

As I mentioned earlier, a common way to produce recommendations is to compare the tastes of one user to other users. Let’s say User A is most similar to User Z. The system will then determine the items that User Z liked the most that User A has not yet seen, and recommend those. The issue with this approach for our use case is that there is no community of users between which profiles can be compared.

Alternatively, recommendations can be made on the basis of comparisons between items themselves. Let’s say User A loves Item P. Each item is given its own profile, through which it is quantifiably characterized across several ‘features’. Then, suppose Item Q is very similar to Item P. User A would probably like Item Q, right? So, the system recommends it.

Example Item Feature Matrix

The figure above is called an item feature matrix, since each item offered is characterized along several different features. This is closer to what we want, but it’s not perfect. We can’t base our recommendations on what the user likes, since the user may not be right. We must incorporate another dimension.

Example User Feature Matrix

The above matrix is called a user feature matrix, as it depicts the preferences of each user along the same features as the items.

Combining the two concepts, we have two matrices, one for characterizing the user and one for characterizing the items. Collectively, these are considered the item-user feature matrix.

In our case, we don’t characterize the user’s preferences, but rather their data within ConcourseDB. Further, we don’t characterize by the number of characters, action scenes, length, and rating, but a series of data characteristics relating to data types, variable types, uniqueness, and more. This provides a framework to quantifiably determine the similarity between the user’s data and possible visualizations. This is one piece of our system, the DataCharacterizer, which serves to characterize the user’s data across some set of characteristics. But how do we characterize the items, in our case the visualizations, themselves? We employ a heuristic.

Heuristics

So, considering the case of Predictive Text, there is some core ‘rulebook’ from which recommendations originate. For a language predictor in general, this may be in the form of an expression graph or Markov model, where vertices are words, a connection represents a logical next word in a sentence, and each edge is weighted by a certain probability or likelihood.

Expression graph

This could explain why repeatedly tapping Predictive Text suggestions produce something like this as a result of a cycle in the graph:

Nonsen-cycle

In the case of a visualization engine, however, where is the literature? There is no ‘rulebook’ of any sorts that a model can be trained upon, at least not of a size or magnitude that would produce meaningful results.

This is where the heuristic process comes into action. Loosely defined, a heuristic is an approximation. More formally, it is an algorithm designed to find an approximate solution when an exact solution cannot be found.

This formed the basis of my recommendation system, and resolved the problem of having incomplete or unreliable data from which to learn. I developed a table, where the rows represented the same features as in the matrices above, and the columns represented different visualizations. Each visualization was characterized based on the kind of data that it would best represent.

This is another piece of the system, a HeuristicTable, which holds pre-defined, static characterization across the same set of characteristics as the user’s data for each visualization.

Pulling the Pieces Together

The bulk of the system is comprised of the two aforementioned components. I’m glossing over the details of the DataCharacterizer, but it measures a series of characteristics of the user’s data, namely the percentage of Strings, Numbers, and Booleans, whether or not there are linkages between entries, whether or not the data is bijective, the uniqueness of values, and the number of unique values (dichtomous, nominal, or continuous).

Treating a particular characterization as a vector, a cosine similarity function is executed on the user’s data and each column of the HeuristicTable, which measures the similarity between two vectors on a scale from zero to one.

From this point, it’s a matter of sorting the results in descending order of similarity and the recommendation set is ready.

Below is an overview of the system’s design:


Closing Thoughts

Recommendation systems come in all shapes and sizes. Although the problems seem similar from a 10,000 feet view, each use case requires a unique solution to propose the best experience for users.

This was how I built a recommendation system from unreliable data, and I hope it inspires some new ideas.