Give the user something valuable — A suggestion on how to publish better content.
Machine learning for publishers: Part 2
This is the second blog post in my series on machine learning for publishers, and just like the previous one this blog post will not really go into actual machine learning territory. Eventually we will get there, but for now I will continue down a more conceptual path. The reason for this approach is that I believe that the real challenge when applying machine learning is to understand what really stipulates a machine learning problem. Also, staying away from code talk is probably a good idea when it comes to accessibility.
In a previous blog post, the last part in a series of articles on how to build addicting news products, I’ve written about relevancy. That is somewhat of long read, to say the least, but I’ll try to give you the gist of it. In short, there are two aspects that come into play when trying to understand what kind of value a news product gives the end user, and both has to do with relevancy.
First we have something that we can call expected relevancy. This is what happens before the user consumes the content. When a user decides to engage with the product (for example, opening a push notification, following a link in a social media feed or just clicking an article on the site start page) that interaction comes with an expectation of how rewarding or “good” the content will be. A very meta way of defining this could be: If this is the first time you are reading this text, you have some expectations on this blog post before you start to read it.
Secondly, we have experienced relevancy. This is how close the content lives up to the user’s expectations. For example, let’s say that you missed a concert last night with an artist that you like, and today you want to read a review of that concert. Before you open the article you have a pretty good idea of what you would like to take away from the content provided. Maybe you would like to know if it was a good performance and how the setlist was composed. Once again, this is the expected relevancy. And therefore, if the article doesn’t provide a setlist, you will come away from it just semi-pleased. From your perspective, the content could have been more relevant, so in this case the experienced relevancy was not as high as you hoped it would be.

To provide valuable information
With that sorted out we can move on to the primary topic of this blog post. As a news publisher, is there a way to figure out what part of the content the user finds relevant (and by that, valuable)? Well, yes, I would argue that we can apply a simple thought experiment that will help us understand the true value of our content. This solution will look very similar to how a artificial neural network works, of somewhat obvious reasons, and we will get back to that later.
Let’s say that you own a website and that you publish articles on that website. To make things simpler we can assume that the only way for anyone to read the content you produce is through visiting the website (as in: no newsletters, no RSS-feeds etc.). If we use the term information as a way to describe the articles on the website, then we can say that original information (once again, in this case we are talking about an article) is only available in one place. This means that we can define your website an information source.
Losing data
Now that we have defined the information source, we can focus on what happens to that information as it is being consumed. You publish a new article, and after a few minutes a user enters the website and finds the article. To measure user engagement is always a tricky subject, but for the sake of keeping things simple we now imagine that we have a definite way of knowing that the user actually have read the entire article. So, to recap this far: we have an information source (a website) with information (the new article) and a consumer (the user who reads the article).

What does this mean when we look at the article as information? Well, we might say that this information is now stored in two different places. First we have the original article, still available on your website (the original). But now this information is not only stored at the source, there is also a version floating around in the user’s memory. The information have been copied. But of course, we don’t expect the user to be able to recite the entire article to someone, word for word. The version of the article that the user remember is only partial. The term for when some information is lost during transmission from the source to a new copy is data loss.
In other words, the data loss in this case is the stuff that didn’t stick in the head of the user.

By now there are two big questions that needs answering. What information is left behind when the user reads an article, and why? I provide a simple answer to both: the less relevant the information is to the user, the higher probability of the user not remembering it after reading. Somewhat obvious, right? This means that there is a theoretical way to understand how relevant your article really is. But just going by what parts of an article a single user remembers doesn’t give us much to go by, it is far too subjective to provide any distinct value. However, now that we have establish a way of defining relevance we can try to come up with a way to formalize it.
A simple approach
How about this? As you write your article you also write a list of bullet points with the key information from you article. For example, if the article is about a hockey game you might write bullet points for the final score, the players who scored goals, the name of the coaches, how the puck possession was divided between the two teams and so on. Then, after the user have read the article, you are able to ask that person to write down (from memory) a list of key takeaways. When you compare your list with the one the user provided you with you will be able to just count the difference. Hopefully you will end up with a percentage (the number of information points remembered by the user divided by the total number of points in your list), and by that you now have a quantified way of measuring relevancy. The only thing that is left to do now is to scale up!

Instead of only one reader we now imagine that your article is read by a group of users. Every user does the same thing: first the article is read all the way to the end, and then a list of key takeaways are written down. Now you have a bunch of lists to go through, and every time you read a row in a list provided by a user you add one point to the corresponding row in your original list. After a while you notice that some parts of the article is remembered much better than other parts, and some information is not present at all in the user lists. If you would print out the article and highlight different parts with different colors depending on how they scored on the list you would now be able to produce a heat map showing what parts of the article that is most valued by the users.

What the users values
One way to look at this is to say that the stuff few users remembers have a low relevancy and thereby a low value. If you did this with several of articles you would probably be able to find clear patterns, meaning that you would learn what content the users finds value in, and what type of information you produce that is just not relevant enough.
The difference between what is presented at the information source and what ends up in the copy is waste, since it has no value.

There can also be other types of conclusions to be made. For example, maybe it turns out that a name tend to stick in the user’s head if it’s presented within the first sentences of an article, but much less so if the name only shows up in the later part of the text. The thing is, there is no easy way to manually take all aspects of the data into account, meaning that some crucial things might not be noticed because of subjective reasons. This of course is a problem. And guess what. It qualifies as a true machine learning problem.
Basically, if you took it upon yourself to draw conclusions from this data you would be putting your brain to work (using your biological neural network, the neural circuit) doing something that an artificial neural network would be able to do much (much) faster, and more precise.
In a way, we can describe this as as a quest to create a closed system. Information flow from the source to the user, and by understanding what (and why) data is lost on the way we can learn how to produce content that is more relevant to the user. Of course, the main idea here is to create better content by minimizing data loss over time. Actually, I guess that we even could somehow choose to see this as entropy, but that is something for a later blog post.
That’s it for now. If you feel like giving some feedback it would be much appreciated. There is a wide array of fun subjects to cover in this blog series, but let me know if I should continue on this (mostly) conceptual path, or if you would prefer a more technical approach.

