Give the user something valuable — A suggestion on how to publish better content.

Machine learning for publishers: Part 2

Magnus Engström
Sep 7, 2018 · 8 min read

This is the second blog post in my series on machine learning for publishers, and just like the previous one this blog post will not really go into actual machine learning territory. Eventually we will get there, but for now I will continue down a more conceptual path. The reason for this approach is that I believe that the real challenge when applying machine learning is to understand what really stipulates a machine learning problem. Also, staying away from code talk is probably a good idea when it comes to accessibility.


In a previous blog post, the last part in a series of articles on how to build addicting news products, I’ve written about relevancy. That is somewhat of long read, to say the least, but I’ll try to give you the gist of it. In short, there are two aspects that come into play when trying to understand what kind of value a news product gives the end user, and both has to do with relevancy.

First we have something that we can call expected relevancy. This is what happens before the user consumes the content. When a user decides to engage with the product (for example, opening a push notification, following a link in a social media feed or just clicking an article on the site start page) that interaction comes with an expectation of how rewarding or “good” the content will be. A very meta way of defining this could be: If this is the first time you are reading this text, you have some expectations on this blog post before you start to read it.

Secondly, we have experienced relevancy. This is how close the content lives up to the user’s expectations. For example, let’s say that you missed a concert last night with an artist that you like, and today you want to read a review of that concert. Before you open the article you have a pretty good idea of what you would like to take away from the content provided. Maybe you would like to know if it was a good performance and how the setlist was composed. Once again, this is the expected relevancy. And therefore, if the article doesn’t provide a setlist, you will come away from it just semi-pleased. From your perspective, the content could have been more relevant, so in this case the experienced relevancy was not as high as you hoped it would be.

The process of rewarding the user by living up to the expectations.

To provide valuable information

Let’s say that you own a website and that you publish articles on that website. To make things simpler we can assume that the only way for anyone to read the content you produce is through visiting the website (as in: no newsletters, no RSS-feeds etc.). If we use the term information as a way to describe the articles on the website, then we can say that original information (once again, in this case we are talking about an article) is only available in one place. This means that we can define your website an information source.

Losing data

The information source and the consumer.

What does this mean when we look at the article as information? Well, we might say that this information is now stored in two different places. First we have the original article, still available on your website (the original). But now this information is not only stored at the source, there is also a version floating around in the user’s memory. The information have been copied. But of course, we don’t expect the user to be able to recite the entire article to someone, word for word. The version of the article that the user remember is only partial. The term for when some information is lost during transmission from the source to a new copy is data loss.

In other words, the data loss in this case is the stuff that didn’t stick in the head of the user.

The consumer have a partial (subjective) copy of the information.

By now there are two big questions that needs answering. What information is left behind when the user reads an article, and why? I provide a simple answer to both: the less relevant the information is to the user, the higher probability of the user not remembering it after reading. Somewhat obvious, right? This means that there is a theoretical way to understand how relevant your article really is. But just going by what parts of an article a single user remembers doesn’t give us much to go by, it is far too subjective to provide any distinct value. However, now that we have establish a way of defining relevance we can try to come up with a way to formalize it.

A simple approach

The information source have several consumer, each having a partial copy of the information.

Instead of only one reader we now imagine that your article is read by a group of users. Every user does the same thing: first the article is read all the way to the end, and then a list of key takeaways are written down. Now you have a bunch of lists to go through, and every time you read a row in a list provided by a user you add one point to the corresponding row in your original list. After a while you notice that some parts of the article is remembered much better than other parts, and some information is not present at all in the user lists. If you would print out the article and highlight different parts with different colors depending on how they scored on the list you would now be able to produce a heat map showing what parts of the article that is most valued by the users.

Combining the partial copies gives a heat map, showing what is remember by many users and what has been lost in transmission.

What the users values

The difference between what is presented at the information source and what ends up in the copy is waste, since it has no value.

The value of the content is represented by what the users takes away from it, not the original article.

There can also be other types of conclusions to be made. For example, maybe it turns out that a name tend to stick in the user’s head if it’s presented within the first sentences of an article, but much less so if the name only shows up in the later part of the text. The thing is, there is no easy way to manually take all aspects of the data into account, meaning that some crucial things might not be noticed because of subjective reasons. This of course is a problem. And guess what. It qualifies as a true machine learning problem.

Basically, if you took it upon yourself to draw conclusions from this data you would be putting your brain to work (using your biological neural network, the neural circuit) doing something that an artificial neural network would be able to do much (much) faster, and more precise.

In a way, we can describe this as as a quest to create a closed system. Information flow from the source to the user, and by understanding what (and why) data is lost on the way we can learn how to produce content that is more relevant to the user. Of course, the main idea here is to create better content by minimizing data loss over time. Actually, I guess that we even could somehow choose to see this as entropy, but that is something for a later blog post.


That’s it for now. If you feel like giving some feedback it would be much appreciated. There is a wide array of fun subjects to cover in this blog series, but let me know if I should continue on this (mostly) conceptual path, or if you would prefer a more technical approach.

mittmedia

Vad händer i Mittmedia?

Thanks to Micke Tjernström

Magnus Engström

Written by

mittmedia

mittmedia

Vad händer i Mittmedia?

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade