How Kaffae Extension Recognizes and Tracks Articles

Masatoshi Nishimura
Kaffae
Published in
6 min readAug 17, 2019

Today I want to go over how Kaffae extension works internally. It will go into technical detail, so if you want to look at the high level stuff of what the app is about, please jump straight to the website:)

A little bit of background.

In the past 2 decades, we’ve seen through various tracking/logging software. It ranges from running app, sleeping app, brain wave app, etc. I have always believed in knowing who you are, so you can act on it if you choose to. And it’s never been easier with the help of technology. It’s like a technology-assisted ultimate self-awareness.

That is why I wanted to tackle this issue.

How to track articles you read.

Challenge

Throughout the development, the hardest part of the task is to recognize which site is an article and which is not. Tracking the wrong site is bad. Not tracking the right site is also bad. It’s been a constant struggle of finding the right balance between the two.

You see, the word articles come more intuitive to an academic community where articles are thrown around in the academic papers. But here, I try to group articles as a broader concept around textual content, while making it feel a little bit intelligent. At the end of the day, however, there’s no crystal definition of what constitutes articles. Nor there’s any HTML tag that says this is an article. The algorithm is nothing more than an experiment base implementation that works not 100% but good enough.

So when we say articles, it includes news stories and blog posts. But it also supports most categories covered in large magazine publishers like travel photo gallery. What won’t be identified as articles are forums, Q&A, reviews, product description, or tech documentation.

There’re 3 steps in how articles are tracked with the extension.

  1. Extension and Browser URL
  2. Server Accessibility
  3. User Prompt

Let me go over each in details.

1. First Check — Extension and Browser URL

Ends in Headline

In a first glance, checking against url feels like unreliable, given that you can create its infinite combination as a site owner. But it is surprisingly effective.

First, it detects how the url looks like. Modern blogs today consist of url that ends with a headline connected by dashes. This is surprisingly consistent. So the whole url looks like https:/domain/date/headline or https://domain/category/headline. If the title was“The best comedian of the year” that’s published today, it’d be something likehttps://funnyguy.com/2019/08/03/the-best-comedian-of-the-year.

Sometimes, the headline is followed by hashes as well, but the text will be always there. This is surprisingly consistent from New York Times, Medium, to Techcrunch. This predictability is thanks to the widespread popularity of Wordpress based infrastructure, and well-accepted standard of best url practice (thanks to all the full-stack developer folks).

Min 3 Words

On top of that, the extension requires that headline snippet to have 3 words at least (separated by dashes). That is to easily distinguish articles from webapps that finishes with a random hash string or an action value. For example, this Medium page I am editing right now has 2f477092b9d1/editat the end of the url. It is very rare, but I do come across some articles with a 2-word headline. (Google Design had one blog post called UX AI). The future plan is to match the last url path against natural English vocabularies and see if the url is human given instead of machine-generated hashes.

Remove Unwanted Site Type

Lastly, I want to remove forum and Q&A from being trakced. Technically you’d be still reading texts. But it can be quite messy in figuring out which content is being read.

Again, this has come to my surprise. After browsing through 100+ popular forums, I have realized many of them follow consistent url structures just like blogs. The trick is amazingly simple. They contain subdomains or directory path with “forum”, “thread”, “questions”, “community”. A simple Regex check to filter those sites turns out to be sufficient. It successfully removed 80% of unwanted urls that have passed through the initial headline check.

Explicit Domain Matching

The combination of headline and forum check work in most cases. But some sites of course bypass those rules. This is where I cheated. I’ve added white and blacklistings for major sites. For the past 8 months, the list has been gathered from dealing with multiple publishers and user feedback.

Examples of whitelisted sites:

Wikipedia, Associated Press

These sites should be tracked but do not fall under the rules above. Wikipedia is obvious. Many articles have one-word headlines like Canadaor Goofy. Associated Press is the outlier in news publishers where they use only hash to indicate every one of their articles.

Examples of blacklisted sites:

Quora, Goodreads

Black listings are for those whose urls look very similar to articles, but the content was not enough to make it so. They are to be filtered.

Where to position Quora has been tricky because it contains many lengthy answers with proper paragraph breakdown. Unlike regular forums, Quora does not encourage interactions between users, and each question works rather like a blog title. Some of the answers are so well written it is debatable whether it should be treated as articles. But to make a clear boundary, I’ve decided to omit any site that promotes itself as Q&A. The same logic goes for Goodreads where reviews are written in a long format.

2. Second Check — Server Accessibility

Public Accessibility

No parsing is done through the frontend (Chrome extension) except the url. After it passes the first test, the url string is passed to the server (minus the query variables). Then, the server tries to access the website on its own via HTTP. This process is to ensure that articles being tracked do not contain any private or sensitive info. Those articles should be retrievable by anyone on the Internet.

There is a catch however. This also means the extension is not able to retrieve any article that is behind paywalls. It will not be able to track Bloomberg, Financial Times, or Wall Street Journal. To give an accurate representation of a personal categorical graph, you need to give a content in order to count that article as read now.

This area is still under experimentation. In the future, it is possible to put urls coming from those major sites as legitimate article list regardless of content accessibility.

Determine Article Content

Once the site is determined to be accessible from the backend server, the app tries to parse out the HTML content. This is rather art than science. Based on experience dealing with hundreds of articles, I’ve discovered major factors governing web articles:

  • Does it have an author name?
  • Does it have a published date?
  • Does it have enough paragraphs?
  • Does it have enough content length?
  • Does it have HMLT p tags that are consistent in a layout?

In the future, it is possible to implement machine learning (maybe Naive Bayes similar to spam detection), but at the end of the day, it will never be able to achieve 100% accuracy. The handpicked rule seems to be working quite well right now.

3. Third Check — User Prompt

How much control to give to a user is a heavy UX challenge.

On average, people spend about 30 seconds on each article. But sometimes, they can go deep into 6 minutes. What constitutes reading and what does skimming are quite subjective. Not only that, you might want to make a distinction as a user to check what reading means. I personally do not like to track articles which I have glanced but not understood at all. I feel that cheating.

Right now, the extension gives 30 seconds buffer before the article gets counted as read. During that period, you can cancel it anytime by simply clicking.

Lastly, why bother?

We spend a great amount of time on our laptops. But within, there’re many activities involved from watching YouTube to checking out Facebook. Defining article reading has been a tough challenge, but I’ve felt the importance of putting a spotlight in this direction.

In this blog post, I went over how tracking articles was done through Kaffae extension. Much of the technique is based on multiple experiments and going through countless articles.

What is reading an article to you?

Please leave a comment below if anything.

Also, the extension has so far tracked 15,188 articles as of today. If you’d like to know how much you read every day, you can check out the extension.

Chrome Extension: https://chrome.google.com/webstore/detail/read-with-kaffae/cdopdmmkjbdmffleiaajlplpgfbikekc

--

--

Masatoshi Nishimura
Kaffae
Editor for

Maker of Kaffae — remember more from articles you read. NLP enthusiast. UofT grad. Toronto. https://kaffae.com