Building a world class News aggregator on a coffee-a-day budget

Richard Nelson
10 min readJul 25, 2021

--

Earlier this year in Australia, Google was in the news with the government’s non-sensical media bargaining code, resulting in (eventually) Google launching News Showcase in Australia; paying media companies for content to showcase and promote. I have a number of thoughts on this that I won’t go into. However, I got thinking and decided to do a little proof of concept with building a news site and search engine, indexing Australian news content.

To build such a site, we need several components, and sub-components:

  • A search engine, obviously. This is the powerhouse of the system, and provides indexing, scaling capabilities, complex queries and more. I use vespa.ai for this, which is relatively unknown compared to elastic search/solr and co, but it’s extremely good and fits my requirements perfectly. It’s been used to power many large scale engineering efforts at Yahoo, and they open sourced it in 2017. It is the search engine to use on mutable data sets with modern ranking methods. If you need such search capability I’d highly recommend looking at Vespa Cloud. For my application, most search queries execute in under 20ms.
  • Spiders. These crawl news RSS feeds and sitemaps, looking for, visiting and scraping content. They feed data directly into vespa, which processes and indexes according to the schema I’ve defined.
  • If we want to be able to rank articles based on an attribute other than published time, we also need something to feed these signals and update article documents with these signals. I’ve called this component the “augmenter”.
  • A web API. I didn’t want clients/browsers hitting the Vespa document API directly (though that is possible), so there’s a simple kotlin/micronaut application which abstracts some of the Vespa query logic, and can scale independently. This is the “web api”.
  • A site! I’m not a particularly good frontend developer, so I’ve built a static material angular site which gives me decent UI components without a whole lot of effort.

A picture may paint this a little better — here’s an architecture diagram of the components:

Infrastructure & Build pipelines

All components except for the frontend site are deployed into a kubernetes cluster. I originally used Google Kubernetes Engine (GKE), as the $300 intro credit was enticing — however the overall cost compared to say, Linode’s LKE was prohibitive to run on a coffee-a-day budget, and after a month I ended up migrating to LKE. This took less than half a day from Linode account sign-up to complete data migration, DNS repointing etc. The kubernetes cluster is currently made up of two 8GB Linodes, costing about $80/month, plus the load balancer, $20/month, plus the storage volumes, $8.50/month = $108.50/month. GKE pricing is like black magic, but it was substantially more than this ($70/month just for the k8s control plane!).

Each component of the application sits in the same git repository, and GitHub Actions are run depending on paths of commits. For example, if a change is made to files in the web-api path, the appropriate action is run to:

  • Build and run tests against the web-api component
  • Push the docker image to Docker Hub (spiders, web-api, augmenter)
  • Deploys the new image with kustomize and kubectl apply -f.

Each component runs the same way, except interestingly the search application. The vespa components run a plain vespa docker image, and when the application is built in the pipeline, the zip file is copied to the vespa config node, which manages the update of the application to the cluster. In the deployment pipeline:

kubectl cp target/application.zip vespa-0:/workspacekubectl exec vespa-0 -- bash -c '/opt/vespa/bin/vespa-deploy prepare /workspace/application.zip && /opt/vespa/bin/vespa-deploy activate'

Here, the Vespa config node verifies the admin packages (prepare), then on the activate stage, vespa manages installing to all other nodes.

The vespa components are StatefulSets, with Persistent Volume Claims on content nodes to maintain the data. There are 3 separate StatefulSets here:

  • admin/config nodes (this is why we copy to the pod named vespa-0)
  • container nodes (named vespa-container-X)
  • content nodes (named vespa-content-X)

The admin/config node doubles as a content node, and container/content nodes can be scaled independently. For example, if more space to store content is needed, we can simply scale the size of the content StatefulSet, add the appropriate entries in the application hosts.xml/services.xml, and redeploy the application. Vespa will then take care of redistributing and balancing the number of documents across and into the new node, with no effort on my part.

In the search application’s services.xml, we also define content redundancy — this means we can take out a content node without affecting search results. This is also useful if the application is updated in a way that requires a restart of content nodes; they can be restarted one by one on a live system.

Queries and feeding into vespa are done through the container nodes, which has a kubernetes service in front of them. The spider, augmenter, web-api components all use this http service. For example, the python component of the augmenter simply uses the service dns to reach vespa for feeding:

self.vespa = Vespa(url = "http://vespa-search", port = 8080)

Since the web-api (a very efficient Micronaut application) is completely stateless and also utilises this service, it can be scaled independently and very quickly.

Also in the cluster is an install of prometheus/grafana. Vespa has built-in prometheus metric endpoints, so I simply have a definition to enable prometheus to pick this up.

You may note that this also shows query rates and latency. Using only 2 content and 2 container nodes, I can get over 1000qps. Latency is generally around 10ms, and this performance does not decrease with scale.

The Search Engine

Out of the box, Vespa is a powerful platform. It supports many of the user query types you’d expect:

  • covid -china: Searches for articles containing the word "covid" and not "china"
  • bylines:"Josh Taylor" google: Searches for articles with "google" written by Josh Taylor.
  • source:abc bushfire: Articles from abc.net.au containing "bushfire"
  • covid source:guardian firstpubtime:>1612211170 covid articles on The Guardian published after a certain time.

And we can create our own query profiles/definitions, and ranking algorithms and expressions. When documents are fed in, functionality can be extended with document processors, and custom searchers can help with extending user queries etc to provide useful results.

The foundation of the search is within the schema definition. It contains the attributes that belong to news articles (headline, body, bylines, publish time etc), as well as ranking profiles. There are several ranking profiles in the definition, including one for standard search queries, as well as profiles for ranking the “Top News” section. For example, the ranking profile for a standard search looks like:

rank-profile bm25_freshness inherits default {
first-phase {
expression: bm25(headline) + bm25(bodytext) + freshness(firstpubtime).logscale * 2
}
rank-properties {
freshness(firstpubtime).halfResponse: 172800
}
}

This uses a bm25 algorithm (fast, and good enough) as well as the “freshness” of an article to determine ranking — someone searching for “covid NSW” is probably after more recent articles.

One feature I wanted to have was the ability to find related articles. This is useful both in search, where one may want to find similar articles from any time, and also for the Top News section where articles are grouped into topics. When fetching news from multiple sources, especially for large news events, there can be many articles on the same story. This makes it difficult to present a mix of news. So the Top News section limits topics to 3 articles per group, with a link to full coverage to see the rest — and visually separates out stories.

To achieve this, there’s a “Related Articles” searcher, a Kotlin class which is part of the application package. This searches for similar articles given a particular article. But this isn’t practical for top news ranking – we need to obtain groups for all articles and running the related article searcher on all recent articles. As fast as vespa is, retrieve time would be sub-optimal (and painful to implement). So for this, a DocumentProcessor is implemented; when an article is fed in, the processor runs a related article search and groups articles which match a required relevance score. If it finds an already existing group, the article is added to it, and further matches are also added if they meet the required relevance score.

Vespa then allows to group results by a particular attribute. So for the Top News list, articles are grouped by the topic which has been pre-calculated:

select * from sources newsarticle WHERE group_doc_id matches "^id" |
all(group(group_doc_id) max(15) order(-avg(relevance())) each(output(count()) max(3) each(output(summary()))))

This query retrieves all articles with a group_doc_id, then orders each group by the average “relevance” of each article and outputs 3 items per group with the document summary. Here’s the rank profile for top_news:

rank-profile top_news {
rank-properties {
freshness(firstpubtime).maxAge: 86400
}

first-phase {
expression: freshness(firstpubtime) * max(10, attribute(twitter_favourite_count))
}

summary-features: freshness(firstpubtime)
}

The linear freshness is multiplied by the number of twitter favourites for a story — this gives us a surprisingly accurate and good mix of top news!

The extra signals — twitter_favourite_count and twitter_retweet_count are part of the document schema, and these values are regularly updated by the augmenter, which I talk about later.

By tuning the required relevance scores, extremely accurate topic groups are obtained.

Augmenter

The augmenter is a python script, deployed in the k8s cluster as a standalone deployment which has a list of twitter handles. It polls the twitter API and regularly updates the documents in vespa with retweet and like counts for matched articles:

vespa_fields = { }
vespa_fields['twitter_favourite_count'] = status.favorite_count
vespa_fields['twitter_retweet_count'] = status.retweet_count
vespa_fields['twitter_link'] = 'https://twitter.com/{}/status/{}'.format(status.user.screen_name, status.id)
response = self.vespa.update_data(
schema = "newsarticle",
data_id = hashlib.sha256(article['fields']['url'].encode()).hexdigest(),
fields = vespa_fields
)

I used the pyvespa library for the integration (the spiders also use this), and k8s secrets to store twitter credentials.

Spiders

The spiders are built on the https://scrapy.org framework, in Python. All of them start from RSS feeds or sitemap xmls.

A pipeline component is implented that runs at the end of a scrape, to feed the items into Vespa. Fortunately, there’s a vespa python module that can be used for this. The only trouble I came across here was that it didn’t support creating a document when updating, but it was easy to code so I raised a PR to implement this. This is required because the augmenter adds other fields to documents, and when the spider crawls a URL again, it needs to update instead of overwriting it, and this avoids the need to fetch the document first.

Individual spiders are very easy to add — they can inherit from one of the existing RSS or sitemap spiders, and add xpath or other means to extract specific attributes from articles.

News sites are awful at standardised markup. There are many standards they can use, but most implement half of them (e.g. open graph), get things wrong (e.g. date/timezones) and are otherwise poorly structured (e.g. byline information). So some of them require a fair amount of manual massaging to get useful extracts, which is absolutely key to having a decent search engine! This work, probably the most important, is also unfortunately the most tedious — manually extracting paths, ensuring they work for most article types on sites etc.

One bit of trouble I had was that I was getting related articles that seemed to be completely unrelated. It turned out that this was because some sites (e.g. Guardian) have article pages that summarise a bunch of different news stories, and while these aren’t typical “article” pages, they were being indexed. Then the related searcher would link completely unrelated stories because of these. So I had to find ways to manually exclude these. Once this was done, topics and related news articles suddenly became much better.

Web API

The web api is almost a proxy between users’ browsers and the Vespa API. It is possible to configure vespa containers only have a site component (this is what vespa.cord19.ai does), but it’s more costly — having a lightweight micronaut application does the job just fine, is easier to work with for this purpose and can be scaled and deployed completely separately.

This application is about as simple as it gets, and just abstracts some of the vespa query functionality away to make things simple for the javascript.

There’s a simple Controller with functions which create the YQL for Vespa queries, and a SearchClient which takes care of the http requests to Vespa.

Deployment is also straightforward. jibDockerBuild is used to create the docker image and it’s sent to docker hub, then pushed out with kustomize and kubectl apply .

On the k8s side, in front of the deployment exists a service which simply selects the ausnews-web-api app. Since www.ausnews.org is a CNAME for ausnews.github.io, I set up s.ausnews.org to point at the external IP for the ingress, and the ingress rule for s.ausnews.org points at the service.

SSL certs for both s.ausnews.org and m.ausnews.org (monitoring) are managed by the letsencrypt ClusterIssuer. One challenge I had was that friends would just type in “ausnews.org” into their browser, which can’t be cnamed. So I added a rewrite rule to the ingress in order to direct browsers back at the GitHub Pages hosted site.

Since the deployment is stateless, it’s incredibly easy to scale up and down as required. The micronaut framework is very efficient, and in my tests was not a bottleneck at all for the search queries to go to Vespa.

Site

I am not a frontend developer, but unfortunately I needed a site. I chose to use Material Angular for this implementation, and decided to go with a single page web app, which can be deployed to free static hosting — in this case, github pages. Material Angular allows me to concentrate on getting things to work, without having to think much about UI widget design.

Most of the site’s code is simple UI work. There’s a service which has simple functions related to retrieving lists or groups of articles; getTopics , getAuthors , getRelated etc. These simply hit the relevant web api’s endpoints and return lists of article objects for display in the various component types.

One nice ability of the search results is to filter results after search, by source. Each time you check one of the sources, a new search is done. The search is so fast (almost always <100ms including all networking) that it almost feels like it’s doing local filtering.

Deployment is done to GitHub pages, with a CNAME file in the repo — www.ausnews.org is a CNAME for ausnews.github.io. The action is largely:

npm run ng -- deploy --repo=https://github.com/ausnews/ausnews.github.io.git --cname=www.ausnews.org --name="Ausnews Dev" --email=dev@ausnews.org --no-silent

--

--