Scalable Sitemap Solutions with AWS Lambda+JS
Background
The first step in improving SEO on any web application is to make sure that the search engines can index every piece of content. Search engines crawl pages by following links on every page that it crawls. However, if a site has a lot of pages that are not interconnected, search engines may never crawl a large portion of content. This is the case with Vevo. We have hundreds of thousands of videos and a hundred thousand artists across fourteen geographic territories, so in order to make sure everything is crawled, we use sitemaps.
Vevo’s site structure is pretty simple. Our main function is of course to play music videos. Videos are associated with the artists that create them, and artists can have many videos. Additionally we have user and playlists, however since these can be both public or private and their content tends to change frequently we’ve decided not to include them in our sitemaps for now.
What are sitemaps?
Sitemaps are XML documents that contain a small amount of metadata for a particular page on a website. In the simplest form, a sitemap is just a list of URLs for search engines to crawl. A basic sitemap looks like:
If your site is separated by country, like Vevo’s, you can also add those alternative URLs that contain the same content. For instance:
With more complex pages you can add additional information about the web page. Google supports the additional media types of videos, images, and news. For instance, for a music video site like Vevo we’re able to add metadata about each of the video pages. Metadata including the video title, length, view count, and description. An example video sitemap might look like:
That’s a lot of data to gather and keep up to date for every single music video on Vevo! So how were we able to do it? We built a suite of sitemap tools known as Magellan.
Additional Sitemap Resources:
Magellan
Magellan is Vevo’s Sitemap building tool built using Serverless Javascript functions deployed to AWS Lambda. It is a collection of Lambdas that build, monitor, and update sitemap files based on changes to Vevo’s catalog. The five main functions of Magellan are: Backfill, Receiver, Generate, Index, and Sitemap.
Together these Lambda functions allow Vevo to build sitemaps that catalog every single video, artist, and miscellaneous page on Vevo.com. Below is a diagram of Magellan’s architecture.
Backfill
This Lambda does a lot of the heavy lifting for Magellan. It queries every page of videos and artists from the Vevo API to pull out all of the information that is necessary to build a sitemap entry. It runs this data through a translation function that converts the data into what is expected in a sitemap and inserts the translated record into an Aurora database in AWS’s RDS system.
example of a translation decorator:
Because there are so many pieces of content that we need to do this for and Lambdas have a hard time limit of five minutes, the Backfill Lambda must keep a cursor of where it is in the catalog and spin up another instance of itself when the time is running out. This way, the Lambda can complete the entire job. There is, however, a caveat with this technique. A bug in the code could cause an infinite loop. If for some reason that happens during the backfill process the only way to stop the function is to set the max time to 1 second.
In theory, this function only ever needs to be run once for each type of asset (video or artist), just to initially fill the database. Because of this infrequency of use, it is triggered manually via the AWS console.
Receiver
There are two instances of this Lambda function that run. One for videos and the other for artists. These functions are triggered by an AWS Kinesis stream that fires when any asset is created, updated, or destroyed. Their purpose is to make sure that the database stays in sync with Vevo’s catalog.
These work similarly to the Backfill’s repair functionality in that when a new update comes through the Kinesis stream it grabs the updated data from the Vevo API and updates the database entry.
Sitemap
The Sitemap Lambda does the rest of the heavy lifting. This Lambda is responsible for building a single sitemap document. It takes an asset type, country, and page number as arguments, queries those specific items from the database, converts those records to XML, and uploads the whole thing to AWS’s S3.
A major advantage of the Lambda architecture is it’s scalability with batchable actions. This allows us to spin up an instance of this function for every sitemap page that we need to build. Because of the protocol’s 50,000 item or 50MB limitations, that can be around 15 sitemaps for each territory. Building all of the sitemaps at once ensures that we don’t have race conditions with new assets being added during the building process.
One of the challenges with this function is making sure to select the correct set of data from the database. A single sitemap will have around 50,000 videos taken from the set of videos that have viewing rights in the specified country, are enabled and have a publish date that is prior to Today’s date. To make this easier, we built a MySQL interface class that allows simple composing of complex SQL queries. For instance we can compose a WHERE clause like this:
Once we’ve gathered the subset of records that we are looking for, the final steps are simply building and uploading the XML structure. For that, we use a simple npm package, js2xmlparser. This allows us to compose simple JSON objects that map to the correct XML Schema. To upload, we use the AWS library and point it to the correct S3 bucket.
Generate
The Generate Lambda is relatively simple. It’s main function is to analyze the database size and orchestrate building each of the sitemaps for each territory. For instance, to build the UK’s sitemaps this Lambda will get a count of all of the valid records in the database that have UK viewing rights. It then sends an SNS message to the Sitemap Lambda for each of the sitemaps that need to be built for the UK. This process happens for each of the available territories. The Lambda also initiates the building of each sitemap index file and a sitemap for Vevo’s base urls. The Generate Lambda is initiated on a timer. This ensures that sitemap records are never more than a couple hours out of date.
Index
Based on the Generation Lambda’s analysis, the Index Lambda builds the entrypoint for a territory’s sitemaps. A sitemap index file is a very simple XML document that catalogs the location of all of the individual sitemaps for a territory. This Lambda is initiated by the Generation Lambda.
Advantages of Magellan
By implementing the Magellan sitemap tool, we’re able to point directly to every item in our catalog. Rather than diving through pages linked from our other pages, Googlebot now knows exactly where our content is, which pages are duplicates, and the priority of each page. That helps the crawler determine where to spend its limited time, and as a result we’ve seen the number of pages crawled per day skyrocket by nearly a factor of ten.
Magellan’s retriever system and timed sitemap generation ensures that Google’s listings are never out of date and links are never broken. With fewer errors and broken pages, our reputation with Google increases and we’re able to see increased impressions, rankings, and click through percentages.
Many factors have contributed to Vevo’s SEO efforts and success, but the most important and impactful has been maintaining consistently precise sitemaps through our suite of tools known as Magellan.