5 Followers
·
Follow

We’ll do it in SQL

Image for post
Image for post
Photo by Martin Adams on Unsplash

I wrote a service that scrapes data from web comic sites, stores them in a database, and generates RSS feeds. I call it Comic Gator and I made it so that I can have an automated way to replay a comic’s history when I first discover it. While thinking about how the state of the RSS feeds are managed I converged on a design that could be completely expressed in a standard PL/pgSQL statement; in effect describing a data pipeline inside of the database. This is an analysis of the statement that updates the RSS feed state and how the overall design of the schema allows it to be done in this way.

The comic database (cdb) has four resource tables:

  • geek (user is a reserved word in postgres)
  • comic
  • strip
  • feed

and two linking tables

  • feed_comic
  • feed_strip

Their relationship is summarized by this diagram.

Image for post
Image for post

When an RSS feed is generated it can simply read off the latest feed_strip rows for a given feed_id and left join them to all of the resource tables to fill in the context. The fancy part is populating the feed_strip table while following the rule defined in the feed_comic table. This rule is based on the state of five columns.

  • mark is a pointer to the number of the last strip inserted into feed_strip.
  • step is how many strips are inserted on each update.
  • interlude is the interval of time between feed updates.
  • start_at is a lower bound timestamp restricting when the feed starts.
  • next_at is the time when the next update is scheduled.

Source

This is the replay update statement in its entirety.

Source

It is easy to understand if we break it into five parts.

This declaration tells the engine that this is a PL/pgSQL statement, which is a DSL that comes standard in PostgreSQL databases. It has neat features like variables and loops.

The DECLARE section allows us to set up variables that will be used throughout the statement. rec is a record and possesses the same properties as a row selected from the feed_comic table.

The FOR LOOP section allows us to populate our rec variable with all the results of the executed query. Here we are filtering the feed_comics for only those that have the replay feature, are active according to their start timestamp and are overdue for an update. Then we loop with rec representing each consecutive row.

Inside the loop there are two statements.

The first statement inserts an amount of strips equal to ‘step’ starting from ‘mark’ into the feed_strip linking table. If the feed/strip combination already exists then it performs an update. The feed_id will not change in this update but it will activate the update_timestamp_trigger and this feed_strip row will be included again when the RSS feed is generated.

The second statement updates the feed_comic; progressing the ‘mark’ by the ‘step’ amount and increasing the next_at timestamp by the interlude. This has the nice feature that if there is accidental downtime — say that the service has stopped for more than several interludes — then the feed will rapidly catch up as the CURRENT_TIMESTAMP will continually be greater than the next_at timestamp.

This design allows us to avoid round trips between the database and the application layer. By using linking tables we define relationships without having to copy a lot of data around and by using PL/pgSQL we can expand our capabilities to do business logic in the database itself.

Written by

software, sailing, sandwiches

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store