We’ll do it in SQL

Photo by Martin Adams on Unsplash

I wrote a service that scrapes data from web comic sites, stores them in a database, and generates RSS feeds. I call it Comic Gator and I made it so that I can have an automated way to replay a comic’s history when I first discover it. While thinking about how the state of the RSS feeds are managed I converged on a design that could be completely expressed in a standard PL/pgSQL statement; in effect describing a data pipeline inside of the database. This is an analysis of the statement that updates the RSS feed state and how the overall design of the schema allows it to be done in this way.

The comic database (cdb) has four resource tables:

  • geek (user is a reserved word in postgres)
  • comic
  • strip
  • feed

and two linking tables

  • feed_comic
  • feed_strip

Their relationship is summarized by this diagram.

When an RSS feed is generated it can simply read off the latest feed_strip rows for a given feed_id and left join them to all of the resource tables to fill in the context. The fancy part is populating the feed_strip table while following the rule defined in the feed_comic table. This rule is based on the state of five columns.

  • mark is a pointer to the number of the last strip inserted into feed_strip.
  • step is how many strips are inserted on each update.
  • interlude is the interval of time between feed updates.
  • start_at is a lower bound timestamp restricting when the feed starts.
  • next_at is the time when the next update is scheduled.
CREATE TABLE cg.feed_comic (
feed_id VARCHAR(24) NOT NULL REFERENCES cg.feed(id),
comic_id VARCHAR(24) NOT NULL REFERENCES cg.comic(id),
is_replay BOOLEAN NOT NULL DEFAULT false,
mark INT NOT NULL DEFAULT 0,
step INT NOT NULL DEFAULT 0,
interlude INTERVAL NOT NULL DEFAULT '1 day',
start_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
next_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP);

Source

This is the replay update statement in its entirety.

DO $$
DECLARE
rec cg.feed_comic%ROWTYPE;
BEGIN
FOR rec IN EXECUTE 'SELECT * FROM cg.feed_comic fc
WHERE fc.is_replay = TRUE
AND fc.start_at < CURRENT_TIMESTAMP
AND fc.next_at < CURRENT_TIMESTAMP'
LOOP
INSERT INTO cg.feed_strip (feed_id, strip_id) (
SELECT
rec.feed_id,
s.id
FROM cg.strip s
WHERE s.comic_id = rec.comic_id
AND s.number > rec.mark
AND s.number <= rec.mark + rec.step)
ON CONFLICT (feed_id, strip_id)
DO UPDATE SET (feed_id) = (rec.feed_id);

UPDATE cg.feed_comic
SET (mark, next_at) = (rec.mark + rec.step,
rec.next_at + rec.interlude)
WHERE feed_id = rec.feed_id
AND comic_id = rec.comic_id;
END LOOP;
END $$

Source

It is easy to understand if we break it into five parts.

DO $$
...
END $$

This declaration tells the engine that this is a PL/pgSQL statement, which is a DSL that comes standard in PostgreSQL databases. It has neat features like variables and loops.

  ...
DECLARE
rec cg.feed_comic%ROWTYPE;
BEGIN
...

The DECLARE section allows us to set up variables that will be used throughout the statement. rec is a record and possesses the same properties as a row selected from the feed_comic table.

FOR rec IN EXECUTE 'SELECT * FROM cg.feed_comic fc
WHERE fc.is_replay = TRUE
AND fc.start_at < CURRENT_TIMESTAMP
AND fc.next_at < CURRENT_TIMESTAMP'
LOOP
...
END LOOP;

The FOR LOOP section allows us to populate our rec variable with all the results of the executed query. Here we are filtering the feed_comics for only those that have the replay feature, are active according to their start timestamp and are overdue for an update. Then we loop with rec representing each consecutive row.

Inside the loop there are two statements.

    INSERT INTO cg.feed_strip (feed_id, strip_id) (
SELECT
rec.feed_id,
s.id
FROM cg.strip s
WHERE s.comic_id = rec.comic_id
AND s.number > rec.mark
AND s.number <= rec.mark + rec.step)
ON CONFLICT (feed_id, strip_id)
DO UPDATE SET (feed_id) = (rec.feed_id);

The first statement inserts an amount of strips equal to ‘step’ starting from ‘mark’ into the feed_strip linking table. If the feed/strip combination already exists then it performs an update. The feed_id will not change in this update but it will activate the update_timestamp_trigger and this feed_strip row will be included again when the RSS feed is generated.

    UPDATE cg.feed_comic
SET (mark, next_at) = (rec.mark + rec.step,
rec.next_at + rec.interlude)
WHERE feed_id = rec.feed_id
AND comic_id = rec.comic_id;

The second statement updates the feed_comic; progressing the ‘mark’ by the ‘step’ amount and increasing the next_at timestamp by the interlude. This has the nice feature that if there is accidental downtime — say that the service has stopped for more than several interludes — then the feed will rapidly catch up as the CURRENT_TIMESTAMP will continually be greater than the next_at timestamp.

This design allows us to avoid round trips between the database and the application layer. By using linking tables we define relationships without having to copy a lot of data around and by using PL/pgSQL we can expand our capabilities to do business logic in the database itself.