How to backup data from Postgres to AWS S3 from a Elixir application using Ports

CERN PHOTOWALK 2010 — Computer Centre — Roger Claus

Here in Rakuten Brazil, one of the main products that my team maintain is a Elixir system that function as a kind of Proxy that allows the access of several services that, for a long time, were only consumed internally.

This system records all transactions in an Audit Log table, so all teams can observe what is going on with the calls that are made through their services, enhancing debugging experience of these distributed services.

Time to time, this table grows a lot so we need to cleanup old records, but a question arose: what to do with this old data? After some research, I discovered Amazon Athena that allows one to backup structured data and query it, paying only for the queries that are issued. There are two excellent posts¹ about it.

Why Ports?

So I decided to store all this data in gzipped JSON files in a way that we don't need to touch the filesystem, but why?

The initial plan was to put the old data on a temporary file, gzip it and call the aws s3 cp to store its content.

Coming from Ruby world the first think that I tried was the Tempfile class Elixir counterpart, without success. After some research I discovered that even Plug.Upload has a custom implementation of temporary files used during File uploads.

The mktemp could be used here but after confirm that aws s3 cp command accepts data from stdin a new route was unveiled, the Elixir Ports.

Ports enable us to communicate with external processes through stdin and stdout, allowing feed the data directly to the command without the need to create a intermediary file.

Building the task to backup old data.

The basic premise is to make this command testable, so that we can ensure that all data will be sent through the Port, so the follow Mix Task was built:

Some important notes:

  1. The backup_cmd argument on line 5 allowing to construct a command to be executed, the default argument, build_backup_cmd/2 lies on line 58;

The use of Repo.stream/2 allows the retrieve of records without blowing up the RAM. There is another interesting approach that I found here using Postgres COPY.

  1. The JSON enconding using JASON pretty: true option on line 39, to easy the posterior work.
  2. The use of Ecto.Multi on line 45, wrapping both data transfer and further delete operations in a single transaction.
  3. The execution of operations around a try block, ensuring the appropriate Port release.

Testing the implementation.

Testing this solution is a challenge. How to guarantee that all records are being passed to the backup command?

After some tinkering with this problem I got surprised with the wonderful set of tools that the Unix offers, cat our secret weapon:

$ man cat
CAT(1)              BSD General Commands Manual              CAT(1)
NAME
cat -- concatenate and print files
DESCRIPTION
The cat utility reads files sequentially, writing them to the standard output. The file operands are processed in command-line order. If file is a single dash (`-') or absent, cat reads from the standard input. If file is a UNIX domain socket, cat connects to it and then reads it until EOF. This complements the UNIX domain binding capability available in inetd(8).

Building a cat command in test allow us to capture on output all data sent to input, so bellow our test:

But one can ask how we get the messages that are sent back to Port in our test?

This excerpt of Ports and Port Drivers explains that the process that is creating the Port is the Port owner meaning that all communication sent to and received by the Port must go through the Port owner, this means that the Port owner will receive back on its process mailbox the messages sent back by the Port, in this case the Port owner is the AuditLogPurge mix task.

The next step is to capture the process mailbox of our mix task, what is done through the setup of the current mix shell to use the Mix.Shell.Process, so all messages sent to mix task are received on our test process mailbox, allowing to use assert_received/2 and refute_received/2 on lines 95 and 104.

So the next step is to setup a Cron to make it happen once per month.

[1] Amazon Athena — A “sneak peak” and Store JSON logs on S3 for search and analytics using Amazon Athena