Crafting production-ready Backup as a Service solution using Akka Streams
Using Akka Streams to backup MongoDB to Amazon S3.
Edit: both parts of this article are up to date as of 16 Jan 2019.
Imagine you’re developing a database as a service solution like Compose or mLab. You’re hosting a horde of database instances/containers which are used by your customers. One of the key features of your solution is performing backups since no one is willing to lose data. You are already doing backups in form of e.g. volume snapshots, just in case of a disaster in your infrastructure (and you obviously test your automated recovery procedure frequently to prove it’s up to date ;)).
What if you wanted to let your customers do automatic backups of their data (or some selected part of it), so they could restore it on demand?
In this post, we’ll craft a production-ready backup as a service solution. We’ll see how to use Akka Streams in connection with Alpakka, which is a powerful open source library of connectors for Akka Streams, to stream the whole MongoDB collection to Amazon S3. What’s more, at the end we’ll enhance our stream with compression and encryption as we obviously want to guarantee the maximum safety with a minimal storage cost. And, of course, we’ll do all of this in Scala! Sounds cool? Read on!
Disclaimer: If you don’t have any experience with MongoDB, this article is still suitable for you.
The diagram below is a high-level overview of a stream we’re going to implement in a minute.
The stream will be constructed as follows: it’ll start from a MongoDB collection, in our case cookies. Then, there will be a logic that converts BSONs to JSONs, which will eventually end up being streamed to a file in Amazon S3 bucket. Now it’s time to see some code!
The snippet below presents the whole solution. It streams a MongoDB collection to a file named CookieCollectionBackup.json into the bucket bucketWithCookies on Amazon S3.
The code below uses 3 dependencies: Akka Streams, Alpakka S3 connector and MongoDB Reactive Streams driver.
I added a surplus amount of type annotations in order to help you better understand it just by looking at types.
- In the 1. line we start our stream with a Source of BSON documents. We do it by passing
collection.find(), which returns a
Publisherfrom Reactive Streams with BSON documents, to
Source.fromPublisherhelper. It accepts Reactive Streams publishers and creates Akka Streams source out of them.
- In the 2. line we transform each BSON document received from MongoDB to JSON* form which is a String. In order to achieve this, we use
toJsonmethod on BSON document.
- In the 3. line we transform our String with JSON inside to ByteString. That’s because it’s required by the Sink, in our case
- In the 4. line we want our stream to run with
s3Sink, which is defined in line 6. It writes to S3 using
In fact, it’s an Extended JSON. Since BSON has more types than plain JSON, MongoDB introduced an Extended JSON format. Read more about it in the MongoDB manual.
I highly encourage you to run the stream yourself. You can find the full runnable code in the Github repo: bszwej/mongodb-s3-stream-example.
The multipartUpload sink
Let’s take a closer look at our sink (
s3client.multipartUpload), which signature is defined as follows in Alpakka:
def multipartUpload(bucket: String,
contentType: ContentType = ContentTypes.`application/octet-stream`,
metaHeaders: MetaHeaders = MetaHeaders(Map()),
cannedAcl: CannedAcl = CannedAcl.Private,
chunkSize: Int = MinChunkSize,
chunkingParallelism: Int = 4): Sink[ByteString, Future[MultipartUploadResult]]
Apart from specifying the bucket and file name, it allows us to:
- Specify a chunk size, which is 5mb by default.
- Specify parallelism of sending chunks, which is 4 by default.
- Specify a content type and metadata.
If you’re curious about how parallel upload of chunks is realized, then take a look at S3Stream#createRequests.
When building a solution like backup, it’s highly likely you want to apply compression to the backups in order to optimize the cost of storage. In fact, Akka Streams has a few built-in flows, that allow compressing streams of ByteStrings. They can be found here: akka.stream.scaladsl.Compression. After adding a gzip compression to our stream, the code looks like the following:
As you might have noticed, in line 4. we make our stream go through a flow, which is a
Compression.gzip. It accepts and outputs ByteStrings performing the gzip compression.
We can distinguish 2 kinds of encryption when crafting a backup solution:
- In transit encryption. It’s applied when data is being sent/downloaded to/from S3. As far as S3 is concerned, data in transit can be secured by using either SSL or Amazon’s Client-Side encryption of which the latter is, sadly, not supported by Alpakka S3 connector. In this case, you can always rely on SSL, which is always enabled by default.
- At rest encryption. It’s applied when data is persisted on disk in one of Amazon’s data centers. It’s decrypted when you download the data. From the client’s perspective, there’s really no difference because it’s encrypted/decrypted internally by S3. It can be realized by Amazon’s Server-Side Encryption and can be used with Alpakka. You can easily do it just by specifying a header, namely
x-amz-server-side-encryption. In the snippet below we added such encryption.
In the line 8. we used
s3Client.multipartUploadWithHeaders sink and specified a
ServerSideEncryption header. Under the hood, it just adds a header:
x-amz-server-side-encryption: AES256. After running the stream, the file is encrypted which can be seen in the properties on S3.
If the above solution is enough for your project, then go with it. It’s the simplest one. But what if you wanted to encrypt the data in the stream before sending it to AWS? The data would be then encoded in transit as well as at rest. This topic is huge and is a good material for a separate blog post. In short, you can use this snippet made by @TimothyKlim.
So… that would be it! You could have noticed how easy it is to do integrations between systems using Akka Streams and Alpakka. One could argue, that a similar effect of streaming a large amount of data could be realized solely by actors. It’d be much more difficult since things like buffering, throttling or backpressure, which actors don’t provide by default, would come with an increased code as well as test complexity (btw. Collin Breck wrote an outstanding article about it).
To sum up, Akka Streams seems to perfectly fit for the problem we were trying to solve: doing on-demand backups for our customers. I’d also like to mention, that the solution shown in this post is working like a charm on production for months in one of the projects in my organization. Crafted with scalability and reliability in mind, it turned out to be very stable over the period of the last few months.
In the subsequent post, we’ll implement the restore part of our backup solution. Thanks for reading! Stay tuned and happy hAkking :)!