Cost-Effective Data Streaming

Published in

Mynd Technology

7 min readAug 26, 2019

How we created a real-time reporting database from multiple sources leveraging cost-effective AWS services DMS and Lambda.

By Ken Ostner and Roman Laguta

Moving data is nothing new. However, as the focus on data escalates, so do the demands that businesses have on using data. One such demand that is difficult to meet is real-time reporting. We are talking real-time, like within a couple of seconds. Not all companies can throw teams of data engineers and devops engineers at it. We couldn’t; we needed a high “bang for your buck” solution with the resources we had.

Specifically, here at Mynd we needed to find a solution to achieve the following:

Real-time data, within a few seconds from submitting a change in our application;
Easily maintained by using minimal resources;
Dynamic, such that our scrum teams can change their services without downstream dependencies;
A single source of truth reporting database;
A database capable of executing complex cross schema queries within seconds.

At Mynd, we have built our own internal system on which we run all our operations. This system, named Otto, is built on a services architecture within AWS. Currently, we are streaming a mix of roughly 20 different Aurora MySQL and DynamoDB data stores into a single aggregated Aurora PostgreSQL database. The replication time generally occurs within milliseconds, and during a large spike of new or changing data, it can go up to a few seconds. We achieve this by using Data Migration Services (DMS) and Lambda functions.

Target: Aurora PostgreSQL

We chose Aurora PostgreSQL as our target database. The key reasons are: Postgres’s ability to quickly ingest data and make changes to existing data (data being streamed is not immutable), perform complex queries across many different schemas quickly, and its general usefulness to take on data warehouse-like workloads. Additionally, our domains are complex and require many joins and analytic (window) functions across schemas to produce the reporting needed by our operations. Lastly, we needed a database that is performant and can scale. The AWS version of PostgreSQL with Aurora adds some nice performance enhancements and their management of replicas has made scaling easy for us. If you go with Aurora PostgreSQL we recommend the following:

Keep your main instance, the read/write, reserved for the data replication and have replicas, read-only, as the workhorses which your reporting/BI uses. This keeps your replication speed as fast as possible with the r/w instance devoted to replication. Even if your reporting workloads spike, replication is unaffected.
Use functions not views on the target Aurora PostgreSQL database. The reason for this is dependency chains and how PostgreSQL handles them. To keep this solution dynamic and low maintenance, we wanted our schema to change as the source schemas changed. Functions don’t hold onto columns with dependencies, but views do. Because of this, and parameterization, we’ve written all of our report logic as functions so that the schema can stay in sync without additional overhead.

Note: Aurora PostgreSQL isn’t being proposed as an all in one data warehouse. Focus on streaming the data which is time-sensitive and move other big data, data lake and batched analytic functions to the more specialized data warehouses (Redshift, BigQuery, Snowflake, etc). Typically with those workloads you can delay data and you’re solving a different problem.

Structured Source Data: Data Migration Services

For our service applications that use Aurora MySQL, we leverage Data Migration Services. DMS can be used for a one-time migration from A to B, where A and B can be the same or different types of databases, and it can be used for continuous migrations (streams). DMS works with many different types of structured databases. We have used DMS for other flavors of SQL migrations outside of this specific use case with similar success.

Configuration is relatively simple:

Configure the source and target with a UN/PW.
Create a replication instance which needs to be within the same VPC as your source/target. Bigger instances typically translate to more capacity/speed of migrations.
Select a replication task. For this purpose, you want full migration + ongoing replication.
Additional setting is ‘Target table preparation mode’. If you don’t have locking dependencies, go with the delete and recreate on restart option to keep everything up to date.
If you have sensitive data, PII or PCI for example, then exclude these in the DMS configuration and don’t worry about the extra overhead needed for that specific data security; leave it in your application which handles it already (hopefully).

These features of DMS have made it a great tool for us. As our source changes, our target stays in sync without having to maintain additional migrations and we leave the sensitive data behind since it’s not needed for general reporting usage.

For the times when DMS does have a problem, you need monitoring. Since DMS is an AWS service, it naturally integrates well with Cloud Watch. For the metrics that aren’t available with CW, we augment the logging with Datadog by using Python to look at DMS stats through the CLI. Here are some helpful configurations and monitors we’ve found:

Try to avoid having dependencies on your target database or else DMS cannot make changes due to your target not letting it. This breaks the dynamic part of this solution which then requires going in and manually making updates to get back in sync.
Additional Task settings are available using the AWS CLI and passing a JSON config; not everything is available on the UI.
Watch the ‘cdclatency’ metric. This tells you the lag between source and target. Set an alert if it goes above your desired threshold and then look into what’s slowing it down.
Watch the ‘cdcchanges’ metric. This tells you how many changes are happening. Typically if you get a cdclatency alert, your cdcchanges are high or even spiking. If this happens often, upgrading your replication instance is something to look into.
There is a lot of data you can get from the raw service around logs. Experiment polling these and creating custom alerts if you need.

Unstructured Source Data: Lambda

We like DMS; however, it does not handle unstructured data like DynamoDB. Lucky for us we have data stored in DynamoDB which needs to be queried in real-time alongside our structured data. We looked for a tool we could use, similar to DMS, which would sync our DynamoDB data stores into our aggregated Aurora PostgreSQL reporting database in real-time but didn’t find anything suitable, so we built our own lightweight tool. To do this, we used Lambda functions to move data from DynamoDB into our target.

DynamoDB, like many NoSQL databases, has an event stream. Additionally, Lambda can receive these streams and trigger a process which takes all of these table events and moves them into our target database. The stream data received from DynamoDB is ordered, comprehensive (all table data) and specifies INSERT, UPDATE and DELETE commands. You can think of it as a very stripped down version of Kafka.

Here’s some of the logic we used for this (which should work generically for most NoSQL to SQL streaming):

For INSERT and UPDATE we receive record type and all data. For DELETE we only use the id.
Collect new fields and compare with the target. If there are more, then add them.
Based on the above logic, we know what command is needed and can dynamically create the DDL to be executed against the target.
As an additional safeguard, we save these events to S3 as an incremental backup that can be reapplied through a certain range of time if ever needed.
Data Type mapping is needed; however, as a generic fallback one can use VARCHAR(255).
We found that if we change our data type on the target to something else, for example, INTEGER, and the data fits then the load from Lambda continues to work.
Since this is only for replication, an additional script is needed to do a full load from scratch. This one is simple and almost any plain NoSQL dump to SQL conversion will work. The main thing to note here is to do a full table scan of your NoSQL first so that you create your SQL schema with all fields.

Just like with DMS, this needs monitoring to ensure your data is up to date and the homegrown solution is working. Since we are using Lambda, it again works well with CW and Datadog. These two metrics are ones we watch carefully:

‘IteratorAge’ tells us the latency. If it gets bigger, then so does the lag between data syncs.
‘ErrorCount’ tells us the number of errors. Don’t be afraid if there are some errors. Lambda will retry and usually takes care of itself. However, if this number continues to grow, along with ‘IteratorAge’, then something is most likely broken.

With both DMS and Lambda we have a real-time single source of truth for reporting with minimal ongoing effort or cost. Additionally, we can granularly tune the performance of DMS and Lambda for speed vs. cost when needed. The schemas are automatically updated and the only coordination between teams is notifying when SQL functions need to be updated based on the new logic. If a new data point is needed for a report, an engineer typically only needs to add one line of code to the SQL function because the data is already available. Currently, we have multiple services using DMS with a latency of milliseconds and Lambda syncs with a latency of 1 to 3 seconds. This low-cost, low-maintenance solution is performing well for us with indications that it will continue to scale.

Ken Ostner is the VP of Business Intelligence at Mynd Property Management.
Contact him at ken@mynd.co

Cost-Effective Data Streaming

Target: Aurora PostgreSQL

Structured Source Data: Data Migration Services

Unstructured Source Data: Lambda

Written by Kenneth Ostner