Seven Lessons We Learned — Building A Real-Time Machine Learning Pipeline In AWS

Published in

SingleStone

5 min readSep 19, 2019

We created a serverless streaming data machine learning pipeline for one of our clients, and we learned quite a bit.

Amazon’s data services have been evolving to the point that it’s making database vendors nervous. Where previously companies made big investments in relational database systems like Oracle and SQL Server, today these companies are not as tied to these expensive, proprietary systems as they were before. Amazon’s Database Migration Service (DMS) allows data teams to migrate existing database systems from on-premises to the cloud, from other cloud vendors (i.e. Azure) to AWS, and even from one database vendor to another within the cloud.

A lesser-known feature of DMS is the ability to combine it with a database feature called change data capture (CDC) to stream changes from source databases to S3. This is an extremely powerful feature, especially when combined with services like AWS Lambda that can feed real-time machine learning algorithms hosted on platforms like Sagemaker. With a database hosted on Amazon’s Relational Database Service (RDS), DMS, Lambda functions, and Sagemaker, you can have a completely serverless, scalable, resilient, real-time machine learning pipeline at an attractive price point. In our case, using a mid-range DMS instance (r4.large) was easily powerful enough to handle tens of thousands of streaming transactions per hour, and cost under $300 per month with a standby instance for high availability.

DMS monitors relational databases for changes and can write events to S3. Lambda functions can monitor S3 and send data to Amazon Sagemaker which can be used to create real-time predictions that may be used to optimize customer experience.

Database Migration Service (DMS) Concepts:

Subnet groups — a subnet group is necessary to setup your DMS replication instance. The path of least resistance is to setup your replication instance in the same VPC as your source and target endpoints, but it’s possible to have your source endpoint in a different VPC, or even on-premises. To setup DMS for high availability, you’ll need to select at least two subnets, each in a different availability zone.
Endpoints — You’ll need a source and a target endpoint for each replication task.
Replication instances — a replication instance is essentially a VM similar to an RDS instance, in that you can’t access the underlying OS as you can with EC2 instances or EMR cluster nodes. In a high availability setup, this is in an active/passive configuration.
Replication tasks — a task is essentially a source endpoint, a destination endpoint, and a replication method of either Full load plus CDC or CDC only. The latter is explained in more detail later.

Once your replication task is associated with a source and a target endpoint, you can use DMS to stream insert, update, and delete events to a variety of sources. In our setup, we’re using S3 as a target endpoint, then using a Lambda file watcher to ultimately copy that data into Kinesis Data Streams. From there, the data can be read by multiple subscribers to perform multiple machine learning algorithms in real time. Some use cases for this include automatic price adjustments, individually-tailored product suggestions for an online store. Another possible use is streaming updated call center notes or even using speech-to-text services like Amazon Transcribe to send data to sentiment analysis algorithms to predict customer churn and intervene while the caller is still on the line. As you can imagine, this could have a significant effect on curtailing customer turnover, and can help a business be truly data-driven in ways it never could before.

Here are seven lessons learned while working with DMS:

Only certain versions of each vendor database support CDC. Before you start down this path, be sure that the source database supports this feature. See Sources for Data Migration for information on your source system.
While Amazon details a procedure to use a non-master RDS user account for the source endpoint, our team was only able to use the master account to create both the full table loads and the incremental CDC data.
There are some other vendor-specific considerations when using DMS as a source endpoint. For instance, with PostgreSQL, the database can fill up if there is no activity in the database for an extended period of time. The workaround for this is to add a “heartbeat” to the source endpoint by adding HeartbeatEnable=true;HeartbeatFrequency=1; to the endpoint’s “Extra Connection Attributes.” This prevents the dreaded “disk full” scenario that keeps database administrators up at night.
Starting DMS requires some changes to parameters to your database, for an RDS instance, these changes take place in its associated database parameter group. Once these changes are made, an RDS instance reboot is also required. You’ll want to do this in a maintenance window.
There are two ways to start DMS replication, resume and restart. When getting started, use “restart” if you want to ensure a full table load is created. This is known as “Full load plus CDC.” If you have to stop DMS for any reason, you’ll want to do a “resume,” also referred to as “CDC only.” A resume will pick up replication where it left off, and will not create another full load. If you accidentally do a “restart” instead of a “resume,” any CDC between when you stopped a replication task and performed the restart will be lost, and you’ll have an additional full load file created for each table defined in your source endpoint.
When using S3 as a target endpoint, you’ll need to add addColumnName=true if you want DMS to write the column names to the files it creates from your source table data.
DMS can scale up, but at this time cannot scale out. This means that while you can upgrade the engine behind DMS to a more powerful VM, you can not have more than a single DMS instance for each replication task.

While there are some limitations and peculiarities to Database Migration Service, overall it’s a great way to stream changed data from an operational source system. In our case, we sent this data to a machine learning algorithm using S3, Lambda functions, and Kinesis Data Streams, but this architecture could be used for scenarios including real time backup and system synchronization.

Seven Lessons We Learned — Building A Real-Time Machine Learning Pipeline In AWS

We created a serverless streaming data machine learning pipeline for one of our clients, and we learned quite a bit.

Written by Matthew Cloney