Exploring the Limitations: Challenges While Migrating an AWS CDK Application With Thousands of IoT Devices

Çağrı
PostNL Engineering
Published in
7 min readJun 21, 2023

PostNL is a modern logistics company that processes more than a billion IoT messages per day by relying on AWS IoT and fully serverless computing to manage its IoT devices. Thousands of IoT devices are connected to the AWS IoT and send an enormous amount of data to the AWS platform via the MQTT protocol. Based on this IoT data, different types of events are generated in real-time and used for better forecasting, and planning to make our logistics more flexible and sustainable.

Connecting objects from the physical world to the cloud environment also comes with some set of challenges. Even though many things can be managed by having proper monitoring and observability, “Everything fails, all the time” is my favorite quote from Amazon’s Chief Technology Officer Werner Vogels that there are certain times when things can go wrong unexpectedly. In this blog post, we will go over some challenges while migrating a CDK application with thousands of IoT devices.

How are we using AWS IoT?

AWS IoT Core is a service that enables communication between IoT devices via the cloud environment. It provides us with tools and services to connect, manage and interact with IoT devices at a scale. A “thing” represents a physical or virtual device that can connect to the internet and send or receive data. In our use case, each device needs to be registered in the AWS IoT Core with an X.509 certificate. Necessary permissions such as “iot:Publish” or “‘iot:Connect’” actions are provided by attaching principals to those certificates. The following flow diagram illustrates how we are connecting the physical world to our AWS infrastructure:

Having a secure IoT infrastructure

Thousands of devices in the physical world moving around also bring some set of challenges. Since those devices are under heavy utilization, there are certain moments that they can be out of service temporarily or permanently. Some of them can be broken, under maintenance, or may lose their certificates due to unexpected situations. Not only due to external reasons but also due to the changes you apply to your infrastructure, those devices may need to refresh their certificates. While refreshing a device certificate, if there are any existing certificate(s) for a thing, they must be also revoked to decrease the attack surface from a security perspective. Last but not least, since we all know “Everything fails, all the time”, there can be incidents or failures that prevent us from revoking those certificates. In any case, as soon as a device certificate is idle for some reason, it’s a good practice to revoke or delete those certificates.

Our challenge

In addition to such cases, we faced another situation that forced us to clean our AWS IoT Core Certificates. PostNL IoT-Platform is a huge system that consists of different microservices as serverless architecture. The application is being modeled and provisioned as infrastructure as code by using AWS CDK. Recently due to an organizational change, we had to migrate our existing CDK application into a newer one. Even though it looks like a pretty straightforward process, having thousands of IoT devices connected to the platform brought additional challenges. The following simplified architecture shows some portion of the device management module of the IoT-Platform:

IoT-Platform uses its own Private Certificate Authority which all things certificates are generated. While deploying the stack for the first time, a lambda-backed custom resource is triggered to register this certificate in AWS IoT Core as Certificate Authority which is used to validate all device certificates. When we wanted to create a new stack for those resources, everything worked as expected. However, the problem is not creating the new stack, but rather deleting the old stack. The same lambda-backed custom resource is also triggered during stack deletion to clean existing Certificates from IoT Core. Let’s look at the required API calls to delete a certificate that is attached to a device:

To be able to delete a certificate, at least 6 API requests to AWS IoT Core are needed after getting all certificates. While running our end-to-end tests before going to production, we realized that this would take a significant amount of time for thousands of certificates and result in a timeout due to the max execution duration of lambda-backed custom resources.

Can we use parallel programming?

The first solution that may come to mind is parallel programming, which could work well unless there is an existing production workload that uses the same APIs. However, we are also aware of the AWS IoT Core API throttling limits, which are very low for some APIs. For example, the DeleteCertificate API has a TPS of 10, and the DetachPolicy API has a TPS of 15 by default, which are also used by the existing production workload.

Let’s try to increase those API limits!

Almost all of the limits for AWS IoT Core quotas are adjustable, and you can request a quota increase via AWS Service Quotas. As a first step, we created several support tickets to increase those quotas to prevent throttling during parallel execution. For example, for DeleteCertificate API, we created a support ticket to adjust it to 100 TPS and similar values for others. However, we quickly realized that adjusting IoT Core limits was not easy since the service team asked us additional questions. Although we provided some details, we could only increase the limits up to a certain level since some limits were beyond safe limits according to the AWS Service team, which we agreed with later.

Suggestions from AWS Support

As part of the PostNL IoT Platform, we frequently contact AWS Support for assistance whenever we have questions, issues, or concerns. This was also the case when we encountered this problem. We provided a detailed explanation of the issue and asked whether they could assist us. The suggested solution was to run a console application on an EC2 instance or a Fargate container. However, as we aim for a 100% serverless architecture at PostNL, running an EC2 instance for a one-time operation was not an option. While a Fargate container could have resolved our issue, we did not want to spend time on it since it was a one-time operation.

Solution

Ideally, we never prefer to make any manual changes from the AWS console or run any piece of code against our production account outside the AWS environment unless there is a really valid reason. We were also aware that it was a one-time operation and the certificates that needed to be deleted were in an old application and wouldn’t cause any problem by deleting them from a remote machine. After analysis of the problem and considering the possible solutions, we opted to run a console application on a remote machine from a secure environment. Running the console application for a few hours deleted thousands of redundant certificates and we deleted the old AWS Cloudformation stack when there were no certificates left.

Conclusion

The journey of migrating an AWS CDK application can be fraught with challenges, particularly when it comes to the low API quotas of AWS IoT Core and the maximum execution duration of Lambda functions.

The low API quotas of AWS IoT Core pose constraints on the number of requests that can be made within a specific time frame. This limitation can hinder the scalability and responsiveness of IoT applications, making it crucial for developers to carefully consider and optimize their API usage.

The maximum execution duration of Lambda functions imposes a time constraint on the execution of serverless code. This limitation can be a major stumbling block when migrating applications that rely on complex or time-consuming operations. Developers must thoroughly analyze their Lambda functions and consider potential optimizations or architectural adjustments to fit within the execution time limits.

The combined impact of these limitations has made migrating AWS CDK applications a daunting task for developers. The intricacies of working within restricted API quotas and ensuring Lambda functions stay within their execution time limits require careful planning, optimization, and potentially rethinking the architecture of the application.

In conclusion, migrating AWS CDK applications to a new one may be challenging due to the low API quotas of AWS IoT Core and the maximum execution duration limitations of Lambda functions. However, with careful planning, strategic optimization, and a comprehensive understanding of the constraints, developers can successfully navigate these challenges.

--

--