Building A Serverless IoT FinTech App with AWS and NodeJS

Building an AWS native, Node.js, serverless, event-driven system to back up an IOT device capable of real time messaging and events.

Joshua Toth
Feb 10, 2019 · 15 min read

In August last year I had the opportunity to work on a greenfield project within the financial sector. The pitch to me was: An AWS native, Node.js, serverless, event-driven system to back up an IOT device capable of real time messaging and events. I was very intrigued.


  • Lambda (Node.js) For both the API’s and event handlers.
  • DynamoDB
  • SNS + SQS for eventing
  • API Gateway (Authorised, Unauthorised and several 3rd party restricted APIs for integrations)
  • Cognito (Security) With both ‘User Pools’ and ‘Federated Identities’
  • SES for emails

On the Build/Ci side of things:

  • Terraform
  • Travis
  • Github
  • Swaggerhub

We also utilized Swaggerhub for our API documentation along with more detailed information on Confluence.

Lambda and Node.js

When working on new services, a massively underestimated part of the project is the initial skeleton end-to-end standup. This step tends to blow out as unknowns become known and limitations are discovered. This project was somewhat an anomaly; the system just worked. It was a very quick process to get the initial infrastructure coded up in Terraform and deployed with a simple ‘hello world’ message to and from the DB through API Gateway.

There were some teething issues with specific security permissions but once the debugging was in place it was fairly straightforward. There is a lot of documentation available around what needs to be done and the whole serverless ecosystem has evolved into quite a business-ready, mature product.

Node.js lessons

Having a shared DB + services library available for the projects to use would have worked a lot better and we did have code in place to make that migration (but never got implemented before MVP).

The library we used to manage our API lambdas was aws-serverless-express. Express itself is a super easy library to use with a tonne of middleware plugins available. Creating new routes was a breeze and it’s super easy to understand for people approaching the project. The library integrates seamlessly with API gateways {proxy} feature and we had no issues with getting it setup and in use. The library also supports the development of bespoke middleware which we took advantage of for routing.

Test all the things! Our code had close to 100% unit test coverage, both red and green pathing. I’m a big advocate for TDD and BDD. Being able to exercise both even on a project like this was fantastic. All the developers had very high confidence in the code that was produced and the end result was something you could be proud of. We used Jest for our testing.

Setting standards at the beginning of the project is a big must. It’s an easy trap to fall into when your spike code becomes production code and things are forgotten. A mistake we made when approaching this was not setting error response standards across the API’s from the beginning. We had multiple API’s being developed simultaneously and it was difficult to keep up with all the different error response formats. We ended up utilising the boom package. Once that was in place we had stable error responses and all the integrations were a lot smoother.

ESlint is a great way to keep your project neat and tidy. We were targeting node 8.10 as that’s what Lambda supported at the time with 2016 support. I recommend adding the additional ‘experimentalObjectRestSpread’ spread operator feature here. It’s supported by 8.10 although ESlint will yell at you for using it otherwise. (The spread operator is amazing)

Function lambdas that weren’t APIs were a great way of consuming and emitting events throughout the system. Small processes such as sending an email because x happened or completing a workflow because y notification came in were written and deployed in no time. In almost all cases it was just a handler function and a state machine with one purpose.


Services publish to an SNS topic and then that topic is subscribed to by one or more SQS queues with a Lambda attached. The lambda then consumes the event and does one of three things:

  1. Nothing, consumes and actions the event.
  2. The event isn’t ready to be consumed so it’s placed back on the queue (In this case the 3rd party service could have a delay of several minutes to complete its action)
  3. The event errors and then is reprocessed. We used a deadletter threshold of 10 before events were placed into a separate deadletter queue, causing an alarm to go off.

We had no issues with this method of eventing and it actually really helped to create the distinctions between services.

The shared libraries point comes across strongly with these events as well. Having a well defined event contract is key for adding services that consume them. We did feel a slight bit of pain when it came to keeping these events in sync and having a shared place for these to be defined would have alleviated a lot of that.


This worked fairly well although the fact that you don’t get a bearer token that is verifiable from Identity Pools was frustrating. We ended up using IAM roles to restrict our API endpoints which isn’t ideal and not standard API behaviour. There was a fair bit of education needed to have the teams consuming the endpoints across how the signing method works for AWS. User Roles do actually use a verifiable JWT but as we wanted to have the same endpoints cater to both types of users we couldn’t lean on this functionality.

In hindsight, a better approach to using IAM roles would have been to encode the IAM credentials into a JWT and then use the Lambda Authorizer to verify and pass on the credentials. This would have a saved a lot of time and kept all of our security using the same method.

The authentication workflow for our users was pretty cool though:

  • Initial users use an unrestricted API endpoint to generate an unauthenticated Cognito user. Return IAM credentials restricted to the next API to the consumer. This becomes the users session.
  • Using those credentials to access the API responsible for registration. Once registration is complete, escalate the Cognito user to be ‘developer authenticated’ and return new IAM credentials for that user to the consumer. This is the users new session.
  • Using the newly authenticated credentials, access the rest of the API suite.

The workflow was a bit annoying to put together. A challenge we had that necessitated this was a 3rd party integration that didn’t have an immediate response, meaning the registration could take several minutes. Ideally there wouldn’t have been a registration session at all.

There are some downsides to using IAM credentials for user sessions. One is that you can’t actually invalidate the IAM credentials without deleting the user (That I know of). Meaning that every time you use a federated identity you need to create a new Cognito user. IAM credentials also throw off every alarm a penetration tester has once they come through, even if they are restricted to only act on the APIs there are intended for.


As usual, using DynamoDB and Lambda together is seamless and a pleasure to implement.

AWS and Terraform

Each feature branch had its own stack built when commits were made, meaning each day it was common to have ~30 uniquely deployed stacks within AWS. This number doubled when the feature branch went into pull request and another ‘merged’ branch was deployed. You might be thinking at this point, isn’t that a LOT of infrastructure to have deployed? Doesn’t AWS have soft limits? Doesn’t AWS have hard limits too? Yes, yes and yes.

A couple of weeks into the project, those limits were reached. In most cases the limit could be increased, but in others AWS had a hard limit. Parts that reached their limits:

  • S3 Buckets
  • API Gateways
  • Lambda Functions
  • IAM Roles + Policies
  • User Pools

Teardowns of the infrastructure were a massive problem, across all teams. The API gateway design was suboptimal, with 1 gateway per API, rather than all the API’s sitting under routes within the same gateway. We had a total of 7 API gateway instances per stack. Multiply that by the amount of stacks we had at once and you could exceed 200 API Gateways daily. This is a problem because an AWS account is restricted to deleting 1 API every 30 Seconds. That’s a lot of time to tear down only one type of resource we had.

Trying to terraform destroy the whole stack

This problem coupled with the fact that S3 Buckets have to be empty to be deleted caused havoc, with a huge portion of time dedicated to cleaning up the AWS environment as efficiently as possible. With a small team of developers this wouldn’t have been too bad, but with the velocity of the project and the mono-repo design, it was unsustainable.

A new, more concise design for the API gateway with sub routes for each API was developed towards the end of the project which addressed the issue but it wasn’t implemented before MVP.

Repository: GIT

  • The frontend branches also had their own redundant AWS resource stacks (Where they actually just developed against the ‘develop’ branch backend). This meant the frontend was heavily affected by build issues that were completely unrelated to their development.
  • Merging was a nightmare.
  • Coding standards at one point to were set to the root of the project with ESlint and other tooling that conflicted heavily between the frontend and backend.
  • The project size exploded when a remote branch had binaries added to it, causing massive delays in GIT actions.
  • The build exceeded 15 minutes per commit. This was with the repository cloning, terraform, frond end deployment, QA testing and docker deployments.

There were a few git rules active during the start of the project that needed to be rescinded:

Actual footage of merging just before builds finish running

PR and Branch builds had to both be green. Unfortunately when you have 15 developers all committing code at the same time getting the sweet spot for when your 15 minute build and branch are both in sync was insane. This caused a lot of velocity issues. Eventually we settled on just having the PR build green, this was a build that is automatically merged with develop so we could ensure that the merge would be OK.

The biggest issue we had with Github itself was when GitHub had their outage and that set us back an entire day.

All things considered, the Pull Request system within Github worked very well as a code review tool, once all the rules were in the right place.


While Travis has a very simple UI it can still be somewhat confusing for a lot of users all on the same repository. There is also a huge issue with congestion (particularly around 4:30pm). At points there can be 20+ builds queued as everyone commits for the end of day. Utilising something like a git tagging system to manually trigger builds or just relying on PR builds rather than per-commit may have been a better option in this case.



The general gist is, the cloud system releases events to the IOT device which then receives them and does what it needs to. The devices themselves can also raise events that require a response but the interactions are very limited.

The IOT communication was conducted using AWS IOT. There was a processes to register devices that involved a serial number provided by a supplier. Once the devices was received, the user would register the device and the platform would mark it as ‘active’.

When messages between the cloud platform and the IOT devices were sent, they used unique ID’s per device that were linked during the 3rd party registration. Getting the eventing between AWS IOT and the cloud platform running smoothly was not as difficult as anticipated, just another integration point using lambdas and SNS/SQS.

3rd Parties

Other 3rd party integrations mostly involved notifications and small updates into our cloud platform. Pretty small, very specific integration points which were modular enough to swap in and out.

This is probably where using eventing shows one of its greatest strengths. A small outage of one integration is so insignificant when it comes to the rest of the system. For example: A 3rd party we send a bit of info after every x action just has their events back up, ready to be sent through once they are back online. This makes the system a lot more robust and much less prone to total collapse.


Secondly it would try and load a static request from SwaggerHubs VirtServer, we had files for each 3rd party endpoint and it would be the ‘default’ response.

This was an issue when a lot of the tests would call SwaggerHubs at the same time (a test suite running 10 times concurrently). SwaggerHubs VirtServer actually has a rate limit of 10 per minute which unfortunately isn’t very stable and may serve 10 or 40 requests per minute depending on load. This actually caused intermittent failures during our testing before we figured out what was going on, which was infuriating as the initial thought process is:

How is my code SOMETIMES throwing errors, where did I go so wrong

In our development environments all URLs for 3rd parties were swapped to the mock API at build and prefixed with their name. This worked really well.


The most challenging part of the project was the build pipeline and time wasted waiting for builds and issues that stopped that from happening. The learning curve for Cognito was sharp but in general most pieces within AWS fit together so well that major blockers were rare. This was the largest project I’ve worked on that was 100% AWS backed, I learned a LOT and I eagerly await the next one.

Building A Web Or Mobile App?

Developer? Try out the Crowdbotics App Builder to quickly scaffold and deploy apps with a variety of popular frameworks.

Busy or non-technical? Join hundreds of happy teams building software with Crowdbotics PMs and expert developers. Scope timeline and cost with Crowdbotics Managed App Development for free.


The fastest way to build your next app.