A Serverless Java Solution for Deliveries

Published in

My Local Farmer Engineering

16 min readMay 25, 2021

Building a Service in AWS

In previous posts, we here at I Love My Local Farmer have discussed the need to create a new service to provide delivery capabilities to our existing business model. With the lockdowns and changes that resulted from COVID, we could no longer rely solely on our existing farm produce pickup model to sustain our business. Considering that this new service is being built entirely from scratch, we saw it as a good opportunity to implement a portion of our platform in AWS that is completely distinct from our previous pickup options.

Disclaimer
I Love My Local Farmer is a fictional company inspired by customer interactions with AWS Solutions Architects. Any stories told in this blog are not related to a specific customer. Similarities with any real companies, people, or situations are purely coincidental. Stories in this blog represent the views of the authors and are not endorsed by AWS.

Previously, our e-commerce platform was developed in a somewhat monolithic structure that focused solely on allowing customers to schedule pickups from farms that had subscribed to our service. This new delivery option gave us the opportunity to implement a portion of that platform in the cloud that was self-contained and easily manageable, and which we could use as an example to showcase whether we could migrate more of our existing platform into a cloud-based architecture. After considering cost and capability, our upper level management decided to go with AWS as our cloud-provider of choice, and so we as the engineering team were required to design a scalable and manageable solution within that environment.

Some of the constraints for this new service were that it be low-cost, scalable to meet dynamic customer traffic, and that it must have a lower level of operational effort than our current system. We wanted to ensure that we were realizing actual benefits by moving to the cloud, and not simply moving for the sake of trying something new. Our previous on-premise infrastructure had posed many challenges for our engineering teams that we were hoping a cloud based infrastructure could solve. For instance, it took our operations team a significant amount of time to stand up new infrastructure in different regions and we wanted to make this process easier and more efficient. We also wanted to reduce their burden when it comes to maintaining the infrastructure. Finally, we wanted the ability to more easily scale as our business grows and we see more users adopting our new delivery service.

We also wanted to be able to integrate our new service seamlessly into our existing business model as its own microservice, without having to make sweeping changes to the infrastructure that we currently have in place. This means that the service had to be completely managed within AWS, and should only be exposed as an API that our e-commerce platform can call to perform operations necessary for the delivery of products. These operations include creating the delivery slots for customers to book, retrieving available slots so that the customers can view them, and reserving a delivery slot for a given date and time. We also wanted to ensure this new service operated entirely on the backend, and did not require major changes to our existing UI and frontend architecture.

In this post we are going to cover all of the architectural decisions and tradeoffs we have made, as well as some other potential implementations that we considered along the way and why we decided not to go with those solutions. For reference throughout the rest of the post, here is the diagram of the architecture that we have decided to implement. It is composed of an API Gateway that proxies 3 Lambda functions, each of which connects to an RDS database using RDS Proxy. As a quick clarification, Magento is the e-commerce platform of choice that we use for our existing application, although technically any e-commerce platform could be dropped in its place. One particularity about Magento is that it requires an Elasticsearch cluster and database to function. In the below diagram we refer to these as Magento Database and Magento Elasticsearch, even though they are not technically part of the Magento application.

Serverless vs. Serverful: How Would You Like To Be Served?

One of the initial decisions that had to be made when designing our new service was whether to go with a traditional server-based architecture or to leverage the serverless capabilities that AWS offers.

One of the main pain points that we experienced working with our on-premise infrastructure was the operational effort involved in properly scaling our system to meet the dynamic requirements of customer requests. Customers typically visit our website beginning in the afternoon and gradually increasing throughout the evening, with some spikes in the middle of the day. When we first considered moving to a cloud-based architecture, we wanted to decrease both the cost and human effort involved in ensuring that our system was operating at peak efficiency levels. Previously, our engineering team had to constantly monitor our resource usage and requisition more capacity whenever it seemed that our service was in danger of being unable to support customer demand. Otherwise, visitors to our website would experience high delays and/or wouldn’t be able to access the site at all. Additionally, when traffic to our website was low we ended up paying more to maintain resources that weren’t being used. However, by moving to AWS Lambda, we realized that we were able to significantly cut down on costs and operational workload due to the inherent scalability of the service.

The biggest advantage that drew us to serverless technologies like Lambda, as opposed to offerings such as EC2, was the decreased level of operational effort required to maintain the service. Because we were trying to get the delivery service up and running as quickly as possible, we wanted to focus on the application itself and not worry about configuring and maintaining instances. With EC2, we would have had to provision instances and make sure that they are kept up-to-date as far as operating systems updates and software dependencies. While our team was familiar with this workload from our on-premise experience, we wanted to move away from needing this level of operational oversight. With Lambda, we can focus on simply writing the application code and uploading it to the cloud. Additionally, we only have to pay per request to Lambda functions, while with EC2 instances we would have to pay around the clock to ensure that our service was always available. We briefly considered moving to a containerized solution such as ECS or EKS, but considering that we did not have much experience with containers we figured it would be too drastic of a shift to implement in the time we had.

The decision to go with API Gateway as opposed to another technology such as an Application Load Balancer (ALB) was based mostly on the feature differences between the two services. Both services have the ability to integrate with Lambda functions, but we needed to pick which one suited our use case best. While not part of our initial design, we considered that in the future we would most likely need to add some form of AuthZ/AuthN to our service, particularly if we end up moving some of our existing features to AWS and need to verify calls involving customer information. We had read that API Gateways can potentially be more expensive than Application Load Balancers integrated with Lambda functions, particularly for frequently accessed applications, but we wanted to avoid having to redesign our architecture if we ended up needing some of the features that API Gateway offers in the future.

To Queue Or Not To Queue: That is the Queuestion

One of the main discussion points we came across when designing our new service was whether to implement synchronous or asynchronous calls in the system. Originally, our thinking was that implementing asynchronous calls would lead to a better user experience for both our customers and our schedulers. The schedulers could create delivery slots and the customers could book deliveries through a faster and more responsive UI. To this end, our initial design for the architecture looked like the following, where SQS queues are integrated with API Gateway and placed in front of each of our Lambdas.

However, upon further consideration we realized that there were several problems with this design. If several users wanted to book the last available delivery in a time slot, we could end up having multiple requests in a queue at the same time and end up overbooking a slot.

To rectify this issue, we could have decided on one of two options. The first would have been to create a notification system where a customer would attempt to book a slot and then receive a notification once the asynchronous call had completed and the slot had actually been reserved. The other was to simply remove the SQS queues from the architecture and use synchronous calls combined with the atomicity of the database transactions to ensure that no slot was booked beyond its capacity.

Implementing the asynchronous solution would have given us an advantage in performance because we could respond to customers much more quickly. In addition, it would arguably have been a more reliable solution because we could implement dead-letter queues to ensure that any messages we couldn’t process were properly handled. However, this would have added a great deal of complexity to our system and drastically increased the time it took us to deliver the new service. Additionally, since our customers pay for their order at the time they book a slot, we would have had figure out how to hold payments until the booking had actually been confirmed.

Considering how quickly we wanted to get our delivery service up and running, we decided there was too much complexity and effort required to implement any kind of asynchronous execution in our initial implementation of the service. However, if it turns out that the latency of calls to the service is too great then we may need to consider moving to this model in the future.

Database Considerations

For our choice of database, we wanted to go with what was most familiar to our engineering team. NoSQL database technologies such as DynamoDB are becoming more and more popular, but our team does not have experience writing applications using these and so would need more time to get up and running with them. For our existing Magento application we had been using a MySQL database to store information about the farms’ produce. In order to most closely replicate that setup we decided to go with MySQL deployed in the Relational Database Service (RDS) in order to store our delivery order information. We considered briefly going with Aurora Serverless, but the higher cost did not seem justified as we don’t anticipate needing that level of performance given our anticipated load.

One concern that we read about when moving to Lambda integrated with RDS was that it is very easy to exhaust the database’s maximum number of concurrent connections. This is because new Lambdas must be spun up in order to meet the traffic demand of customers visiting our website and placing orders, and each individual Lambda must create its own connection to the database. To solve this issue, we decided to place an RDS Proxy in between our Lambdas and our RDS instance. RDS Proxy reuses a single pool of connections between all of the Lambdas attempting to interact with the database and shares that pool between each Lambda making calls to the service. It handles the logic for making sure that each Lambda’s requests go to the database, without the need for each Lambda function to create its own individual connection, which might starve the database. This reduces the load on the database that typically comes with opening and closing connections at a high rate.

Infrastructure for the Future

Another major challenge in our previous on-premise application was the difficulty involved in replicating our infrastructure in different regions and/or countries. Every time that we would expand to a new area, it meant recreating our entire physical infrastructure stack from the ground up and then redeploying our application onto that new infrastructure. This process could sometimes take months, and drastically slowed down the expansion of our business into new areas.

Moving into AWS, we wanted to make sure that whatever system we designed could be easily replicated in the case that we needed to deploy into new regions in the future. To this end, we started exploring some of the potential services that AWS offers for specifying infrastructure deployments. The major ones that we came across were the Serverless Application Model (SAM), Cloudformation, and the Cloud Development Kit (CDK).

Looking deeper into each of these technologies, we quickly realized that using Cloudformation directly was most likely not the direction we wanted to go in, since both SAM and the CDK provide benefits on top of what Cloudformation alone gives us. Additionally, SAM and CDK end up being compiled into Cloudformation on the backend, and we wanted to make use of the higher level abstractions that they provide. Using simple Cloudformation would not allow us to benefit from either the pre-built constructs that come with SAM, such as out-of-the-box API configurations, or the ease of development that comes with the programmatic nature of the CDK. This is a major point for us, as we came across some posts saying that what takes 1,000 lines of Cloudformation JSON can be done in around 50 using the CDK and time was of the essence for us when it came to developing our new service.

This left us with the choice of either using SAM or the CDK. In the end, we decided to go with the CDK for a number of reasons. First was that it not only offered similarly high-level infrastructure components compared to those offered by SAM, but in some cases SAM constructs are available to be used through the CDK as well. A nice benefit of these high level constructs is that they typically have default configurations and security settings that reduce the chance of us misconfiguring certain aspects of our infrastructure.

Second, being able to specify our infrastructure with code offers us many features from a development perspective. Being able to programmatically specify certain aspects of our infrastructure and make use of logic statements available through coding languages is a feature that we see as an invaluable capability of the CDK. Writing loops and conditional statements for our infrastructure and being able to easily retrieve object properties is very convenient compared to performing these manipulations in YAML, which is the language SAM uses. We can easily write logic to make necessary changes to infrastructure based on things such as region and account, or create a certain number of resources based on passed in parameters. Also, the code completion features for the languages that the CDK uses makes development easier and more efficient in our opinion.

Languages and Frameworks

In what is becoming a common theme throughout our design process, we tried to stick with technologies that were familiar to us when it came to our choice of programming languages and frameworks. Many people on our engineering teams are familiar with Java since it’s been a predominant choice for backends over the years. Since Lambda supports Java natively it seemed natural to use it as the language of choice for our new service.

We do have some concerns about this choice. Namely, in our research we have seen that cold start times for Java Lambda functions can be larger than those of other languages such as Python or Node. This is particularly a concern since we have made the decision to go with synchronous invocations in our service and any extra latency could have a negative impact on our user experience. However, we believe that this will be offset by our ability to develop more quickly. We predict that the cold start issue will only affect a small subset of users who visit our website during off-peak hours, and if it ends up being too big of a problem we can later include Provisioned Concurrency into our design to ensure that some Lambdas are always kept warm.

Another decision that had to be made was what language to use for our CDK development. The CDK is currently available in several languages including Typescript, Java, and Python. We have decided to go with Java as our language of choice. While Typescript is the native language of the CDK, we wanted to stick with a language our team was familiar with to play to our strengths as a development team. Moving to a new language would have meant that our team would have to become familiar with an entirely different set of tools and package managers, and we wanted to avoid incurring these sort of setup costs as much as possible. Our only concern with this decision is that a majority of code examples for the CDK are written in Typescript and we may have to spend a bit more time translating them into their Java equivalents.

Compartmentalization of the Service

In order to avoid modifying our current ordering system, we tried to rely on the existing infrastructure in our platform as much as possible. By respecting our existing application as the source of truth for customer and farm data, we are not required to perform any unnecessary duplication of data between our current platform and our new Delivery Service, eg. customer addresses, farm addresses, and Elasticsearch clusters. That being said, while our existing application will store data pertaining to customers and farms, the new delivery service will be the source of truth for information pertaining to delivery bookings.

The below sequence diagram shows the flow that occurs when a customer visits our site and attempts to book a delivery with one of our farms.

Since our existing platform already stores data about customers, farms, and available produce; we retrieve all of that information from the database and Elasticsearch cluster currently integrated with our e-commerce platform. When a customer wants to search our site for potential farms to order from, the list of farms that are within both pickup and delivery distance from their location is retrieved by performing an Elasticsearch query from within our existing e-commerce application (1). Then we go to the farm database table to retrieve the list of available produce from that farm (2).

Initially, we had considered storing info such as customer addresses and farm opt-in status within the Delivery Service. However, the added complexity of keeping this information in-sync with the information stored by Magento would have added too much time to our development workflow, as well as introducing more concerns from a GDPR perspective as far as the movement of personal data in between systems. We considered it a benefit to be able to design our service so that it stored as little information as necessary in order to plan and book deliveries, since it meant we are able to get this feature up and running as soon as possible with the minimum amount of effort.

Only once the customer has selected a farm that is within delivery distance and made their choice of produce does our Delivery Service come into play. Magento will send a query to the service to retrieve a list of available delivery slots for the given farm (4). This list only includes dates beginning two days in the future out to a maximum of 2 weeks. This delay allows enough time for the farmers to prepare their orders for pick up, as well as for the schedulers to organize the deliveries based on driver availability. Note also that Magento double checks whether the user is in delivery range of the farm at this point (3), in case a customer doesn’t go through the search page and instead gets a link to the farm’s page from someone else.

Once the list of slots is retrieved, the customer can select a slot and Magento will make another call to the Delivery Service in order to reserve that slot for the specific customer (5). Notice that nowhere in the Delivery system do we need to replicate any of the information that we currently store with Magento, outside of userId’s and farmId’s that we use to relate deliveries with certain entities. We simply maintain what slots are available as well as which ones have been booked already.

In developing our new delivery service, we wanted to adhere to the tenets of microservice architectures as much as possible. As we continue to develop I Love My Local Farmer, we want to be able to make changes to parts of our system without affecting the behavior of other portions of our application. By separating the Delivery System from the underlying information stored in Magento, we are able to make changes to the backend behavior of the service without affecting the rest of I Love My Local Farmer.

Final Thoughts And Wrap Up

Moving our new project to the cloud has definitely not been a trivial decision, and there are a number of considerations that need to be taken into account when designing a system within AWS. For our new system we knew that we wanted our service to be cost-effective, reliable, and with the bare minimum of operational overhead. To this end, we decided to fully commit to the serverless approach of building an application, and rely on technologies that require little-to-no oversight from an ops perspective.

To cut down on the operational overhead involved in managing our new service, we made use of both Lambda and API Gateway in our design. Using both of these technologies in conjunction means that we don’t have to provision any instances ourselves, and we can focus solely on designing our API’s and writing our application code. While there are some worries in regards to cold starts and cost, these are offset drastically by the ease with which we can our new application using these technologies. Additionally, since our application is not generally being used around the clock we don’t have as large of a concern when it comes to costs due to not needing instances up and running at all times.

Another concern that came up when we decided to use Lambdas was the potential starvation of database resources due to maintaining a large number of open connections coming from a bunch of Lambda’s. We were able to mitigate this by incorporating the RDS proxy into our system, which should cut down on the amount of logic we will have to implement to ensure that our database stays reliable and performant.

Overall, we believe that we have come up with a robust and easily managed architecture that can be easily redeployed in different environments and regions. We still have to see whether some of our decisions will pan out, and if we will have to modify some parts of our architecture — e.g whether we will require provisioned concurrency for our Lambdas. However, we believe that moving to this new cloud-based architecture will drastically cut down on the time required for our engineering team to develop, deploy, and maintain our new service.