Box Office App — A Serverless Journey

Published in

Spektrix Engineering

6 min readFeb 19, 2021

Over the past two years, Spektrix has been in the process of migrating its services and infrastructure to the Azure cloud. The journey to this point has been fraught with challenges and we are now at the point where we will be able to decommission our own hardware and complete our move to the cloud. The Spektrix platform is now more prepared than ever to support our users as they scale.

Along the way, we have learnt an incredible amount about Azure cloud services and I thought it would be great to share one of our many success stories: how we took a legacy self-hosted service and migrated it to a cloud native event-driven microservice.

One of the many services that Spektrix provides to our users is the ability to perform several functions that require integration with devices used in their box offices, things like printing tickets, taking payments from chip and pin terminals and opening cash drawers. As part of the Spektrix solution, we ask users to install a Windows application on their desktops and configure it to suit their requirements (for example the type of ticket printer or the particular provider they make use of for chip & pin payments). The solution itself comes with several key requirements:

We need to know at any point which applications are actively connected. For example, when printing a ticket from our Spektrix Web application, we need to know which printers are available to perform the printing.
We need to be able to perform automatic upgrades so we can deliver features and improvements to users seamlessly.
It must be responsive and reliable. For example, we do not want the customers of our users to wait unnecessarily while their tickets are being printed.

A pair of tickets being handed over at a box office

Core Goals

We wanted as much of the solution as possible to use Platform as a Service (PaaS) offerings. At Spektrix, we want to spend our time focusing on solving our users’ needs rather than on maintaining infrastructure.
We wanted the solution to be scalable, reliable, and secure.
Communication between the user applications and the service must be fast — for example, the delay between a user requesting a ticket to be printed and the ticket beginning to be printed should be as responsive as possible.
We need to know as quickly as possible if an app is available, our users should not be able to select a printer connected to an app if the app is not available.
The auto-upgrade of the user apps should be reliable. We wanted all our users to benefit from having the latest version installed.
We wanted the solution to be as independent as possible from our other services, this aligns with our strategy of breaking apart our monolith into separate deployable business capabilities.
It must be cost effective. We wanted to make sure we get the most value out of any service that we deploy to the cloud.

Design Challenges

During our design sessions we acknowledged that we had several challenges that we needed to overcome before the solution would work:

Due to historic reasons, we had multiple major and minor versions of the app deployed to our users. Some users had a mixture of major and minor versions also. Version 1 of the application made use of a proprietary web socket technology for maintaining connections to the user. Version 2 of the application made use of a long-polling technique which was not scalable as it impacted the web servers on which we host the Spektrix platform. We needed to ensure both versions supported upgrading to the new version.
Our strategy of preferring PaaS over Infrastructure as a Service (IaaS) meant that we had taken a decision to move from RabbitMQ to Azure Service Bus. RabbitMQ was used by version 2 of the app so it was a requirement to replace this.
A lot of this technology was new to us, we needed to quickly iterate and validate that our technology choices were correct and cost effective.

An illustration showing the various Azure components used in the solution along with the interaction with the Spektrix Web Platform and Client Box Office App installations — Architectural High-Level Overview

Technology Choices

Azure SignalR Service

The Azure SignalR Service gives us the ability to maintain a real-time connection between the Spektrix platform and our user installations of the app. This effectively delegates all the complications about having to concern ourselves with connection management. One of the most beneficial aspects of SignalR is that it provides the ability to auto-recover lost connections. So, if our users are having connectivity issues, the library detects this and attempts to re-establish the connection. As this library also abstracts the underlying connection transport types (WebSocket by default falling back to older transports like Ajax long polling when required) we will be able to benefit in the future if any newer transport types become available.

Azure Functions

Azure functions are a serverless offering from Azure that natively support an event-driven architecture. Functions are triggered based on some external event (such as a new message on the Service Bus or perhaps an HTTP request to the function). They also give us the ability to scale on demand — as more users use the service, Azure will automatically scale the underlying servers hosting our function. This is provided by default by the consumption plan of Azure App Services. As we only pay for when the functions are being executed, in times of low demand, our costs can drop. At the time of writing, the average daily cost in Production for the entire service is around £5, and with some small changes to our Cosmos configuration we are confident we could get this even lower.

Azure Cosmos DB

We have a requirement to store some state about the user applications connecting to Spektrix, such as when they last connected, the version they are currently running and the descriptive name of the app they have registered. Azure Cosmos DB also supported our technical requirements of being fast to respond, serverless (in that we do not manage the service), extremely scalable (far more than we would ever need for this use case) and cost effective.

Azure Service Bus

Azure Service Bus serves as our message broker and effectively our integration layer between the Spektrix Web platform and the Box Office Service. When users request to perform an action through the Spektrix platform (such as printing a series of tickets), a new message is placed on the bus which an Azure function then consumes sending the necessary message, via SignalR, to the correct app configured printer.

Application Insights, Log Analytics and Grafana

Monitoring and logging of any production application is essential, and we took advantage of the ‘out-of-the-box’ integration that Application Insights (part of Azure Monitor) gives us with Azure Functions. We can watch telemetry in real time or query historical telemetry data. We have metrics around performance of each of our functions and smart alerts that fire when anything out of the ordinary occurs. Logs are streamed to Log Analytics where we have created several pre-prepared Kusto queries to enable quick troubleshooting if one of our users is having problems which our support team cannot assist them with. All this data is then visualised via several dedicated Grafana dashboards which give us a rapid way of monitoring and assessing how the whole system is functioning.

Azure Event Grid

Azure Event Grid gives us a great way to integrate Azure services together and we have taken advantage of this by subscribing to user disconnection events from the Azure SignalR service. As soon as a user app disconnects, SignalR produces an event that is published to a topic on Azure Event Grid. We have a function that consumes this event and performs the necessary action to mark that user app as unavailable.

Summary

Our users are now all running on the most up-to-date version of the box office application, reliably connecting to the Spektrix platform via our new Box Office Service. We have a scalable and easy to maintain platform with no downtime during deployments. We have visibility of the performance of all aspects of the application giving us an improved ability to diagnose and mitigate issues as they arise. We are in a great place going forward to iterate on this design and push more improvements directly into our users’ box offices.