How I Built This: A Sports Betting Exchange (Part 2: The Backend)

13 min readNov 29, 2021

If you haven’t already, check out Part 1: The front-end

Hurdle 4: How Do I Go About Building a Back End?

For clarification, I split the “back end” and the “exchange” in order to keep the two components loosely coupled. The original idea was to give the exchange the ability to grow independently and potentially have multiple applications that all hook into it. In order to do so, I started with the sorts of communications that would have to happen between the back end and the exchange:

Receive all bets executed on the exchange
Submit bets to the exchange
Receive event updates from the exchange

Right off the bat, it was evident that there would have to be an API as well as some sort of relational database, all pretty standard. What else was evident was that I would need a whiteboard of some sort because, again, I would be iterating quite frequently on the design and would need a quick way to make changes. While my 300 sq ft studio didn’t have much space, it did have a large open wall, so I got a 4x8 restickable roll-up whiteboard from WriteyBoards and got cracking.

I was looking to create both a microservice architecture as well as an event-driven architecture to keep components loosely coupled, so here’s where I started:

Then I started laying out some non-trivial requirements to guide the rest of the process.

Requirements

Users would need to be notified via email of any status change to their bet
Users would need to confirm their account creation via email
Bets would need to be submitted in real-time
Bets would have to be consumed and user profiles updated in real-time
Bets would have to be aggregated in real-time over 5 minute windows
Events would have to be updated in real-time
Users would need some sort of type-ahead functionality for searching events

I’ll approach my design decisions by addressing each of the above requirements individually.

Users would need to be notified via email of any status change to their bet

With regards to an emailing service, it seemed my choices were SNS, SES, and a third party service on Lambda. Again, I wanted to keep my architecture loosely coupled

SNS vs SES vs Third Party on Lambda

SNS, or Simple Notification Service, is a channel publisher/subscriber service. To receive emails from SNS, the end-user first subscribes via a developer-enabled method such as a web-hook, then receives emails from that channel.

SES, or Simple Email Service, is designed for sending high volume email efficiently and securely. SES also enables sending emails to recipients without consent.

When looking at a third party for an emailing service, it appeared SendGrid by Twilio was the most popular and for good reason. SendGrid is a cloud-based tool that can be used for sending mass marketing emails as well as transactional emails. It also has numerous integrations and a very simple python API library that seemed easy to pick up and get running in minutes.

While SNS is simple and cheap, bet updates were transactional in nature, sending custom emails to select users, which was not supported by SNS. While SES seemed like a great tool for transactional emails and was part of the AWS ecosystem, it was only available for enterprise accounts, thus, hello SendGrid! SendGrid was great as it allowed for not only transactional emails but also promotional emails, an inevitable future for any web application.

Winner: Third Party

Users would need to confirm their account creation via email

While many may not see this as a strict requirement of a system, email confirmation would help verify two assumptions:

The email exists and mail can be sent to it
The user has access to the inbox

I also wanted to provide a time to live, or TTL, on the link provided in the email to confirm that the user confirmed shortly after creating the account.

Based on these requirements, it seemed like a key-value store would be needed with the ability to handle TTL. I also wanted to keep costs low and stay within the AWS ecosystem if possible. DynamoDB seemed like an obvious choice here, however, it’s worth noting that TTL typically deletes expired items within 48 hours of expiration in DynamoDB so additional checks would be needed in the API to see if the expiration was violated.

I’ll come back to this decision later in the article as, in retrospect, authorization is a piece of this architecture I would have handled differently.

Winner: Third Party & DynamoDB

Bets would need to be submitted in real-time

Working under the assumption that the exchange was loosely coupled from the application back end, submitting bets would involve updating the relational database as well as a fire and forget to the exchange so there’s really not much to see there.

At this point, I had to define the bridge connecting the application back end to the exchange. Let’s discuss some requirements for bets being submitted to an exchange:

Order: Bets must maintain order for a user. In other words, a bet must be submitted before it’s canceled or chaos will ensue.

Messaging Semantics: Exactly once.

Throughput: Potentially hundreds of thousands per second.

Latency: Real-time.

If you’re not familiar with the issues of distributed systems and message brokers, here’s a tweet that sums it up nicely:

Let’s give some options here.

SQS vs Kinesis vs MSK

SQS, or Simple Queue Service, is a fully managed message queueing service for decoupling microservices. Essentially, it stores messages for later retrieval like a TODO list. SQS offers FIFO queues that both guarantee order and exactly once semantics.

Kinesis is a massively scalable and durable real-time data streaming service. The data is available in milliseconds to enable real-time analytics use.

MSK, or Managed Streaming for Apache Kafka, is a fully managed service for using Apache Kafka to process streaming data. Kafka is an open-source platorm for building real-time streaming data pipelines and applications.

While SQS FIFO queues can offer exactly once delivery and maintain order, they only guarantee 300 messages per second, not nearly enough throughput to meet requirements. Kinesis and Kafka are very similar; both scale to hundreds of thousands of messages per second; both maintain order; both have similar concepts of sharding or partitioning under the hood. While MSK is more customizable and has the ability to delivery exactly once semantics, I preferred Kinesis because of its ease of use, minimal operational management, and integration with other AWS services. I would, however, have to use DynamoDB for integrity checks to make my consuming application guarantee exactly once processing to make up for the at least once delivery that Kinesis provides.

The infrastructure of the consuming application also plays a role, however, I’ll speak more to that point while addressing the solution design of the exchange.

Winner: Kinesis & DynamoDB

Bets would have to be consumed and user profiles updated in real-time

For the same reasons as the ones above, I decided to use Kinesis for bets going both in and out of the exchange. When it comes to consuming from Kinesis there were two obvious options.

Lambda vs KCL on ECS Fargate

Lambda vs Fargate is a debate as old as time — no, sorry, I meant as old as Fargate’s existence, so closer to roughly 4 years. Lambda has quite a few limitations as noted in the chart below.

Graph explaining when to use Lambda vs AWS Fargate

In this case, while Lambda would be easier to configure, I preferred ECS Fargate because it was environment independent and supported dockerized development. It’s worth noting that at the time of making this decision, Lambda did not yet support Docker and creating deployment packages for Lambda was quite custom. I also foresaw the CI/CD process being much smoother using Github Actions to publish to AWS ECR, Elastic Container Registry.

At this point, I should note that I intended to do a lot of the back end development in Python, partially because it was the language I was most familiar with, but also because speed of development was more important to me than speed of the application. While most would recommend Java in this scenario, if I had a team of 5+ developers, I could not agree more. I also believe that if business logic is sound, then it can always be ported to another language if and when performance becomes an issue.

The only issue I had with Docker was that configuring a KCL consumer application with Python was non-trivial. KCL was a Java library, however, it supported a MultiLangDaemon. The docker image was difficult to configure, the MultiLangDaemon operated by piping messages over STDIN/STDOUT, and I couldn’t manage to get it working in anything outside of Python 2. For these reasons, I opted for Nerd Wallet’s pure-Python implementation of a Kinesis consumer, which was surprisingly easy to use, however, needed a few modifications to get working with localstack, a test/mocking framework for developing cloud applications.

Winner: ECS Fargate with NerdWallet’s Kinesis Consumer

Checkpoint

Now that some decisions have been made, let’s check in to see what the solution is looking like.

Alright, back to our regularly scheduled programming.

Bets would have to be aggregated in real-time over 5 minute windows

I had two options here.

Option 1: Add all bets to the database directly and do aggregates on the fly during API calls to the event overview route.

Option 2: Pre-aggregate windows of bets and upload the aggregate windows to the database for the API to retrieve.

Again, building for scale, it didn’t make sense to calculate this aggregation on the fly when potentially thousands of users could be hitting the event page every minute, hence I opted for option 2. For aggregating at scale in real-time, I saw 2 potential options.

Kinesis Analytics vs Glue Streaming

Kinesis Analytics is an easy way to analyze and transform streaming data in real-time with Flink or SQL applications. Kinesis Analytics is a serverless service that scales to match the volume and throughput of your incoming data.

As of April of 2020, AWS Glue supports serverless streaming ETL, making it easy to continually consume data from streaming sources like Kinesis and Kafka. These jobs run on Apache Spark Structured Streaming engine and can run complex analytics and machine learning operations.

When it came down to it, using Glue for this sort of aggregation would be like using a sledgehammer to kill a fly. While Glue ETL’s an extremely flexible tool, Kinesis Analytics scaled better, cost less, and could be accomplished with a single SQL statement using tumbling windows as follows:

Query for 5 minute tumbling windows averaging bet odds

Unfortunately, Kinesis Analytics couldn’t directly update the relational database with aggregated odds, so some extra legwork was involved. Options for Kinesis Analytics sinks include Kinesis Streams, Kinesis Firehose, or Lambda. Again, I wanted to stick with containerized development and didn’t love the arbitrary limits lambda imposed so I decided to use Kinesis Streams with an ECS consumer, a familiar pattern at this point.

Winner: Kinesis Analytics

Events would have to be updated in real-time

To understand the solution here, we again had to figure out what we were reading events from. Again, events had similar requirements to bets, requiring order to be maintained and having potentially high throughput, thus, Kinesis seemed like an obvious answer.

Similarly, the consuming application had all the same requirements as updating bets of reading in real-time and updating a database. The database would also be a crucial factor here which I’ll discuss in the next requirement, but I knew an ECS consumer could again handle the use case for all the same reasons.

Winner: ECS Fargate with NerdWallet’s Kinesis Consumer

Users would need some sort of type-ahead functionality for searching events

I decided to have a search bar where users could search for games, however, I knew very little about type-ahead or autocomplete, so per usual, I read everything I could get my hands on, including lectures in the underlying theory, going so far as to even implement a quick POC to fully understand it conceptually. Now for the options.

CloudSearch vs ElasticSearch

Amazon CloudSearch is a managed service making it simple to set up search for features such as highlighting, autocomplete, and geospatial search.

Amazon ElasticSearch Service is a fully managed service for running ElasticSearch cost effectively at scale. It’s used for full-text search, structured search, analytics, and more. Amazon’s ElasticSearch also comes with Kibana, a feature-rich frontend for data visualization, monitoring, and managing, right out of the box.

ElasticSearch and CloudSearch were neck and neck in many categories including provisioning, management, availability, client libraries, and cost. While ElasticSearch’s autocomplete wasn’t straight forward, having multiple ways to implement it, ElasticSearch is the most popular tool in the industry for search and I’ve always wanted to get hands on with it but never had the chance outside a few simple queries at work. It was also extremely simple to set up ElasticSearch and Kibana on Docker for local development.

Winner: ElasticSearch

Loose Ends

While there’s now a solution in place that fits many of the requirements for the app back end, there’s still a few loose ends to tie up architecturally — the API and the relational database.

Aurora vs RDS MySQL

Ok, so I’ve made a lot of assumptions here. First, I went with MySQL over something like PostgreSQL as it has a lot of community support, is the go-to for scalable web applications, and has a nice interface with MySQL Workbench. PostgreSQL seems to be the go-to for performing high-volume and complex operations, which seemed like overkill here. Secondly, I was deciding between Aurora and RDS as I wanted to stay within the AWS environment and minimize management.

Amazon Aurora Serverless is an on-demand, auto-scaling configuration for Amazon Aurora that is MySQL-compatible that enables the database to scale capacity up or down based on your applications needs. While it’s typically used for infrequent and intermittent workloads, it’s also used for unpredictable workloads, reduced management, and kept costs low.

While RDS takes a little more management, allows flexibility on releases, supports plugins, and allows for other engines outside of InnoDB, I didn’t see a need for any of these features and decided to go with Amazon Aurora.

Winner: Aurora Serverless

API Gateway & Lambda vs ALB & ECS Fargate

API Gateway is a fully managed service that makes it easy to create, publish, maintain, monitor, and secure API’s at any scale. It supports RESTful API’s and WebSocket API’s as well as handling tasks such as concurrency, traffic management, CORS support, throttling, etc. It’s quite feature rich to say the least.

An ALB, or application load balancer, is used as the single point of contact for clients to distribute incoming traffic across multiple targets while monitoring health of targets.

While using an ALB with ECS allows for some python web application frameworks that simplify development, such as Django and Flask, API Gateway allows for native IAM integration and schema validation. What I didn’t love about API Gateway was the fact that individual infrastructure needs to be written for each route, although if it came down to it, I could always proxy groups of routes to Lambda and let some Lambda friendly frameworks do the heavy lifting, such as a mix of FastAPI with Pydantic and Mangum.

Winner: API Gateway & Lambda

The Back End Architecture

Now that any loose ends are wrapped up, let’s take another look at what the solution looks like:

I won’t speak to much of the application code as it was rather rudimentary, for the most part using a combination of the following tools:

Boto3: for AWS calls to DynamoDB, SNS, and Kinesis
ElasticSearch Client: for reading from and writing to the type-ahead ES index
PyMySQL: for interactions with MySQL on Aurora
PyTest: for unit tests
Github Actions: for publishing docker images and uploading Lambda zip files to S3

The Infrastructure

The last piece of this puzzle was the IAC, or infrastructure as code. While I was familiar with CloudFormation and Troposphere, I decided to go with Terraform, an open-source IAC tool that provides consistent CLI workflow to manage cloud services. Terraform had the following benefits over Troposphere and CloudFormation:

Provider agnostic
Simple declarative configuration files
Faster feedback via the Terraform CLI
Modularity
Popularity and community support

I also knew in the future I wanted to endeavor in GCP tooling and I didn’t want to have to learn a new IAC framework, so Terraform seemed like a great option, however, being entirely new to it, I knew it was going to be an uphill battle.

The POC

After reading through all of the terraform documentation to get an understanding at a conceptual level, I spun up a repo and got hands on with a POC that would prove out the following concepts and get me comfortable with the framework:

Creating and destroying a local back end with an S3 bucket
Creating and using a Terraform module for an S3 bucket
Creating a Terraform Cloud back end workspace and standing up an S3 bucket

At this point I could modularize my code, set variables for the environment, and had the CI/CD in place.

Terraform Cloud Workspace runs console (left) and variables console (right)

Development Flow

At this point I should note how I went about development as well as keeping track of my progress. While I again used a Github Project board to keep track of progress of features and bugs on the back end, I typically completed a POC of a component first, then set up the infrastructure, then moved on to feature complete, tracking the status of each component on the whiteboard. Prepare yourself because while you’ve seen some beautiful diagrams so far, you’re about to the see the truth of what I was really working with along the way:

An in-progress look at architecture and development progress

Yes, it wasn’t pretty but it got the job done. Another bit of the process that also wasn’t pretty but got the job done was a lot of manual work in the AWS console to test integration points. I’ll talk more about this in the section of things I would’ve done differently.

That being said, a couple of months later of working through components on the back end, it was finally complete and I was ready to move on to the exchange!