How we built Finimize Markets — Part 2

Published in

Finimize Engineering

7 min readSep 27, 2022

In the previous post we talked about the process for deciding how we built the Markets feature. In this post I’ll go deeper into what that design looks like, and our thought process behind this.

Data Storage

Before our jam session, I researched various data stores that might suit the Markets data we’d need to serve. This data comes in various forms, which include time series data for pricing, and key/value data for company information and fundamental metrics. It’s possible some of this could have been modelled in a relational way, but the time series data wouldn’t fit naturally in a Relational Database.

We also had to take into consideration the team size, which pushed us towards managed solutions over anything self hosted. It also meant we preferred solutions that people on the team had experience with, so had a low barrier to entry.

We decided to go with MemoryDB for Redis, a Redis compatible in-memory store managed by AWS. This is a slightly different offering to Elasticache for Redis, as it is designed for durability, so you can use it as your primary data store. It uses a distributed transactional log, whereas for Elasticache (and open source Redis) the append-only file is only persisted on primary nodes.

The Memory DB cluster is sharded, and is deployed across multiple availability zones (AZ), making it resilient to AZ outages. We have replica nodes for each shard too, to scale read throughput.

To store our data we use a few of the Redis data structures, including strings and hashes. For our time series pricing data we use Streams. We unfortunately can’t use the RedisTimeSeries module in Memory DB, as modules aren’t supported. But for our use cases Streams work really well. At the simplest level Streams are an append only data structure, where entries are keyed by timestamp. There are a range of richer features that can be leveraged with multiple consumers via consumer groups, but we operate with only a single consumer.

We also make use of the stream capping capabilities to evict old entries from the stream as we add new data. We actually have two streams for pricing data, an Intraday Stream which stores prices at five minute intervals capped at 31 days, and an End-of-Day Stream storing close prices only capped at 15 years.

Data Ingestion

The data ingestion processes are probably the most complex part of the project. We needed to pull data from a number of different vendors, using different APIs, each with different rate limiting requirements. The data itself varied in terms of shape (scalar vs time series), and how often it needed to be updated. Some of the ingestion processes also have a series of transform steps prior to loading into our data store.

It was clear we needed to use tools that would help us manage these processes. We favoured tools that had good visibility, were low cost, and could ideally be run and tested locally. In the end we decided to use AWS Step Functions. We have a number of State Machines for our various ingestion processes, which are triggered by EventBridge rules. We defined the State Machines using AWS CDK, which provides a really nice API allowing you to visualise the flow in code very succinctly. We were also able to build and test these locally using LocalStack.

AWS Step Functions has some really nice constructs for configuring retries, even down to the exponential backoff delay. It also allows you to manage multiple concurrent branches of execution very simply, which helped ensure we stayed within the rate limits of our external vendors. In the image below you can see an example of a State Machine that runs 2 concurrent branches, and ensures that each takes at least 10 minutes.

Examples AWS Step Functions State Machine

AWS Step Functions have generally been a very good fit for our use case, but there were some limitations that we needed to work around. We found the maximum task output size of around 260 bytes quite limiting. In most cases the first step in our ingestion processes was to run a map job to output all the security identifiers, which easily exceeded this. So we decided to use an approach that we called ‘state pointers’, where instead of passing the task output directly to the State Machine, we would first write the state to S3, then pass a reference to those S3 objects between tasks. Although this hid the internal state a little, we felt this was a good trade off given that we could increase the amount of data we could handle by over 100x. To make it easier for Lambda tasks to work with this S3 based state, we created a series of helper functions to allow us to read and write state in the same way.

New Microservice for Markets

We decided that the next version of Markets should be built as a microservice. The main reason was so that we could build this on a technology stack that we could more easily scale. At Finimize we have a team of experienced TypeScript and NodeJS developers, and we found the concurrency model and tools much more appealing than the existing Python (Django) based ones.

We decided to go with a Serverless-first approach, and built this service using AWS Lambda for our compute tier. Each of these Lambda functions were small and well defined, following the single-responsibility principle. Building this way allowed us to deliver quickly, using tools we were already familiar with including AWS CDK for writing Architecture as code.

In order to minimise the latency for these Lambda functions we make use of provisioned concurrency, to keep a baseline number of warm functions ready to go, limiting the impact of the cold-start problem. We also use the auto-scaling features of Lambda, that will scale up (and down) the number of warm instances as load patterns change.

This Microservice has both a Staging and Production deployment, each within its own AWS account. In order to communicate with existing backend applications, we used VPC peering for each stage of the core application and the new service. This physical account separation is a security best practice that we decided we should go with out of the gate. We use an internal Application Load Balancer to front the service, which is only accessible from the VPC of the relevant stage of the backend application.

Given that we can’t actually access this new service from our local machines, we also worked on a mock service that we can run locally. This simple tool returns fixture data for each of the endpoints, which we can customise locally if required.

gRPC based communication

For communication with our new Markets Microservice, we considered both REST and gRPC as potential candidates. Whilst the simplicity of REST was attractive, we felt that gRPC had more appeal in terms of being able to define a strongly typed service contract using Protocol Buffers and native code generation.

Given that we decided to start with Lambda functions to run our computation, we can’t benefit from the performance of a true gRPC server. Right now we only support unary requests over HTTP. But when we decide to move on to less ephemeral compute, we will be able to realise the performance gains of HTTP 2.

Data Fetching from our Django Application

Our existing backend Django application provides a GraphQL API to communicate with our mobile app, which we’ve built using Graphene. There are a number of different GraphQL nodes which return data from our new Financial Data service, by making API calls from their resolver functions. Initially this meant we made those API requests serially, as each resolver function was called. Whilst the API requests are each reasonably quick, the cumulative time really added up.

We wanted to see if we could run these requests in parallel to keep the overall response time fast. Although Django offers support for Async views, Graphene doesn’t yet support async views. So we are not yet able to take advantage of asyncio, which could have been a good choice here. So instead we looked into making these requests via a ThreadPoolExecutor, which has a nice high level API for threading, and suits this particular case really well.

The challenge now was to be able to coalesce all these API requests in one place, so we could submit these requests together via the ThreadPoolExecutor. Fortunately Graphene provides the DataLoader utility, which helps us to do just that. The purpose of a DataLoader is to prevent the N+1 problem where overlapping data can be fetched from multiple resolver functions. The DataLoader provides an API where you can request Ids, and then fetch those as a batch just once. In our case the ids were the encoded strings which represented the data type we wanted to fetch from our new service, along with relevant arguments.