Release the Kraken!

Fiona Tang
StreetGroup
Published in
4 min readJul 30, 2021

The dream team

The team is led by Chris, our Head of Data, Ed and Jemma who are officially our Data Entry Associates (or more coolly, they’re like our Data Investigators), and Jack and myself as Data Engineers. We’re looking for more people to grow the team, so do keep an eye out on our job board!

What have we been up to?

Here in the Data team, we’ve been working on a project with the code name Kraken. So what is Kraken? Kraken is an API that is intended to be used internally to power the different products at Street Group. We’re also in the process of migrating an existing service we have that is built on a different technology stack so that it’s more maintainable, and as an added bonus, will eventually be cheaper to run!

What was the design approach?

We took a Design First approach with the API and used Stoplight to help us achieve this. Stoplight has been useful to ensure that our API conforms to the OpenAPI specification and provides documentation for users of the API.

Within the team, we collaborated on ideas of what we could include in each of the API endpoints and we worked with our co-founder, Tom, to prioritise the development of what to include in the APIs depending on how useful the data would be for our customers and how difficult it was to acquire the data.

What technology stack do we use?

We have taken a multi-cloud approach, our data ingestion and transform pipelines are on GCP and we serve the API on AWS to keep things close to other Street Group products.

Python is our primary language, we’ve used this to build our ingestion and transform pipelines in Apache Beam to be executed on Dataflow, Dataflow has been incredible for loading in large amounts of data in an impressive amount of time.

Ingested and transformed data is stored in BigQuery. Having a view of the data in BigQuery has also been beneficial to help us draw analysis from the data, highlight any potential data issues and build our pipelines more efficiently.

Data that has been transformed to be used by the API is stored in Storage in JSON format ready to be copied over to AWS S3.

Lambda functions are used to pick up the files and load the data into either DynamoDB or CloudSearch. DynamoDB is used for any endpoint that only requires a key lookup, whereas CloudSearch is useful for endpoints that require an area lookup. We also use Lambdas to return the response for the API endpoints, which are written in Python as well. Finally, we use API Gateway to route requests with the appropriate response from the Lambda function.

We’ve fully embraced Infrastructure as Code, we’re using Terraform for all GCP resources and AWS SAM for anything we have in AWS to ensure we have consistent environments for development and production. CircleCI is our continuous integration and deployment tool to ensure the underlying code is also consistent across our environments.

The process above has been automated as much as possible with Apache Airflow, which is used to detect whether we have new data available to load into BigQuery using Dataflow. We use the Airflow DAGs to manage all Dataflow pipeline dependencies all the way to shifting the data to AWS for the Lambda functions to pick up.

What technical challenges did we encounter?

We work with some pretty big datasets, one of the biggest datasets we have is about 80GB. Since Lambdas have a timeout limit of 15 minutes, it is unable to process all the data before timing out. We tried different approaches to solve this problem, including drip feeding the data for the Lambda to process using SQS and calling the Lambda again just before it times out to process the remaining data.

Another challenge we faced was trying to return entities within a certain radius of a location provided at input, which was required by some of our endpoints. Initially we trialled a Python library to work out which entities were within a radius using the data stored in DynamoDB, however we noticed it was behaving strangely where the number of entities would decrease after hitting a certain radius. We spent some time seeing if we could resolve the issue using the library, but agreed that it would potentially end up being a rewrite. Eventually we discovered CloudSearch which was perfect for the job and introduced this to the tech stack.

What are the upcoming plans?

We’re currently in the process of migrating old services to use the new Kraken endpoints, this will simplify the maintenance but also help us reduce costs of having lots of services doing very similar things.

We’re also hoping to improve the current data we have and potentially looking at building some machine learning models to fill in the gaps, as well as highlight any anomalies such as identifying data issues from our sources.

If this sounds interesting to you, then find out more about our team here.

To find out more about life at Street Group, visit streetgroup.co.uk, our Glassdoor page, or visit our careers site.

--

--