Refreshing the Kraken

Chris Amor
StreetGroup
Published in
4 min readDec 15, 2022

So often with tech solutions the right answer is a point in time and space around which requirements, technologies and capabilities swirl (tenuous link to the the amazing Carina Nebula captured from the JWST).

Despite building our Property Data API (Kraken) only 18 months ago (see our blog post from the time), we began to feel some of our design decisions weren’t serving us or our customers as well as we would like.

In particular, our left to right push model of data updates, managing a complicated chain of dependencies from 50+ source data sets meant we were often fighting against the system, with a difficult to reason about path of events to ensure our API data was consistent and up to date, compounded by our multi cloud approach requiring data to be packaged and shipped across clouds between data ingestion and preparation and API serving.

Additionally, limitations of the NoSQL database / search solutions (Dynamo DB and Cloud Search) around query flexibility and geographic search meant we were often relying on the application layer to filter / sort data, impacting performance and splitting business logic across the stack, again making it hard to reason about data lineage and logic. Given our customers demand for ever more capability and performance to explore our data at scale it was clear we would need to tackle some of these challenges.

So, as a picture paints a thousand words, here’s the before and after architecture…..

Quite a change! But in brief the main drivers and rationale for the choices we made were:

  1. Minimise the movement of data — Our initial plan, supported by early performance testing was to rely directly on Big Query using the in memory BI engine to serve API requests, giving a single ingestion / transform and serving layer however we found as the data size increased the query performance dropped to a level where it no longer met our performance goals (see final point). After exploring other NoSQL solutions we settled on good old Postgres, with acceptable performance characteristics, excellent geospatial support through the PostGIS extension and the full flexibility of a SQL compliant database allowing us to support extensive filtering, sorting and aggregation requirements. Whilst we could have implemented a very similar architecture on AWS the simplicity and performance of Bigquery and Dataflow made GCP the more obvious choice.
  2. Improve data quality — Our V1 data model invovled a number of aggregated and transformed layers through which data was pushed on a left to right path (e.g. source data updates were propogated through dependency chains orchestrated by Airflow to their target tables). This led to a complex data lineage with transformation logic present at each step and the potential for inconsistencies in end state data where jobs may have partially failed. Flattening the data model for V2 and moving to a right to left consolidated ELT (e.g. base data tables to end presentation tables in a single in database SQL procedure) with presentation tables used in the API viewed as ephemeral and recreated in full from base data on a schedule has removed the need to manage dependency chains and incremental update logic, improving data quality and simplifying management.
  3. Sub second geographic area searches — Our V1 architecture using AWS Cloud Search for geospatial search was limited by Cloud Search’s geospatial search support for point radius and bounding box searches only (top left and bottom right lat/longs) meaning any more complex geospatial search had to implemented in the application layer making some area searches slow. Postgres with the PostGIS extension supports a wide range of geospatial filtering options with geospatial indexes giving us our target of sub second response times for a range of area searches.

Along the way there were some other notable opportunities we took to further improve performance / data quality:

  • Source data validation — Validating source data to capture errors before they can propogate to the API through Great Expectations. Each source data update is staged and validated against a range of “expectations” on the data before loading to our base data sets.
  • API framework — Our decoupled V1 architecture using a separate AWS Lambda function for each endpoint was simple for performance but not for code re-use or benefiting from an API framework. We chose to implement our API as a single containerised FastAPI application using the power of the framework to remove a lot of boilerplate and re-use code but to run it using Cloud Run giving us the benefit of serverless, multi verisoning, fast horizontal scaling and scale to 0 (well 1) we got from Lambda.
  • Apigee Gateway — A late addition to the party, our intention had been to use GCPs API Gateway product but, whilst functional, found the limited functionality, manual management of keys and need to structure keys around GCP projects wasn’t ideal. Although it comes with a cost, the developer API portal, user managed keys, observability, rate limiting, API product compostision offered by Apigee gives us an API gateway that will support our broader Street Group API strategy, manage risk and make it easy for internal and 3rd party developers to integrate with our APIs, expensive but in our case, well worth it.

We expect to continually evolve and already have a number of ideas and challenges to tackle, a single DB solution for ELT and API serving removing duplication of data between Bigquery and Postgres (postgres loading is slow) being our first priority, so I’m sure be back with a follow up post soon!

To find out more about life at Street Group, follow us on LinkedIn, see what our team are saying on Glassdoor, or visit our careers site.

--

--

Chris Amor
StreetGroup

Long time data geek, passionate about the power of data to build knowledge rich organisations.