Employ a lock policy with lock fairness for micro-batch processing

A banner that says the FIFO locking system is the way to go.
A banner that says the FIFO locking system is the way to go.
Image by the author


A distributed lock policy that enforces lock fairness is vital for handling late-arriving and unique records in a micro-batch architecture. This article explores the design implications of building a first-in first-out (FIFO) locking mechanism to prevent process starvation.

Bluecore’s analytics pipeline reads from an append-only data lake and writes to PostgreSQL. If an email is opened in January and then again in February, we need to recognize that the February open isn’t unique in that the email was previously opened.

This kind of infinite retention necessitates streaming records into a buffer and updating our unique counts in micro-batches. Recent events…

42 — the answer to the ultimate question of life, the universe, and everything. It is also the number of books I read this year. Coincidence? I think not.

screenshot taken from goodreads.com

Quarantine sucks, but at least I had more reading time! This year I explored more non-fiction and tech-related books than I have in the past. My main categories were tech, personal growth, non-fiction, literature, and sci-fi (of course). To see all of the books I read, find me on Goodreads!

8 Memorable Books

1. War & Peace by Leo Tolstoy

A classic that I can’t believe I had never read. I COULD NOT put this down. I know it’s long, but…

made by me

Deciding which database to use for a service requires some understanding of how different databases store data, especially at scale. Choosing between a database that stores data as rows or columns, for example, has a big impact on the database performance. For this reason, certain access patterns are better suited for row oriented databases (row-stores), and others for column oriented databases (columnar-stores). At Bluecore, our production infrastructure for analytics includes both Google CloudSQL (row-store) and BigQuery (columnar-store) in our analytics pipeline. We carefully chose these databases after weighing the tradeoffs of each and analyzing our access patterns. …

Starting with databases and venturing into how the physical components of a computer store data and the differences between how those components work. Knowing how a particular database stores data is important for understanding the performance of that database and weighing tradeoffs between databases.

drawn by me

Have you ever started learning about a topic and then your suddenly 3 hours and 500 tabs in? That’s what this is, but make it a blog post. I’m going to start with a topic I am interested in — databases. This topic is, uh, pretty huge. Therefore, the focus will be on a subset of…

Along with knowing how to use Airflow, it is also important to know when to use it.

Airflow is a popular tool used for managing and monitoring workflows. It works well for most of our data science workflows at Bluecore, but there are some use cases where other tools perform better. Along with knowing how to use Airflow, it is also important to know when to use it.

About Airflow

“Airflow is a platform to programmatically author, schedule and monitor workflows.” — Airflow documentation

Sounds pretty useful, right? Well, it is! Airflow makes it easy to monitor the state of a pipeline in their UI, and you can build DAGs with complex fan-in and fan-out relationships between tasks. …

In Airflow: how and when to use it, we discussed the basic components of Airflow and how to build a DAG. Once you start building a DAG, you will notice that it gets complicated quickly. Choosing operators and setting up the DAG structure takes some time. As we continue to grow and scale at Bluecore, we continue to search for solutions to better create and scale our data pipelines. Here are some of the learning we’ve made along the way.

Operators: Advanced

Moving on from the basic concepts, we will discuss some more practical uses of Airflow’s operators. At Bluecore, we developed…

This article is co-authored by my great tech lead Mike Hurwitz.

From the beginning, Bluecore has run on the Google Cloud Platform (GCP). When it came time to build a high-performance service, it was only natural that we looked to leverage the GCP infrastructure, rather than building it ourselves. By using Kubernetes (GKE), Redis (MemoryStore), Bigtable, and other GCP technologies, we were able to build a service with sufficient bandwidth and latency to send 10,000s of personalized recommendations per second.

What is a product recommendation?

You’ve most likely received an email tempting you to buy a product you left in your online shopping cart, and…

Bluecore’s Data Science team uses Airflow for our model workflows. In our Airflow pods, we had been, until recently, using a Cloud SQL proxy as a sidecar container. The Cloud SQL connection handles database connections. We can get information about the state of a task or XComs, for example. Google has recently allowed users to connect to Cloud SQL using a VPC. Because of this, we decided to remove the proxy and implement a private connection to be more secure and save resources. …

I’m definitely a sci-fi fan, but this is the first book I’ve read by Philip K. Dick. When a few people recommended Ubik, I decided to check it out. At ~230 pages, it is a pretty quick read. Skip “Thoughts/Philosophies/WTF is going on??!” if you don’t want spoilers.

TL;DR mind == blown.

from https://en.wikipedia.org/wiki/Ubik

Quick who is who

If you haven’t read the book, the basic idea is that Glenn Runciter owns Runciter Associates, a psychic agency that protects other organizations from bad psychics. It is like a futuristic cyber-security organization. Ray Hollis is in charge of the bad psis, and Runciter is suspicious that Hollis…

So, You’ve Made a Kubernetes Job Operator in Airflow…

At Bluecore, we rely on our Kubernetes Operator, or KubernetesJobOperator, to execute workflows via DAGs (Directed Acyclic Graphs) in Airflow. For example, our data science models generate product recommendations utilizing our Kubernetes Operator. We previously published multiple blog posts detailing our use of Airflow and our reasons for creating our own Kubernetes Operator (see here and here). Airflow is an essential part of our infrastructure, so it’s important that it is easily accessible to a variety of users, including data scientists, analysts, and engineers. An important part of that usability is to understand the current state of the DAGs, to…

Alexa Griffith

Software Engineer at Bluecore

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store