I’m currently working with an indoor sensor startup, Smooth Sensors, and one of our challenges so far has been bespoke data analysis: when a customer has asked us to tell them something interesting from their data that is not part of the standard reporting we provide. The main reason this is a problem is that our sensor platform serialises the data in a highly compact format to minimise data transfer, and is stored in our database without deserialisation. In order to perform any analysis — including simple daily means and ranges of values — we go through the painful process of first extracting the data of interest locally, deserialising it, and finally analysing it.
Obviously this isn’t ideal, and as an interim measure to solving this without a massive system rewrite we decided to build a basic data lake. Being a startup the key factor here is cost: we simply don’t have the funds to use the technologies we’d like to because most of them incur substantial upfront costs. The advantage with a data lake is that storage is cheap, and compute only costs us when we need it.
Building a basic data pipeline
We quickly settled on S3 as a good storage medium: we can pack a years worth of our data into AWS S3 for the low cost of $2 USD per month — even though we record over five million data points each day! We also already use AWS for the core platform, so whilst we might get started faster with something like BigQuery, we’d prefer to keep everything in one place for now.
Our main database is hosted using AWS Aurora, so our first step is to extract the data from our database, and dump it somewhat unceremoniously into S3. AWS does provide AWS Data Pipelines, which allows you to schedule extracts from databases in various ways, but at the time of writing this the documentation was woefully inadequate with respect to setting it up with Aurora (oddly?). Also, although the costs are not excessive ($1USD per pipeline per month, not including resources used for running the pipeline) we weren’t sure how this would scale if we needed it to.
In the end, we scripted our pipeline using Prefect. Prefect is a python library that is specifically designed for dataflow automation: it allows for scheduling of tasks, and progress reporting on task success or failure. It has the added advantage of setting us up nicely if we want to scale our pipelines in future.
In Prefect, each pipeline is a flow, and the flow performs multiple tasks — like extracting the data, transforming the data, or loading the data. The flow is a directed acyclic graph (DAG), meaning that tasks that are dependent on others won’t be run unless their predecessors complete successfully. Although Prefect has a cloud offering, and a nice UI if you choose to self-host the engine, it’s also possible to run the flow as a python script locally, rather than as part of an orchestrated workflow automation system.
For those of your reading this and wondering “why not Airflow?”, well, good question. Amazon does have a managed version of AirFlow, too. And although we’ve got nothing against Airflow, Prefect is pretty swish and was simple to get started so we went with that. And we used the cloud variant: the free tier covered our first use case of extracting the data, and we felt it would scale easily and without excessive cost as we might (hopefully!) need.
Our first Prefect pipeline is straightforward: it has one task to get the data out of our MySQL database, a second task to transform that data (deserialise it), and a third task to finally write it out to S3. We also ended up including some table copies in our final flow. Some example code for doing this will be provided in a follow-up post.
One key thing we did here to minimise cost was to use CloudWatch events and AWS Lambda functions to turn on/off our Prefect agents. We only need to be running our pipeline for a small time each day, so having an instance on and idling doesn’t make much financial sense. The guide on how to turn on/off instances on a schedule is well-documented (click here to go to it) and we were able to set this up in less than half an hour.
Building a basic data store structure
AWS S3 is basically a bottomless file system, but one advantage of using this store (aside from it being extremely cheap!) is that if you add a little bit of structure to your filenames you can turn it into a partition-ready big data store.
The trick is this, your filenames should be in the format (we are using parquet as the data format, cos it’s awesome):
Replacing all the names with the appropriate names for your system. For example, our structure looks something like this:
Okay, okay, it just looks like a nice way to structure files, but I promise you being pedantic about this file structure is worth it. Parquet is also worth considering if your data has lots of rows and fixed columns: it’s an efficient file format for that kind of data and in our case what is 250MB of uncompressed CSV files (30MB compressed as gzip) turns into 20MB of parquet data with no data loss.
Supercharging the data store with Glue and Athena
By now we’ve got our data piped out to S3, and neatly stashed away in a tidy filesystem — but that doesn’t actually help us with doing anything useful with it. In order to operate on it and be able to do SQL queries over it like its a database, we need to utilise Glue and Athena.
AWS Glue is a cataloguer: it runs over the S3 bucket and makes a record of what data it finds, and the schema it is in. Conveniently because we used parquet the schema is encoded in the data files, so we don’t need to use any custom data classifiers. Also, because we structured the filesystem in the way Glue expects, we can just point Glue at the S3 bucket, and set it to run just after our Prefect pipeline, and hey presto, it finds the fresh data moments after it is written to the store and enters it into its catalog.
Speaking of Presto, our final data lake step is AWS Athena, which is an AWS version of the popular Presto distributed query engine. It uses the Glue Catalog to know where to find the data we’ve tucked away in S3, and by default anything that we’ve catalogued using Glue will appear in the Athena interface. This means we can run our SQL query over all of our deserialised data, and summarise it in a matter of seconds… for only a few cents.
The reason why it is so quick and cheap is thanks to our partitions: if we use date=datehere in our query WHERE clause, the query engine only needs to look at those folders — quickly narrowing the data that is scanned. This reduces cost (as well as making it fast) because Athena charges you based on the data that is scanned.
In fact, our basic data lake costs us less than $5 USD to maintain — a few cents per month for every sensor we have deployed in the field. We don’t expect to maintain that cost (i.e. it’ll go up) as we add more analysis, aggregations and data points going forward — but it’s a great start to unlocking our data for new customer use cases.
Our final basic setup looks like this:
Our basic data lake is really nothing momentous — yet. At present it’s just an easy to manage deserialised version of our main database. Some tables are copied, one is transformed via Prefect and S3, and then exposed using serverless services such as Glue and Athena. However, although it might be basic now — it’s also extremely cheap to run, and easily scales as our data and use cases grow. Our turnaround time on bespoke analysis has rapidly reduced, without a huge cost overhead.
The investment in time to build this wasn’t huge either: it took one developer three work days to complete the basic pipeline described here — even though they have never touched any of the technologies mentioned before. As a startup this small time and money investment has set us up extremely well for the future, and already paid for itself in reduced time spent performing data extraction and transformation.