We have a team of hungry analysts and modelers that want to get at data as soon as it becomes available.
Unfortunately, we had relied on a mixture of engineering personas to create environments on-demand — creating a massive constraint and a lot of waste.
We set out to solve that.
At Nomis, we work with customers that, at a very high level, give us this requirement:
“Here is my data. Optimize it for me to meet my business objectives and constraints. I don’t care how.”
As you can imagine, a process like the one below prevents scalability, causes frustration, and demoralizes the team over time:
When we had a much smaller team and less rigid definitions of roles and responsibilities, it was manageable, as broken as the process was. As we grew, the rate of growth across departments was not equal, and not long after the constraints began to surface.
From a pure technology perspective, the process is even more painful to look at. But instead of bashing many of the cringe-worthy flaws, let’s see what it looks like now:
Note how the engineer is no longer a persona or a constraint in the process. Constraints breed self-sufficiency — and that became the theme of the solution.
So how did we do it?
We wanted to provide a “no-touch” experience for the data analyst, so that it shouldn’t matter when data arrived. It also shouldn’t matter whether an engineer was physically available or not — the entire department could have been at an offsite or vacation, and the data analyst should still be able to be productive.
The team imagined a world where incoming data, at infrequent intervals, would be able to organize itself based on its business context and what it represented. Then, without any need to manually load the contents of the S3 data objects anywhere, the data should be accessible immediately via a read-only, multi-tenant SQL interface.
Naturally, this was a good fit for an event-driven reference architecture in AWS:
AWS Glue is a fully managed, serverless ETL service backed by Apache Spark that also provides a data catalogue that is essentially a Hive metastore on a data store of your choice. Using crawlers and schema inference, it classifies and populates the pertinent metadata of any data that is arriving, in our case, on S3.
We have written simple Lambda functions that validate ObjectSummaries, and orchestrate the organization of the data arriving on S3:
Once the objects are in the partitioned structure, a simple boto3 API call will start a Glue crawler configured to that S3 location:
Once the Glue data catalog is updated, Athena can automatically use it as a metastore to directly query the files in-place, on S3. You cannot modify or delete the S3 data from Athena, which greatly limits the amount of damage that a user can do with their exploratory analysis — not quite the case when the data is loaded ad-hoc into an RDS instance, which usually means there won’t be restrictive roles and access policies assigned.
The architecture and implementation is almost brutally simple, and the problem itself perhaps even more so. Most of you reading may have even solved for this long ago. However, if you do experience procedural pains in making data readily accessible to those who need to analyze it, we hope you are encouraged in this story by how:
- A broken process was identified
- Streamlined with less than 100 lines of code
- Deployed without the need for any dedicated infrastructure, and most importantly:
- Created self-service where a severe resource constraint used to exist.