Living the Golden Rule (of Cloud Architecture) #3: Serverless Analytics
Analytics have always been an important element of enterprise software, but in recent years increased awareness and growing recognition of the potential value of data have given it even greater importance. It is becoming essential to handle analytics well, and there is a definite desire for “big” data, the bigger the better. Which can be pricey.
You may have to scale whatever analytics solution you choose in proportion to the amount of data that you’re dealing with, even if you’re making only light use of it to actually generate results. This may lead to violating my proposed Golden Rule of Cloud Architecture, “Never pay for unused capacity,” with a large, expensive, and mostly idle analytics solution sitting around.
There are some options to tackle the problem. For example, if you can get away with periodically pregenerating and caching reports, you may be able to spin-up your analytics cluster, generate your reports, and then shut it down until the next day. That could help.
Another possible option may be to use a serverless analytics solution like AWS Athena or Google BigQuery. Both of these solutions allow you to take large amounts of data and run queries on a pay-per-query basis. You’ll pay for data storage (which is pretty reasonable in cost) on a continual basis, but there is no need to pay for a cluster sitting around idle until you need to run a query. They can be a good option, but — as always — there are limitations and potential gotchas.
The most important factor in using Athena or BigQuery successfully is understanding how they work and making proper use of them. As is the case with any tool, if you try to use it the wrong way you’re going to get bad results. For example, both tools use an SQL-like query syntax that is convenient and familiar, but can get you into big trouble if you assume that they behave like an RDBMS. Athena and BigQuery are most definitely not serverless RDMBSes — you might be able to formulate the same queries that you would on a true RDBMS, but don’t expect the same behavior.
That being said, you can very efficiently and economically execute queries over very large datasets if you arrange your data the right way and use the tool accordingly. Giving complete guidance on how to do this is beyond the scope of a single blog post, but I can provide a few pointers that may help you to build a mental model useful for working towards a solution.
Arranging your data so that you can run a query against a single table is a generally good idea for tools like Athena and BigQuery. Denormalization may be appropriate, and if you have different classes of queries that use the data in different ways, it may make sense to have a separate table for each class even if there is some overlap in data. You will probably be doing read-only analytics and the cost of data storage is reasonable, so don’t be too afraid of duplication.
Also remember that Athena/BigQuery charge you based on the amount of data that they need to scan in order to answer your query — so anything that you can do to limit the amount of data they need to scan is A Good Thing. The scheme you use for partitioning your data can be quite important in this respect, perhaps important enough to warrant having multiple tables containing the same data, but partitioned differently. Some queries can even be “free” (or at least executed without scanning any data) if the query engine can generate an answer using only metadata (for example, counting the total number of rows in a partition, but you’ll still pay a minimum charge). Rolling-up your data and executing queries against the rolled-up data can be a very good idea for queries that you execute frequently.
You can get very good results doing aggregation queries with grouping and filtering. Picking individual rows out of the table as a whole doesn’t work so well (indexes are not supported). Don’t be afraid of putting a large number of columns into your table, since you’ll only pay for scanning the columns used in the query. That being said, avoid doing a “select *” when possible, since you will end up scanning all of the columns in the table.
All of the above recommendations apply to analytics that you’re running on a regular basis, and if you’re rigorous about it you can probably get excellent results at a very reasonable cost, and with no unused capacity. But it may take some work to massage your data into the right form and you won’t get quite the same flexibility that you’d get with an RDBMS or a graph database.
That being said, Athena or BigQuery can also potentially be very useful if you have a need to do just occasional ad hoc queries. You can very economically throw large amounts of data into them, and if it costs $10 or $20 per query that may still be quite economical for infrequent usage, especially when compared to running a database instance on a continuous basis that is mostly idle.
Overall, serverless analytics solutions like Athena or BigQuery can be quite a useful tool to have in your tool belt. They don’t handle all use cases, but if you can massage your data into a suitable form you may be able to deliver big analytics at a small cost.