Every data system has three characteristics that uniquely identifies it: the size of the data, the recency of the data, and the latency of queries on that data. You are probably familiar with the first one, but the other two are sometimes an after-thought.
As a data engineer, I have frequently seen instances when I deployed a big data system for one use-case. Then a new user uses the same data system for a different use-case and complains that “Oh, my query latencies are slower than my acceptable limit of 500 milliseconds” or “My query is not finding data records that were produced in the most recent 10 seconds”.
At the very outset of engineering a data system, the three things that I ask myself are
- What is my Data Latency? The data latency requirement of an application can vary widely. An annual budgeting system would be satisfied if it had access to all of last month’s data and earlier. Similarly, a daily reporting system would probably be happy if it can get access to the most recent 24 hours of data. An online-software-gaming-leaderboard application would be satisfied if it can analyze data that is produced in the most recent 1 second and earlier.
- What is my Query Latency? If I am building a daily reporting system, I can afford to build a system that is optimized for overall-throughput. The latency of a query could take a few minutes or a few hours because I need to produce a set of reports only once every day. On the other hand, a backend application that powers personalized news stories for a reader would typically demand latencies of a few milliseconds, and this data system is engineered to optimize query latency.
- What is my Queries-Per-Second (QPS) ? If my data system is powering an application on a mobile device, it is likely that my QPS would be in tens or hundreds of concurrent queries per second. If my data system is used to build a daily reporting system, I need to support 5 to 10 concurrent queries at most.
The answers to the above three questions determine the type of data system you would have to use. Data Latency is dominated by data pipelines, also called Extract-Transform-Load (ETL). You can use a ETL process to weed out records with bad data or to pre-generate aggregates over time ranges. The ETL process adds latencies to your data, and a shorter pipeline means that you get to query your most recent data.
Query Latency and QPS are dominated by the database that serves your queries. If you use a key-value store, you would get very low query latencies but you would have to implement a larger part of your business logic in your application code. Alternatively, if you use a data-warehouse that exposes a SQL api, then you can delegate a larger share of your application logic via SQL to the data-warehouse but the latency of your queries would be higher than a key-value store, and you would be limited to 5 or 10 concurrent queries.