The 5 Most Important Criteria for Choosing an Analytics Platform
In our previous article, we explained how analytics’ role in the enterprise is changing, and why this has forced businesses to reevaluate their current analytics ecosystems. This has led many to consider implementing a modern analytics platform to simplify technological complexity, enhance analytics performance, and improve operational efficiency.
But how do you evaluate these capabilities? And what’s the best way to go about choosing an analytics platform?
In this article, we’ll take a look at the five most important criteria for evaluating analytics platforms and the questions you should be asking yourself and your vendors when it comes to your search for the right solution.
What Do You Need From an Analytics Platform?
There are many reasons to adopt an analytics platform, but for most enterprises, those reasons can be condensed into two main factors: speed and efficiency.
When it comes to speed, analytics platforms shine in a number of ways. This goes beyond the performance benefits of using a system where all of its pieces are optimized to work together. Analytics platforms also enhance speed when it comes to lowering learning curves, streamlining workflows, reducing IT-related downtime, and other areas of the business that are able to move quickly when hardware and software become more centralized.
When it comes to efficiency, the benefits of an analytics platform seem obvious. Consolidating solutions can make your entire analytics ecosystem easier to manage. This helps cut out the overhead that comes with building connections between separate tools that add to latency and tie up engineering resources. Speaking of resources, the benefits of analytics platforms also manifest in the form of cost savings in several ways. This includes fewer costly IT and end-user labor hours wasted, the ability to eliminate solutions that only exist to solve the shortcomings of working across multiple distributed systems and even faster ramp-up time (and faster time to value) for new users.
As a general rule, analytics platforms have a lot to offer enterprises, but not all platforms are created equal. Deciding to adopt an analytics platform is merely the first step towards modernizing your business’ approach to analytics. Understanding what to look out for in these platforms and how to effectively evaluate them is what will make the difference between successful adoption or just another failed analytics initiative.
Criteria for Choosing an Analytics Platform
Concepts like speed and efficiency may be good reasons to adopt an analytics platform, but they’re too broad to use as evaluation criteria when choosing an analytics platform. So what should you be looking for?
Aside from must-have capabilities based on the unique scenarios you’re working with, there are five key factors you should always consider when choosing an analytics platform. They are:
- Performance
- Timeliness
- Scalability
- Operational Efficiency
- Cost Effectiveness
Let’s dive deeper into what we mean when we talk about these factors and how to best evaluate them.
Performance
Why It Matters
Performance, undoubtedly, is the most important aspect of any analytics platform. Performance, in analytics terms, refers to query speed. User productivity depends on sub-second query latency, so they can keep asking questions without long breaks interrupting their train of thought.
What To Look Out For
It’s important to pay attention to the main technique being used by the query engine to improve query performance. There can be several options, each with its pros and cons:
- Denormalization — Queries against a set of joined tables can be dramatically slower than single-table queries. This is a big reason why some query engines are optimized only for single-table queries. For these engines, users will need to build a data pipeline that moves data from joined tables into a single table (a process called de-normalization) in order to get great performance. This comes with costs in the form of extra data delays and additional effort to maintain the data transformation pipeline.
- Indexing — Another common technique is to rely on extensive indexing. Indexing, while improving read performance, adds overhead to data ingestion. It also leads to an explosion in data storage.
- Materialized Views — Materialized views can also improve query speed, but they can easily become stale. Another challenge with materialized views is that you have to understand what queries are often used, so you can generate these views upfront, which limits the flexibility of the engine.
All of these techniques are helpful, but they don’t solve the root problem. Solutions for good analytics performance should fundamentally address the issue by using a cost-based optimizer. With a cost-based optimizer, users can leverage the parallel processing capabilities of modern CPUs, the distributed data processing of a massive parallel processing architecture, and more. The end result should be excellent query performance even on joined tables directly, and without denormalization.
What Questions You Should Ask
- What TPC-H/SSB test benchmarks are available?
- How does this solution improve query performance?
Timeliness
Why It Matters
As Dr. Prashanth Southekal points out in his blog, data loses 50% of its value only 8 hours after its’ produced. The timeliness of data is a decisive factor in the effectiveness of the data. This is especially true for the operational analytics scenarios discussed in our previous article.
What To Look Out For
- Ingestion Speed Support — Analytics platforms need to keep up with the data ingested from streaming data sources. If the platform needs to do a lot of data denormalization or asynchronized data consolidation behind the scenes, it usually can’t keep up with data ingestion speeds. In these cases, you will have to increase your hardware, software, or support costs.
- The Integrity of Ingested Data — Data in the streaming pipeline will almost certainly be lost, mis-transformed, or end up out of order. Analytics platforms need to be able to handle these errors and guarantee the integrity of ingested data. Data needs to be ingested once, only once, and in the right order.
- Data Mutability — Mutable data means ingested records need to be updated or deleted. For example, a purchase order with updated line items. If the platform doesn’t handle update operations directly, developers have to build compensation logic manually, which will be time-consuming and error-prone. Without supporting mutable data in real-time data streams, the platform won’t be able to support many common use cases.
What Questions You Should Ask
- What ingestion speed does the solution support?
- How does the solution handle data ingestion errors?
- How does the solution handle updates in ingested data streams?
Scalability
Why It Matters
Modern analytics platforms need to support growing data volumes as well as a growing user base. The platform needs to be built on a distributed architecture that has the elasticity to grow and shrink in order to efficiently use hardware resources while supporting business activities.
What To Look Out For
- Data Volume — Storage and query engines need to scale out independently. When new hardware resources are added, any data redistribution should happen automatically and behind the scenes, without impacting users’ query experiences.
- Concurrent Users — Operational analytics workloads are often external user-facing. That means the number of concurrent users is much larger than traditional internal analytical workloads. Platforms need to be able to support 1000s or tens of thousands of users sending queries to the platform.
What Questions You Should Ask
- How is data redistribution handled?
- How many concurrent users can the solution support?
- What’s the solution’s highest QPS (query per second)?
Operational Efficiency
Why It Matters
Analytics platforms serve as a distributed computer system that processes large amounts of data. Such a system involves several different hardware and software modules. If the system’s architecture is too complex, system administration is going to be a nightmare and will cost you a fortune.
What To Look Out For
- System Complexity — Some systems may require you to start up several servers even before you load data. In contrast, a streamlined architecture with no dependency on third-party components allows you to start with a small footprint and is much more stable.
- A Unified Platform — Modern analytics platforms shouldn’t be able to support only batch analytics or real-time analytics. Having separate data pipelines and analytics engines to satisfy different data latency requirements is not only counter productive, but also a source for many data quality and integrity issues. Ideally, analytics should provide excellent performance for both batch and real-time analytics.
What Questions You Should Ask
- Does this solution support real-time analytics as well as batch analytics?
- What’s the minimum cluster setup, and are there any system architecture diagrams available that showcase this?
- Can the system auto-scale out?
Cost Effectiveness
Why It Matters
Whether you are running on premises or in the cloud, reducing your cluster footprint makes your CFO happy, keeps your operational costs down, and is, in general, good for the environment.
What To Look Out For
- Hardware Resource Consumption — We need to be careful how much storage space, CPU power, and memory space the platform needs to deploy for a certain type of analytical workload. We also should take into account items like network connectivity, and cloud data egress costs.
- Support Personnel Cost — This requirement is straightforward. We need to look at how much time we need to involve our DBAs every week.
- Pricing Model — Pricing models can vary from vendor to vendor and the projects you support. The needs of your business may align better with one model versus another.
- Scaling Needs — Do your analytics needs scale regularly? Only at certain times of the year? Who is going to manage the platform and its updates? Examine how your business uses and manages its data to best identify the pricing structure you’re able to extract the most value out of.
- Open Source — Open source solutions can offer significant savings compared to proprietary platforms, but beware, open source solutions aren’t exactly free either. While you may avoid contracts and subscriptions, you may end up paying the cost of a commercial solution in additional maintenance costs and support. If you go the open source route, the quality of the documentation and developer community can be the difference between saving money or wasting it.
What Questions You Should Ask
- What are the solution’s typical server configurations?
- How many servers will I need to get started?
- What does this solution’s pricing model look like?
- Are term discounts available?
- Is there a cap on how much data this solution allows you to query?
You’re now all set to establish your evaluation criteria and know what questions to ask in order to narrow down your options. Now it’s time to put these solutions to the test, but what should you look out for?
Selecting your solution criteria and building your solution short list is one thing, but putting those solutions to the test and effectively evaluating them against each other can be even more challenging. Our next article in this series will walk you through the evaluation and rollout process to ensure all of the time you’ve put into building your criteria list pays off.
Join Us on Slack
If you’re interested in the StarRocks project, have questions, or simply seek to discover solutions or best practices, join our StarRocks community on Slack. It’s a great place to connect with project experts and peers from your industry. You can also visit the StarRocks forum for more information.