20 question checklist for Cloud Data Platforms

Published in

Wrong AI

5 min readOct 16, 2017

There are a plethora of Cloud data platforms with no one-size-fits-all! How do you select the one that best matches your Enterprise requirements? Well, like everything else in life, any transformation starts with introspection. Essentially, understanding your requirements along the following three dimensions:

Characteristics of your data?
Your requirements for data analysis?
Required production SLAs?

This is the first of several posts on this topic. In this blog, we focus on 20 questions that you need to answer for defining your requirements. In future posts, we will explore how these requirements are used to determine your choices for data platforms from one of the cloud providers as well as the option to use apache open-source to spin up your own clusters in the cloud. So lets get to work!

What is the data type?

Is your data structured, semi-structured or unstructured? More specifically, what is the type of your data records i.e., relational, Document records, Graph objects, KV, Multi-column, Self-describing documents, time-series events, or just unstructured blob of text
Based on the data type, there is an option to explore specialized data stores, and compare with generic SQL engines that support multiple data sources by separating the persistence layer from the query logic.

2. What is the data schema?

Three options: Schema-on-write,schema-on-reads or schema-on-fly. Relational data has schema-on-write. All other data can be treated to have a dynamic schema that can be updated during analysis (schema-on-reads), or varies on a per record basis ( schema-on-the-fly).
It is further helpful to differentiate schema as flat versus complex/nested. For instance flat schema is SQL friendly, while nested or sparse data needs to be first flattened/pre-processed before SQL analysis.

3. What is the on-disk format of the data?

Defines the persistence format of the data on the disk. There are a variety of popular Hadoop data formats such as text files, ORCFile, Parquet, Sequence files, Avro, Proprietary formats. Additionally, the data may be compressed using popular codecs such as Zlib, GZIP, LZO, Snappy.
On-disk data format plays an important role w.r.t. performance, interoperability, and dollar cost per query (typically charged based on data accessed). For instance, queries that search values in a specific column run faster and are cheaper using columnar formats such as ORCFile or Parquet, since a smaller footprint of the data needs to be accessed.

4. Data size and cardinality limits?

Max data size
Max record size
Max of # of tables (if applicable)
Max number of indexes (if applicable)
Max # of nodes within the cluster for horizontal scaling

5. Data analysis use-case?

Knowing the use-case plays a key role in deciding the technology selection. Typical categories of data analysis use-cases are: Interactive BI, Streaming, Batch/Reporting, OLTP, Point-queries, Advanced analytic workflows, Operational, ETL, Self-service Data Exploration, Ad-hoc queries, ETL, Long-running jobs, SQL as part of Spark pipelines.

6. SQL support required for persona involved?

Different personas can be involved in using the data platform: Business analyst, Data Engineer, ML engineer, Application programmer. The persona defines the need for self-service data modeling, transformation & visualization.

7. Programming languages supported

Integrating with existing data analysis programs. Platforms support a wide range of programming languages such as Python, Scala, Go, Java, C#, PHP, Perl, Erlang, etc.

8. Data federation across sources?

Do the use-case require aggregation of data across multiple data sources. For instance, a streaming use-case where the access logs are joined with relational table of items or other feeds.
Having clarity on this requirement is important in the selection process since solutions vary w.r.t. the number of data sources they can use during data analysis.

9. Support for multi-user workloads?

Most Enterprises have concurrent parallel jobs running on the data platform. For effectively using the available resources, the platform needs to support prioritization of selected jobs, as well as ensure appropriate resource allocation policies.

10. SQL Query constructs

There are wide range of SQL constructs. In selecting the solution, it is important to understand the requirements for these SQL primitives (ANSI SQL compatibility). Following are a few categories:
UDFs, Stored procedure, Triggers support
Joins, Partitions, Composite key support
Complex Aggregations
Materialized views, Windowing functions, Nested queries, Approximate queries
Referential integrity/Foreign keys
Data filtering operations using single keys, range, faceted search, graph traversals, geospatial

11. Support for indexes?

This is applicable mainly for SQL queries. If the queries use the primary key as well as other columns. Index support for column values varies across solutions: compound, unique, array, partial, TTL, sparse, hash, text, geospatial.

12. Data consistency assumptions (made by application programmers)

Application programmers assume certain properties from the data platform. This is analogous to relying on POSIX contracts. It is important to take these assumptions in the selection process. A few key constructs are:
Transactions support primitive
Replication consistency: If read replicas is supported, clarifying whether the application program can handle eventual consistency.
Idempotent write batches: Assumption that repeated updates are idempotent.
Tunable consistency knob: The application program may require different degrees of consistency for different writes.
Conditional writes/Atomic counters
Consistency of the indexes: Whether the indexes are refreshed on updates or a lazy background process.
Write-Write assumption: Behavior when concurrent writes to the same record occur
Read-Write assumption: Behavior when reading and writing from the record at the same time

13. Need to archiving i.e., data lifecycle management

As the data continues to grow, an increasingly common requirement (especially for interactive use-cases) is to archive the cold data. This allows better performance and dollar cost savings in managing the hot and cold data. Platforms may have policy controls such as Time-to-live columns that automatically archive data partitions based on expiry policies.

14. Whether multi-datacenter support is required?

There are two possible scenarios: a writer in a single geographic region and globally distributed readers in other (read) regions, or writers and readers are both globally distributed. Based on the this requirement, available solution options can be filtered.

15. Backup/Snapshot support

Depending on the use-case, there might be a need to periodically backup the data. The primary motivation of having backups is to roll-back during data corruptions. The use-case of backups for high availability is becoming increasingly less popular.

16. Data Security requirements

Whether encryption is required
Granularity of access control
Need for Network isolation
Data privacy compliance such as GDPR compliance

17. High Availability/DR

Data durability requirement i.e., # of 9s.
Recovery Point Objective: Data updates lost (in mins/secs) during a recovery
Recovery Time Objective: Time duration for the system to come online after a failure
Automatic Failure detection
Need to proactively detect & correct latent disk errors
Need to have a pre-warmed cache after the failure
Support for multiple availability zone
Verification of data integrity before serving to the application/user

18. Performance/Auto-scaling

Scaling reads across replicas
Sharding tunables for scaling data across the cluster
Integrated caching/In-mem capabilities (holding data structures in memory)
Auto-scaling: CPU & memory scaling
Read and Write throughput requirements
Avg. & 95 percentile latency

19. Integrations with existing solutions

Platform Monitoring solution
Data Visualization/Reporting solution
Map-Reduce API support
HCatalog Metadata support
Hadoop integration
Machine Learning Libraries
Support for Restful API

20. Pricing optimization

The cost for hosting a data platform can be broadly divided into three buckets: a) Data storage cost; b) Per query cost; c) Cost for HA
While pricing is not an SLA, it is absolutely top-of-mind for any reasonably sized deployment.

Next time you are evaluating which data platform to select in the Cloud, use this template to first define your requirements. In future posts, we will cover examples of existing solutions.

20 question checklist for Cloud Data Platforms

Written by Sandeep Uttamchandani