20 question checklist for Cloud Data Platforms

Sandeep Uttamchandani
Wrong AI
Published in
5 min readOct 16, 2017

--

There are a plethora of Cloud data platforms with no one-size-fits-all! How do you select the one that best matches your Enterprise requirements? Well, like everything else in life, any transformation starts with introspection. Essentially, understanding your requirements along the following three dimensions:

  • Characteristics of your data?
  • Your requirements for data analysis?
  • Required production SLAs?

This is the first of several posts on this topic. In this blog, we focus on 20 questions that you need to answer for defining your requirements. In future posts, we will explore how these requirements are used to determine your choices for data platforms from one of the cloud providers as well as the option to use apache open-source to spin up your own clusters in the cloud. So lets get to work!

  1. What is the data type?
  • Is your data structured, semi-structured or unstructured? More specifically, what is the type of your data records i.e., relational, Document records, Graph objects, KV, Multi-column, Self-describing documents, time-series events, or just unstructured blob of text
  • Based on the data type, there is an option to explore specialized data stores, and compare with generic SQL engines that support multiple data sources by separating the persistence layer from the query logic.

2. What is the data schema?

  • Three options: Schema-on-write,schema-on-reads or schema-on-fly. Relational data has schema-on-write. All other data can be treated to have a dynamic schema that can be updated during analysis (schema-on-reads), or varies on a per record basis ( schema-on-the-fly).
  • It is further helpful to differentiate schema as flat versus complex/nested. For instance flat schema is SQL friendly, while nested or sparse data needs to be first flattened/pre-processed before SQL analysis.

3. What is the on-disk format of the data?

  • Defines the persistence format of the data on the disk. There are a variety of popular Hadoop data formats such as text files, ORCFile, Parquet, Sequence files, Avro, Proprietary formats. Additionally, the data may be compressed using popular codecs such as Zlib, GZIP, LZO, Snappy.
  • On-disk data format plays an important role w.r.t. performance, interoperability, and dollar cost per query (typically charged based on data accessed). For instance, queries that search values in a specific column run faster and are cheaper using columnar formats such as ORCFile or Parquet, since a smaller footprint of the data needs to be accessed.

4. Data size and cardinality limits?

  • Max data size
  • Max record size
  • Max of # of tables (if applicable)
  • Max number of indexes (if applicable)
  • Max # of nodes within the cluster for horizontal scaling

5. Data analysis use-case?

  • Knowing the use-case plays a key role in deciding the technology selection. Typical categories of data analysis use-cases are: Interactive BI, Streaming, Batch/Reporting, OLTP, Point-queries, Advanced analytic workflows, Operational, ETL, Self-service Data Exploration, Ad-hoc queries, ETL, Long-running jobs, SQL as part of Spark pipelines.

6. SQL support required for persona involved?

  • Different personas can be involved in using the data platform: Business analyst, Data Engineer, ML engineer, Application programmer. The persona defines the need for self-service data modeling, transformation & visualization.

7. Programming languages supported

  • Integrating with existing data analysis programs. Platforms support a wide range of programming languages such as Python, Scala, Go, Java, C#, PHP, Perl, Erlang, etc.

8. Data federation across sources?

  • Do the use-case require aggregation of data across multiple data sources. For instance, a streaming use-case where the access logs are joined with relational table of items or other feeds.
  • Having clarity on this requirement is important in the selection process since solutions vary w.r.t. the number of data sources they can use during data analysis.

9. Support for multi-user workloads?

  • Most Enterprises have concurrent parallel jobs running on the data platform. For effectively using the available resources, the platform needs to support prioritization of selected jobs, as well as ensure appropriate resource allocation policies.

10. SQL Query constructs

  • There are wide range of SQL constructs. In selecting the solution, it is important to understand the requirements for these SQL primitives (ANSI SQL compatibility). Following are a few categories:
  • UDFs, Stored procedure, Triggers support
  • Joins, Partitions, Composite key support
  • Complex Aggregations
  • Materialized views, Windowing functions, Nested queries, Approximate queries
  • Referential integrity/Foreign keys
  • Data filtering operations using single keys, range, faceted search, graph traversals, geospatial

11. Support for indexes?

  • This is applicable mainly for SQL queries. If the queries use the primary key as well as other columns. Index support for column values varies across solutions: compound, unique, array, partial, TTL, sparse, hash, text, geospatial.

12. Data consistency assumptions (made by application programmers)

  • Application programmers assume certain properties from the data platform. This is analogous to relying on POSIX contracts. It is important to take these assumptions in the selection process. A few key constructs are:
  • Transactions support primitive
  • Replication consistency: If read replicas is supported, clarifying whether the application program can handle eventual consistency.
  • Idempotent write batches: Assumption that repeated updates are idempotent.
  • Tunable consistency knob: The application program may require different degrees of consistency for different writes.
  • Conditional writes/Atomic counters
  • Consistency of the indexes: Whether the indexes are refreshed on updates or a lazy background process.
  • Write-Write assumption: Behavior when concurrent writes to the same record occur
  • Read-Write assumption: Behavior when reading and writing from the record at the same time

13. Need to archiving i.e., data lifecycle management

  • As the data continues to grow, an increasingly common requirement (especially for interactive use-cases) is to archive the cold data. This allows better performance and dollar cost savings in managing the hot and cold data. Platforms may have policy controls such as Time-to-live columns that automatically archive data partitions based on expiry policies.

14. Whether multi-datacenter support is required?

  • There are two possible scenarios: a writer in a single geographic region and globally distributed readers in other (read) regions, or writers and readers are both globally distributed. Based on the this requirement, available solution options can be filtered.

15. Backup/Snapshot support

  • Depending on the use-case, there might be a need to periodically backup the data. The primary motivation of having backups is to roll-back during data corruptions. The use-case of backups for high availability is becoming increasingly less popular.

16. Data Security requirements

  • Whether encryption is required
  • Granularity of access control
  • Need for Network isolation
  • Data privacy compliance such as GDPR compliance

17. High Availability/DR

  • Data durability requirement i.e., # of 9s.
  • Recovery Point Objective: Data updates lost (in mins/secs) during a recovery
  • Recovery Time Objective: Time duration for the system to come online after a failure
  • Automatic Failure detection
  • Need to proactively detect & correct latent disk errors
  • Need to have a pre-warmed cache after the failure
  • Support for multiple availability zone
  • Verification of data integrity before serving to the application/user

18. Performance/Auto-scaling

  • Scaling reads across replicas
  • Sharding tunables for scaling data across the cluster
  • Integrated caching/In-mem capabilities (holding data structures in memory)
  • Auto-scaling: CPU & memory scaling
  • Read and Write throughput requirements
  • Avg. & 95 percentile latency

19. Integrations with existing solutions

  • Platform Monitoring solution
  • Data Visualization/Reporting solution
  • Map-Reduce API support
  • HCatalog Metadata support
  • Hadoop integration
  • Machine Learning Libraries
  • Support for Restful API

20. Pricing optimization

  • The cost for hosting a data platform can be broadly divided into three buckets: a) Data storage cost; b) Per query cost; c) Cost for HA
  • While pricing is not an SLA, it is absolutely top-of-mind for any reasonably sized deployment.

Next time you are evaluating which data platform to select in the Cloud, use this template to first define your requirements. In future posts, we will cover examples of existing solutions.

--

--

Sandeep Uttamchandani
Wrong AI

Sharing 20+ years of real-world exec experience leading Data, Analytics, AI & SW Products. O’Reilly book author. Founder AIForEveryone.org. #Mentor #Advise