Cloud Data Stores — Are you aware of the knobs?

Sandeep Uttamchandani
Wrong AI
Published in
4 min readNov 22, 2017

--

Enterprises are increasingly storing data in the cloud. These Cloud-based Data Stores come in all shapes and sizes: Object-stores (AWS S3), Relational Databases (AWS Aurora, RedShift), NoSQL (AWS DynamoDB), Streaming pipelines (AWS Kinesis), Gateway Servers (AWS Storage Gateway), etc. While the operational headache is borne by the Cloud Service Provider, Enterprises still need to diligently specify their requirements in the form of policies for security, access control, availability, etc. Recent data breaches exposing sensitive consumer data stored in S3 buckets illustrates the need for diligent policy-based management for cloud data stores. Also, with growing compliance regulations, having well-defined policies are no longer a nice-to-have, but a must-have!

Irrespective of the differences in the Data Store architectures, there are a common set of policy knobs that most Data Stores support in some shape and form. This post covers the generalized template of policy categories citing specific examples from AWS Data Stores as applicable (nothing AWS specific, just picking one to keep things concrete).

Data Security & Integrity Policies: Policy knobs that help ensure confidentiality, privacy, and integrity of the data. The policy knobs fall in various categories:

  • Encryption (Data-at-rest & Data-in-motion): Over the years, there are multiple different approaches for securing data within the cloud. For instance, beyond the option to use SSL/TLS, AWS S3 has the following options: a) Server-side encryption; b) SSE-S3 (AES-256) where keys are managed by S3; c) SSE-KMS where keys are managed using AWS KMS service; d) SSE-C where keys are managed by the customer; e) Client-side encryption
  • Data Zero’ing policy: Ensuring resources (such as AWS EBS) are zero’ed before returning to the pool. Having an Enterprise policy ensures that all applications using the resource enforce this requirement.
  • Access Control: Limiting access to data is critical for Enterprises. Multiple knobs are available today — a few examples are as follows:
  • AWS IAM & Access Policies (equivalent to ACLs)
  • MFA(Multi Factor Authentication) for operations such as deletes
  • Integration with existing Active Directory & LDAP solutions
  • VPC Endpoint Policy: Ensures secure network access to data stores such as S3 from instances running within the VPC
  • Key rotation Policy: Configuring frequency of key rotation in AWS KMS
  • Vulnerability Checking Policy: Services such as AWS Inspector & TrustAdvisor that continuously alert on vulnerabilities in open-source stack components, as well as anomalies in configuration policies. Having clear policies on timeframe to address vulnerabilities, as well as reacting to access anomalies.

Data Lifecycle Policies: Various different policies to manage the lifecycle of data:

  • Versioning Policy: # of versions of data to maintain, and the expiry time.
  • Policy to read from replicas: In services such as AWS Aurora, whether to read from the replicas (which are eventually consistent)
  • Tiering Policy: Policies to automate moving data across different storage classes (for instance in AWS S3, moving data to Glacier)
  • Backup/Snapshots Policy: These policies primarily serve to protect against data corruption. The policies can be schedule-driven or event-driven.
  • Data classification Policy: Defining metadata tags to be assigned to data objects — having a consistent approach across the DevOps teams is important, especially if automation policies rely on these tags to further decide encryption, tiering, redundancy, etc.
  • Optimization Policies: Policies to govern cost optimizations such as using data compression, use of spot instances, scaling-down, etc.

Availability Policies:

  • Level of redundancy: Policy to define the failures-to-tolerate which dictates the number of replicas to maintain for HA.
  • DR Policy: Defining DR needs w.r.t. RPO and RTO — will translates to picking one or more Availability Zones for replication.

Data Namespace Policies: Cloud Data Stores use scale-out instead of scale-up architectures. Following are some of the policy knobs to ensure data distribution (technically referred to as sharding) among the internal cluster nodes used by the service.

  • Object Naming Policy: Services such as S3 use some variant of consistent hashing. Objects named with the same object name prefix (such as object names based on timestamp) map to the same node with cluster, creating hotspots. Having a scale-out friendly naming policy can ensure better performance and scalability.
  • Data Partitioning Policy: Partitioning of table rows or columns is an old database concept. Dividing the data tables into partitions helps with the scalability of queries. Partitioning is based on domain understanding of what the data means, and how it will be accessed.

Auto-scaling Policies: Services vary w.r.t. level of automation in resource scaling. Ideally, the user should be able to specify the “what” (such as 1TB/day of ingestion and 100K reads/sec) while the service figures out the “how” (number of nodes required, instance sizes, disk IOPS, etc.). Services such as AWS DynamoDB allow scaling by specifying what, while services such as AWS Aurora provide finer-grained control w.r.t. a number of servers, etc. Another example of auto-scaling policies is AWS Kinesis Analytics that define scaling groups and triggers for automated scale-up/down of CPU, memory, storage resources using a combination of metrics (load- or schedule-driven).

Performance Policies: Policies in this bucket help control trade-offs between throughput versus latency; as well as data staleness versus access latency:

  • Caching/Buffering Policy: Examples include specifying look-aside caching either as a separate layer (such as Redis/Memcache) or integrated caching within the Data Service (such as AWS Aurora). The policies essentially allow controlling data stateless versus access latency. Other example knobs include: Read-ahead configuration (in AWS EBS); persistence of the cache to avoid warm-up degradation due to failures (in AWS Aurora).
  • Concurrency Policy: Examples include queue-depth (in the case of AWS EBS), max number of concurrent streams (in AWS Kinesis). Goal would be to maximize the concurrency limits to deliver the better performance throughput.
  • Resource Isolation: Examples include EC2 optimized instances for separating traffic between network & EBS volumes
  • On-disk Data Layout/Compression: For services that support multiple data formats (such as AWS Athena on S3 files), selecting the right format based on the workload requirements.

--

--

Sandeep Uttamchandani
Wrong AI

Sharing 20+ years of real-world exec experience leading Data, Analytics, AI & SW Products. O’Reilly book author. Founder AIForEveryone.org. #Mentor #Advise