Cloud Data Platform Security, Governance, and Disaster Recovery Best Practices

Disclaimer: I am a Principal Solutions Architect at Snowflake with experience in data strategy, architecture, and engineering. The views expressed here are mine alone and do not necessarily reflect the views of my current, former, or future employers.

Many articles on my LinkedIn feed compare “speeds and feeds” among the large cloud data vendors. Posts such as: “New feature makes product X 2x faster than product Y.” I think it’s pretty fair to say that these assertions, especially without code, are not typically useful for decision-makers when looking at why a vendor would be a good fit for their organization. Speed is a component, along with use case fit, integrations with existing products, price, and the importance of data security, governance, and availability, in addition to others. Security and Governance features may not be as flashy and innovative (AI anyone?) and don’t make for viral vendor A vs B performance benchmark articles. But in today’s landscape, taking a holistic view of your cloud data platform is imperative- it is the core component of your modern data stack.

Below is a list of best practices I have worked with customers to implement over the past years to ensure the proper security of their cloud data platforms. This is not an exhaustive list, and it's ever-evolving in the world of continually updated software. And yes, these are all available on Snowflake’s AI Data Cloud platform.

Security

Allowing only authorized users to access your platform in a secure and controlled manner.

  • Data Encryption by default — When interacting with data in the cloud, data must be encrypted by default, both at rest and in transit. As soon as the data is put in the cloud, the system must have a key to (transparently) decrypt the data, which can be rotated often (e.g., every 30 days). Many customers also take this a step further and request to control the key used in their cloud platform. If there is a security concern, they can pull the key, and all the data will be unreadable.
  • Network & Login Policies—Control Internet Protocol (IP) addresses and Virtual Private Cloud (VPC) networks allowed to access your cloud platform. Most corporations have a Virtual Private Network (VPN) or proxy through which all communication is routed. This makes it easy for your cloud platform only to allow access through trusted network connections. In addition, it should offer fine-grained control of what drivers or client-specific users can access the platform. For example, data analysts can only access the UI and not Python/ODBC.
  • Private Networking—In addition to network and login policies, customers often look to restrict the public Internet and prefer traffic only to go through private IPs over Private Networks (e.g., PrivateLink). This allows the customer VPCs and the Cloud Data Platform VPC to communicate without traversing the public Internet.
  • Single-Sign-On, password policies, and MFA—For all users, Single-Sign-on (SSO) is preferred using an IDP like Azure Active Directory (AD) or Okta. This is not always feasible with Service Accounts. For those users without SSO, password policies with Multi-Factor Authentication (MFA) are a must. Password policies allow rules to be set for how often passwords must be changed and specify complexity. MFA should be the default for all users with passwords.

Governance

Data assets are protected by only allowing authorized users to access them. Access should be closed by default.

  • Coarse-Grain Access Controls—Use Role-Based Access Control (RBAC) to govern who can view which data assets. This must be scalable for thousands of users with varying needs to create/ read/modify thousands of tables. Programmatic methods applying roles to various objects in the platform will allow your data to remain secure by default while quickly providing access to required users.
  • Fine-Grain Access Controls — Once it is known who can access tables and data assets, customers work with governance teams to determine if data is to be further restricted for specific columns (e.g., only showing last 4 of, Credit Card Number) or for specific rows (e.g., filtering for sales data in your region or territory). Column and Row-level security policies make this process easy to manage and scale. In addition, having the flexibility to utilize policies at different grains of data (e.g., aggregated data vs. transaction level) or allowing specific columns to be used in queries provides feature-rich capabilities, especially when dealing with sensitive data.
  • Tagging and Classification—The need to determine and track sensitive data within your environment. For example, the data should be automatically tagged and properly protected if you have Social Security numbers in a column called SSID. Classification makes identifying and tagging columns, tables, or databases easy so others can act on data based on governance policies or access control.
  • Logs—Having the ability to trace what was run, by whom, what they accessed, and any other pertinent login/session information is impartive to having full visibility in your platform. Logs should be available immediately (no configurations or setup needed) and made available for a minimum of a year.

Backup and Disaster Recovery

Recover from failures or system outages with minimal effort.

  • Data Recovery—The ability to customize automatic data retention time periods, which stores the version of data at every change. This retained data should be kept to perform time travel queries, undrop data in case of accidental deletion, or review data based on a specific point in time. If data doesn't have time travel retention, the ability for data to be recovered up to one week after it was removed for emergency repair—without any additional configurations.
  • Data Redundancy—Since you are in the cloud, your data should be protected by the durability of your cloud provider's object storage guarantees. In addition, data should be available in at least three availability zones (AZ)—triple replicated without having to pay extra or perform additional configurations.
  • Compute Redundancy — Knowing your data is secured in 3 AZs is great, but what happens if your compute layer goes down? Your compute engines should employ the same redundancy for your AZs so that any availability zone can be made available to run your query in case of AZ downtime. Because compute is ephemeral, it can be prone to fail; your platform should automatically retry queries rather than fail so your pipelines can continue to succeed.
  • Replication & Failover — In the rare instance that your Cloud provider or region goes down, you should be able to seamlessly failover to another region or cloud (e.g., AWS West to East or AWS to Azure). The frequency of replication should be based on your Recovery Point Objective (RPO) and Recovery Time Objective (RTO). Having the ability to move data and failover the entire platform objects (users, roles, policies, account setting) along with data objects (database, schemas, and tables) will make this process seamless, so you don't have to worry about missing objects required to access your data.
  • Client Redirection—If you need to failover from region A to region B, your clients may not know. For example, if your tableau users have a stored connection to AWS East but you have failed over to AWS WEST, the tableau users (and other clients) should automatically be able to be sent to the new URL. This will ensure continuity between failed-over platforms and where clients are sent to read and write to the platform.

Conclusion

Cloud software is ever-evolving. Most cloud products ship weekly or monthly feature updates that continue to provide capabilities to better govern and protect your data assets. As you evaluate, modernize, or upgrade your critical data infrastructure, you should pay close attention to security, governance, and disaster recovery capabilities. If data is the lifeblood of your organization, these features will help leaders and data users feel confident that the data will continue to be available and relied upon to drive trustworthy impact and results.

--

--