Addressing the Data and Cyber Interface

Cloud-native trends in modern data security (and establishing best practices for their adoption)

Justin Van Wygerden
Slalom Data & AI
5 min readOct 31, 2023

--

Photo by Christina Morillo from Pexels

In a world where the amount of data is constantly expanding, it is critical to think about the continually evolving cybersecurity impact. With the growth of cloud-native solutions for enterprise, mid-tier, and startup companies alike, modern solutions to data security are critical when addressing common challenges that face present-day tech initiatives. Several examples of these challenges include:

  • Leveraging cloud-managed services for secure data ingestion (with an emphasis on AWS, Azure, and GCP)
  • Navigating industry-standard methods that protect against unauthorized or undetected data access
  • Keeping track of the vast amount of unstructured data within a company’s data lakes
  • Increasing data security for cloud-native data warehouses and related ETL (extract, transform, and load) processes
  • Approaching data compliance strategies, especially when leveraging multiple cloud vendors

Before addressing these challenges, it’s helpful to review relevant data trends. Recently, Airbyte released results from a data engineering survey of 886 respondents, indicating key information about the direction of the overall data community. Below are several key takeaways from Airbyte’s State of Data 2023 report, each paired with a respective inference on the basis of data security.

1. Multicloud data warehouse usage

Per the report, there are at least 100 respondents using the following data warehouses: Amazon RedShift (AWS), Azure Synapse (Microsoft), BigQuery (GCP), Databricks and Snowflake.

Why it matters:

Companies are beginning to utilize a broad range of cloud vendors for data warehouse solutions, which only elevates the complexity of data security solutions.

2. Cloud-native data warehouses are on the rise

The number of respondents soars up to at least 500 that have either heard of, are interested in, or are already using the data warehouses listed above.

Why it matters:

Interest in using more cloud-native data warehouses (that will likely increase data security complexity) appears to be a trend that is on the rise.

3. Companies are leveraging multiple data integrations

More than half of survey respondents reported using six or more data connectors.

Why it matters:

A large number of data connectors indicates source data is being transformed at a variety of points, which likely heightens cybersecurity impact.

4. Double-digit data connection points

Approximately one-third of respondents reported using an excess of 10 data connectors.

Why it matters:

With so many different data connection points, understanding the data flow within an organization is becoming increasingly complex.

Examining data lake and data warehouse security

In examining these patterns from Airbyte’s report with a data security lens, it’s clear that new and updated best practices are needed. By looking at solution techniques for multicloud and high-growth workloads, more effective strategies can be brought to light. The following section breaks down best practices for both data lake security and data warehouse security, with a primary emphasis on cloud-native use cases.

Data lake security

As a result of data lakes primarily containing unstructured data (typically stored as cloud object storage, i.e., Amazon S3, Azure Blob Storage, or Google Cloud Storage), there are specific preventive controls that can be taken. Some of these include:

  • Automating data lake configuration and updates (detect configuration drift). By automating the creation/updates of data lakes, the chances for manual errors are significantly reduced. Additionally, automation capabilities are utilized (which is a major benefit).
  • Upgrading validation/secure transit of unstructured source data. Data entering a data lake should be validated from the source it originates from, as well as properly encrypted in transit to data lake(s) that have underlying encryption, at rest. Note this is an activity that will likely take time, but provides excellent value in understanding data entry points.
  • Enhancing security configuration of data lake object storage. Simple actions can often be the most important, especially when it comes to cloud data security. Blocking public access, updating role-based access control (RBAC) roles, and defining clear and tested access policies can drastically increase data lake security.
  • Emphasizing data auditing. All steps of the ETL pipelines that utilize source data from a team’s data lake should have a robust auditing solution. Moreover, consider retaining auditing data for a time period long enough that teams can look back on consistent patterns/trends (i.e., in the event of a data security anomaly).
  • Data lake monitoring, logging, and alerting. Arguably the most important of these best practices, data lakes should have an “understandable” strategy toward monitoring (that both engineers and C-suite team members can make sense of). This helps identify outlying patterns quickly and increases speed to remediation.

Data warehouse security

Data warehouses typically contain structured data that is readily consumable by teams, which causes a clear line of differentiation from data lakes. Some key data security points for data warehouses include:

  • Provisioning your data warehouse(s) as code. Though this is similar to the best practice for data lakes, a key difference is the scalability component. Because there is more of a likelihood of having multiple data warehouses within an organization, investing in data warehouse infrastructure as code (IaC) can increase repeatability/scale across your organization.
  • Emphasizing readily available testing data. Investing in readily available testing data that suits your team’s needs can yield a very useful amount of automation capabilities. Additionally, it gives a structured approach that can be consumed by many teams within an organization.
  • Utilizing a scalable RBAC solution. RBAC roles should be created in a way that is easy to understand. Consider creating fewer roles that are more meaningful (and have clear security boundaries) as opposed to an extensive number of custom roles.
  • Leveraging cloud-native data security features. When using vendor-specific data warehouses (Amazon Redshift, Azure Synapse, or GCP BigQuery), appropriate cloud-native controls are great to put in place. For the majority of use cases, replacing manual controls with cloud-managed services is a win.
  • Achieving real-time data warehouse observability. Data warehouses give teams an efficient opportunity for real-time observability solutions since the data exists in a defined, structured state. In addition, this gives phenomenal benefits for understanding usage trends by consumer teams.

Data best practices in real-world scenarios

It may not be easy to apply every data security best practice all at once, so don’t panic! Like feature development, security enhancements can be associated with an agile, prioritized backlog. Instead of attempting to get everything done all at once, take the time to focus and prioritize data security needs that affect your teams the most.

Additionally, work on developing an engineering plan that focuses on clear, consistent goals. Remediation is an ongoing journey that favors teams that consistently dedicate bandwidth to security, as well as overcommunicate with concurrent changes from data engineers, developers, and DevOps engineers/SREs. Piloting quarterly security enhancements is an additional practice that can yield new and automated benefits for your organization, especially in the age of cloud-managed services.

Happy data-sec engineering!

Slalom is a global consulting firm that helps people and organizations dream bigger, move faster, and build better tomorrows for all. Learn more and reach out today.

--

--