Snowflake’s migration to Envoy for traffic management

Snowflake is a globally available, cross-cloud, multi-region service. When customers connect to Snowflake, their request is mapped to a cloud region and routed to specific internal services within the region for optimal processing. Snowflake’s traffic layer is responsible for this functionality. It ensures secure connectivity via Transport Layer Security (TLS), load balancing for reliable data flow, and business logic for routing queries. The details of the traffic layer are mostly invisible to customers, except for the URLs they use to interact with Snowflake.

The current traffic stack uses cloud provider application load balancers for TLS termination and load balancing, and NGINX as the L7 traffic routing layer, running on virtual machines (VMs).

To further enhance performance, security, and reliability for customers, Snowflake has deployed a new traffic serving layer. The new stack uses L4 load balancers provided by the underlying cloud platform, and Envoy Proxy as the L7 traffic routing layer, running on Snowflake’s internal Kubernetes platform.

The core of this new stack is Envoy Proxy. Envoy is an open source (Apache 2.0 licensed), high-performance, programmable network proxy originally developed at Lyft, and now maintained as part of the Cloud Native Computing Foundation. It is in wide production use by a number of large companies. It has a vibrant open source community that surfaces and fixes bugs swiftly, and is among the leading adopters of new industry standards.

The customer benefits

Performance

Snowflake is used by customers across 35+ cloud regions across the world. These customers care deeply about query performance. Snowflake has also launched transactional workloads such as Unistore where query latency is critical. Performance at the traffic layer is vital for optimizing end-to-end query latency. Envoy’s threading model and Read-Copy-Update strategy allows for configuration changes to be applied consistently, without the need to create new processes (thereby improving performance by preserving state such as connection caches).

Security

As the security landscape evolves rapidly, Snowflake is focused on adoption of latest standards and protocols, and being able to respond quickly to vulnerabilities. The migration of Snowflake query traffic to Envoy will bring better security for data in transit by adding support for TLS 1.3. It also adds some additional strong TLS 1.2 cipher suites that can improve performance for clients on some devices (e.g., ECDHE-ECDSA-CHACHA20-POLY1305).

Reliability

As Snowflake’s usage grows and the product landscape evolves, the underlying traffic infrastructure needs to be more reliable, flexible and maintainable. Envoy is notable for its extensive programmability, with virtually every facet of its operation dynamically configurable by a set of gRPC APIs. The migration to the Envoy-based stack will bring greater flexibility and reliability as a result of uniform infrastructure, improved observability, and better automation. The shift to a uniform infrastructure stack across cloud regions will allow us to add new features faster, remove accumulated technical debt, and reduce operator overhead.

Envoy is already in use at Snowflake to support features such as the Snowsight UI, Data Loading and Unloading, and Streamlit.

Migration process

Snowflake will undertake a gradual migration of query traffic to the new stack over the coming months. We have developed granular traffic controls that enable us to safely perform this switch, as well as systems to advance the rollout automatically at a controlled pace.

We have started migrating Snowflake-internal traffic first to thoroughly validate functionality and performance. After we are fully satisfied with the behaviors of the new system, we will start migrating customer traffic.

What’s next?

For the vast majority of customers, we expect that this change will be transparent. Some customers may experience connectivity issues if they:

  • Rely on specific DNS resolution paths: As a part of this change, customer account names will resolve via a different chain of DNS CNAMEs. A Canonical Name (CNAME) Record is used in the Domain Name System (DNS) to create an alias from one domain name to another domain name. Customers using one of Snowflake’s supported URL formats will not be impacted.
  • Rely on public IP addresses used by their Snowflake accounts today: As a part of this change, IP addresses underlying Snowflake account URLs may change. Note that public IP addresses are not part of the connection APIs and are subject to change at any time during the course of routine operations. Customers hardcoding Snowflake IP addresses in their firewall setups should instead use DNS to enforce firewall policies around Snowflake connections (see supported URL formats) to avoid disruption.
  • Have an environment with an outdated root trust store for TLS certificates: As a part of this change, Snowflake in AWS will start using Digicert-issued server-side TLS certificates and stop using Amazon Certificate Manager (ACM) issued server-side TLS certificates. Any customers having outdated trust stores for TLS certificates will need to update their trust stores to trust the G1 and G2 DigiCert Global Root certificate authorities (CA). Snowflake generally recommends keeping the trust store updated as the list of trusted root certificates is constantly evolving in response to security threats.

Note: For Customers connecting to Snowflake over a private network (AWS PrivateLink, Azure Private Link, Google Private Service Connect), there is no change for now. We will provide updates on this topic as appropriate.

For more guidance, please see this Snowflake Knowledge Base article or contact Snowflake Support.

Curious about the engineering challenges described in this blog post? Interested in opportunities to drive reliability, security, and performance improvements for Snowflake? Join us! The Snowflake Traffic Engineering team is hiring in Bellevue, Washington and Dublin, Ireland.

--

--