Scaling Horizons: Effective Strategies for Wix’s Scaling Challenges

Natan Silnitsky
Wix Engineering
Published in
9 min read2 days ago

Have you ever faced the challenge of scaling your services and databases as demand skyrockets? If vertical scaling has led you to resource bottlenecks, service downtime, and potential revenue loss, you’re not alone. And what about when a single tenant’s activity throws the entire system off balance, causing outages and stability issues?

Scaling a platform as expansive as Wix is no small feat. As a leading website-building platform catering to millions of users globally, the need to handle enormous traffic and vast datasets is ever-growing. Today, we delve into the battle-tested strategies Wix employs to tackle these scaling challenges and ensure a smooth, reliable service for our users.

By the end of this article, you’ll have a clearer understanding of which strategies might work best for your own use-case.

A Glimpse into Wix’s Scale

To understand the magnitude of our operations:

  • Our system comprises 4,000 microservices
  • We handle 4 billion HTTP transactions daily
  • We produce 70 billion Kafka messages every day
  • Our systems manage over 4 petabytes of data

Operating at such a colossal scale inherently comes with unique challenges. Our infrastructure must not only accommodate today’s demands but also scale to meet future growth as more users join our platform.

Vertical vs. Horizontal Scaling

When it comes to microservices scaling, there are two basic options:

  • Vertical Scaling (Scaling Up): Increasing the power of an existing machine by adding more CPU, RAM, etc.
  • Horizontal Scaling (Scaling Out): Expanding the system by adding more machines to handle the load.

Initially, vertical scaling might seem straightforward. However, real-world constraints quickly emerge, such as bottlenecks and escalating costs.

For instance, our Redis machine used by our rate limiting service frequently hit 100% CPU utilization. This caused disruptions and triggered fallbacks to simpler algorithms. Despite upgrading to more powerful CPUs multiple times, we still faced these issues.

This clearly illustrates the limitations of vertical scaling. At a certain point of higher traffic loads, all Redis users need to move to a cluster of Redis machines, i.e., horizontally scale Redis usage.

The Horizontal Scaling Paradigm

To effectively manage our scaling needs, we’ve shifted towards horizontal scaling. This approach involves distributing workloads across multiple machines, significantly enhancing system responsiveness and reliability.

It’s more efficient for distributing traffic and preventing bottlenecks, but it requires complex management to ensure coherent operation across all machines. Horizontal scaling requires careful selection of routing/sharding keys (how data is managed and routed).

Routing and Sharding Strategies

Efficient data management across different nodes requires effective routing and sharding strategies. These strategies aim for uniform data distribution and consistency, thereby reducing latency and boosting overall performance by distributing data and queries effectively.

Fixed Routing

In fixed routing, data or traffic is assigned to specific shards or paths based on predefined keys. This method helps maintain consistency and predictability.

Dynamic Routing

Dynamic routing adjusts the distribution of data or traffic on-the-fly based on custom rules or mathematical functions. This method offers flexibility and scalability, particularly for applications with fluctuating demand. For instance, we employ hashing functions to achieve uniform distribution in a consistent manner.

Highlighting Wix Scaling Examples

As Wix scales, we face various challenges in managing high loads, ensuring reliability, and maintaining performance consistency across different services. Here’s how we address these challenges with our routing and sharding key selection strategies:

1. Kafka Infra Services Fixed-Key Scaling Strategies

Wix operates multiple proxy services that handle Kafka traffic. Here, we will focus on two critical services: the Wix Kafka consumer proxy, which caches messages and reduces broker connections for improved efficiency and performance, and Wix’s Kafka-to-HTTP-webhooks translation service, which exposes events to external apps running on the Wix platform.

Managing high volumes of Kafka traffic through these services presents significant challenges:

Pain Points:

  • Single Points of Failure: An issue with even one Kafka connection can potentially bring down the entire service, as the impact can be significant in a single service deployment. This includes issues like CPU or memory exhaustion, which can propagate and affect overall service performance.
  • Noisy Neighbors: Maintaining high reliability and performance for each individual tenant is challenging when all service instances handle requests from all tenants, making it difficult to isolate issues impacting one tenant from affecting others.

Our Solutions:

To tackle these issues, we implement horizontal scaling with straightforward fixed-keys traffic routing:

  • Fixed Clusters Keys Routing: We deploy consumer proxies for each Kafka cluster. This strategy avoids single points of failure by ensuring efficient message consumption and distributing the load evenly among consumers.
  • Fixed QoS Groups Keys Routing on Kafka Dimension: Our Kafka-to-HTTP-webhooks translation system uses QoS tiers to segment data. By doing so, we ensure that higher QoS tenants maintain reliability even when other segments experience frequent timeouts.

Currently, this fixed-keys traffic routing method effectively ensures high availability and performance for all tenants. However, we are considering introducing more dynamic key selections based on live metrics such as traffic and latency to further enhance our capabilities if needed in the future.

2. Web Traffic Management

Fixed Segments: At Wix, web traffic is divided into two fixed segments:
the Editor segment (site editing and business management) and the Public segment (serving sites to visitors). This decoupling — driven by the need to transform data structures during the publishing process — allows us to tailor service levels for different functions.

The Public segment, affecting all users, demands the highest service levels regarding availability, performance, risk of change, and recovery time. The Editor segment, impacting fewer users, can operate with lower service levels, allowing for greater agility.

Dynamic Routing: Load balancers dynamically route traffic across multiple servers. For instance, nginx routes *.wix.com traffic to the Editor segment and wixsite.com or user domains to the Public SSR segment. This ensures efficient web request management and seamless performance during traffic spikes.

3. Reactions App Using Kafka & DynamoDB to Scale

Reactions are a key feature for social interaction on Wix sites and mobile apps, including likes, stars, thumbs up, comments, and more. To support millions of users, the system needs to be fast, reliable, and scalable.

Here’s how we address the key design considerations:

  • High Write Throughput: Many users will post reactions simultaneously, requiring a system that can handle a high volume of concurrent writes.
  • Low Latency Reads: Users expect to see reactions in real-time, necessitating a system that provides immediate feedback.
  • Scalability: As Wix grows, the reactions system must scale seamlessly to accommodate increasing loads.

Dynamic Routing on Kafka Dimension
For the Reactions app, we use Kafka’s partitioning and message-handling capabilities to manage high loads efficiently.

  • Partitioning: Divides topics for parallel processing and horizontal scaling.
  • Performance & Durability: Ensures high throughput, low latency, and reliable data storage.
  • Order Guarantee: Maintains event order within each partition.
  • Scalability: Sharding patterns and increasing partitions enable seamless growth.
  • Dynamic Message Routing: Messages are dynamically produced to different partitions based on their keys. This allows for balanced load distribution and ensures efficient processing.
  • Consumer Parallelism: Consumers in a group are assigned different partitions. To scale, increase the number of partitions.

By dynamically routing messages to partitions, we ensure the Reactions app processes real-time interactions efficiently, supporting high write throughput and low latency reads.

Dynamic data Sharding (DynamoDB):
DynamoDB uses tables with primary keys for data distribution. Composite primary keys (partition and sort keys) enable optimized data access and the storage of multiple items with the same partition key.

  • Partition Key Importance: Distributes data across storage partitions, balancing load and scaling the database.
  • Schema Flexibility: Allows each item to have unique attributes.
  • Capacity Limits: Each partition holds up to 10 GB of data and supports up to 3,000 read or 1,000 write capacity units.
  • Internal Hash Function: Uses the partition key as input to a hash function to determine storage partitions, ensuring efficient data retrieval and load distribution.

These advanced sharding techniques allow DynamoDB to dynamically adjust to data and throughput demands, maintaining consistent performance. This scalability ensures the reactions system can seamlessly grow alongside Wix.

4. Data Locality for Enterprise Customers

Fixed Data Sharding
For enterprise customers with specific data locality requirements, Wix employs fixed geographic sharding to comply with regulations like GDPR. This ensures that data, such as employee-related information, is stored only in specified regions (e.g., only in the EU or the US). Wix has data centers in both the US and Europe and supports tenants in choosing where their data is persisted. Some tenants may opt for regional storage while others prefer global copies. Most of Wix’s microservices data is stored in MySQL tables.

Dynamic Data Routing
Using ProxySQL, a high-performance load balancer for MySQL, Wix dynamically routes database traffic while maintaining geographic sharding. ProxySQL allows for:

  • Intelligent Query Routing: Distributes database traffic for optimal load balancing and performance.
  • Enhanced High Availability: Redirects traffic away from failed nodes to maintain uninterrupted database access.
  • Sharding Support: Manages large datasets and scales out by routing queries to the appropriate shards.

ProxySQL enables configuring custom query rules to route specific SQL traffic to predefined groups of nodes within a MySQL cluster. Routing decisions are made based on regular expressions in SQL statements, ensuring efficient performance and compliance with data locality requirements.

Key Takeaways

Implementing horizontal scaling comes with increased complexity in system modeling, architecture, and management. However, these strategies are essential to ensuring our platform remains fast, reliable, and scalable, meeting the diverse needs of our growing user base.

  • Choosing Keys to Split: Carefully analyze the workload and traffic patterns to select the most effective keys for sharding and routing.
  • Tech Stack Choice: Select a tech stack that supports efficient scaling strategies. Technologies such as Kafka, DynamoDB, and ProxySQL offer robust capabilities for horizontal scaling.

Overall Scaling Strategy Selection

The following diagram illustrates a decision tree to guide you in selecting the optimal scaling strategy for your use case. This decision tree focuses on three key considerations:

  1. Resource Usage and Cost: Evaluate the resources required and the associated costs of different scaling strategies.
  2. Number of Potential Tenants: Consider the number of different tenants your system needs to support.
  3. Complexity of Potential Routing Rules: Assess the complexity of the routing rules needed to manage traffic and data effectively.

Conclusion

Scaling a platform as vast as Wix requires an ongoing commitment to innovative solutions and flexible strategies. By blending vertical and horizontal scaling with fixed and dynamic routing and sharding techniques, Wix optimizes performance, adapts to evolving requirements, and ensures a reliable service.
This approach equips Wix to handle current demands while preparing for future growth, and it can provide valuable insights for anyone facing similar scaling challenges.

Thank you for reading!

If you’d like to get updates on my future software engineering blog posts, follow me on Twitter, LinkedIn, and Medium.

You can also visit my website, where you will find my previous blog posts, talks I gave in conferences and open-source projects I’m involved with.

If anything is unclear or you want to point out something, please comment down below.

--

--