Published in

Women in Technology

3 min readMay 18, 2023

Scaling Data Engineering Infrastructure: Lessons Learned from Industry Experts.

Scalability refers to a system’s ability to handle larger workloads, including considerations about managing growth and incorporating additional computing resources to manage increased load.

Approaches to Managing Workload.

How can we ensure optimal performance as our workload parameters increase?

As data volumes increase, data throughput and scalability become critical considerations. It is important to design a system that can easily scale up or down to accommodate the desired data throughput.

Scalability in data systems involves two main capabilities:

Scale up (Vertical Scaling): Vertical scaling involves increasing resources such as CPU, disk, memory, and I/O on a single machine. This allows the system to handle larger data volumes and temporarily manage high loads. The primary reasons for scaling up are to enhance data processing speed and handle significant input data.

Representation: Visualize a machine as a tower-like structure. Vertical scaling involves adding more floors to the tower, representing the increase in resources within the machine and its capacity.

Scale down (Horizontal Scaling): Horizontal scaling allows for adding more machines to meet workload and resource requirements. When the load spike diminishes, reducing capacity is necessary to optimize costs. An elastic system can dynamically respond to varying loads.

Representation: Imagine a network of interconnected nodes, each representing a separate machine. Horizontal scaling involves adding more nodes to the network, expanding the system’s capacity by distributing the workload across multiple machines.

In horizontally scaled systems, there is often a central leader node that delegates tasks to worker nodes within the system, which execute the tasks and return the results to the leader node.

Choosing the Right Path: Exploring Scaling Options in Data Engineering.

The choice between vertical scaling (upgrading to a more powerful machine) and horizontal scaling (distributing the load across multiple smaller machines) is frequently debated. While running a system on a single machine is simpler, expensive high-end machines may necessitate horizontal scaling for demanding workloads. In practice, successful architectures often employ a combination of both approaches.

Some systems are designed to be elastic, automatically adding computing resources when a load increase is detected. In contrast, others are manually scaled, with human decision-making on adding more machines to the system. Elastic systems are advantageous for highly unpredictable workloads, while manually scaled systems are simpler and typically have fewer unexpected operational issues.

In conclusion, a well-designed and scalable data engineering system should carefully consider the appropriate scaling choices,leverage managed services when available, and find a balance between scalability, performance, and cost efficiency.

Other things I write:

You might be interest in this series where I’m introducing several important concepts that new Data Engineers should be aware of. The other topics I talked so far:

Slowly Changing Dimensions

Distinctions Between CTEs, Subqueries, and Temporary Tables.

Replication Lag

Replication

Sharding and Partitioning

Partitioning Data

Optimizing data

Enhanced Query Performance

Indexing

Thanks for the read. Do clap👏 and follow me if you find it useful😊.