What is Parallel Sysplex® and Why Should I Care?

Frank J. De Gilio

Published in

Theropod

10 min readAug 31, 2020

🖥 🖥 Written by Frank J. Degilio & Barb Leinberger

Introduction

If you have been around mainframers for any period of time you will have heard the term Parallel Sysplex. In fact, they talk about it as if it were something magical. This article attempts to describe what the technology is and why it is so important not only to the platform but to enterprise in general.

Many people see the mainframe as a complex environment requiring thousands of hours spent on learning. Parallel Sysplex actually can simplify the solution of certain business problems. This article will describe how using this technology can easily solve some problems that are complex in the Linux, Unix and Windows (distributed) world. It describe how multitenancy in a Sysplex can actually be cheaper and easier to maintain with a higher level of service.

It is not meant to be an exhaustive explanation, rather it is meant to give an initial thumbnail sketch of the capability and why mainframers think it is so important.

Perspective Matters

Parallel Sysplex is a clustering solution that is fundamentally different from the clustering models in the non mainframe worlds. For convenience the article will refer to these as distributed systems, but it could easily refer to cloud systems as well. The way each set of solutions attack clustering problems are based on very different perspectives on managing workload.

Distributed Systems

In the distributed world clustering is based on the idea that each system in the cluster is a solitary unit that acts independently. This keeps every system configuration relatively simple and easily repeatable. This makes managing the stack relatively simple. Scaling is managed by adding independent nodes to solve availability and performance issues, often called horizontal scaling. If the systems supporting the traffic are unable to keep up with the load, administrators (or automation) can add another node into the cluster, update the network sprayers to include the new node and move on. While there are advantages to this model there are some distinct disadvantages:

1. Adding systems does not scale linearly. Even well written applications will suffer when scaling. Where the knee in scaling appears depends on a number of factors but nothing scales linearly forever.

2. Because the systems are not aware of the fact that they are part of a cluster the understanding of what to do in a cluster is the responsibility of either the application or middleware. Code above the OS has to have an understanding of the effects of clustering.

3. Since the nodes are all the same, they treat all workload equally. To provide different qualities of service to different users the system needs unique clusters for different levels of users with smart routing models.

4. Workload can be stuck on a node while other nodes are waiting for work. Networks have to be careful about spraying work across the cluster because pushing work to a busy node means that the performance for that set of users will suffer.

5. While the initial set of nodes in a cluster are often inexpensive the cost of the cluster does not scale linearly. As more nodes are added the supporting infrastructure (network support, cooling, power and real estate) increase the cost.

There are reasons for employing this model. One of them is based on the fact that machines that are cheaply built will eventually fail. This kind of model can mitigate such failures because they can be easily replaced by new node. It allows a lower entry investment and puts off worrying about scalability and maintenance costs.

Mainframe Systems

The mainframe uses a fundamentally different way of clustering. Clustering is done via a Parallel Sysplex which is comprised of multiple systems that look and act as a single system image allowing each node in the Sysplex to manage the workload cooperatively. Since each node in the Sysplex is a cooperative partner in the cluster the workload can be shared across all of the nodes. Different nodes can share work avoiding issues that are inherent in distributed systems because:

1. A Sysplex can scale vertically as well as horizontally, that is, code running in the Sysplex model can take advantage of more memory, CPUs or IO capabilities all while relying on multiple instances.

2. When systems cooperate in a cluster, the understanding of how to manage the cluster can be removed from the application and to some extent the middleware. In place of that understanding is a set of services that programs rely on (either implicitly or explicitly). As such the business code is less focused on enabling technology and more focused on solving a business problem.

3. When the workload is shared the system can manage different qualities of service in the same environment (more about this later).

4. When all nodes in the Sysplex share work no one node is left to do all the work while partner nodes lie fallow.

5. When Parallel Sysplex systems scale on two dimensions there is a high degree of management sharing of the additional workloads thereby reducing cost.

The mainframe model is based on a different fundamental notion of the two systems. Since mainframes are more expensive than distributed systems they are expected to fail less often and protect against interruptions when they do happen. As such the focus is to ensure that the overall system is available even when a machine in the Sysplex experiences a failure. With the appropriate support processes in place a Parallel Sysplex continues running for decades without an outage, even as the nodes are upgraded.

How Sysplex Works

Consider figure 1.

Multiple mainframes connected in a sysplex.

Figure 1: A typical Parallel Sysplex

In figure 1 we can see multiple Logical PARtitions (LPARs) running z/OS residing on three separate servers. Each LPAR has an operating system instance on it. Servers rely on at least one Coupling Facility (CF) to work together. Imagine the coupling facility as a place where these systems share everything they need to work as a team while sharing the workload. The CF has three basic responsibilities:

1. Share locks across all systems in the Sysplex.

2. Cache information required to be known across systems (like transactions in progress).

3. Keep track of data shared across the systems.

As you can see in figure 1, a CF can be on the same machine as systems managed in the Sysplex (an internal CF) or it can run on a separate machine (as an external CF). A Sysplex can hold up to 32 systems all working together. In each of these configurations the CF typically has a large memory footprint since all of the data is stored in memory. Coupling Facility Links are specialized fast links between systems and are the only I/O devices available besides system memory and some solid-state drives. Additional details regarding the configuration of the CF is included in the November 2019 “Coupling Facility Configuration Options” whitepaper found at https://www.ibm.com/downloads/cas/JZB2E38Q.

Smart management of workload

While the Coupling Facility allows multiple systems to share the same data it is only part of what makes a Parallel Sysplex unique. The other major component is the Workload Manager (WLM). The workload manager allows administrators to define a policy about how work will be managed across multiple systems in the Sysplex. This allows these systems to work together to create a single system view to the user community. It is also responsible for maintaining uptime even while planned and unplanned outages occur.

Planned outages

Once administrators complete maintenance on a LPAR in the Sysplex work can be drained from that system by letting the current work finish without adding anything new then restarting the system without interrupting any running processes. The system can be tested and certified prior to going back into the Sysplex. WLM is used to take the system out of the Sysplex and bring it back in by ensuring no new work is routed to that system and in turn routing work back to it.

Unplanned outages

In the event that there are problems with a system in the Sysplex, WLM stops routing work to that system. Since data is preserved in the CF, other systems in the Sysplex can continue processing work. Most businesses have multiple CFs to eliminate a single point of failure.

Smart Routing

WLM understands how busy each system in the Sysplex is allowing it to route the workload to systems that are less busy ensuring consistent execution across the Sysplex. Processes can utilize the knowledge WLM has to request that it be run on a less busy system. For example, TCPIP running on the Sysplex gives a route request to WLM to be moved to a less busy system. WLM also allows data to be shared across database instances.

All Workload is not equal

Another powerful capability built into WLM is to classify different workloads running on the system and define appropriate, often different, performance characteristics for each workload. This allows administrators to identify different classes of work ensuring that workload classified as important doesn’t wait behind unimportant work. WLM looks at workload classifications and determines if the execution matches the goals setup in the policy. If something is running slower than it should WLM can add memory, processors or even add new instances of servers within the environment to ensure that it fits within the execution profile defined by administrators. This allows workloads with different priorities to share the same Sysplex instance without colliding while ensuring that everyone meets the performance requirements identified by the policy. This policy activates only when a workload starts to miss its goals.

Capacity Planning

Since the WLM policy defines the characteristics of the execution of each workload it can keep track of how well the system is matching the policy defined by administrators. This allows the system to provide reports and alerts based on the execution of everything on the system. With this information operators gain flexibility in creating highly efficient capacity plans. Administrators can clearly identify the processor and memory usage each workload is consuming.

Using Parallel Sysplex to solve today’s problems

Parallel Sysplex has been around since the 1990s. This might lead you to think that it can only solve traditional computing problems. Nothing can be further from the truth. Today modern digital applications are incorporating services made available via Parallel Sysplex technology.

For example, there is a major retail company that leverages the Parallel Sysplex capability to provide a network cache using REST services and a key value data store. Because it takes advantage of Sysplex facilities it can provide this caching service without worrying about latency associated with eventual consistency that is inherent in other popular REST based caching facilities.

Other companies have seen the value of the low latency and low operational cost of providing REST services to perform functions like package tracking and banking services on a Parallel Sysplex. In a world where easy access must be met with low latency, low cost and high availability Parallel Sysplex is an obvious choice.

Parallel Sysplex in a Cloud World

The fact that this capability can be leveraged in a world of cloud services has already been discussed but you might be thinking, “I don’t need this, I can use Kubernetes to manage my workloads.” You would be right that Kubernetes is indeed a powerful way of managing a loosely coupled set of services, but it is important to remember that not all services are created equally.

Today we have plenty of examples of really cool services that anyone can use to build powerful cloud and mobile applications. These tools allow programmers to provide all kinds of new technological capabilities to the application. These are great for providing the kinds of enhanced features that gain users and provides them a more personalized experience.

Many business services aren’t quite so simple. They need to coordinate multiple business components and do it in a more tightly coupled way. If these are managed by scaling out a level of coordination complexity is added which is unnecessary and expensive.

Kubernetes is great for managing a large scale of independent nodes working loosely in concert, but some problems don’t lend themselves to that kind of solution. In this environment applications performing complex implementations must take on the responsibility of coordinating multiple business components. This forces developers to visit and revisit solving a problem that could be handled by the underlying infrastructure introducing risk into the development process. Even programmer’s creative solutions to deal with the complexities associated with these complex business problems run the risk of lowering the understandability of the application.

Parallel Sysplex Scaling vs Kubernetes

Both Kubernetes and WLM provide automatic workload scaling. Kubernetes bases this on CPU consumption. WLM uses a number of different factors to determine what to do. Both WLM and Kubernetes have the concept of growing vertically, but Kubernetes accomplishes this by draining the workload from an active container, killing the container and then creating a new instance of the container with new sizes. WLM alters the existing environment on the fly. Another advantage of WLM is that the feedback loop between WLM and the workload it is managing is shorter. This is important when workloads need to be managed closely. Workloads that are likely to spike for short periods of time do better in an environment that can quickly morph to support the demand.

Why Should I care about Parallel Sysplex?

Companies with z/OS can leverage it to provide powerful capabilities to their existing Cloud and mobile applications. This does not mean all of your applications should be run on this platform, rather, you should consider a Sysplex as one cloud environment in your multi cloud structure.

The services available with a Parallel Sysplex are a powerful enterprise addition to your mobile and cloud applications. They can handle the parts of the application that:

1. Need to react quickly to spiky workloads

2. Need to manage complex business interactions

3. Need to manage different qualities of service easily

4. Need a high level of efficiency managing requests

It is no coincidence that major banks, retailers, insurance companies and credit card companies rely on Parallel Sysplex. These companies recognize the need to support transactions at a high speed, regardless of the time of day, the number of people and processes using the system or maintenance window.

More information can be found at the following links:

https://www.ibm.com/it-infrastructure/z/technologies/parallel-sysplex

https://www.ibm.com/it-infrastructure/z

https://www.ibm.com/it-infrastructure/z/education/what-is-a-mainframe