The need for speed in risk management applications is unabated. If an applications’ performance is improved, users and consultants tend to fill this space rapidly with additional requirements or more complicated calculation settings. Since processor speeds are hardly improving (Figure 1), applications should adopt other paradigms to comply with their current computational requirements. At Ortec Finance we tailor the High Performance Computing (HPC) techniques used in our applications to the domain and to the application usage. Previously, we published articles and posts about heterogeneous computing for nested simulation applications and heavy quantitative model calibrations. In addition, we research web and cloud technologies like Docker, and serverless computing for server-client applications which require elasticity and high availability. These techniques do not assimilate into the more traditional desktop applications in our portfolio. These applications can have very specific requirements (e.g. for security, OS, workload types, etc.) for which no off-the-shelf solutions are available. In this post we focus on our preference for using the Microsoft HPC (MS HPC) framework to distribute our calculation-intensive platform GLASS. We will briefly discuss the road to our decision to use MS HPC, and subsequently explain the match with the MS HPC features to our requirements. Finally, we will show some preliminary results, and will conclude with future work and release expectations.
Figure 1: Processor grew exponentially like Moore predicted until around 2005.
GLASS is the Ortec Finance asset and liability scenario model that improves financial decision making for institutional investors by offering short and long term balance sheet simulations. Each scenario in such simulations is independent and hence the model is highly parallel in nature. Previous GLASS releases initially included a proprietary multi process model. So before upgrading this model to a distributed setting, we revealed the added value of distributed computing for GLASS first. Since, for computer scientists it is a rule of thumb to not distribute your application unnecessarily.
The Value proposition
Firstly, we are interested in the scaling potential of GLASS from sixteen to several hundreds of CPUs (Figure 2). To examine this we simulated the future behavior of the system with the Workflowsim toolkit (Chen, 2012).
Figure 2: example performance graph with indicated areas of interest
Before examining the results we first need to elaborate on the type of workloads GLASS generates: The workloads in GLASS are represented as Directed Acyclic Graphs (DAGs). A DAG is a tree structure of nodes, which has directed dependencies (Figure 3). Directed means the dependencies move in one direction. For example, the upper node L-1 depends on I-1 and L-2, which are on a ‘lower level’, so I-1 and L-2, need to be finished first in order to complete the DAG. This means the system is able to start all leaf nodes of the DAG after initialization of the calculation. Simply put a node in a DAG will enable for example to simulate returns for scenario 1–10. The granularity of the tasks can be set by the user.
Figure 3: Example of exported DAG from GLASS
Given a set of assumptions on communication, database and processor efficiency, we can use measurements of the multi process GLASS as input for the Workflowsim toolkit. We measured three different benchmark cases (A,B and C) for which we plotted their forecasted parallelization efficiencies in the graphs below:
Where the first graph gives some insight into the absolute runtime, it contains no information on the efficiency of the targeted number of CPUs (>16). The second graph, however, shows on log scale how close the parallelization is to a perfect situation (doubling the number of cores results in half the runtime). These results are promising, even more so when users initiate several to hundreds of such benchmark simulations simultaneously.
Although we have to be careful on our aforementioned assumptions, distributing the GLASS product seems to significantly improve its performance.
Microsoft HPC (High Performance Computing) Pack is a software solution with which a distributed computing cluster can be created. Multiple resources can be flexibly connected to each other and computing clusters can be created including on-premise servers, workstations and cloud instances.
Microsoft HPC Pack allows many Microsoft Windows machines to communicate in a cluster to work on the same problem or calculation. Each of the machines works on a different part of the problem simultaneously, hence speeding up the overall time needed for the entire calculation.
The most important role is the Head Node (Figure 4). The Head Node manages all resources which are part of the cluster. Client applications start a job on the Head Node. A job consists of a (large) number of tasks. When tasks are sent to the head node.
Figure 4 Overview on MS HPC architecture
MS HPC is the most suitable for our requirements in comparison with other solutions. Other techniques with roots in the Big Data or Cloud industry seemed promising at first, but were significantly lacking in matureness. We elaborate more on the requirements compatible with MS HPC below:
Scalability: Microsoft Azure is proven to be very scalable. There are many use cases where thousands of cores are used for parallel calculations.
Security: Because HPC runs task in the security context of the user who starts the task, integrated security can be used for accessing the database. It also prevents unwanted interaction between jobs from different users.
High availability: HPC clusters can be installed in a high availability setup where the head nodes and the HPC database are setup in a failover cluster ensuring the highest possible availability for the cluster. Compute nodes can be taken offline with a minimal impact on overall performance. This means cheaper hardware can be used.
Scheduling: HPC has many options to configure scheduling of task from different users. We use a balanced strategy. While this is less efficient when scheduling multiple large jobs it prevents smaller jobs from having to wait until a large job is finished.
Tooling: There is an extensive management API which can be used from inside your application to allow users to manage their jobs. There is also sufficient tooling for managing the cluster by system administrators.
Caching: Because GLASS needs to load a large amount of necessary data for multiple tasks, it is important that MS HPC has the possibility to calculate multiple tasks in the same process. This will prevent the need for loading data in every task.
The implementation of MS HPC into GLASS was finalized recently. Since GLASS was already well prepared on distributing tasks to multiple processors the transition to use MS HPC as an execution platform was actually much simpler. Below we plotted a Gant chart of an example run in a two machine, 32 core cluster. The graph displays the scheduling dynamics of the different tasks of the DAG. Although, the scheduling features in MS HPC are quite advanced, during tests we suffered several challenges in order to direct the scheduler to implement the load as efficiently as possible.
While running actual loads during examination of the value proposition we discovered very promising results. Despite running tests with 4 calculation nodes with 28 cores each some bottlenecks became apparent:
- Network connection to the database server: Some simulations were slowed down or even ran into timeout because the database server was overloaded. Most of these problems were reduced by minimizing the amount of queries sent to the database from GLASS, but this is still a bottleneck and it will become a bigger problem when more compute nodes are added. We are currently investigating a caching mechanism and hardware improvements to overcome this problem.
- Non-parallel tasks: Some tasks were not (yet) in parallel, because their runtime was relatively short. Because the total runtime of simulation is much shorter now, these tasks now make up a relatively large percentage of the simulation lead time. This can be solved by splitting these tasks in multiple tasks so that they can be executed in parallel.
Microsoft HPC is compatible with GLASS and improves the performance of large-scale ALM simulations. The GLASS simulations are suitable for parallelization. During the preliminary tests, the resulting performance was close to what we expected. Challenges using the required network communication with the database server had to be solved, and further improvements like caching and better hardware are required in order to scale up the number of calculation nodes further to 200 CPUs and more. Also, a few simulation tasks which are not in parallel take up a larger proportion of the simulation time than before. Therefore, they will have to be split up into parallel parts in future releases. Currently, a GLASS release containing the MS HPC implementation is used in a pilot by consultants of ORTEC Finance. Next year this version will be distributed to other users.
Chen, W. &. (2012). Workflowsim: A toolkit for simulating scientific workflows in distributed environments. E-Science (e-Science), IEEE 8th International Conference , 1–8.