It’s mainframes all the way down

Ruth Grace Wong
Supplyframe
Published in
11 min readSep 10, 2019

Increment Magazine published an article about the curious longevity of COBOL. We interviewed Peter Wong*, an engineering lead for mainframe systems, to get the real story on why institutions flush away hundreds of millions of dollars trying to migrate legacy systems, and how companies and engineers can do better.

person working on a mainframe computer

Have you noticed that some customer support lines have faster systems than others? “If you call, say, TD or BMO about your visa card, they respond very quickly because they use green screen in some customer service applications.” Peter is referring to the style of computer terminal that has green text on a black screen, commonly used with mainframe systems. “There’s no wait time. They can immediately tell you your transactions, payments, and bills. But if you call a customer service team using a web based application, they say, ‘Can you hold on one second? My program is very slow today.’ I say, yeah, every day. Rogers’ customer support system is especially slow. ‘Can you hold on?’ We haven’t even talked yet! Fine, fine.”

“First are the banks, followed by telecommunications, manufacturing, retail, and insurance. There are many companies around the world who depend on the mainframe for their mission critical applications. In telecommunications, Rogers is the biggest one here [in Canada]. AT&T and Verizon also use mainframes. In manufacturing, companies like Nissan, Toyota, Boeing, and Ford use mainframes. Even in retail: Sears, Walmart, maybe even The Bay. Major insurance companies too. The software is written mostly in COBOL, second in assembler, and third in PL/I [Programming Language One]. For example, I think BMO, HSBC, Canadian Tire, and many others all use PL/I.” Pretty much any large company that was successful in the 1960s still has a mainframe system running. Only mainframes were available at the time. That’s why it’s so widely used. There was no PC, no Unix at the time. Before the nineties, personal computers were very weak and could not perform the heavy lifting necessary for commercial work. People only used PCs to print reports and do word processing.”

Peter was educated in China, where he studied AI. “I remember seeing that my father had a COBOL book, and even then, I thought it was an ancient language.” However, his first job was on mainframes. “The city I lived in and a couple other cities ran their bank on the same mainframe — an IBM 4381. At that time, China was pretty behind in hardware. It was very expensive. Later on we migrated to System/390 (ES9000), z/Series, then System z. All the five major banks in China run mainframes.” After many years of work in China, Peter moved to Canada, where he continued to work on mainframe technology.

“It’s not uncommon for mainframes to process hundreds of millions of transactions on a regular day, per mainframe. It’s a very high throughput and stable system.” About two thirds of enterprise data goes through a mainframe computer. However, the speed is both an advantage and a drawback, as people try to integrate mainframe with cloud: mainframes were built to process transactions very quickly, but they were not designed to wait. “If you wait, it jams up. A hundred million transactions is nothing, but a hundred million waiting — that’s a disaster. The architectural design of how the mainframe connects to the cloud or servers is extremely important. For example, when you have a glitch on the cloud or the network, and the mainframe-powered IMS [information management system] is trying to connect to the cloud, then the IMS has to wait. If the volume is very high, and the wait is too long, the bank will shut down immediately. All the regions will wait, and all the buffers will blow, and it will affect all the systems. I’ve had this problem before. It’s called message flooding. But people want to open a connection to the mainframe and then wait 2, 3 seconds. If the transaction volume is low, we have logic to handle it, but if transaction volume is high… you know, online and mobile transactions from the web are accounting for more than half of the traffic. Every time a customer looks for their account information on the website or swipes their credit card, it will fire a couple transactions to the mainframe. If you apply for a mortgage, all the inquiries and calculations involved add up to a hundred or two hundred transactions. Engineers using the cloud often don’t have the tools to fire a hundred million transactions. They don’t have this region size on a test environment. But sometimes they use a simulator to fire a couple of transactions per second and claim their proof of concept for migrating the mainframe system to the cloud is successful. You’re kidding me — a couple transactions? The test methodology is wrong.”

“It’s not the language. It’s not the technology. When you want to replicate or migrate, you need to understand the business you have been doing, or changing, or modifying, or fixing — for decades.”

But the main reason these systems are so difficult to migrate is a bigger problem than just the transaction volume. “COBOL is not special. You can pick it up relatively easily, in a month or even in a couple of weeks. Why is it so difficult to migrate? It’s because of the history. From the start of computers 60 years ago, the business has used COBOL. And now you want to migrate? You think 60 years of work is trivial? That’s accumulated human intelligence. It’s not the language. It’s not the technology. When you want to replicate or migrate, you need to understand the business you have been doing, or changing, or modifying, or fixing — for decades. The business logic is very complex. And it’s difficult to migrate because no one knows all the business rules buried in the system now. It’s often not documented, and existing documentation may not be up to date. Reverse engineering does not work at all. I majored in mechanical engineering. In mechanical, reverse engineering is widely used. But somehow in software the business logic and technical logic are mixed. You can ‘replicate’ or ‘translate’ a system from one computer language to another, but the translated or replicated system will be unmaintainable. So it’s next to impossible to read the code and reverse engineer a working software system back to its design intention and underlying business logic. It’s even more difficult to reverse the fixes and patches implemented over the years. I challenge those who want us to reverse engineer a software system: I will write 10 lines of C, COBOL, whatever language you prefer, and you try to tell me what I was thinking, what the purpose of the code is. If you can do it, then I can reverse all the systems for you. But nobody can do it. Companies think their code is too old, but it’s their greatest business asset. People think a 30 year old system is old. Is a 50 year old person old? The 30 year old system is far younger and runs flawlessly. Some code may be written decades ago, but the system is evolving. One of the biggest advantages of using mainframe is that mainframe preserves your IT investment. It has the best backwards compatibility, so you don’t have to rewrite your program every couple of years due to hardware or system upgrades. The mainframe has the most advanced hardware. One mainframe can have up to 170 10-core 5.2GHz CPUs and 16TB RAM per partition, up to 85 logical portions, not to mention its enormous IO capacity — hardly any part of mainframe is old. It can run modern languages and operating systems. IBM keeps enhancing their IMS and CICS systems along with the traditional COBOL and PL/I languages. The mainframe does not have any problems so long as IBM maintains compatibility. It is very secure, very fast, and rock solid. Even if you rebuilt it in Java or anything else, with a better structure, maybe using microservices — eventually, when the application system’s complexity grows, when the integration grows, you will repeat history. It will be very difficult to maintain again. You can change your microservice and it will affect thousands of people, hundreds of departments. In the business world, it’s all because the business is complex, not because of the technology or language. Many times people compare traditional applications with Google’s or Netflix’s technology, but they fail to realize that the core of traditional business applications is very complex business logic accumulated over the years, while most of Google or Netflix’s services are based on unstructured data and algorithms. This means traditional business applications are more difficult to rewrite or migrate regardless of the technology used.”

Beyond the primary difficulty of complexity built up over decades, there is some technical difficulty specific to migrating mainframe systems: “If you migrate from Unix to Windows, it’s easy. Pure batch process migration is relatively easy. On the other hand, mainframe IMS, CICS online systems, and IMS BMP [information management system batch messaging processing] or DL/I [Data Language 1] batch migration is very difficult. In this case, COBOL is not COBOL alone: there’s a whole environment involved. If you translated mainframe IMS COBOL into Java, it wouldn’t run on other platforms at all. On top of the mainframe operating system is the CICS [Customer Information and Control System] to manage transactions, and IMS [Information Management System]. CICS runs on mainframe, Unix, and Windows. But IMS is only on mainframe. Most banks in Canada still use IMS. So the system only works with the same hierarchical/relational database. Without the environment, you cannot succeed. You have to rewrite — not just replatform. It’s a lot of human labor. Companies will get a report made, and it will say, we have two million lines of COBOL, and a tool that can translate one million lines per day. But so what? You have garbage output. Nobody can maintain it, no human can read it.”

people working on computers

Peter is personally familiar with the labor of migrating mainframe systems. “We once performed a migration with a four hundred page spec, for migrating a relatively small credit card collections system from one mainframe to another. This was a culmination of more than three years of work. We found a bug the night before the launch. So we spent the night staring at our flow charts and cross checking the database and file integrity using SAS programs, because we wanted to make sure the program we wrote to patch the data would work. The patch is a hard coded patch so you really don’t know if it will work for downstream processing, or if it will ruin the integrity of whole database. But my coworker did some validations before running the three hour job and we were able to run it successfully. We worked the whole night. This kind of work cannot be avoided using new technology. Luckily, we managed to perform perfectly: No outages for our system. We were so tired, but everything was alright. Later on, not having heard about any difficulties, upper management thought that the work we did was simple. But our manager had seen our design, diagrams, process, hundreds of scripts, and knew it was not simple.”

Even if the migration goes smoothly, it’s only possible to reduce the cost if the application is small in terms of transaction volume and storage. “Small companies can save maybe 50% of the cost by migrating away from mainframe. Mainframe is not suitable for small or even medium sized companies. But big companies shouldn’t expect to save money from migrating completely away.”

Upper management will often vastly underestimate the difficulty of a migration. Peter’s migration went well, but the industry is abound with rumors of migrations going over budget by hundreds of millions of dollars, only to be partially successful, or not successful at all. One famous case is Commonwealth Bank of Australia, which spent over a billion Australian dollars ($749.9 million USD) migrating a system off of COBOL, twice as much as they initially budgeted. The migration was not even completed as originally scoped — their mortgage system was unable to be migrated due to complexity and cost, and runs on mainframe to this day. Peter and his team once worked for 4–5 years to document the requirements of a system meant to be migrated off of COBOL. When they finished, they asked other companies to put bids in to do the work. They received two bids: one from Company A for 4 million Canadian dollars, and one from Company B for 7 million. After interviewing the engineers at the respective companies, Peter and his team recommended that Company B be chosen to do the work, based on the competency of their engineers. Upper management picked Company A instead. Three years later, Company A had spent 20 million, and concluded that they were unable to migrate the system. The company went back to Company B to ask them to do it, and since Company B had seen the difficulties Company A had, they came back with a new quote of 100 million. In the end, Company B was able to complete the migration for 100 million, but with a smaller scope than originally intended. The original motivation for the migration was cost reduction, but Peter thinks that if the management knew what the cost of the migration would be beforehand, they wouldn’t have gone through with it. Company B had predicted that the cost of maintaining the system would be reduced by up to 90%, but when the migration was complete, the engineer personnel cost tripled, and the hardware cost stayed the same. “People were fired over it.” In Peter’s experience, cost is rarely saved. It’s difficult to run a migration successfully. But even if the migration goes smoothly, it’s only possible to reduce the cost if the application is small in terms of transaction volume and storage. “Small companies can save maybe fifty percent of the cost by migrating away from mainframe. Mainframe is not suitable for small or even medium sized companies. But big companies shouldn’t expect to save money from migrating completely away. Not to mention the high risk of system outage and business interruptions.”

“Testing these systems is a huge effort. It’s human intelligence and work effort combined. If you have money, it’s no issue. You can replace everything. It’s not that they can’t do it. They can. But would you like to risk your business? Would you to spend all this money? The spending quota is a waste. It’s rare for a system to be unmaintainable. But companies don’t want to pay to invest, train, and pay people to maintain it. Young and talented people would work on mainframe if, first, they knew that mainframe has a bright future for big industries, and second, that you’ll pay them a fair rate. Some systems are really messy, I know. But you can gradually replace it. As long as you can test it, you can change it. Nobody who modifies the system understands every sentence. What’s the benefit of the migration? It’s zero. Actually, sometimes it’s negative. They gained a GUI, but they were unable to replicate all the original functionality. And often the web based application is slower. Every year it seems many CIOs, CTOs want to migrate off of the mainframe system. But if they knew mainframe is their biggest IT asset, they would know it isn’t so easy to get away from. They would invest more into mainframe. All the big mainframe shops increase their MIPS processing power more each year, usually by ten to fifteen percent.”

In the context of failed migrations and wasted dollars, what can companies and engineers do better in maintaining and upgrading these systems? “Every time I get a chance to talk to upper management, I say: Please, look at the complexity. Gradually migrate the peripheral applications off, but don’t start with the core system. They think, if the core system is migrated, then others will be easy. Yes, it might sound like logical thinking. But why would you want to migrate your brain off your body first? Maybe you can have artificial arm, but you cannot have artificial brain for a long time. The core system connects to every single other system.” Mainframe might not last forever, but considering the high complexity of what’s been running on them, and the even higher cost of migrating, it won’t go away any time soon.

*Name changed at request

--

--

Ruth Grace Wong
Supplyframe

Pinterest engineer by day, manufacturing engineer by night. Manufacturing writer for https://medium.com/supplyframe-hardware