Shop Till You Drop: WorldPay’s Service Failure of 2016

Bishr Tabbaa
DataSeries

--

The third anniversary of the Worldpay service outage is an opportunity to reflect upon computer system failures, human error, process flaws, organizational mistakes, and the best principles and practices for solution delivery in the IT industry. In this blog and my forthcoming book, Bugs: A Short History of Computer System Failure, I will chronicle some illustrative system failures in the past and discuss ideas for improving the future of system quality. As information technology becomes increasingly woven into Life, the quality of hardware and software impacts our commerce, health, infrastructure, military, politics, science, security, and transportation. The Big Idea is that we have no choice but to get better at delivering technology solutions because our lives depend on it.

On July 1, 2016, thousands of retail customers on Etsy, StanJames.com and other e-commerce websites began experiencing transaction failures due to a payment processing service outage at Worldpay. The outage lasted more than three weeks and affected credit, debit, and gift card transactions in the millions of dollars. This essay will discuss the details of the event made public, the possible business and technology factors that contributed to the system failure, and how to prevent such incidents from affecting your organization.

The history of payment systems is an intriguing one that connects how people have organized themselves, exchanged goods and services, and then decided the value for economic activity. As people transitioned from nomadic hunter-gathering to settled agriculture around 10,000 years ago, human societies became more complex and hierarchical, individual roles became more specialized with divisions of labor, and it resulted in the need for mediums of exchange. The original payment method was barter in which parties traded goods and services without using money; for example a farmer might sell eggs, wheat, or sheep in exchange for stone tools or construction labor. Starting around 3000 years ago, humans recognized that an independent, portable unit of account could simplify economic exchanges, and the first currencies took the form of beads and shells. Eventually, people in Mesopotamia made coins of gold and silver around 600 BC; thereafter, people in China developed paper money.

During the Renaissance, European governments introduced banknotes that were IOUs backed by precious metals to finance colonial exploration, mercantile trade, and wars. Various countries experimented with the gold and silver standard for the next few centuries, however it all remained physical until Western Union debuted the first electronic fund transfer (EFT) transaction in 1871. The US Federal Reserve also began using the telegraph to transfer money in the 1910’s. In the 1950’s, Diner’s Club established itself as the first independent credit card company, and it was soon followed by American Express in 1959 which introduced the first plastic card for electronic payments. In 1972, the Automated Clearing House (ACH) technology was developed to batch process large volumes of financial transactions. In 1979, Visa introduced the credit card terminal. In the modern payment ecosystem, individual or business Buyers use their credit or debit card account issued from a Bank to make purchases from a merchant Seller associated with a merchant Bank. The Seller sends the credit transaction request to a payment Gateway that acts as a virtual electronic terminal and abstracts the different payment types (e.g. credit, debit); the Gateway then forwards the request onto an Acquirer/Processor. The Processor asks for buyer authorization from the account Issuer, queues authorized transactions between the Buyer’s credit Issuer and the Seller’s Bank, handles chargebacks and refunds, manages recurring billing and subscriptions, monitors risks and fraud, and integrates with various payment solutions, communication networks, card schemes, and banks. Merchants send scheduled batches of authorized transactions to their payment processor typically in an automated, nightly process. The payment processor forwards these transactions onto the appropriate card network (e.g. American Express, Discover, Master Card, Visa) who communicate the debits with the issuing banks in their system. The issuing bank charges the account holder for the amount of the transaction, and the issuing bank then transfers the appropriate funds to the merchant bank minus credit interchange fees. The merchant bank then deposits the funds into the merchant account. Authorization can take a matter of seconds while the settlement and funding takes about a day.

Worldpay (NYSE: WP) is a global payments technology company that processes over 40 billion transactions annually through more than 300 payment types across 146 countries and 126 currencies; in 2016, it collected more than $4 billion in revenue and employed more than 5,000 people. The company was started by Nick Ogden in 1997, and it initially partnered with the UK National Westminster Bank for its finances and payments. When NatWest was acquired by the Royal Bank of Scotland (RBS) in 2002, Worldpay began rapidly expanding through acquisitions and mergers with several payment solutions including Streamline, Payment Trust, Bibit, and Lynk. After the 2008 financial crisis, the EU imposed asset sales upon RBS as a condition of state aid, and Worldpay was sold in 2010 to a group of investors; Advent International and Bain Capital each paid $1bn USD for a 40% ownership stake with RBS retaining an equity interest of 20%.

According to the Financial Times, Worldpay attributed the July 2016 outage to an isolated issue with just one of its payment gateways, and they downplayed it further claiming it only impacted a small number of customers (1%). The payment gateway had been recently updated by software system changes prior to the incident, the changes resulted in a surge of error messages on the gateway server, the server was overwhelmed, and then it had to be taken offline. Payment transactions related to this gateway had to be manually processed and settled for more than three weeks while the automated workflow for this payment gateway was being repaired. Some Etsy customers also reported experiencing duplicate chargers as Worldpay attempted to fix the problem. Besides the public embarrassment, Worldpay had just completed construction in 2015 of a new payments platform that cost more than $500 million. Furthermore, Worldpay was not forthcoming with information for the three weeks duration of the outage, and its major customers were not able to provide updates and set expectations with their own customers. Using Worldpay’s annual revenue, the 3-week length of the payment gateway failure, and the 1% proportion of affected customers, the system failure resulted in roughly $2.3 million in lost revenue (1% x 4000 / 365 x 21); this ignores the short-term cost of recovery procedures and the long-term reputational damage resulting in loss of business customers that substituted to its competition. Furthermore, based on Worldpay’s 0.75–2.75% transaction fees for debit and credit cards, respectively, and overall transaction volume, the direct revenue impact of the Worldpay outage can be estimated at $83–300 million (2.3 / 0.0075 or 2.3 / 0.0275). The persistent payment service outage, poor communication, and total business impact to customers all called into question Worldpay’s IT processes, technology, people, and governance.

ITIL Service Strategy (Source: AXIOS)

There are several important lessons to be learned from this story:

  • For Etsy and other e-commerce site operators, one must offer customers multiple payment choices and avoid single points of service failure whether the service in question is related to compute, storage, database, payments, logging, or other infrastructure services. As an individual consumer, you likely carry more than one credit card; an e-commerce business should be similarly adaptive and resilient with regards to payment methods. This is common sense for system designers, and one can only conclude that Etsy and others were overly cost-conscious and focused on simplicity.
  • Furthermore, there should be an incident response plan for both the service provider (Worldpay) and the service consumers (Etsy, StanJames, etc) so that there is a well-understood process, a group of responsible individuals, and technology tool set for detecting, reacting to, and recovering from such incidents. At minimum, Worldpay should have had a communications template, a standard list of recipients to alert, and an ability to queue up transactions to durable persistent storage so that they could be settled manually or eventually routed to electronic workflows when system services were restored online.
  • Quality assurance and change control are serious matters for all IT organizations and especially so for real-time e-commerce systems involving large sums and volumes of financial transactions. Again at minimum, Worldpay should have coordinated its release management plan with major customers, communicated the changes at least 30 days in advance, and allowed them to test and sign off the system changes in a staging environment so that all stakeholders could learn from the testing, prepare for the final go-live, and make any necessary adjustments. This should be ITSM 101 for ITIL and PMP professionals. Another engineering best practice involves logging customer transactions in production to durable storage so that it could be replayed concurrently and used to simulate actual traffic in the staging environment. A more advanced DevOps practice would have been to make the production release reversible and backpedal the changes once the catastrophic failures were noticed.

For all the recent hype of FinTech comprising mobile-only digital wallets, blockchain, and financial services for those outside the traditional banking system, the consolidation of legacy payment solutions continues onward and upward with ever-larger valuations. Worldpay was acquired by US-based Vantiv for $10 billion, and the deal was settled in January 2018. Then in March 2019, Fidelity National Services (FIS) announced its intent to merge with Worldpay, valuing the new company at $43 billion. BCG and McKinsey forecast that global payments will total $2–3 trillion a year in revenue sometime in the next decade as more people switch from cash to digital payments for online sales.

Enjoy the article? Follow me on Medium and Twitter for more updates.

References

--

--

Bishr Tabbaa
DataSeries

Architect @ AWS • Amazon Web Services • Board Member • Fractional CTO • Built B2B DNA supply chain stack @GxGene • History of System Failure • Writer @ Medium