As part of the Digital Service at Veteran Affairs team’s(VA) Appeals Modernization effort, we are developing a React on Rails application called Caseflow. When fully implemented, Caseflow will enable a better, more efficient process for managing Veterans’ appeals by increasing productivity, preventing errors, and automating or eliminating manual steps.
Our current deployment strategy follows the industry’s best practices with Continuous Integration and Continuous Deployment (CI/CD) using Travis and Jenkins. The team develops features quickly and deploys changes to the production daily.
Before we started deploying daily, however, we had to solve an elusive bug that caused our deployment to fail every time.
Five minutes of action
Our deployment pipeline uses the Immutable Server Pattern to create an Amazon Machine Image (AMI) and perform rolling deploys into our AWS AutoScaling group. When we would deploy, the newly deployed EC2 instances would serve traffic beautifully for five minutes. Then the servers would crash. The applications would tie up all their threads on SQL queries to an Oracle database called VACOLS. The failure would prompt PagerDuty to conduct a chorus of ringtones, emails and Slack messages. This meant that our systems were down and preventing Veterans’ Appeals from being certified by Region Offices around the entire nation.
The VA Digital Service team prides itself on making Veterans’ lives better and bringing in modern software development process to the VA. In reality, we were only causing further frustration as a result of our daily application downtime. Despite wanting to deploy daily to production, we could only stand to deploy once a week during off hours. We panicked more and more with each deployment failure, and with little time to think things through, we had to reboot until the problem went away.
Even though we spent hours on this problem, the team was burned out and no one could crack the case.
Unpeeling the onion
Caseflow is a VA intranet application deployed in the AWS GovCloud region. To connect to the VA on-premise Oracle database, Caseflow goes through a IPSec tunnel managed by the VA Network Security Operation Center (NSOC) in Martinsburg, WV. NSOC manages a fleet of Cisco routers that contains a large number of routing and firewall rules for our packets.
We knew that our five minute bug was related to the connection with the Oracle database described above. However, it was difficult to pinpoint the root cause because there were many layers to the issue.
To isolate the problem to a smaller surface area, we first came up with a set of concrete possibilities and verified them one by one.
1. The Oracle database was dropping our session silently.
A quick check with
netstat would show that our database TCP connections were in the
ESTABLISHED state. So at least on the surface, we thought that the connection was healthy. For a second opinion, we used SQL Developer and connected to Oracle. The status check showed that the database was happily serving requests for other applications.
2. Ruby’s OCI8 implementation was dropping the connections to the Oracle database.
To test the Oracle database connection, we crafted up a small Node.js application that connected to the database and performed a query every two seconds. At the fifth minute after OS boot up, the query hung. This verified that the problem was not isolated to Ruby’s OCI8 implementation.
3. Some startup program in our Linux image was dropping the database connections.
We were able to replicate this issue using Ubuntu 14.04 with the 3.16 based Linux kernel. Since we were able to replicate the issues across two major versions of Linux, we were able to conclude that this problem was not specific to our version of Linux.
4. Our Site-to-Site VPN was dropping the connection in the IPSec tunnel.
Since none of the other possibilities panned out, this was likely a Site-to-Site VPN issue. This meant that our packets were probably being dropped by NSOC. We called up NSOC and spent three fruitless hours debugging the issue. To NSOC, the connection was active, and there was no anomaly on their side. Since we couldn’t pinpoint the issue in this network path, the problem was bounced back to us.
At this point, our only tool left was TCP/IP packets. As we gathered more packets, a clear pattern started to emerge. As long as a persistent TCP connection was created within the first five minutes from bootup, the connection would be silently dropped at exactly five minutes from bootup. To diagnose the TCP/IP layer, we used
tcpdump to examine the IP packets for clues. Our efforts paid off when we observed an anomaly: the TCP TS Val was near the 32 bit boundary when the TCP window stopped sliding. This immediately screamed integer overflow.
15:34:25.326520 IP AWS-IP > VACOLS: Flags [.], ack 1802470, win 843, options [nop,nop,TS val 4294967250 ecr 2081125286], length 0
Armed with this information, we again discussed the symptoms with the engineers at NSOC. They were amazed that nobody had seen this before, but through this conversation, we were able to determine why. Most TCP connections on this network path are short-lived HTTP sessions and were very unlikely to see this problem. Since we were one of the few systems that created a persistent TCP connection over a VPN tunnel on startup, our connection was exposed to TS Val wrapping where others weren’t.
The NSOC engineer we were speaking to searched through the Cisco bug database and confirmed that this is a known Cisco firmware bug. If the NSOC routers see a TS ECR value that is ahead of TS Val, they will drop all following packets in the current TCP session. This bug occurred only if we made a persistent TCP connection within five minutes from bootup. The only way to fix this was to upgrade the firmware. We set
net.ipv4.tcp_timestamps = 0 in
/etc/sysctl.conf to disable TCP timestamping as a stopgap until NSOC could make the upgrade. With this patch, we were able to deploy during business hours without alarms and phone calls for the first time in months.
The end? Not quite.
All about the Jiffies
The most interesting aspect of this bug was the magical five minute death mark. Digging deeper into the Linux TCP stack, we found out that the TS Val is initialized to the Jiffies on session creation. Jiffies is a unit of measurement for the Linux kernel to measure time. And for many years, the kernel developers have initialized the Jiffies to 5 minutes before wrapping to force exposure of integer wrapping bugs. This means that the TCP TS Val also wraps at the five minute mark. If the kernel developers hadn’t thought to force this exposure, we would have had no hope of nailing this bug down. Our applications would have crashed after random lengths of time in measures of weeks and months rather than minutes.
Looking at the big picture, this bug is fascinating because we started debugging Ruby on Rails from AWS and found the root cause in the NSOC Cisco Routers thanks to an ancient feature in the Linux kernel. While you may encounter problems like this elsewhere, USDS teams solve them every day knowing that the problem affects not just a nebulous release date, but potentially the lives of hundreds of thousands of people.
Alan Ning is a site reliability engineer working with the Digital Service at Veterans Affairs team