It’s an old programming adage that if you don’t understand the cause, you can’t be said to have fixed the bug.

To network or to not work? That is the question

I’m a Game Designer and Game Programmer at heart and at my origins. I’ve spent more time designing and implementing AI’s, turn based systems, interactions between the player character and the world, collisions and physics than I spent doing common tasks in software development. I lack knowledge when it comes to Web Development, CSS, Networking (other than your usual TCP/UDP connection and coding small multiplayer games). I considered myself to be a pretty good programmer in software with my background doing games.

If I can make an AI that can challenge the player in a Turn Based Strategy game inspired by Advanced Wars I can make a responsive button behave properly across platforms, right?

Turns out it’s not the case. And I never saw this so clearly until today when I finally stood up and said: Done, I fixed it. It wasn’t an actual issue and it wasn’t my fault. So how was the issue fixed? I realized what caused it, what the behavior was and why it happened and I didn’t have to change a single line of code to “submit the fix”.

Submit is a strong word though. I did submit some code today but it was more of a refactoring of the previous code-base. No features were changed and no bugs were fixed but I stood there proudly announcing the end of my wack-a-mole game with the QA team. Let’s start with the hypothesis shall we?

The details: You have a piece of code that retrieves an ip address from the cloud. It uses that ip to connect to a computer on the same local network. What it does after establishing the connection is of no importance to us. But we need that connection to be established at all time after the power runs out, any of the two computers are shut down or in case manual reconnect is issued. Both computers access the cloud: one to store it’s ip and the other to retrieve it.

The problem: Connection between the two local computers doesn’t happen after a reload/restart, consistently at least. You can access the cloud from both computers, it responds to pings and this issue happens seemingly random. The connection is established after another restart but fails on the third attempt. And the forth, and the fifth. But the sixth one comes out alright.

And armed with the knowledge of this problem I’ve spent the last three weeks trying to find a fix for this problem. My first idea was to cache the server’s ip address and use it but this doesn’t work when the lease for the server’s DHCP given ip expires. I then decided to log the Mac Address for the server and search for the ip inside the arp list but there’s a problem with the arp list being flushed (work around for an issue with another piece of software on the client computer). And no I have no possibility of installing another piece of custom software on the server pc to broadcast it’s ip to the network. For all intents and purposes the server pc is a black-box for me.

After going back-and-forth and logging all data I could (how many times / hour would the connection drop, how many times it would happen in succession and who knows what else I tried to log) I was about to give up. All the data I stored and logged didn’t help me find the underlying issue. The solution slowly came to me as I was experiencing other, separated issues. For example sometimes I couldn’t access my home banking app from ING while connected to my shared office space’s wireless network, or Github being down while on wireless but accessible on my phone’s 4G connection.

My initial thought is that we had some kind of filtering on the network so I asked one of the network administrators if this could be the cause. To my shock the answer was no, there is no filtering system in use and no addresses are blocked. However I learned that our internet connection is supplied by two service providers and the load on the network is balanced.

I don’t know at this point how common knowledge network load balancing is or how it works but a co-worker explained it to me and there it was, the only good explanation for what I was experiencing. In short, the way things we’re supposed to work is:

  • The server would phone the cloud with and post data to it such as the outbound ip and local ip address
  • The client would call the cloud and ask for the server’s address
  • If the client’s outbound ip matched the local server’s ip a valid one would be passed

Due to the load balancing the outbound ip of the server and client did not match and the server would not give away any information requested by the client.

It’s a simple issue that could have been solved quickly had I known anything about common network issues. Most of my experience comes from making games and tools for games. I wrote AI’s that could teach the player how to properly play the game and wrote enough strcpy-powered hacks to patch content in games by exploiting 0 byte strings that I never thought that I could find an issue in “simple software development” that could stop me in my tracks. Turns out I lack enough knowledge on basic stuff to keep my eyes open from now on.

Like this article? You can show your support by giving me a few claps or by sending me a donation via Paypal :)

--

--