They Tried to Make Me Go to (Network) Rehab

And I said yes, yes, yes

Martín Beauchamp
Shapeways Tech
5 min readOct 20, 2015

--

As we’ve mentioned in previous Shapeways Tech blog posts we manage our own servers in real live datacenters. The Ethernet network that connects our servers to each other and to the Internet is another aspect of our infrastructure that we designed and operate to keep Shapeways serving up 3D printed goodness.

If you’ve ever worked at a startup you know that the one constant is change. Relentless change with little to no warning. Sometimes things move so quickly that planning for change isn’t possible. When that happens you whip out the Tech Debt credit card and put it on your tab. How much are those monthly payments?

This is more or less what happened at Shapeways when our primary datacenter vendor announced that they were closing their doors. With very little time to prepare we found ourselves pulling up stakes and traveling to a new datacenter.

Physical networks form the foundation of any website. Whether the site is hosted in the cloud, from a basement, or somewhere in between — there is a server plugged into a switch, a switch into a router, and a router plugged into the Internet. When these physical networks are designed well they are fast, reliable, and easy to maintain and extend.

Some features of a well designed network are:

  • Rapid Spanning Tree Protocol (RSTP) — prevents Ethernet packets from being forwarded by the switch forever. These “packet storms” happen when someone creates a cabling loop between two or more switches. They can also happen with virtual switches (like the kind found on many blade servers). A packet storm will make it impossible for servers to communicate and your site will go down. Without RSTP you can disable your entire datacenter by plugging in one cable. STP’s inventor, Radia Perlman, also wrote a poem summing things up nicely. I’ll bet you didn’t know that poems are the hot new TL;DR.
  • Link Aggregation (LAG) — our access switches connect each server in each rack to our core switches. These links are important because without them an entire rack of servers would be unable to communicate with the outside world. Having only one link per switch is asking for trouble. Also, a single link can get saturated with traffic, slowing down communications for all servers in the rack. Link Aggregation allows multiple links to behave as one with improved bandwidth between the racks and core switches. Unplugging a cable in a LAG and not setting off alarms or bringing down the site is thrilling every time.
  • Bonded Network Interface Cards — Each server should have at least two NICs connecting it to the access switch. If you can use L.A.C.P. (the protocol behind LAGs), all the better. Without this, links go down, servers go offline, tears are shed, and feelings are hurt.
  • Virtual Local Area Networks (VLANs) — VLANs help you segregate traffic. You don’t want the admin GUI of your switches exposed to your servers, or the entire Internet, right? Who would do that?
  • Clearly labeled ports — now that we have all manner of fancy ports (LAGs, different VLANs, multi VLAN trunks) we need an easy way to distinguish one type of port from another. Plugging a server without VLAN tagging into a trunked port results in a lot of head scratching and fist shaking. Save future you some agita by labeling special ports.
  • Clearly labeled cables — when doing maintenance on servers and switches, labeling cables is not OCD, it’s what the pros do. The cables for your production MySQL master (very important) and your 23rd webhead (not very important) look identical, but the impact of unplugging them isn’t. We label each end with the name and port number of what’s plugged in on the other side.
  • Up to date firmware — switches are computers with embedded operating systems. Like all software, it has bugs and security vulnerabilities. Installing the latest firmware on switches helps reduce problems and generally makes life more pleasant.
  • Stacked switches — Two (or more) switches become one logical switch and good things happen:
  1. Bandwidth between the stacked switches can be greater than without it.
  2. LAG links can span multiple (stacked) switches, preserving the link if one switch gives up the ghost.
  3. Use fewer of the core switch’s logical LAGs compared with the same number of unstacked access switches.
  4. There are fewer switches to configure and manage.

After the dust from our hasty move had settled, we wanted a network with all of those features and the flexibility to refine things with time, but there were a few obstacles:

  1. We didn’t have administrator passwords for most of the switches in our primary datacenter. From a tech debt perspective, this was like getting a loan, buying a new car, selling the car, taking the cash to the racetrack and betting it all on a horse named 100% Uptime. While the network was working in this state, it was preventing us from making improvements.
  2. No cables were labeled anywhere. We had a color-coding system in place, which was a good start, but it wouldn’t give us the information we needed during the rehabilitation.
  3. Some racks had dumb switches which couldn’t do RSTP, LAGs, or VLANS. That was a show-stopper since none of those protocols will work as expected if some switches don’t participate.

Rolling up our sleeves

We started by replacing the non-managed switches with managed units. Next we verified the cabling and bonding config for every server. Then the rack-by-rack rehab could begin.

Because we were going to do maintenance which required reboots for both access switches, we had to get each server connected to a temporary switch.

Can you spot the temporary switch?

Once the secondary access switch was empty we could unplug the primary bond connections from the primary access switch. Then we could get down to the switch maintenance:

  1. Reset to factory defaults (default admin password, bingo!)
  2. Upgrade the firmware.
  3. Connect the access switches together (stacked config)
  4. Load the switch configs containing the LAG, VLAN, etc.
  5. Label the “special” ports such as those carrying LAG and VLAN traffic.
  6. Label and cable the LAG cables connecting the access switches to the core switches.
  7. Test connectivity and then connect all of the servers to their new access switch!
Nicely labeled LAG cables

Betty Ford Would Be Proud

Rehabilitating each rack in each datacenter was a lot of work, but it was also rewarding. Today we have a faster and more manageable network — and we’re not done yet! After the holiday season, we’re planning on rolling out all new networking gear with faster port speeds to drive down site load time and give us more operational flexibility. We’re also working on rolling out Link Layer Discovery Protocol (LLDP) so that we can reduce the need to spend time labeling and re-labeling cables. We’re imagining a Raspberry Pi in each rack which displays port, hostname, and NIC information then updates when devices move. It should be easy to use and reliable, and we’ll be sure to post about it if and when it goes from big idea to P-Touch killer.

--

--

Martín Beauchamp
Shapeways Tech

Dad, son, husband, snowboarder, mountain biker, humanist, and former podcaster. Computering since 1984.