Production: The First 100 Days

Edgemesh Corporation
Edgemesh
Published in
9 min readJul 12, 2017

Edgemesh Corp’s first 100 days in the real world

General George Patton’s Speech to the Third Army: “Patton” 1970

“Let’s not romanticize production. Production is war, and war is hell”

- Bryan Cantrill (@bcantrill) , CTO of Joyent

Launching a tech company is always tough, but there are real differences between companies that are far out on the edge of technology and those that are not (e.g. “It’s like Uber for Cat Sitting!”). When a technology problem has a “glue this to that” solution and there’s not much actual innovation, the technical ‘unknowns’ don’t really exist (until you start hitting scaling issues). In those cases the key metrics generally revolve around customer acquisition and scaling costs.

With a more ‘mad science’ technology, there are many more technical unknowns than knowns. Just getting everything to work is a massive challenge. It’s essentially the difference between “riding in a cab” and “riding in a prototype star ship”. Yes technically they are both modes of transportation, but there is little chance that the cab will explode and burn the company and team down to a fiery mess.

For those of us who have blown past the “cutting edge”, waved goodbye to the “bleeding edge” and are now firmly planted on the “impaling edge” of software, we know that before the company can focus on the business metrics in earnest there is a critical test you must pass: PRODUCTION.

Unlike a simple website, software at the impaling edge is generally difficult to test at scale. It’s much harder still when it is distributed in nature and it’s borderline masochistic when it’s distributed globally across devices you cannot directly touch and debug.

So for Edgemesh, where the vast majority of our software runs on client browsers, the first 100 days in production were going to be … stressful.

Get ready

What is Production?

Production can be defined a number of different ways, so lets define what exactly we mean here.

Production is when you have a paying customer who expects your service to work at a scale that you have not been able to test before … and if you fail there is little chance you will be able to recover the customer.

For Edgemesh that meant a customer site that exceeded a few hundred visits per minute from multiple geographic locations. That day came on April fools day 2017.

Getting Battle Ready: Don’t over promise

If you’ve ever been in the trenches here, you know the first rule of Production day zero: don’t over promise. That’s always a thin needle to thread — on the one hand you have to ensure you are committing to deliver enough value to win the account, but at the same time you need to give the team enough leeway to fix issues if (when) they arise.

Getting Battle Ready: Don’t bring a knife to a gun fight

The next step is ensuring you have the tools to debug before you deploy at scale. This is not easy and takes a lot of time to get right, but if (when) the issues arrive you need to have the right tools at hand. Essentially you need the ability to debug post mortem. When we do have a crash we need a “flight data recorder” to analyze what happened. Bryan Cantrill has a wonderful presentation on the value of this available here (definitely worth the watch).

For us, that meant we needed a combination of internal logging (with performance data) and client side error reporting. For client side error reporting we strongly recommend Sentry (also open source ). We also developed some internal adapters to allow multiple systems to report into Sentry. This allows us to see both the client and server side issues together in correlated view.

Example Sentry error event, includes all the information needed to debug when you can’t get a core dump

What needs to be in your pre-Production rucksack? At a minimum I’d say:

  • Automated error capture from all systems
  • Full stack trace (with source inline to trace analysis tool)
  • Per function performance metrics (number of times executed, min/max/avg/p95/stdev of runtime) for server side (at least)
  • Client side information (browser/OS/hardware info) if applicable
  • Direct mapping of Error to Owner (all errors should have an obvious project lead)
  • Ability to add comments to an error and link back to releases

Getting Battle Ready: Mentally prepare for a siege

“If the mind is willing, the flesh could go on and on without many things”

— Sun Tzu

Production means late nights, hard problems and some of the most trying times for a team. Know this beforehand. Unfortunately you can’t pre-classify all errors into severity categories so accept the fact that the core team will need to be online and available during the entire campaign.

Work out who does what and for how long. Set a deadline for analysis (8 hour max?). After the deadline the lead either passes onto another team member who is just coming in or we set the analysis aside for a break (sleep management is critical). Ensure everyone knows who will speak to the customer (it’s one person so the company always speaks with one voice) and what the output of that analysis will look like (have a Root Cause and Remediation template).

Getting Battle Ready: Define victory

It’s important for everyone to know what the goals are and to define realistic metrics for success. This helps keep everyone’s head in the game and keeps the focus of the team on the key value propositions your customers expect. For example, automating the build chain and test suite is important — but it’s not a goal. An actual goal might be: “we want to deliver updates to fix critical errors within a 24 hour window after root cause and remediation”.

For Edgemesh, we set some high level milestones we wanted to hit:

  • Debug all client side errors fully from our internal reporting system without disturbing the client
  • Release automatic updates to remedy all errors in <24 hours from fix, achieve global client update in <72 hours
  • Mesh across 500 distinct networks
  • Offload 1 Terabyte of traffic to mesh
  • Mesh across 50 distinct countries

D-Day: April Fools 2017

On April 1st our first large scale client officially went into production. We had set their expectations by explaining that the first 60 days would be primarily used to measure the traffic patterns, and they shouldn’t expect any significant traffic offload to the mesh. We also did not promise the Real User Metrics feature at the time of sale because we wanted to debug one aspect of the software at a time.

During those first 60 days, we pushed 10 updates into production. Since Edgemesh runs on the end client’s browser, we needed to ensure we could get a stable update method. We experienced one major issue when an upstream CDN cache failed to correctly update edge caches causing clients to roll back to a much earlier release. As a result, in May we moved all Edgemesh code away from the CDN provider and back to our servers.

Upgrades are harder than installations. Dangling clients can wreak absolute havoc.

During those first 7 weeks the client was given weekly updates. We used the portal to provide insight on where users were coming from and custom one off reports of their current origin latency.

Meanwhile Randy and Eugene debugged 100% of the reported errors using our aforementioned internal reporting system. Although there were a number of bugs, none were user impacting. When the Edgemesh client hits an exception, we bail out and allow the browser to resume normal operation. The data collected from the error is then reported back to us for analysis, better to be safe than sorry. This, combined with the fact that expectations were set early gave the team the time they needed to methodically move through the release and upgrade cycles.

Lead Engineer leading by example: Closing out bugs reported in Sentry.io

Floor it: May 26th 2017

When we hit v1.6 on May 26th it was time to go all out. We took the governors off the mesh clients — allowing the mesh to operate at full capacity. We also closed out the last few major issues which would cause the client to bail. Lastly, we rolled out our global Supernode network to provide extra capacity and edge peers — just in case we needed to utilize more traditional networks around the globe.

Edgemesh Supernodes in June: spread across Azure and Google Cloud

We released v1.6 on a Friday, since we knew the client’s traffic patterns were much lower on the weekend. We waited for the End of Day process to run and then held our breath. The Real User Metrics were available now so we could analyze individual page load times, but our internal expectations were tempered. They shouldn’t have been.

By the end of the following week we had offloaded over half a terabyte of traffic from the client’s origin servers. In addition, the customer’s mean page load time decreased by 33%! Both of these numbers exceeded our internal models by a factor of at least 2x.

Over 1 Terabyte of Traffic Offloaded. Page load time decreased >30%

As the weeks went by the mesh continued to expand organically. We crossed 1,000 distinct networks within 10 days (2x our goal).

500 distinct networks, then 1000 🚀: Mesh latency dropped by ~4x 🎉

To put 1,000 networks in perspective, Akamai reports:

Akamai has deployed the most pervasive, highly-distributed content delivery network (CDN) with more than 233,000 servers in over 130 countries and within more than 1,600 networks around the world.

The mesh was growing fast.

As the mesh started to increase in geography (hello mainland China!) and reach new network edges, the mean mesh transfer time decreased to ~150ms. The connection time to Google is ~100ms, so 150ms for peer to peer replication is very fast.

100ms is to Google

It’s important to note that mesh latency has little impact on actual client page load time as Edgemesh essentially pre-caches content, so clients load content directly from local cache. Where intra-mesh time has the largest impact is the distribution of assets to increase the pre-cache hit rate. The faster we can connect, the more time we have to deliver content in the background.

Post Production:

The client was ecstatic. A 30% reduction in page load time (nearly 2 full seconds) was a major value. The reduced traffic pressure on their Origin was also starting to translate into real savings (load balancers and bandwidth fees).

Client’s page load time comparison: Edgemesh Accelerated (top), non-Accelerated (middle) and blended total (bottom)

We’d also met or exceeded all of our internal goals

  • Debug all client side errors fully from our internal reporting system without disturbing the client 👍
  • Release updates to remedy all errors in <24 hours from fix, achieve global client update in <72 hours 👍
  • Mesh across 500 distinct networks 👍👍👍
  • Offload 1 Terabyte of traffic to mesh 👍👍👍
  • Mesh across 50 distinct countries 👍👍👍 (78 and counting)

Best of all we’d been able to prove that the technology works at scale and that the team was well equipped to add even more edge technology.

🚀Next stop … Edgemesh 2.0 😄

--

--