Stile Downtime: What happened and what we’ll do next.

Byron Scaf
12 min readMay 3, 2020

--

Last week was rough for the hard-working teachers that have come to rely on Stile. Stile was down, slow or glitchy during school hours on Tuesday, Wednesday and Thursday. A previously unknown bug in some off-the-shelf and widely-used database software we use, together with a massive shift in how schools are using Stile in our new world of remote learning, conspired to cause our worst outage ever.

I am so sorry for the frustration and stress we caused at an already challenging time. A time when I know that teachers are already pulling out all the stops to ensure their students’ learning is minimally disrupted by COVID-19.

For a company that prides itself on being reliable, on being there when teachers and students need us, we really fell short. Below is a full and transparent account of what happened, what we learned, and what we plan to do differently to avoid this sort of disruption in the future. If you’re so inclined, grab a cuppa (or a wine) and settle in for the read.

Timeline of events

Lockdown and the lead up to Term 2

Like everyone involved in education, the last month has been a total blur at Stile. Our whole team of 40 transitioned from the office to a fully remote working environment in just a couple of days, and we began the process of learning how to work effectively over video conference. But more importantly, we redirected all our resources to two things:

Monday 27th April

Stile was running as smoothly as always, with no signs of any issues and handling over three times the traffic of the single busiest day of Term 1.

Tuesday 28th April

Before reaching even half of the preceding day’s peak traffic, our primary database server failed. This database is where all lessons and associated student work are safely stored and is critical to Stile functioning. This is reliable and widely-used database software, and is managed for us by the biggest cloud computing provider in the world, AWS. A bug in the database software itself caused the database server to crash.

Under normal circumstances, if our primary database crashes, our system automatically switches to our spare database within seconds. This is part of the service AWS provides us. But, for reasons that aren’t forthcoming, this did not occur on Tuesday. An attempt to manually switch also inexplicably failed.

It’s worth noting that the failure to both automatically and manually switch to the spare database has never happened in our entire company’s history and we’ve been using this database from the beginning. Our collective in-house engineering team has dozens of years experience working with this database in a number of different companies, we’ve run many successful drills internally for this exact type of event and none of us have ever previously seen what happened.

Our database was officially busted. We were left with three options:

  1. Start Stile up again on a third spare database that we had “just in case”. This was one working just fine, but it was nowhere near big enough to handle the sort of traffic we were seeing on Stile in this remote learning era. We began increasing its capacity, but it quickly became apparent this would take 5+ hours.
  2. Build a brand new database from the automatic backup we had taken 5 minutes before Stile went down. Doing so would mean students and teachers would lose 5 minutes of work. We started this process too, but it became apparent that this too would take around 5 hours.
  3. Do the same as option 2, but recover to a different backup from 1:30am Tuesday morning, therefore losing hours of student and teacher work. We knew this would be significantly faster, but it would be incredibly frustrating for those who had done the work, and difficult to clearly explain or predict what the impact would be. We’d also never tried to recover a database backup that old into a live system before — we’d only ever done it in our ‘doomsday’ scenario planning where a natural disaster in Sydney impacts all our servers. In that scenario, we’d have days, not hours, to bring Stile back online.

By mid afternoon, we had the ability to go live with the data from 1:30am (Option 3), but with the school day mostly over, we made the decision to wait for Option 1. This meant that we didn’t lose any data but needed to increase the capacity of this third database.

Simultaneous to this database recovery operation, our infrastructure engineering team was working with the database experts at AWS to try and understand what had gone wrong and how we were going to avoid the issue happening in the future. We’re always uneasy when we can’t clearly explain why something has happened, and this was no exception. By late afternoon, the joint teams had isolated the issue to a particular part of the database software, and on the strong recommendation of AWS’s experts, we made a configuration change to bypass the suspicious component by replacing it with an older version.

By 6:30pm Melbourne time, Stile was back up and running as normal with our spare database. We had not lost any data, and we had made configuration changes that we believed would prevent us from running into the same issue again.

Wednesday 29th April

As traffic to Stile ramped up at the start of school on Wednesday morning, everything was looking fine, but by around 10am we could see Stile was starting to slow down. We were starting to see an incredibly strange oscillation that brought Stile to a standstill about once every 5 minutes, then it would clear and start working again. It was getting so slow that it was bordering on unusable for some people. Our monitoring systems told us too many people were receiving errors as they used Stile.

Cyclic bursts of traffic to Stile’s servers, with incredibly high response times on Thursday 30th April. On the left you can see the number of requests for data our servers are receiving every second. On the right, you can see the time it takes for the database to respond. As you can see, some requests were spiking to taking over 10 seconds - that’s a long time to be looking at a loading spinner.
What it should normally look like. This is the same window of time on Friday 1st May.

We couldn’t understand why this was happening. The load on Stile was not higher than Monday, which we handled beautifully. No single component of Stile’s infrastructure appeared to be in any way stressed.

So, without any other options, we made the decision to start removing load so students could keep working.

A brief aside for some background on removing load.

We always like to have a Plan B, especially when dealing with something as new and different as COVID-19. While we had some great hypotheses and anecdotal stories about how teachers and students might use Stile differently in the distance learning context created by COVID-19 social distancing measures, there was no way to actually know what would happen. There was no prior experience of anything even remotely similar to draw upon. We thought we’d made a good estimation of the amount of extra traffic, and we’d allowed a buffer on top of that. But in the event that we got it substantially wrong, we built some technology that would allow us to do two things:

  1. Slow down or stop a variety of features that we considered “non essential” to the core workflows of teachers being able to release lessons to their students, and those students being able to do that work. For example, access to the Markbook, Analytics and modifying lessons. This isn’t to say they aren’t important features that teachers rely on every day, just that in the worst case scenario, teachers could live without them to preserve the ability for their students to work.
  2. Disable Stile access to schools or teachers that are trialling Stile for free (any school that hasn’t used Stile before is entitled to a free trial) to ensure we prioritise access to Stile, in an emergency situation, for our paying customers.

We thought it was highly unlikely that we would need these tools and had so little time to build them that frankly, we built them poorly. The tools worked in the sense that they enabled us to remove load, but they had a flaw: they didn’t explain to users why certain features of Stile weren’t working. It simply left pages like the Markbook looking and feeling broken, with infinite spinners and worse.

One of the first ways we tried to remove load was to disable access for schools that we’d provided with free access to help them through COVID-19 (sorry!).

We gave schools free access because it was the right thing to do. We didn’t want students to miss out on an education at home just because their teachers, through no fault of their own, didn’t have access to the best possible resources. It was going to be expensive, but it didn’t matter. However, we knew we could only do that if it wasn’t at the expense of our paying customers — this was only fair. I 100% stand by this decision. We believed we could do it and so we were morally obligated to do so. It was, for me, a clear example of us living one of our core values: education before profit. Some customers have said to us that COVID-19 is “every man for himself”, arguing that we shouldn’t give anyone a free pass. We respectfully disagree.

It’s a moot point though because unfortunately, it didn’t help at all. As we’ve since verified with some analysis, the proportion of traffic coming from these free access schools was minimal. Instead, the significantly increased traffic came from where we thought it probably would: teachers that have used and trusted Stile for years in their classrooms. They were using it more than ever, and fair enough too.

So we were left with one option: start removing the ability for teachers to use many of the features they’ve come to rely on. I don’t regret us doing this, because it achieved what we needed: record numbers of students could still get their work done. I do regret how poorly we communicated what we were doing. While we were providing regular updates on our status page and on our Facebook group, these are not widely known or used. We needed to provide timely, relevant information inside the Stile platform itself and via email so teachers could make decisions about how to adapt and proceed.

By Wednesday late afternoon, we had a list of possible theories on what was going wrong. We took Stile offline at 10:30pm on Wednesday evening and the team worked all night to test theories and implement changes.

Thursday 30th April

Unfortunately, while Stile was definitely working better thanks to the work completed the preceding evening, Thursday was more of the same.

By 4pm on Thursday, we decided to test what, upon reflection, seems like an obvious thing to try: remove the configuration changes recommended to us by AWS. It was a tough decision; had this been protecting us from the original bug that caused the original crash? It also meant a few minutes of poor performance and risked taking us offline again.

We took the calculated risk. All the problems immediately disappeared. It was like nothing had ever happened.

While most of the engineering team was consumed with simply keeping Stile running (and the rest of the business was answering thousands of teacher, student and parent emails), one group was working to better understand the original bug from Tuesday and had made significant progress.

Again, we worked into the early hours of the morning to implement changes to Stile that we believed would prevent us from running into that bug again.

Friday 1st May

Stile handled record numbers of students and teachers online. Everything was normal.

During the course of Friday alone, approximately 150,000 students used Stile from Australia and New Zealand, including approximately 10% of all years 7–10 students in Australia. At the busiest times during the day, over 50,000 were using it at the same time. These numbers are crazy. These are the sort of numbers that crash critical government websites designed to serve a substantial percentage of Australia’s population.

What we learned, and what we’ll do differently going forward.

Whenever a significant failure or unexpected event occurs at Stile, we run what we call a post mortem. A detailed analysis of the situation, how it unfolded, and what we could have done differently. We process that into a list of immediate and long-term actions and set to work.

Here are the key points from this one:

1. We need to get better at communicating with teachers in real-time.

The most frustrating aspect of last week for teachers was Stile simply feeling glitchy or broken, not knowing why, and rightly assuming that if it was broken for teachers, then it is probably also broken for students.

If in the future for whatever reason we need to turn off certain pages, or certain functionality within Stile, we’ll make sure it tells you explicitly what is going on and why. If you’re keen to receive up-to-the-minute information on issues, please subscribe to email or SMS updates via our newly-created status page.

2. We need an “everything is broken” backup strategy for teachers.

While Squiz is a great way to revise and consolidate Stile-aligned science content even when Stile is fully offline, we need something more. We’re investigating the idea of having a simple backup page that houses all our lessons as PDFs, which teachers could access if Stile was down. PDFs are obviously nowhere near the full Stile experience, but at least it is somewhat possible to get on with learning.

3. We can and will get better at recovering rapidly from full-blown outages

While we have documented and tested procedures in place for when Stile goes down (and we definitely got more practise last week!), in hindsight, we can see ways that we could have shaved several hours of our end-to-end response time on Tuesday. We also have a better understanding of how to estimate the length of these types of outages, which will allow us to communicate more fully in the future. We’ll be building these into our procedures going forward.

4. We’re reviewing our decision to outsource the management of our core database.

Everything that has been built by Stile is maintained and managed by our in-house engineering team in Melbourne, Australia. They are truly some of the smartest and dedicated people I’ve ever had the privilege of working with. However, when it comes to the maintenance and operation of our off-the-shelf infrastructure, including databases, there are enormous advantages to having a 3rd party manage it on our behalf. Not only do they theoretically have more expertise through exposure to thousands of similar systems, but those economies of scale make it cheaper, and that’s a saving we can pass on to schools.

But the cost and logistics savings aren’t worth much if they come at the cost of reliability. Such a decision needs rigorous analysis, which we will undertake in the weeks ahead.

5. When something goes wrong, less is more.

When things aren’t working, it can be tempting to throw the scientific method out the window and change more than one thing at a time. Sometimes it‘s unavoidable, but more often than not it isn’t. We made the choice to implement AWS’s recommended configuration changes, but in doing so did not introduce their change particularly scientifically. Our decision was coupled with a desire to get Stile working again fast. But the fact remains, had we been more scientific, we would have known far sooner that it was the cause of our problems. The irony is not lost on us.

Once again, on behalf of the Stile team, I apologise.

I deeply appreciate that when Stile doesn’t work, we make life harder for already hard-working teachers, and that it can end up reflecting poorly on your school in the eyes of parents.

But I also want to say a huge thank you. Thank you so much for your patience and understanding. We were buoyed by the torrent of incredibly supportive messages — every one of them makes the extraordinary efforts required to get through these situations worth it, and honestly sustained us through several all-nighters.

I am completely committed to learning from our mistakes and remaining transparent as we do so.

I’m completely committed to doing better for you, our superhero teachers on the front lines.

--

--

Byron Scaf

Entrepreneur and science nerd. CEO at Stile, where we're engaging young citizens in the power and wonder of science.