Losing your data center: Lessons learned from Business Continuity Planning (CoSN17)

Mr. Steven Langford, Beaverton School District, Chief Information Officer
Mr. Jordan Beveridge, Beaverton School District, Administrator for Information Technology.

Brief Session Description:

The Beaverton School District suffered a catastrophic data center event at the start of school in 2014, which impacted all learning and business systems. During this session we were walked through the leadership and organisational lessons learned from the event, recovery, and from the subsequent Business Continuity and Disaster Recovery planning process.

Steve Langford:
Remember as you read on that this deals with 5,000 staff + a lot of students.

The event

August 30th, 2013 — last day before school begins + it was a holiday weekend.
At 6am when they got into the server room, every server was offline — no SIS, no finance, no access to anything. Tried to restart but less than 50% of the servers kicked in. No idea what happened but knew they had to restore the entire data centre from tape!

By Sunday night they had most systems back to a state they could be used, bar HR. They thought they had managed the crisis.

Monday Sept 2nd the servers were up but not in a great condition and the cause of the fail still not known. Then it was discovered that they had only been making incremental back-ups, which meant that everything was gone. From pay stubs, to tax reporting to employees on leave or not. They had zero data. That was the start of the crisis. The notifications system also failed. Nothing was tested so no one knew when it happened. They had also stopped paying for alarm calls that were false alarms and so that was missed too when the alarm was real.

They had been working too hard, too fast and too thin. They did not have a system up front to monitor and check. They had to be transparent about everything.

Everyone was called and they called their staff. It was about 2 weeks to the next pay day. Called the superintendent, who did the most unexpected thing and sent the poem “If” by Rudyard Kipling. If’s message is that you have to stay true to your character. In a crisis, staying true to who you are is so important. It was a great leadership lesson.

The Cause:

Like a murder mystery — there’s no evidence, so whodunnit?
The company they called in to discover this found that:
At 1.39 in the morning the fire suppression system deployed into the room. There was no fire. The nozzles (which had been recalled) released the gas at too fast a rate, which then slows and causes a supersonic boom. Hard drives can’t deal with this and those that survived would fail within 3–4 months.

Brought in the disaster recovery team — it was not their 1st time doing this. ALso found a copy of a year old test-base of their HR data and some PDFs of payroll (even though it shouldn’t be done this way). ALso, they had year old tapes which had been slightly damaged by heat a year previously.

https://upload.wikimedia.org/wikipedia/commons/f/f1/Takabisha_roller_coaster.jpg

There were a lot of ups and a lot of downs. Transparency was maintained throughout, however, and everyone was told of every up and every down.

They did manage to pay everyone, but everyone also had to be told that some of them might get paid a bit more and some might be paid a bit less. All were told that it would be sorted out though.

The old tapes were corrupt but a company in Belgium had a $4000 product that could copy the data exactly as it is to allow for it to be read. It worked!

Aftermath:

Data was back, but all reports were gone. This is when you discover which ones are important and need to be rewritten. It took the rest of the school year, Sept-June, to get everything back up and running.

Lessons learned:

TEST!
Practice and simulate!

Starting the Business Continuity Planning

It took a lot of work — resistance was met along the way from the leadership teams but it was overcome.

BCP Phase 1:

Identify who needs to be involved
Identify what needs to be dealt with
Identify the goals

Ended up with an applications matrix that set down the different processes and how long they take for RTO and RPO in days.

Once this was identified, a gap analysis was carried out.

BCP Phase 2:

Identified who in leadership and IT
Identified what processes were mapped to which systems
Identified the goals

BCP Phase 3:

The plan was sent back to each department to be completed

BCP Phase 4:

IT recovery plan

BCP Phase 5:

Management Plan

BCP Lessons Learned: