Troubleshooting 101

Kris Curtis
CMD'ing Data
Published in
7 min readFeb 25, 2019

If you manage a service in any business there will unfortunately come a day when this service breaks. It sometimes feels like a bit of thankless task.

Part of the hat you need to wear is that like a stagehand. Someone who works backstage making sure everything works to allow the show to go on.

There is no spotlight, no curtain call, no encore or rounds of applause for running a Tableau Server. Its head down and get on with it. If you want a job where you are getting shout outs and thank you’s then maybe have a think about where you want to focus.

The hardy souls who still decide to take on this challenge don’t need the plaudits. Knowing that you are orchestrating the way your company access and consume data might be enough.

So, what happens on the day when you suddenly start hearing from everyone?

Something is broken!

Tableau is down!

I can’t access anything!

You need to be prepared and ready to step into action to save the day. By having a plan will mean that you can apply a systematic approach to solving a challenging problem, rather than run around like a headless chicken.

You want to exude a sense of calm and professionalism. You want people to trust you and know that you have the issue in hand are are doing your best to fix it. Even though inside you might be screaming and swearing.

When everything is fine — you don’t hear from anybody. When everything is not fine — you hear from EVERYBODY!

So, I’m going to share my approach to handling situations like one I had last week.

To set the scene — I was offsite, spending the morning networking and presenting at a Tableau event in London (The Tableau Cinema Tour). Right after I came off stage, one of my colleagues who was with me mentions that “it looks like Tableau is down right now”. Great. Of all the days, I was out of the office. My mind was elsewhere, not on Tableau Server.

Lucky for me I have a systematic approach and am a process driven person. It's how I work. Sometimes I’m too process driven and it drives my wife crazy.

Step 1: Stay calm

This might sound cliche but if you start stressing and panicking then you lose focus and concentration. If you feel yourself starting to make mistakes, stop. Take a walk away for 5 minutes. Breathe. Get a drink, water, tea, coffee, Coke anything which takes you away from your desk and to reset your mind and bring you back. Plug in your headphones and find some good tunes. You might be in for the long haul.

Step 2: Stay patient

You are going to get lots of emails, walk ups, instant messages , text messages, phone calls — any possible channel of communication people will think they are the first people to let you know that something is wrong. They mean well and trying to help, but when it's the 20th person saying “Tableau’s broken” you can kind of get a bit snappy.

Again, you need to focus your energy and concentrate on the task. Smile, roll out the standard IT line “I’m working on it” and put your headphones back in and stare blankly at your PC screen. If they don’t take a hint then disengaging the conversation will make it clear that you don’t have time to chit-chat.

Step 3: Communicate

Part of stopping all the walk-ups and updates is to communicate. If you get on the front foot and acknowledge the issue and give clear ownership then people won’t feel as they need to let you know. This quickly removes interruptions and allows you to focus. You might only get one or two interruptions from people who don’t read their emails. By quickly pointing out that you “sent an update” will again show your professionalism and ownership of the issue. It also is a polite way of telling them to leave you the hell alone.

Step 4: Rationalise

You’ve communicated that there is an issue, you’ve done some initial assessments. The next thing to do is to really assess what are the symptoms and how are they being caused.

My issue last week was related to Google data sources. All of these data sources were failing. No one could connect to a workbook from Server with a Google data source (BigQuery or Google Sheets).

With over 450 workbook extract failing and just as many workbooks not being able to be connected to it was a substantial issue. There were lots of key stakeholders asking me what was going on. Business critical reports were failing.

On the positive side — all the initial comments I had was that Server was down. Upon my investigation, I could see from screen shots and my own testing that Server was up and running, it was specific workbooks which were not loading.

Another clue was in the the screenshots I collected from these views.

To quote Alan Partridge “AHA”.

This pointed me to a bit of a crossroad of the underlying problem.

Was there an issue with the Google oauth and Tableau?

Was there an issue with access to our Google databases?

Was there an issue with Google?

Step 5: Contact Support

Don’t be afraid to ask for help. Help can come in many forms. It might be as straightforward as finding out that your issue is a downstream impact from a greater issue.

In my case I reached out to my Tableau support team. I raised a support ticket straightaway. They helped my confirm my suspicions with the fact that I confirmed that no changes were made to Tableau Server.

I also used the Tableau community to search for other clues. Had anyone else mentioned a similar issue? Was anyone talking about Tableau and Google on Twitter? I was not able to find anything, which told me that it was a one off issue with us.

I was also able to connect to the Google databases via Tableau Desktop. This eliminated another suspect that there was no error with access from Google and our databases were fine.

By talking through the issue with my Tableau support team I was able to systematically remove pathways in which I could have gone down and wasted time. It left me with one potential cause-Google oauth and Tableau was misbehaving.

Step 6: Resolution

So I reached a point where I knew that there was something wrong between Tableau and Google’s oauth. My assumption was validated with looking into the supporting metadata. Tableau logs and PostgreSQL data all confirming that all data extract failures were due to authentication issues.

Making use of these resources is also essential in your troubleshooting toolkit. I knew which report to use to find out my error details from extract failures.

At the same time as your investigation continues, your should still be communicating updates to your stakeholders. Email and chat rooms should suffice.

Now that I was sure I knew the issue I then wanted to confirm. Accessing my virtual machines I could see that they were still able to communicate to each other but when testing connectivity externally I could see the issue. There was no access to any internet connection. This meant that when Tableau Server tried to refresh an extract the Google oauth prompt to verify opens in a web browser.

No internet on these boxes meant to authentication and resulted in error failures for workbook extracts and live connections to workbooks.

So by working through this sequence I found the cause of the issue. I was able to quickly raise a critical incident and our networking team quickly resolved the issue.

During some networking maintenance the default route was removed accidentally. Mistakes happen. Although I was a bit upset that this happened and caused an impact to my system I have to continue a professional relationship with this team. People are human and there is no use pointing fingers and laying blame. As long as people acknowledge the mistake and apologise things should proceed as per normal.

Step 7: Learnings

Once you restore your service back to normal you should always look to capture exactly what happened to learn from the issue. In this case, another team caused the issue by not following standard procedure.

This is being addressed internally and communication and testing will be improved after this incident. Not only improving the ways of working but also improving processes along the way. Now that's where I get my satisfaction.

To summarise- Outages will happen. Its up to you to decide how to handle these tests. If you can solve a complex problem and keep calm and handle the pressure like a pro then you might get a thank you at the end of it. But don’t always count on it.

--

--

Kris Curtis
CMD'ing Data

A data professional for 17 years, focusing on educating and creating possibilities for business users to embrace the use of data.