It worked on my machine…

Azure Portal Corrupts Downloaded Files

Credit: http://www.keepcalmandposters.com/poster/5273092_keep_calm_it_works_on_my_machine

Have you ever had a situation where you’re working with a tester who has reported a bug, and you can’t for the life of you recreate it? It does after all work on your own machine…

We had an example of when downloading a parquet file using the Azure Portal Data Explorer, it corrupted the file and made it unreadable by Spark and parq, however downloading the same file with the Azure Storage Explorer solved the problem.

We were two days from deploying to production and our tester raised a bug, reporting he could not read the file he downloaded using the Azure Portal Data Explorer. Executing the below, we got a very strange and unhelpful error. We also got the same error using the spark-shell.

C:\> parq view "C:\pathToMyFile\part-00000-tid-6568847690690471989-b6379e19-6e93-4b7b-8a8c-9da4bda0817a-26-c000.snappy.parquet" parq v3.2.0[FAIL] don't know what type: 15
don't know what type: 15

We started comparing versions of our Scala binaries to see if there were any obvious differences. This worked two days ago, what’s changed? We looked at logic, data types etc. Nothing obvious, we tried increasing resources on our Spark cluster — still no luck.

We ran our integration tests locally, used the same command and it all worked a treat. Okay, so the code is fine and produces a valid parquet file. We manually uploaded the same JAR to the cluster and ran the job again. The tester downloaded the file and the error occurred again, the engineer downloaded the file and the command worked fine.

Credit: http://qadesigngurus.blogspot.com/2016/05/dev-vs-qa.html

This is when it struck me (at the time I was sitting between the tester and the engineer running the job — our tester was using the Azure Portal and the engineer was using the Azure Storage Explorer to download the file. The engineer used the portal and lo and behold the same error occurred.

Upon further investigation, we noticed that the downloaded files had different sizes, the one from the portal was 5812KB and the storage explorer version was 4491KB. This is really strange, given they are as supposedly binary files, however using different tools to download somehow affects the file.

We still don’t know why this happened or what other files this could have an effect on. So if you come across this issue, my advice would be to avoid using the Azure Portal, and instead use the Azure Storage Explorer.

About the author

Eugene is a Senior Data Engineer at ASOS with a passion for Test Driven Development, Agile Methodologies, Continuous Integration and Delivery using Microsoft Azure

--

--

--

Technology and Programming professional and enthusist

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Eugene Niemand

Eugene Niemand

Lead Data QA Engineer at ASOS.com - I have a passion for Test Driven Development, Agile Methodologies, Continuous Integration and Delivery using Microsoft Azure

More from Medium

Installing Hadoop on Ubuntu 20.04

Create a Spark/Hive meta-store table using nested JSON with invalid field names

Starting with Spark and Zeppelin in 2 Minutes Using Docker — Create Your First Data Frame

UNLOAD DATA FROM SQL SERVER AND LOAD AS A CSV FILE TO SFTP USING AZURE FUNCTIONS