Azure Portal Corrupts Downloaded Files
Have you ever had a situation where you’re working with a tester who has reported a bug, and you can’t for the life of you recreate it? It does after all work on your own machine…
We had an example of when downloading a parquet file using the Azure Portal Data Explorer, it corrupted the file and made it unreadable by Spark and parq, however downloading the same file with the Azure Storage Explorer solved the problem.
We were two days from deploying to production and our tester raised a bug, reporting he could not read the file he downloaded using the Azure Portal Data Explorer. Executing the below, we got a very strange and unhelpful error. We also got the same error using the spark-shell.
C:\> parq view "C:\pathToMyFile\part-00000-tid-6568847690690471989-b6379e19-6e93-4b7b-8a8c-9da4bda0817a-26-c000.snappy.parquet" parq v3.2.0[FAIL] don't know what type: 15
don't know what type: 15
We started comparing versions of our Scala binaries to see if there were any obvious differences. This worked two days ago, what’s changed? We looked at logic, data types etc. Nothing obvious, we tried increasing resources on our Spark cluster — still no luck.
We ran our integration tests locally, used the same command and it all worked a treat. Okay, so the code is fine and produces a valid parquet file. We manually uploaded the same JAR to the cluster and ran the job again. The tester downloaded the file and the error occurred again, the engineer downloaded the file and the command worked fine.
This is when it struck me (at the time I was sitting between the tester and the engineer running the job — our tester was using the Azure Portal and the engineer was using the Azure Storage Explorer to download the file. The engineer used the portal and lo and behold the same error occurred.
Upon further investigation, we noticed that the downloaded files had different sizes, the one from the portal was 5812KB and the storage explorer version was 4491KB. This is really strange, given they are as supposedly binary files, however using different tools to download somehow affects the file.
We still don’t know why this happened or what other files this could have an effect on. So if you come across this issue, my advice would be to avoid using the Azure Portal, and instead use the Azure Storage Explorer.
About the author
Eugene is a Senior Data Engineer at ASOS with a passion for Test Driven Development, Agile Methodologies, Continuous Integration and Delivery using Microsoft Azure