Solving malformed json bug at Azure Data Flow

Jão
Data Are Lakes
Published in
2 min readJan 5, 2023

If you work with Azure Data Flow, you probably already had an issue “Malformed records are detected in schema inference” when trying to consume json files from a Data Lake directory.

This bug was solved with a work group from my BI Team. Firstly we inspected a set of json files at our data lake, because sometimes we get this error, and sometimes not, so we supposed that this error was caused by a single file, or a small amount of files. To find what file specifically was causing this bug we tried some files until we find one file that was causing the problem. Inspecting this file, we find that problem:

Print of a json file without final brackets

As you can see, this file is missing some brackets at the end. But, why is this happening? In other files this error doesn’t occur. So I checked the directory on my computer, and I found a perfectly formatted json, as the below image shows:

Print of a json file with the final brackets

So, I suppose that this error was occurring, during the uploading of data using a Python script. To prove my hypothesis, I uploaded the file in question manually, and tested my data flow, and it worked perfectly. Then I checked my Python script and found that the script was using the key account storage and not connection string account storage, I just modified this, and fixed the issue. After this my Data Flow worked perfectly.

One thing that I’ve noticed was that passing the key as a parameter, files were compressed and their size reduced, in the case of this problematic file the size came from 24.8MB to 22.9MB. I wondered why this was happening, because if was passing a wrong value to the function that makes the connection, the connection should not work, instead work precariously.

I wrote this article to help someone its experiencing this problem and doesn’t know how to solve, because in my experience, I didn’t find any article talking about this particular issue. If you are using the “key account storage” instead of “connection string” to connect and upload files to the Data Lake, it may cause some malformed json files.

--

--