Data engineering daily problems -2

Published in

Geek Culture

6 min readAug 10, 2022

Improve your Implicit Knowledge with these double ration of use cases: a custom URL generator and the transformation of compressed XLSX files to Spark tables. Code examples in Python.

Introduction

This article is the second part of the series started in:

Data Engineering daily problems-1

Improve your Implicit Knowledge with these double ration of Poltergeists using AWS Lambda

medium.com

I encourage you to read the introduction to the series in the previous article, but if you are lazy enough to want to save a click, let me summarize it to you:

I am going to start compiling a series of problems that I have encountered in my day to day work as a data engineer, which may not have value as an individual solution (if you have not encountered, directly, the same problem as me) but that can help in the development of that implicit knowledge of the field.
My idea is to compile these examples in this series of articles so that, in the best case, they can help someone facing the same problem, and in the average case, they can serve as curiosity or to help developing that implicit knowledge.

The dose of problems for this article are related to having to download files from the Internet. In the first one we have to infer the possible URL from which download files, and in the second one we have to process a ZIP file with several XLSX files so each sheet could be used to create a table in Spark metastore.

Problems

Problem 3: URL generator

Description

Download and text extraction from a PDF file generated,each week, by a human and published in an unknown URL. This file follow a naming convention but it’s clear that is being created by a human because that naming convention is followed but only partly. Some considerations:

We do not know what day the file will be uploaded: the date of the file in the URL does not have to be the date it was uploaded (4 days range).
As in a normal TDD, when a new bug is identified, the code used to identify that bug has to be added to our test pool. This case follows a similar approach: when a new URL is not identified by our process, the case to build that URL format has to be added to our code, that’s why we need an extendable process (as the URL searched could be the result of applying several transformations).

Examples:

web.es/f/files/Press%20notes/PRESS_NOTE_DATE_27_9_20.pdf
web.es/f/files/Press%20notes/PRESS_NOTE_DATE_30_09_20.pdf
web.es/f/files/Press%20notes/PRESS_NOTE_DATE_14_02_2021.pdf
web.es/f/files/Press%20notes/PRESS_NOTE_DATE_30_9_2021.pdf
web.es/f/files/Press%20notes/PRESS_NOTE_DATE_2_7_2021.pdf
web.es/f/files/Press%20notes/PRESS_NOTE_DATE_2_07_2021.pdf

There are 6 different date formats that we have identified but what if we identify another date format like:

web.es/f/files/Press%20notes/PRESS_NOTE_DATE__23_03_22.pdf

It will be a nightmare to code all the different case conditionals that we could face, furthermore, each time that we want to generate a new casuistic that could be combined with the others (like the previous example), we would have to repeat a lot of code and our program will become unmanageable.

Solution

Use of combinations: itertools

Algorithm that generates possible names for such PDF file and checks if any of them matches the file itself.

The main of the execution is the method “search_pdf_and_get_url”, the process executed there is as follows:

First, the possible dates are generated. We have said before that there is a 4 day range. To be more specific, that day range is (-1,3) so if today is 25 of July, we are going to search to try to infer the URL using the dates: 24 of July, 25 of July, 26 of July, 27 of July. This is being calculated in the method “get_dates_to_check”.

Then, the possible URLs are generated for the previous day range, first we have to generate the possible combinations, for doing so, we call “generate_combinations” and it will return something like:

[[0], [1], [2], [3], [1, 2], [1, 3], [2, 3], [1, 2, 3]]

Once that we have the combinations, the possible URLs are created assigning each number to a method/functionality (as all transformations could be applied to the URL at the same time, for example, in the combination [1, 2, 3], 0’s are removed from the day [1] and from the month [2] and then remaining digits are added to the year [3]).

Of course, the combination returned by itertools could have been hard-coded but what if we want to add a new way of formatting the date? Then we only have to modify the input of that method (number=3 to number=4, to generate all the possible combinations) and we would have to modify the “generate_all_urls_per_fecha” method with a new conditional for the ‘4’ case (Example: swap order of day and month).

Finally, with all the generated URLs in a SET object (to avoid duplicates), we check if any of these URLs exists, for doing so, the method “check_if_url_exist” is used

Doing so, we stick to the DRY principles and to extend the functionality of our algorithm (this is going to happen for sure as new cases depend on humans) is a way easier task.

Question to the readers: Do someone know which methodology have I used here?

Problem 4: Transform compressed XLSX files to Spark tables

Description

Download a compressed file containing XLSX files with an undetermined number of sheets. Each of these sheets has to be transformed to a table that could be referenced from the Spark metastore. Those sheets are not always in a table-like format (the first row is not always the header) so we have to process them in such a way that we can automatically delete those unnecessary rows.

Solution

Use of Pandas + openpyxl (Python library used by Pandas to read Excel files).

The .zip file is downloaded from the URL and extracted.

Once all extracted files are in the same path (xlsx_path), the list of files of XLSX files on that path are stored in a List, as we are not sure of the number of files in that compressed one.

Pandas is used to read the Excel file and, for each sheet in that Excel file, starting from the beginning of the file, a valid header (a row with same number of columns as the rest of the file, in other words, a row without “Unamed” columns) is searched. Pandas is auto-filling the missing headers with fields with name like “Unamed-X”.

Finally for each of these sheets, a new CSV file is created combining the names of the Excel file to which that CSV belong and the name of the sheet (so we could easily identify the original file). The CSV is stored and a Spark table is then created with the following code:

CREATE TABLE table_name USING CSV OPTIONS ('header', 'true') LOCATION 'path_to_csv'

Another option could have been to use:

GitHub - crealytics/spark-excel: A Spark plugin for reading Excel files via Apache POI

A library for querying Excel files with Apache Spark, for Spark SQL and DataFrames. Due to personal and professional…

github.com

(Only available for Scala) A Spark library that allow us to load an Excel file (XLSX) as a Dataframe. I have made some tests with that library and the results were not as good as with Pandas + openpyxl, but maybe, for other simpler use cases, that library could be more than enough.

“The success formula: solve your own problems and freely share the solutions.”
― Naval Ravikant

Conclusion

During this article we have seen:

Problem 3: How to create a URL generator that checks several possible naming conventions.
Problem 4: Download a compressed file, extract their XLSX files and, for each of them, extract their sheets as CSV files, process it, and create a Spark table with each of them.

The aim of this type of articles in which the objective is not only to expose a solution to a very concrete problem, but to help develop implicit knowledge by exposing readers to concrete problems and the process followed to solve them.

Want to Connect?
@data_cyborg
https://www.linkedin.com/in/ivan-gomez-arnedo/