3 Steps to Improve the Data Quality of a Data lake
Read the original article on Sicara’s blog here.
This article will share an approach on how to make the data injection flow more transparent in the data lake due to monitoring of custom logs in Kibana.
The previous project I was working on was dedicated to the construction of a data lake. Its purpose was to inject gigabytes of data from various sources into the system and make it available for multiple users within the organization. As it turned out, it was not always easy to certify if all the data was successfully inserted and even if the problem was already evident, it required hours to identify its cause. Hence, there was no doubt that the system needed fixing.
From a technical perspective, the solution that we proposed might be divided into three principal blocks:
- logging the necessary information in the code,
- indexing logs in Elasticsearch using Logstash,
- visualizing logs in custom dashboards on Kibana.
In software development, logging is a means to decrypt the black box of a running application. When the app is growing in its complexity, it starts to be trickier to figure out what is going on inside and here is where the logs are getting more influent. Who could benefit from them? Both developers and software users! Thanks to logs, the developer can restore the path the program is passing through and get a signal of potential bug location while the user can obtain the necessary information regarding the program and its output: such as time of execution, the data about processed files etc.
In order to improve the robustness of the application, the logs should fulfill the two standards: we wanted them to be customized so that they contain only the data we are interested in. Hence, it is important to think of what really values in the application: it may be the name of a script or an environment, time of execution, the name of the file containing an error, etc. The logs should be human-readable so that the problem could be detected as fast as possible regardless of the processed data volume.
Step 1: Logging essential information in the code.
The first sub-goal is to prepare the logs that can be easily parsed by Logstash and Elasticsearch. For that reason, we are keeping the logs messages as a multi-line JSON that contains the information we would like to display: log message, timestamp, script name, environment (prod or dev), log level (debug, info, warning, error), stack trace.
The code below can help you to create your customized logs in JSON format for a mock application which consists of the following parts: the application body is written in main.py script, the logger object is defined in logging_service.py, its parameters are described in logging_configuration.yml. To add the specific fields into the logging statement we have written CustomJsonFormatter class that overwrites add_fields method of its superclass imported from pythonjsonlogger package. The function get_logger from logging_service.py returns the new logger with the desired configurations. Note: the best practice is to define the logger at the top of every module of your application.
To create file.log, run the code above, placing the files in the same folder and running the following command from your terminal:
Step 2: Indexing logs in Elasticsearch using Logstash.
To promote the readability of logs we were using the ELK stack: the combination of the three open-sourced projects Elasticsearch, Logstash, Kibana. There exist multiple articles that can give you insights about what it is and its pros and cons.
Read the full article on Sicara’s blog here.