Sogeti’s Data Quality Wrapper

Automating your data pre-processing with Streamlit

Published in

Sogeti Data | Netherlands

9 min readFeb 25, 2022

A Sogeti branded gif representing data preprocessing and reporting.

Sogeti NL has a large data science team that’s always looking for methods to ensure transparency, ethics and quality in their AI development process. Additionally, we are involved in a project that focuses on testing AI models in various development phases — ITEA IVVES. As part of this project, we developed the Data Quality Wrapper (DQW), an app for automated EDA, preprocessing and data quality report generation. Its goal is to automate the preprocessing of data, but also educate aspiring and experienced data scientists about different methods that can be used to improve its quality.

While trying to create an app around this solution, we found Streamlit, a framework for easy app development for ML projects and experiments. I’ve already written about how easy it is to develop apps with it.

In this blog post, we will go though the purpose of the app and its sections, Streamlit components and packages used to develop the app. We will also point to the scripts where the code is located (look for the 🔍 emoji).

TL;DR? Try out the app. 🚀 Or jump into the code! 👩🏽‍💻

The purpose of the app

Sogeti’s DQW is used as an accelerator for another product we are developing, the Quality AI Framework, used to test AI models in all phases of the AI development cycle. The framework provides a practical and standardized way of working that outputs trustworthy AI. Sogeti’s DQW is used in the Data Understanding and Data Preparation phase of this framework. It is an accelerator used to ensure the quality of the data that goes into a given ML model is suitable and representative.

An infographic of Sogeti’s QAIF. — The phases of the QAIF and where the DQW is positioned.

The best thing about the app is that it can be applied to more than one data structure, including:

Structured data. Data in a well-defined format. Used in various ML applications.
Unstructured data. This includes images, used in computer vision algorithms such as object detection and classification, text, used in NLP models, be it for classification or sentiment analysis and audio, used in audio signal processing algorithms such as music genre recognition and automatic speech recognition.
Synthetic data. Synthetic data evaluation is a critical step of the synthetic data generation pipeline. Validating the synthetic data training set ensures model performance will not be impacted negatively.

These data formats define the app sections, which you can toggle through in the main selectbox. Each of the sections has multiple subsections, which we will go though in the next few paragraphs.

Packages used to enable the EDA (description, visualisation) and preprocessing (selection) of these data formats are below. Please note, these are packages that we recommend using, not a definite guide.

Let’s jump into the app sections!

Structured data section

The section of the DQW dedicated to structured data offers automated EDA and preprocessing of your data. The code is placed in the tabular_eda folder.

Structured data subsections include one file analysis and preprocessing, two file comparison and synthetic data comparison. Let’s go through them.

One file analysis subsection uses pandas-profiling — easy to set up due to the Streamlit pandas-profiling component. The code used is below.

One file preprocessing with PyCaret — a very useful package for workflow automation. In the DQW, we rely on the setup() function which creates the preprocessing pipeline. Streamlit widgets make it quite easy to add flexibility for the user to select which preprocessing steps they want to run. We also display these steps as a diagram, offer a comparison of the original and preprocessed file and the download of the report and the pipeline pickle file you can use later. The pipeline pickle is provided so you can easily use it with the PyCaret modelling functions, especially in case of imbalanced class mitigation with SMOTE. The sampling needs to happen within training folds, so you won’t be able to see any impact of this method on the datasets themselves, but you’ll be able to see the difference in model performance when you use the pipeline pickle file.

🔍 The code used is in the structured_data.py script, see preprocess and show_pp_file functions.

Two file comparison with Sweetviz — another automated EDA library extremely useful for two file comparison. If we want to show the Sweetviz html report, we need to use the Streamlit html components function, as seen below.

Synthetic data comparison with table-evaluator. A wholesome comparison of original and synthetic datasets, it checks all statistical properties (PCA included) and offers multiple model performance comparisons with the original and synthetic dataset.

As Step 4, you can download the report and files. An example of the code is below.

🔍 The code used is in the structured_data.py (table_evaluator_comparison), te.py, viz.py and metrics.py. I copied the script from the repository because I needed to adjust them to make the package work efficiently in Streamlit. If you would like to try out this package, you can simply install it as it is.

Text data section

The text data section offers the flexibility of pasting a body of text or uploading a csv/json file for analysis. It currently only supports English, but it offers a lot of analysis methods and automated data preprocessing. The code is placed in text_eda folder.

Let’s focus on the most interesting subsections.

Data preprocessing subsection relies on various text preprocessing functions like stemming, lemmatization, de-noising, and stop word removal. These steps prepare text data in a machine-readable way. The preprocessed file can be downloaded.

🔍 The code used is in the preprocessor.py script.

Topic analysis with LDA, where we offer the flexibility of providing the number of topics you want to run or calculating the optimal number of topics based on the u_mass coherence score. Furthermore, LDA topics are visualized in an interactive plot using pyLDAvis.

🔍 The code used is in the lda.py script.

Sentiment analysis with Vader and textblob. An easy way to get the polarity of input text data.

🔍The code is in the polarity.py script.

Audio data section

The audio data section offers data augmentation, EDA, and comparison of two audio files. The code is placed in audio_eda folder.

Let’s focus on the most interesting subsections.

One file analysis where we provide several useful plots with librosa to describe the input audio file. The plot descriptions are in the app.

🔍 The code is in the audio_data.py script, function audio_eda. To upload and display the audio file widget, you can use the below code.

One file augmentation with audiomentations, a useful library for the augmentation of audio files. Augmentation of audio files is very important for increasing the robustness of the dataset in case of a lack of training data. The app also runs EDA on the augmented file.

🔍 The code is in the audio_data.py script, function augment_audio. An interesting approach is used to pass the selected augmentation methods to this function with the multiselect API, parsing the user input as expression arguments and evaluating them as a python expression.

Two file comparison with Dynamic Time Warping (DTW), a method of analyzing the maximum path to similarity of 2 sequences.

🔍 The code is in the audio_data.py script, function compare_files.

Two file comparison with audio analyser, a method that compares 2 spectrums with an applied threshold.

🔍 The code is in the audio_data.py script, function audio_compare.

Image data section

The image data section offers data augmentation and EDA of your images. The code is placed in image_eda folder.

This section allows multiple files to be uploaded and worked on at once. Let’s focus on the most interesting subsection.

Augmentation of the images using Pillow — the app offers several augmentation methods, including image resizing, applying noise, contrast, and brightness adjustment. A nice added flexibility thanks to Streamlit session state is that you can apply multiple augmentations in sequence and go back to the previous state if you need to.

🔍 The code is in the augment.py script. The versatile Streamlit APIs helped configure the augmentation part of this section. You can see example code below.

Additional tips for Streamlit app design

Personally, I find the app design very important. It gets the audience and users excited straight away so I don’t mind spending a lot of time on it. To those that feel app design takes too much of their time, here are a few tricks I use for all my Streamlit apps:

Using local file as background hack. This is a lifesaver (I got the tip from the useful Streamlit forum). Simply use the following code to pass a local file and open it as background.

Custom themes. Streamlit offers a very easy way of secondary app styling through their UI. You can check it out here.
The sidebar design. Did you know you can change the width of your sidebar? You can do it with this simple snippet of code. You can adjust the width to what suits you.

I prefer to move all of the high-level user-defined steps to the sidebar, like upload widgets, select boxes, etc. It’s very simple. Just add .sidebar to the relevant API.

Styling the text with HTML. You can use this function to do that.

Streamlit expander API for more information and references. The expanders are a space-saver in robust apps with a lot of text.

🔍 You can find a lot of helpful design functions that are used in the app in the helper_functions.py script.

Wrapping up

The DQW is a useful app to automate your data preprocessing during AI model development. It enables data-driven model development and streamlines the data preprocessing workflow, ensuring transparency and quality. This app is still under development and is one of many Streamlit apps the Sogeti NL Data Science team has developed. We find Streamlit very useful for demonstration purposes. Furthermore, the reason we made this app open source is to educate the data science community about data-centric model development and provide advice on the methods that can be used to ensure data quality.

I encourage you to try the app out and let me know how you experienced it. Leave your questions in the comments or reach out to me if you want to hear more about what we do at Sogeti!

Sogeti’s Data Quality Wrapper

Automating your data pre-processing with Streamlit

The purpose of the app

Structured data section

Text data section

Audio data section

Image data section

Additional tips for Streamlit app design

Wrapping up

Further reading

A Step by Step Guide to Generate Tabular Synthetic Dataset with GANs

Goal

NLP Preprocessing Pipeline — what, when, why?

This article is part on a series that aims to clarify the most important details of NLP. You can refer to the main…

A Data Scientist’s Approach to Visual Audio Comparison

In this article, I demonstrate some custom ways I created to visually compare the frequency domains of multiple audio…

EDA for Image Classification

Why do we overthink EDA for image classification? Let’s go back to basics.

Written by Tijana Nikolic