Exploratory Data Analysis with Desbordante
Hey everyone! Today, we’d like to share with you our data profiling tool. Below, we’re going to talk about data profiling itself, the tool’s features, and who could find this tool useful. You can give the tool a try here.
Introduction
Data profiling [1] is the process of extracting metadata from data. It’s a common task for people working with data, and their extraction goals can be very different.
What is metadata? In general, it’s data about data such as file size, creation time, ownership, and such. In reality, this concept is broader, and includes various patterns hidden in the data itself. It is about the latter type of metadata that our entire further conversation will be about. One more caveat: for now, we will consider tables only.
The patterns that may be present in data are described using various primitives. We will call a primitive some description of a rule that holds over the data which is set mathematically. Functional dependencies can be a good example: dependency A->B (A and B being columns) holds if for each pair of rows, the equality of values in A implies the equality of values in B.
There are dozens of primitive types, each with its own specific use case, well-established properties and a sound theory behind it. New types are continually being developed, too. The most well-known examples of primitives are functional dependencies and association rules.
Patterns can be found automatically using specialized algorithms, which vary depending on the primitive. The discovered patterns can be valuable for a fairly wide range of people. Here are some examples:
- Bioinformatics researchers, chemists, geologists, and in fact almost any scientist working with large volumes of data, especially those obtained experimentally.
- People working with business data: financial analysts, people working in sales, traders, who all also have a lot of data that can be explored.
- Data scientists, data analysts, machine learning specialists.
For scientists that work with data, finding a primitive indicates the presence of some pattern. Based on it, they may be able to formulate a hypothesis that could lead to a scientific discovery or even draw conclusions immediately if there is enough data. At the very least, the found pattern can provide a direction for further study.
In the case of business data, the researcher can also try to obtain some kind of hypothesis. For example, “cars sold on weekdays have larger engine capacity than those sold on the weekends”. This looks very easy to check, but only if you suspect its existence beforehand. But, if it is not, then it takes a significant effort to generate such assumptions manually.
There are more mundane and more in-demand applications for business data: clearing errors in databases, finding and removing inexact duplicates, and many more. Note that scientists might also be interested in this functionality, albeit to a much lesser extent.
As for machine learning, the found primitives can help in feature engineering and choosing the direction for the ablation study. Let’s say you need to train a model to solve a problem while you have a table dataset. A question immediately arises: which columns should be used as features and which shouldn’t be? Everything that is in the dataset is a bad answer, because, firstly, some columns may be derived (or dependent) from others and adding them will not give anything in terms of performance, instead only increasing the size of the model and training time. Secondly, this can provoke overfitting, which obviously will negatively affect the quality of the solution.
Knowing the present patterns will make it possible to discard some features immediately, and for the remaining ones, choose the order in which they should be used with the model.
Finally, there are a bunch of other more niche applications, such as database optimization, database reverse engineering, and data integration. You can read about them here [1].
Next, let’s take a look at a small example to show what it’s all about. The table below describes products offered by a company — five different things. For each, its factory serial number and price are given.
The table shows that the first, third and fourth entries contain data on the same product. However, an error has probably crept into the fourth entry—perhaps the operator who entered the data did not finish typing a nine. This could be detected using the primitives described above, and in several ways. For example, one could look for approximate functional dependencies with a small threshold value and find Serial ->Price. Or it would even be possible to detect the exact functional dependency by loading only the upper half of the dataset. Of course, such a simple error could be detected in another way, without using primitives. However, more complex examples are not as straightforward or cost-effective to analyze using traditional methods.
The academic database community has a huge amount of primitives describing many different patterns that may be present in data. However, they are largely unknown to people, and in the worst case they simply remain a theoretical result, and at best they exist in the form of a little-known prototype. Our aim is to make these primitives accessible and give everyone the opportunity to study their data.
In this article, we present the beta version of the Desbordante platform (Spanish for limitless). This is our data profiler that enables its users to search for and validate various primitives in data. The Desbordante core is available through several interfaces: console, Python, and a web frontend. For the latter, we have a deployed instance which you can try here. The platform is open-source and you can take a look at the project on GitHub.
Disclaimer. The deployed instance is a beta version, and its resources are limited. The service may freeze or crash, and various errors and bugs are also possible. We will be glad to receive feedback, and we will try to fix everything quickly.
Initially, we were inspired by the Metanome project [2], but at the same time we have our own, different vision of this area. Metanome is more of a research prototype, while Desbordante is focused on the end user, and is much closer to a complete product. In addition, Desbordante has better performance, a user-friendly interface, and features that are not available in Metanome.
Positioning and Alternatives
We have come up with a classification of data profilers so we could clearly show the place of our tool among all other alternatives. First, we need to say that all data profiling can be divided into naive and science-intensive. Naive profiling includes extracting basic facts like the number of rows and columns in a table, the minimum and maximum values in a column, the number of NULL values, etc.
There are dozens of such data profiling tools, since almost all big information system vendors offer them. Many open-source tools are available, including Pandas Profiling, which is one of the most prominent [3].
On the other hand, science-intensive profilers focus on extraction of complex metadata, which we talked about in the introduction. It requires sophisticated algorithms and tools that offer such profiling are much rarer. Applications that allow mining or validation of primitives are mostly research systems, or not even systems, but prototypes that scientists wrote for their own purposes. They have all the disadvantages of research systems:
- They are not oriented towards high performance and are quite often written in Java or Python. After all, their purpose is to show the feasibility of an idea or to compare several methods. Furthermore, the code of most research systems is rarely maintained after publication, leading to quick obsolescence. And our subject area is no exception.
- Implementations are difficult to access. In the worst case, there may be no implementation, because some of the primitives or algorithms of their discovery were invented in the pre-Github era. Scientists’ webpages, unfortunately, do not live long, so the implementations could simply get lost. At best, implementations are usually scattered in various places on the Internet, written in different programming languages for different operating systems, and so on. That is, there is no single tool.
- They do not have a user-friendly interface and often require diving into the code in order to simply build the project, run and try it.
There are two prominent systems that offer science-intensive profiling: Metanome [2] and OpenClean [4]. However, both of them can be considered research systems since they share the same Java-based core that is used for primitive discovery. At the same time, Desbordante industrializes the idea of science-intensive profiling: it’s a fast, crash-resilient, and scalable tool which makes it possible to try several different primitives at once.
However, the industrialization of science-intensive profiling does not end here. Existing prototypes do not sufficiently fit use-cases outlined in the introduction. Our primary users are domain experts who have large amounts of data that they would like to explore and discover various patterns within, which express non-trivial facts. This leads to these specific requirements for an industrial-grade primitive discovery tool:
- Focus not only on discovery of primitives, but also on primitive validation and explainability of results. Users need to be provided with information why a particular instance of a primitive does not hold. Consider an example of metric dependency validation (CLI interface) in the picture above: it presents the rows that violate the dependency with the related information. Rows are provided in context which allows users to understand what is wrong with the data.
- Focus on tunability. Users can fine-tune the discovery process by specifying various constraints on primitives they are looking for. Besides better control of the output, this also reduces algorithm run time, which is important given the high computational costs of primitive discovery and validation.
- Focus on non-tabular data. While tables are the most popular data type, users are also interested in other types such as graphs and transactional data.
- Focus on approximate primitives. Users work with real data, in which all kinds of errors occur. In this setting exact primitives are of little use since they will rarely be detected, due to the “strictness” of their formulation. Therefore, implementing approximate primitives should be prioritized.
- Focus on supporting multiple interfaces. Some users prefer a console application, some need a rich web UI, and some might need a Python interface.
Unlike other science-intensive profilers, Desbordante aims to comply with all these requirements by providing the functionality that is necessary to support the discussed use-cases. Concluding this section, we can say that currently Desbordante is one of the few projects that open science-intensive primitives to the general public and give everyone the opportunity to study their data.
Features
Desbordante can perform three types of tasks:
- Discovery of all instances of a primitive.
- Validation of a single primitive instance.
- Running jobs that perform some particular real-life task using algorithms for primitive discovery.
The Discovery task looks for and returns all instances of a particular primitive. For example, the functional dependency discovery primitive returns all dependencies that hold on a user dataset. However, not all types of primitives can be efficiently mined, and thus, for some primitives we only allow to check its single instance. In this case, a user supplies an instance and the result is either “yes” or “no”. In case of “no”, the tool outputs rows that prevent it from holding, allowing the user to look into the data and analyze the root cause. For an example, consider checking whether a particular metric functional dependency [5] “serial -> price, 1000” holds, which basically states that the price difference between items with the same serial should not exceed 1000. If this rule is violated, Desbordante will present groups of rows with the same serial but with a price difference larger than the specified value.
In this article, we will not dwell on primitives in detail. Descriptions with formal definitions can be found at the links below, and we may also present them in future articles. Desbordante also includes several built-in datasets, each showcasing a good example of a specific primitive with fixed parameters.
Moving on to the third category of tasks, these are some scenarios that are based on the available primitives, and are usually a composition of several algorithms. Their goal is to solve practical tasks for non-specialists who are more concerned with ready-to-use functionality rather than the discovered primitives. Currently, only one such scenario is available in Desbordante: a typo detection scenario implemented via combining the searches for exact and approximate functional dependencies. We can say that such scenarios are the distinctive feature of Desbordante, as there was no such functionality in Metanome.
Desbordante comes with a web and a CLI version, as well as Python bindings. The web version offers a convenient UI and interactivity: the user can look at both the result and the source data, and can apply various filtering and sorting. On the other hand, the CLI version offers only the classic interface in the form of “primitive + params = output”, but a lot more primitives — usually, an algorithm is newly implemented in the CLI-only version. We have also provided the ability to call primitive discovery and validation tasks from Python programs since contemporary data scientists use Python as their language of choice. Using these bindings, users can call Desbordante algorithms to experiment and construct their own pipelines consisting of operations on primitives and data. The typo detection scenario is, in fact, a pipeline that was provided with the web UI. In the future we plan to add popular pipelines to the web version and develop a user-friendly interface for them.
Some final remarks:
- All three categories of tasks require effort from the user, experimentation and data exploration. Desbordante’s intended workflow is as follows: set parameters, run a selected task, inspect the result. Next, alter data according to the result: remove or add rows, columns, modify individual values and so on — possible actions depend on the task and primitive. After this, run the task again and repeat the process until the desired result is obtained. When running a discovery task, it’s advisable to experiment with different algorithms for the same primitive, as each has its own strengths and limitations. The performance of these algorithms may vary depending on the dataset.
- Searching for any nontrivial primitives is a very computationally expensive operation. Below is a table (taken from [6]) where you can see the performance of Metanome in functional dependency discovery tasks (various datasets for 8 algorithms). As you can see, the datasets are quite modestly sized. Also, note that these experiments were performed on a dual-processor server with Xeon E5 and 128 GB of RAM on board.
Aside from the large run times, the table shows that there may not be enough memory to process a dataset. In the deployed demo of Desbordante, we have added a few restrictions to avoid these problems. Furthermore, only registered users are able to upload their own datasets. Unregistered users, however, can still explore the platform’s functionality by using the built-in datasets.
Currently Supported Primitives
Here is a table with supported primitives and their properties: algorithm type, availability in Web and CLI versions, its input data type, and what it can be used for. The detailed description of each primitive can be found in the provided references.
About the implementation
The core of Desbordante is a console application with an extremely simple usage scenario: a dataset (e.g., a .csv table) is submitted to the input, the desired primitive with the algorithm is selected and, optionally, parameters are provided. The application then produces a set of found primitive instances as the output.
Over time, the console application has been extended with several components. As a result, now Desbordante can be used as a web service, that is, the server performing computationally complex work is already deployed on a remote machine, and the GUI client is accessible from any browser. Thus, the entire work cycle, meaning selecting a dataset, primitives, parameters and subsequent result analysis via filtering and visualization, is available immediately.
In addition, all Desbordante components are containerized, which makes it easy to deploy the service on almost any machine, and also makes it possible to manage the available computational resources, limiting the execution time of the algorithm and the amount of used memory. This is crucial for preventing a situation when a request from one user takes up all the resources of the machine for a long time, obstructing the work of other users. This can easily happen in the context of such a time-consuming task as finding patterns in data.
Recently, we have started to add Python support to make primitive discovery and validation available to Python users. We refer interested readers to [14] which presents a high-level overview of the application architecture.
Finally, unlike Metanome, the algorithms for discovering primitives themselves are implemented in C++, which in some cases allows for a tenfold increase in processing speed and two-fold memory consumption reduction [13]. Therefore, at the moment Desbordante is one of the most high-performance open-source profilers of the science-intensive class.
About the project
Desbordante was in development for several years, including those with the participation and support of Unidata Labs. The project began in the summer of 2019 and was initially a research project, the main question of which was “how much a C++ reimplementation will improve performance compared to the original Java implementation?”. We were inspired by and compared our results to the Metanome system created by the Hasso-Plattner Institut research group. Moreover, our focus was exclusively on functional dependencies.
As time went by, the team expanded and our understanding of the subject deepened. We accumulated a selection of primitives, algorithms and developed a web client. Eventually, we realized that we had the potential to create a valuable tool that could make these primitives accessible to a wider audience and bring benefits to them. This is the vision we continue to pursue.
The project is open-source and if someone is interested in participating, we are ready to cooperate. We’re looking for frontenders, backenders, and С++ developers, especially those willing to analyze papers describing primitives.
Additionally, we are eager to cooperate with teams that use data profiling for either research or industrial purposes. Our belief is that the direction of Desbordante’s development should be primarily determined by real-world applications of its primitives, which we wish to expand.
For all questions, contact us via website or GitHub. The project is led by George Chernishev, Maxim Strutovsky and Nikita Bobrov. At the moment the team consists of more than ten people.
What’s next?
A few more primitives are on the way — they are already implemented, but the code is still in branches. We are not going to stop at that, as we’re planning to constantly expand the list of available primitives and pipelines. Since it is impossible to predict which primitives will turn out to be a success, we will plan our development according to user feedback, primitives’ popularity and their usage patterns. We also aim to extend our profiler to support other dataset types aside from the tabular. Finally, in future versions we plan to add a bunch of quality of life improvements such as interactive tables, convenient data unloading, and so on.
References
[1] Ziawasch Abedjan, Lukasz Golab, and Felix Naumann. Profiling relational data: a survey (2015). The VLDB Journal 24, 4 (August 2015), 557–581. https://doi.org/10.1007/s00778-015-0389-y
[2] Thorsten Papenbrock, Tanja Bergmann, Moritz Finke, Jakob Zwiener, and Felix Naumann. Data Profiling with Metanome (2015). Proc. VLDB Endow. 8, 1860–1863. https://doi.org/10.14778/2824032.2824086
[3] Simon Brugman. Pandas-profiling: Exploratory Data Analysis for Python (2019). https://github.com/pandas-profiling/pandas-profiling.
[4] Heiko Müller, Sonia Castelo, Munaf Qazi, and Juliana Freire. From Papers to Practice: The Openclean Open-Source Data Cleaning Library (2021). Proc. VLDB Endow. 14, 2763–2766. https://doi.org/10.14778/3476311.3476339
[5] N. Koudas, A. Saha, D. Srivastava and S. Venkatasubramanian. Metric Functional Dependencies (2009). IEEE 25th International Conference on Data Engineering, Shanghai, China, pp. 1275–1278, doi: 10.1109/ICDE.2009.219.
[6] Thorsten Papenbrock, Jens Ehrlich, Jannik Marten, Tommy Neubert, Jan-Peer Rudolph, Martin Schönberg, Jakob Zwiener, and Felix Naumann. Functional dependency discovery: an experimental evaluation of seven algorithms (2015). Proc. VLDB Endow. 8, 1082–1093. http://www.vldb.org/pvldb/vol8/p1082-papenbrock.pdf
[7] Sebastian Kruse and Felix Naumann. Efficient discovery of approximate dependencies (2018). Proc. VLDB Endow. 11, 759–772. https://doi.org/10.14778/3192965.3192968
[8] W. Fan, F. Geerts, L. V. S. Lakshmanan and M. Xiong. Discovering Conditional Functional Dependencies (2009). IEEE 25th International Conference on Data Engineering, Shanghai, China, pp. 1231–1234, doi: 10.1109/ICDE.2009.208.
[9] Falco Dürsch et al. Inclusion Dependency Discovery: An Experimental Evaluation of Thirteen Algorithms (2019). In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM ‘19). Association for Computing Machinery, New York, NY, USA, 219–228. https://doi.org/10.1145/3357384.3357916
[10] Paul G. Brown and Peter J. Hass. BHUNT: automatic discovery of Fuzzy algebraic constraints in relational data (2003). In Proceedings of the 29th international conference on Very large data bases — Volume 29 (VLDB ‘03). VLDB Endowment, 668–679.
[11] Charu C. Aggarwal and Jiawei Han. Frequent Pattern Mining (2014). Springer Publishing Company, Incorporated.
[12] Wenfei Fan, Yinghui Wu, and Jingbo Xu. Functional Dependencies for Graphs (2016). In Proceedings of the 2016 International Conference on Management of Data (SIGMOD ‘16). Association for Computing Machinery, New York, NY, USA, 1843–1857. https://doi.org/10.1145/2882903.2915232
[13] M. Strutovskiy, N. Bobrov, K. Smirnov and G. Chernishev. Desbordante: a Framework for Exploring Limits of Dependency Discovery Algorithms (2021). 29th Conference of Open Innovations Association (FRUCT), pp. 344–354, doi: 10.23919/FRUCT52173.2021.9435469. https://fruct.org/publications/fruct29/files/Strut.pdf
[14] George A. Chernishev et al. Desbordante: from benchmarking suite to high-performance science-intensive data profiler (2023). https://arxiv.org/abs/2301.05965