AI-powered data wrangling. Is it possible?

Ivan Begtin
2 min readJun 1, 2022

--

One of many unloved but relevant topics is data cleaning and preparation. The peculiarity is that they usually do this, either from the command line or in a DBMS. Of the convenient interactive tools, there is only open-source OpenRefine [1] or costly tools like Trifacta [2] or Talend Data Prep [3].

I use OpenRefine quite often. It has some handy features like applying Python code to column data and simple operations like column splitting, joining, data reconciliation, etc.

But OpenRefine has severe limitations on the amount of data, while it is trendy in data journalism and data analytics, but not data engineering, to be honest. It’s especially critical since you can’t predict how much data you could load into it. It could just hang on data upload or become very slow on some data transformations.

So there is a shortage of such tools, free and commercial, for small-medium understandable money. And most importantly, with fewer restrictions than OpenRefine. According to my observations, if such a tool is built on a modern column or fast in-memory database, like Tarantool or Clickhouse or similar DB engines, you can create a top-rated product. You need to put a lot of effort into the user interface.

Ideal data wrangling tool should be:

  • open-source + cloud + enterprise editions
  • support everything that OpenRefine may do
  • support data of any size (at least up to 100GB)
  • support SQL/NoSQL databases directly or via DB proxy
  • predictable response time, advanced self-diagnostics, and warnings
  • shared usage and review
  • data catalogs integration
  • data transformation lineage

But even more exciting and vital is that data wrangling is underpowered. Datasets and databases aren’t so hard to understand not to use AI for some tasks. AI couldn’t replace a human, but it could be an intelligent helper with automatic issues detection, applying standard data prep template-based operations, and helping analysts improve the data.

References:

[1] https://openrefine.org

[2] https://www.trifacta.com

[3] https://talend.com

#datatools #datawrangling

--

--

Ivan Begtin

I am founder of APICrafter, I write about Data Engineering, Open Data, Data, Modern Data stack and Open Government. Join my Telegram channel https://t.me/begtin