I am releasing a proof of concept demonstrating my vision for the future of data management — Machine Learning applied to Data Integration. One of the major (unjustified) costs of big data and business intelligence is getting your stuff together on one platform. AI can and will be used to streamline the tedious process of moving and transforming large data to conform to the requirements necessary for analysis.
What is data integration?
As a data scientist, you might be able to write very complex statistical regressions on numerical data, but the infrastructure to keep these analyses running on big data is far from trivial. As a result, there’s an entire industry specializing exclusively in big data services. The concept of a data lake has become commonplace, as businesses seek to aggregate all data in one place for business intelligence purposes.
Data integration is the,“fun” part which needs to be taken care of before you can do big data analytics. But it also has many other applications: in a connected world people will exchange not just small data sets in Google spreadsheets, but also large data sets which cannot even be opened with a standard computer program. The Data Integration and Data Migration market is consequently booming within the larger Enterprise Data Management trend. Big data analytics involves aggregating and processing large amounts of data from multiple sources. The data integration work is generally labour intensive (read: unpleasant) and is traditionally achieved by assembling scripting and bespoke ETL tools either as a one off, or by solidifying a production stack with great pain.
Data integration is messy
Data integration is messy for many reasons. Philosophically: data captures written and unwritten business rules and reflects the specific thought processes of both the system designers and users.Technically: schema differences, syntax, format and storage, access patterns everything can change from one data set to another. Moreover, if you talk about relational data, SQL is generally considered tough to master — and requires highly technical skills. Beyond relational, data management stacks often include NoSQL components, which are easier to use but can also be far easier to misuse.
AI’s next frontier: Data Integration
We live in a world where AI is purporting to be everywhere: from Google translate, to self driving cars, facial and voice recognition, personal assistants. In tech and engineering, people like my friend Bruno Marnette of prodo.ai are trying to bring deep learning as tools to assist with complex tasks and assist developers with software engineering tasks.
It is past time for the strength of Machine Learning to be applied to the realm of data integration, and enable automation and streamlining of an unnecessarily arduous process.
What should AI mean for the future of data integration?
Why is getting data from point A to point B so complex? The data is already available, and although the transformations may be numerous and detailed, they are typically trivial: converting a date-time field, cross referencing a join key across multiple sources, etc. It’s fairly simple to describe those transformations , but is seldom as easy to encode and solidify them in a production quality data integration solution. Why shouldn’t AI help with that?
What am I doing with CSV Studio?
CSV Studio is a tool which is designed to automatically correct parsing errors in very large data files. It addresses a very simple and oft lamented pain point in ad-hoc data integration and analytics: the quality of flat files in CSV format. The CSV format is universally recognised, but the files themselves are frequently imprecise and error ridden. Who has not had to fix or parse a badly formatted CSV file at some point in their life?
CSV files often contain very basic issues which prevent easy ingestion into a database or an ETL system — requiring instead some level of manual editing or regular expression scripting. For small files this is not a problem, but for database sized files, which are far too large to be opened in Excel, this becomes awkward. Visual inspection is no longer reliable or even possible when the file cannot be opened by anything but a database product.
With CSV Studio, the interface and algorithms work together to repair the CSV file. And since only a human being can be relied upon to verify the quality of the repair with 100% accuracy, the algorithm is guided by the user interface to a result verified by the human data expert. This combination of user interface and human operator is crucial to a process known as active learning, where the algorithm seeks feedback from the user to refine the accuracy of the result.
The CSV Repair App
I am releasing a single page CSV Repair App as my Proof of Concept, to gather feedback on the error correction functionality of CSV Studio and validate the approach.
You can upload a CSV file in the app and navigate to the parsing errors and warnings. You can apply four error correction mechanisms which are designed to correct issues caused by unescaped control characters: improperly quoted strings, unescaped delimiters and the like, which are very common CSV parsing issues.
This is only a proof of concept — many simple usability features are missing, but my hope is that it is both a useful and promising new development for those who have to process large CSV files.