The abstracted future of data engineering
The way we view data and data sources has more or less stayed the same over time, but new technologies suggest a step innovation is happening
Technology is an agent for change, but that change can manifest itself in a number of different ways. New technology can change the way we act or feel, but it can also shift our way of thinking and looking at things. For example, the Graphical User Interface (GUI), finished in 1973 at Xerox’s research labs in Palo Alto, enabled computer users to see what they were working on for the first time and think about computer assets in an abstracted way. Advances in mobile processing power and usage patterns are forcing us to think of the mobile phone as the center of our lives. A new shift of thought just like these is starting to happen with data.
Throughout most of modern history and our use of databases, data has been viewed on a source-by-source basis — customer purchases from one database, check-ins from another, etc. All of the work that data scientists and engineers do pre-analysis — cleaning, wrangling, munging, whatever you want to call it — is unique from source to source. This is a real pain for professionals who were educated and recruited on the basis of doing impactful work.
The value in the data pipeline is typically finished and created at the end of it. Everything needs to be set up properly and in a scalable way, but your product is going to rely on the actual models or insights created by your systems. Value is created by outputs, yet Data Scientists are spending tons of time splitting columns. This is understandably annoying.
But the advent of more and more accurate neural networks is spurring a new view: we’re beginning to be able to see data as what it actually is instead of how and where it’s stored. But before we get there, some history is appropriate. The modern record of the flow of data actually fits neatly into three distinct eras.
Epoch #1 — Standard ETL
While humans have been collecting and storing data for as long modern history records, our ability to process that data has typically lagged well behind our ability to store and maintain it. Data Scientists today are expected to be fluent with big data related technologies, but these are relatively recent — Hadoop was released in 2006, and the integrated ecosystem we now know (see Epoch #2), well, didn’t really exist.
The standard for data analysis is ETL. The Extract-Transform-Load (ETL) process basically moves data from a database, which is optimized for transactional processing, to a data warehouse, which is optimized for data analysis. ETL was and is typically done overnight, and is considered a relatively cumbersome process.
Partially because of the lack of sophistication on the analysis side, the ETL process began as very manual and ad-hoc. If you had a singular data source, like a list of credit card purchases for example, you needed to do a bunch of custom work to ready that data for analysis. Consider the following example.
A given customer’s name in your database might have looked like Gage_Justin, but for analysis that’s not really useful — it needs to be split into two separate fields (First Name and Last Name), shifted to lowercase, and made punctuation (underscore) free. So how exactly did you do this?
Manually. Developers needed to create custom scripts for each data source. Regular expressions were typically used to find patterns in the data field (i.e. identify the first name and the last name), and a script would separate these two new fields and adjust the database accordingly. This process doesn’t scale well — unless data sources have identical fields, you needed to create a new set of scripts for each source. Everything was done on a source-by-source basis — you lived and died by where your data came from.
Epoch #2 — Cleaning Tools, Better Hardware, and GUIs
Around 2006, much of this began to change. Named after his son’s toy elephant as legend has it, Doug Cutting co-created Hadoop with Mike Cafarella with the goal of making distributed computing simpler and more accessible. Hadoop, which has come to colloquially refer to a broader ecosystem of big data products, enabled large scale batch processing across multiple servers, or a cluster. This was a big deal.
Hadoop meant better storage (Hive), but it also meant that our ability to process and analyze data was beginning to catch up with how well we could store it. A number of companies were founded around this time to commercialize this open-source wave, like Cloudera, which recently went public with Cutting as the Chief Architect. The big data ecosystem quickly progressed and improved with new frameworks like Apache Kafka (2011) and Spark (2014), and easier integrations with languages like Python (PySpark) and R (SparkR).
Alongside the revolution on the framework and software side, hardware was rapidly improving as well: Graphical Processing Units (GPUs) were starting to be used for data processing tasks. Nvidia, a company that until that point was making processors for Video Game Consoles, released the CUDA architecture in 2006. CUDA allowed developers to create applications and tasks that utilized Nvidia’s GPU power; and it turned out that GPUs were a great fit for certain machine learning tasks. Today in 2017, Nvidia’s stock price is 20 times what it was in 2006 and souped-up GPUs have become the standard for neural networks in production.
Yet even with these two landmark developments happening simultaneously, the way that we looked at data remained largely the same. Data sources were still data sources, and individual work needed to be done on each one to prepare it for the higher-calibre analytics that were now possible. Paxata was founded in 2012 and offers a GUI-based solution to the reg-ex and script process. IBM InfoSphere and Trifacta are other standard offerings among enterprises that sell a similar solution.
These products definitely made the cleaning and wrangling process easier, but relied on the same paradigm: the data source. Each data source remained a unique element that required custom or modular work. You needed to do new work for each new data source you attached. Essentially, hardware and our ability to theoretically process data was way ahead of what we were practically able to do — because of the way systems were set up. Data remained defined by its origin.
Epoch #3 — Total Source Abstraction
In 2017, there are some new technologies on the horizon that might just finally and fundamentally change the way we look at data. Instead of data being tied to a source — a format, a stage, or a type — what if we could just see it for what it is and the ideas it represents? This concept might seem foreign to some, but Data Scientists don’t view data as abstracted from the format it’s in. The majority of a Data Scientist’s time is spent cleaning and munging data — changing formats, splitting columns, ensuring accuracy, and the like.
When I was studying Data Science in school this jumped straight out at me: academics and professionals spoke about data in a very different way than I did. Projects were dominated by cleaning and formatting — modeling and visualization were not the bottlenecks. This is exactly why I was so excited when I met the team at Datalogue for the first time — they had a vision about seeing data in a different way.
(Disclaimer: I’m not employed by Datalogue or affiliated financially with the company in any way. I’m just someone with a Data Science background who finds the company compelling.)
Datalogue’s technology uses neural nets to abstract data from its source — it automatically does all the cleaning and munging that you’d want. Data Scientists can focus on the later end of the pipeline where value is created, driving business insights and creating data-driven products. Going back to our earlier example, Datalogue’s algorithms know that Gage_Justin is a name, and that as a Data Scientist you’d probably prefer it to be split into two columns. But the real interesting part is that this is data-source agnostic — it doesn’t matter where the data comes from. Gage_Justin from a credit card bill, JustinGage from sign-in data, and gageJustin from an email signup are now all the same. The new standard unit of work is ideas.
This concept of viewing data independently of its source is often called data ontology, and rightly so. The dictionary defines ontology as “the branch of metaphysics dealing with the nature of being,” and that’s more or less what’s going on here (alright, maybe a bit dramatic). We should be caring about what data is — a business, and phone number, an address — and not what it looks like or where it comes from. Creating impactful data-driven products shouldn’t require tedious manual processing, and now it might not have to.
Datalogue’s system is built to work and last with large enterprises — security, resilience, and being able to handle huge amounts of data are key focuses of the company. Customers can continue to use the tools they like and dictate the form of delivery — it’s just easier on the backend. This is really built as a foundational tool.
There are some other promising companies and developments that define the third epoch. Tamr, a Cambridge-based startup built by MIT scientists, has raised $40M from notable investors to solve this problem with a different approach. Teradata and others are still providing traditional data integration services that bring at least some value to enterprises. Finally, the hardware front continues to move really quickly. Application-Specific-Integrated-Circuits (ASICs) are quickly gaining traction in computing, and companies are more closely architecting their software to leverage GPU capacity.
But what may end up defining this period isn’t the jump in efficiency, as significant as it may be; it’s that abstracting data from its source enables a new way of thinking. Thinking about data as it actually is and not as the form it comes in. Thinking about how to focus on the important parts of making great data-driven products and forgetting about the tedious manual work. Thinking about how we can enable the next generation of data-driven builders.
That’s a new paradigm worth thinking about.