Q & A with Eliot Knudsen, Data Science Lead at Tamr
With the amount and variety of data in the enterprise exploding, and with the rising need to compete using analytics, traditional methods of integrating data simply aren’t enough. Data Unification (DU), which uses machine learning (ML) breakthroughs to reinvent traditional data integration technologies like Extract, Transform, Load (ETL) and self-service data prep, is poised to become a disruptive force in data management and big data.
We speak with Eliot Knudsen, Data Science Lead at Tamr, which is leading the DU movement with a patented technology that combines ML with human guidance to connect myriad data sources across departments. While self-service data prep tools hit a brick wall when dirty data from too many silos enter the workflow, Tamr can clean and unify data across hundreds or thousands of sources. Tamr also gets smarter and more precise over time.
Please tell us about Tamr and its unique DU technology?
Founded in 2013, Tamr was launched by start-up collaborators and data management veterans Andy Palmer and Mike Stonebraker. The two had previously co-founded Vertica Systems (a high-performance database management company that was sold to HP for $350 million) and worked together on several other related companies. After working on hundreds of data warehouse implementations, they forged a common belief that the core ideas behind the last 20+ years of data management were failing to meet the changing needs of enterprises. Traditional methods of organizing data for analytics could no longer keep up with the amount and variety of data available to enterprises. In 2012 the team began research at MIT’s Computer Science & AI Lab on a bottom-up solution for managing the radical data volume, velocity and, especially, variety in the modern enterprise. The resulting 2013 paper, “Data Curation at Scale: The Data Tamer System,” described a breakthrough approach for combining machine learning and human expert guidance to unify data across thousands of sources. The paper became the guiding vision for Tamr’s platform and also gave the company its name.
How does DU differ from traditional approaches to data integration?
For data officers and their business counterparts at large enterprises, Tamr is the only system capable of unifying data across hundreds, or even thousands of sources and domains quickly, accurately, and cost-effectively. Unfortunately, traditional data integration technologies hit a brick wall when dirty data from too many silos (typically 10 or more) enter the workflow. Tamr’s patented software breakthroughs combine machine learning with human expertise to efficiently automate data unification at scale to create transformational analytic solutions.
What are the breakthroughs that make DU possible?
It’s a unique combination of machine learning, modern architectures (microservices, HDFS, Spark, ElasticSearch, etc.) and extensive, open APIs that enable integration into today’s data and analytics stack.
What role do humans play in a machine-learning data unification process?
It’s our founding philosophy that machines alone are not sufficient to solve this problem. Data requires subject-matter experts and machines to iteratively institutionalize the knowledge on problems that can’t be solved manually or through automation alone. In other words, algorithms vs. humans is not an either-or decision. “Together” is possible with the proper technology and process. Machine learning is essential for scaling and automation, but human knowledge is critical to ensure accuracy and reliability.
What is the difference between self-service data prep and DU?
Self-service data prep and DU are complementary technologies that solve different problems. Data prep tools enable business analysts to tackle challenges quickly and effectively that they traditionally relied on IT organizations to solve. But they typically hit a brick wall when combining anything more than about 10 sources. Data unification on the other hand applies human-guided machine learning to the task of finding the underlying structure in divergent data across hundreds or even thousands of sources. It evaluates the metadata, offers suggestions for combining similar fields, and queries experts for guidance on possible matches to enhance the models. This way, it quickly creates a single view of the relevant data, ready for analysis.
What are the benefits of unification to the data scientist or Chief Data Officer (CDO)?
Tamr’s machine learning-based approach makes it possible to deliver analytic and operational improvements that were previously unachievable due to the time, expense, and effort associated with preparing the relevant data. When a clean, complete, integrated data set is available, data scientists can employ their skills on analyzing it rather than grinding out the grunt work to prepare it. And for CDOs, a new approach to data unification means that they can be a better partner to their business counterparts by being able to say ‘yes’ to more requests that can lead to transformational outcomes.
What kind of companies can use your solution? Can it be easily integrated with their existing infrastructure/ applications?
Tamr works with large, mature enterprises that have accumulated significant ‘data debt’ over their many years of operation. Our customers are mostly members of the Global 2000, including GE, Toyota, GSK, HP, and Thomson Reuters. Part of our guiding product philosophy and design principles is that data unification must be able to integrate into our customer’s current and future data tech stack. As a result, we have built an open platform with an extensive set of APIs that enable our customers to embed Tamr into their infrastructure and applications.
How do you differentiate your platform from data integration software offered by competitors?
Tamr’s data unification platform differs from traditional rules-based data integration technologies in three ways:
- It is powered by machine learning;
- It efficiently incorporates expert knowledge about the data being unified; and
- It is built on a big data architecture (HDFS, Spark, ElasticSearch, Mesos).
As a result, Tamr is able to tackle data unification problems of a scale that are impossible to solve with traditional approaches.
What are your thoughts on the way the data management and the big data segment is growing? What are the major trends that will affect and fuel its growth? How is your company equipped to be part of that growth story?
Data unification is a huge unsolved problem for most large enterprises. Today, the data integration market is estimated to be $6.4 billion, and is forecast to grow 13.7% annually for the next 5 years, making it one of the fastest growing enterprise software segments.
Traditional data integration approaches are based on hard-coded, deterministic rules coded by developers who rarely have a deep understanding about what the data represents. As a result, the approaches (often referred to as ETL — extract, transform, load) are expensive and time-consuming to implement, error-prone, and rarely scale beyond a handful of data sources. In the era of Big Data–where the variety, volume, and velocity of data are exploding — this traditional approach is increasingly unworkable. Tamr’s approach addresses the fatal flaws of traditional data integration:
It is model-based (powered by machine learning) as opposed to rules-based, so it can easily adapt to changing data sources and automate the process of data integration and cleansing in a scalable, low-cost manner.
It efficiently incorporates the knowledge of human experts who genuinely understand what the data represents without requiring them to be technical. This enables the machine learning models to get smarter and more accurate over time.
What is the one thing about Tamr’s DU solution that you would like our readers to know right now?
Analytic and operational breakthroughs become possible when radical innovation is applied to traditional data integration roadblocks. This is particularly true for mature, large enterprises that have accumulated significant ‘data debt’ over the years — siloed, redundant, wildly variable quality data stores. Unlike today’s self-service data preparation tools and traditional ETL data integration products, Tamr’s machine learning-based approach can efficiently clean and unify data across hundreds or even thousands of sources. This makes it possible to quickly pay down the accumulated data debt and to move from incremental optimizations to step change improvements.
What are the top 3 technologies that IT leaders need to watch out for?
- AI: Artificial Intelligence is at the peak of its hype cycle. Like a lot of ‘hot’ technology trends, it is likely to lead to a lot of disappointment in the near-term before it achieves the tremendous long-term potential that everyone believes it has. For mature enterprises, one of the biggest challenges to finding success with AI initiatives comes from their ‘data debt’. Without high quality data as an input into AI projects, they will flounder.
- Self-service data preparation: These are useful tools to empower business analysts to make rapid progress in tackling questions that they typically had to rely on their IT organizations to solve. Like many things though, their strength can also be their weakness. Their decentralized usage and bespoke rules to combine, clean, and analyze data can lead to multiple versions of the ‘truth’. Employ these technologies accordingly, and use them with clean, curated data in situations where accuracy matters.
- Traditional data integration technologies: Incumbent ETL vendors are wrapping their traditional offerings in the cloak of machine learning in reaction to the disruption being caused by a new wave of vendors. Like the Russian proverb (often misattributed to President Reagan) says, “trust, but verify.”