Collaborating on data is impossible? Move away from Tableau and data warehouses
The corporate single version of truth and data driven collaborative decision making. These are today the holy grails of analytics and in this post I will try to cover the reasons why I believe we fail so miserably at collaborating on data and why we were succeeding at this using Keboola and GoodData.
My story — leaving a lean startup called Keboola driven by crazy people and not having seen an offline excel sheet for almost two years, I ended lately in a big (and actually awesome company) in a team of seven super smart people that do analysis on the same data for a lot of internal customers using what is an usual combo of Alteryx and Tableau.
Having seven people working on the same data, you would probably expect, that the data cleaning and ETL process (which takes usually up to 80% of the time) would be done just once and then we would collaborate on the data reusing the codes done previously by other people, being happy ever after.
The reality is that we work much more as individuals, each of us has a crazy 20GB RAM computer filled with downloaded data that we clean locally and then feed to Tableau dashboards online creating an endless stream of separate and independent ETL processes.
Usually no one knows what we have done to the original data and why.
The reason is not a human error but actually the design of the tools we are using currently for our analysis and more importantly for the presentation of data.
Tableau and the single table paradigm
Tableau is actually at the hearth of this problem — to not mistake me: I love Tableau for ad hoc analysis, but Tableau is really a tool you want to use to explore the data, but not for maintaining and sustaining a reporting environment. The reason is the way Tableau (and most of other tools with two exceptions I know of Qlik and GoodData) store the data — in Tableau’s case it is a single denormalized table that the workbook uses as the base of the whole reporting.
Having a single table as the most efficient* way to store the data then the most natural way to use Tableau server is to create and infinite number of ad hoc reporting tables that are used for particular purposes — the end of this is a unmaintainable Tableau hell where every analytst knows well his own reports, but in case when someone else want to work with the same data he is usually better off redeveloping his own ETL script and creating a new dashboard instead.
* I am aware that Tableau supports joins — but it is not the easiest way to use this product — which results into people working with Tableau the way is described above.
The value is in collaboration
This is how reporting worked until recently and the way most analysts are probably used to do their job — the boss wants a report, the analyst crunches an SQL query from the DB, posts the data into Excel or something similar and the same pattern now works exactly the same way using Tableau just with better graphics and some workflow features. It is an improvement? Yes, but is it a qualitative revolution? No way.
Think of it — the business person (the one that actually can create value out of the data) still needs to know a lot about how the data were cleaned and prepared to be put into the dashboard and so he rather than doing even the simplest analysis himself, tasks the analyst for the most trivial tasks, and so the analyst spends 90% of his time solving basic math problems and by cleaning data. (Which is of course very expensive)
This creates also a bottleneck since the business person needs to comunicate and explain what he wants, which is the reason why even simple tasks that actually take a couple of minutes of real work takes days to week in a corporate environment postponing important decisions. (Which is actually even more expensive).
Now imagine — what about if the business person could solve 80% of his requests by himself and immediately instead of needing to ask someone else?
Well this is what a Datadriven company really is — a company where every person has access, means and working understanding of the data needed to do his job.
So let’s do a super Data(warehouse, lake, whatever)
Well I guess we all went through this — someone complains: “We need to collaborate on data, we need to make sure, that we all look a the same data” and a smart people always came up with the solution like let`s make a single Data Warehouse of Everything and Everyone containing all the reports and Data.
Well don't: it ends like this:
I have been to a workshop made by Adastra lately about data-warehousing where a consultant spent two hours describing 52 topics of issues associated with the implementation of a Data warehouse just by stating that at the end the major issue he is facing usually is the “absence of a user layer” in data warehouses — in other words no one actually use them.
Again I am not here to say that data warehouses are pure evil — but they should not be misused for other reasons than what they were designed for eg. to store data.
The purpose of a data warehouse is to keep and maintain the master data about the organization and make sure that there is at least one reliable copy of the data in the firm — and that`s it.
The attempts of using a DWH for reporting will ultimately fail not by mistake but by design since a DWH is inherently:
- Slow — as the information need to be carefully checked before is committed to the data warehouse it takes weeks to months before new data is added. In case of new implementations it can take years before you make it right, resulting in the fact that by the time you implement the model, the business will have completely changed in the meanwhile.
- Incomplete — having increasing amount on data in cloud services like Google Analytics you will find out, that eventually you cannot rely just on the DWH for reporting. (Well unless you do not want to download the whole internet into it)
- Complex — as it should store everything it is not uncommon that it holds up to several hundreds of highly denormalized tables which is not a place where to go without documentation or help even for and experienced analyst not talking at all about end users.
I believe that Data lakes are even worse at that since they allow storing unstructured data which it makes easy for the developers but even worse for the analyst.
- Owned by someone else — as the interest of the owner is to make sure the data are stored, when he will have to choose whether make sure the data are written down properly or risking that you will crash the system by you super-complicated query, he will always choose the first over the latter.
Move fast, break things and keep it simple
Now to the solution, think of it: what do you need so you can immediately work on data prepared by another person?
It ends up that there are just four requirements for collaboration:
- Understanding — you need to make sure that you understand the data you are looking at without needing to ask anyone.
- Means — you need to have the technical knowledge so you can actually do the analysis.
- Have your own workspace — where you are not afraid of breaking down the work of someone else or of breaking the production environment and where you are not concerned about available computing power.
- Trust — this is the most important — you need to trust the data you are given otherwise you will always fall down to verifying the source of the data to make sure that nothing has happen to them on their way to you.
My answer to these issues is to make it simple — a big company stores hundreds of tables but for instance a sales person will always work in an environment that will need just couple of them — a sales table, a customer table and maybe a product table. The same logic of course applies for marketing, logistics etc. So:
- Use a simple and contextual data model: By stripping the data in the working environment to the minimum amount needed in most cases you will generate understanding without needing a tutor to use the data. You could argue, that you can do this in Tableau, but to my opinion this is the place where a tool that supports data models like GoodData or Qlikview really shines — since they enable a variety of views on the underlying data without the need of ulterior transformations. Be careful though about of the complexity of the model — the more complex you build a model, the least people will be able to effectively using it, ultimately leading back to bottlenecks.
- Limit the technologies needed to operate the data: as for the end users the ideal is to have just one reporting tool of course. As for the analysts you will probably end with a set of SQL, R, Python. The more common the technologies used the better as you want to make sure that you use the least possible technologies that will empower the maximum of the users.
- Create a team based data sandbox: If you want to collaborate you need a shared workspace where you can safely share code, data and work on that without being concerned about operational security.
- Let the teams to implement their own processes: Trust is hard to achieve and it is something that is easily lost. To my experience people tend to trust the most things they do themselves — so in order to achieve this it is necessary to give actual control to analysts over their data cleaning and ETL process. Which of course can be done only in case they have a common goal they work on (1. create a contextual data model) and they have a common working environment they share (3.data sand box.)
How is Keboola and GoodData helping
First lets talk about GoodData — as much as I am sorry that this tool in the last three years has not moved an inch in terms of EX design and vizualisations, GoodData has from my point of view some major advantages that all together solve the “make it simple” issue stated above and thus actually enable a data driven environment:
- Completely cloud based — if you want people to have a single version of truth you cannot have people downloading data to their own computers for analysis. A cloud based (at least a private cloud) platform is a must and GoodData does that — say Good bye to patching, version issues and performance issues — that is all their job.
- It has a data model — GoodData’s data model pre-calculates all the possible meaningful combinations of joins between the tables in the model — this basically means that once you upload the model you do not need to revisit the data for transformation and modification unless you need to add new entities to the model. This saves a lot of time to the analysts but also enables feature 3 which is.
- Metrics as objects with predefined manipulation methods — If you are versed in object oriented programming then this is exactly how the metrics behave — you can create a global metric, share it, base on it a new one and predefine what the user can do with it without breaking it.
To make it less abstract lets say that you have a total sales number — GoodData will show you the SUM of all sales and tell you that you can break this number by all the meaningful attributes included in the model as the month of the year and the office that produced the sales.
What this does is HUGE — since an end user can examine the metric, decide how to pivot it and get right results without knowing anything about the underlying model and where the data came from. This makes self service BI something actually achievable.
As for Keboola Connection it solves most of the other issues mentioned above — Keboola Connection is a cloud based ETL solution that enables teams to extract data from a variety of online sources (databases, ftp, email attachments, resp APIs and various others services as Google apps etc.) and store them in a cloud based workspace, where they can be enriched with open data, transformed and send to some visualization tool (as GoodData) without the necessity of leaving the web browser.
Since it covers the same points mentioned with GoodData (cloud based and supporting a data model) but also brings other additional feature making collaboration on data possible:
- Simplifies the data environment — Keboola basically copies the data from different sources into a single working environment, so the team can extract just the minimum data needed for the analysis at the minimum needed level of granularity. By doing this (lets say by limiting the financial data just to sales or by limiting the granularity of the data just at a user/day level) it is possible to effectively simplify the data to a standard that is understood by the whole team. Also since you can create multiple projects there is no problem creating multiple context based workspaces that shares data between each other again limiting the number of concurrent ETL processes made by different teams.
- Limits the technology needed — the problem with extracting data lets say from a 50 sources is that someone need to gain access to the data source, understand how to make an export (thus understand the source's data structure) and then learn the technological process for doing the extraction which can be anything from a click to and api call or NoSQL query. This is complex and you want to save time by doing it once. Keboola by design always copies the data from the source and converts it to a csv file which means that once the extraction is done and automatised a single person can access data from those 50 sources using just one environment and one technology. Second Keboola supports a variety of standard technologies for transformations as R, Python and SQL, which to my opinion covers the widest used technologies for data manipulation available.
- Individual sandboxing and operational safety: In Keboola you always work on copied data, so the risk of interfering with the source system can be limited to the first data upload. Once the data are there, there is virtually no way how to screw up things — When working on transformations Keboola provisions a temporary sandbox for each of the users, so basically it will download a copy of the copied data to an independent sandbox where the user can manipulate the data without fear of interfering with any other running process. The backend that uses Snowflake is astonishingly fast so it actual brings to the user the power to work with big data from the browser.
- Trust — this one is the hardest to be tackled, since it mostly depends on the people working on the project. I believe that trust can be achieved by the team using the tool correctly, having a clear vision of the model they want to create, but currently Keboola do not offer much to support cooperative user behavior by design. I believe this could be easily achieved by supporting a better documentation feature over the stored datasets and by promoting good behavior and best practices by some UX tweaks, but I believe this is something which will come in the not so far away future.
Some other tools comparison and notes
here is some tech I had the opportunity to test lately and some thoughts about it:
Alteryx — Alteryx is a tool that is designed to help end users to do transformations on their computer with a drag and drop interface.
To my opinion it has two flaws — first you do not want having end users do the transformations and second you do not want them to use their laptop for it.
As far as i had the chance to use this tool it has shown as a trap — it deceives you to use it for a quick transformation, but it ends in a lot of pain. The automation with it is a nightmare since it works similarly to Tableau in a way that it forces you to create a workflow on your laptop and then upload it to an Alteryx server that will replicate this workflow online. At the end you will find yourself finding out whether the server uses the same drivers as your PC, whether it is of the same version, not talking about the fact that again it will force you to have offline data on your PC for testing purposes etc. Debugging is also a problem.
Tableau — great for ad hoc analysis that can be thrown away after the use. In case you need to analyze a single dataset and you are pretty sure that you will not need to revisit the analysis later on and collaborate on it — it is the way to go. I believe that the best use is in combination with Alteryx for people that receives data exports they have to work on and then leave it as external consultants etc. For instance it is a great time saver when you are trying to do a data exploration prior creating a statistical model.
R — The good of R is that everything can be done with it as it is a programming language, which is also the bad thing since it is complicated and time consuming. The advantage of all the tools like PowerBI, Tableau and GoodData consists in the facts that once the data are prepared they can be expored and analysed without ulterior manipulation. That is not the case with R — creating a subset in R requires creating a new variable thus writing a a code doing that and every package and function will have different requirements over the data inputs which will then require again preparing the data and what is worse double checking for errors to make sure the data is correct and comparable across several functions.
PowerBI — Still immature, but there are several things that I love about this tool: Dax language provides a mature framework to create metrics like fuzzy search and several other functions borrowed from Excel. Then the community — PowerBi to my opinion skyrocketed thanks to the fact that the community is able to create and actually share custom visualizations and scripts for this platform. Then the R support — many of the problems with R can be mitigated with the integration with it and PowerBI — you can make a quick exploration using PBI and then drag and drop relevant attributes from PBI to the R window managing the process mentioned above. Seems like beta but the idea is very promising. The bad: Data model is supported only in the desktop version, which seems artificially limited to a 250 000 rows per dataset making it unusable for most cases. The online version is highly dependent on Azure making it more of a good tool to analyze data on Microsoft Cloud than a truly independent Data analysis Tool.
Looker: Had just the chance to see the demo and it seems that it does not bring anything new in town. Looker does what the name says — it will create a supportive layer over a database enabling to predefine metrics using SQL and support the storage of these predefined metrics in an environment acessible to the users. First of all in order to use Looker it is necessary to already have the data cleaned and stored in one database thus needing it to operate over a data warehouse or with an additional ETL tool with independent storage like Keboola Connection. Second the metrics creation is really complicated and changes in them will result in needing to revisit the underlying data. That is a process that to my understanding will require a BI expert even for simple manipulations with dashboards thus making self service BI impossible. Since the computing power is provided by the database run by the client under Looker, I believe this bring operational concerns to the end user as well.
Bime: Works very similarly to Looker — it reads data over a database, a heavy hand of a BI expert for data analysis was needed, so same applies.
Rapidminer: I love it. The data manipulation part is a bit hard, but once the single table is loaded, one can use it as Tableau for doing statistical exploration over dataset as in R but more quickly. The documentation is well done so one can pretty safely try priorly unknown statistical methods as it will guide you through the best practices. A novelty is also a recommendation function that will recommend tools used by the community while designing a process. I talked with their sales rep lately and I feel that this tool is a bit in trouble since it provides services every data scientist wants but will not be willing to pay for — If I can use Rapidminer I can use R as well, which is free with just some bigger discomfort. The features the business is willing to pay for as maintaining a reporting/data environment are completely missing.
GoodData: The good — MAQL — a query language like DAX but more abstract enabling to write a lot of complex metrics with a few lines of code. The data model as mentioned above. The bad… I cannot make a histogram without revisiting the underlying data (come on guys really? Why I cant use a metric as an attribute) The lack of development in terms of visualizations that make GoodData look like a poor brother in comparison to Tableau.