BigQuery, Looker and Big Data’s Rediscovery of Data Warehousing and Semantic Models … at Google Scale
Hadoop solved two problems data warehouse owners faced in the mid-2000’s; how to handle the explosion in data volumes brought on by the digitisition of business, and how to bring agility and flexibility into the way data was landed, processed and made available to users.
Traditional data warehousing database server technology got more expensive and unreliable the more nodes you added to a cluster whereas Hadoop became more resilient, and as open-source technology it was free to download and run on existing commodity servers. NoSQL databases emphasised speed over immediate consistency and offered storage options other than just tables and columns for applications that needed to load data in real-time and datatypes were increasingly unstructured and non-relational.
The Hadoop and NoSQL ecosystem grew over the next ten years to include technologies such Hive, Kafka and Sqoop that took concepts well-established in the database world and rethought them for the distributed, polyglot datasource and flexible-processing world of big data.
Most, like Hive, have areas of functionality clearly still years behind their commercial equivalent but focus instead on new possibilities opened-up by Hadoop’s flexible architecture and the ability of its open-source developer community to rapidly iterate and fill-in missing features as needs became apparent.
Step-forward to two-or-three years ago and most large commercial organizations had big data initiatives of one form or another hosted in their data centres, using Hadoop distributions from the likes of Cloudera and Hortonworks on white-box hardware or dedicated server appliances from Oracle and IBM. In-house teams managed the often hundreds or thousands of nodes in a cluster, and over time reference architectures emerged that added governance and workload-separation concepts demanded by enterprise customers, along with design patterns that bridged the gap between fast-moving streaming data sources and the slower, batch-oriented world of traditional on-premise data-processing.
And so Hadoop and NoSQL technologies increasingly became the default choice organizations made when storing and organizing data for query and analysis as they were cheaper, more flexible in-terms of how and when you processed data and were orders of magnitude cheaper to license and provision than the high-end relational databases servers they’d used up until. I’m giving at talk on this topic later this week at the UK Oracle User Group’s Database SIG in London, and the problems each Hadoop component solved and the workloads that still make sense to run on traditional relational database technologies are key concepts DBAs and developers need to be aware of when thinking about these new technologies.
But as with any new technology that makes an older one obsolete Hadoop and NoSQL introduced their own new problems, and as we discussed in the Drill to Detail Podcast episode featuring colleague Alex Olivier in this case they were around the complexity and effort required to manage clusters containing thousands of nodes running dozens of services doing so economically and securely, and in addition the cognitive overhead put on users by Hadoop and NoSQL’s schema-on-read storage meant pushing the burden of structuring and interpreting data onto users who didn’t have the skills or time do this effectively.
The first generation of cloud platforms for running big data workloads typically took what was done on-premise and lifted-and-shifted that workload into the cloud replacing physical servers with virtual ones, giving system owners more capacity and flexibility around server provisioning but still requiring the customer to own and manage the environment themselves. The first cloud database server platforms were most likely to be straight ports of that vendor’s on-premise product hosted in VMs running on that cloud vendor’s Infrastructure-as-a-Service (Iaas) platform, but a new generation of elastic, cloud-native data warehouse Platform-as-a-Service products such as Google BigQuery and Amazon Athena bring together the benefits of cloud and big data scale, come with their own similarly elastically-provisioned data processing and data ingestion services while re-introducing concepts such as tabular storage and SQL access and
On-premise technologies such as Apache Kudu started this move towards structuring the storage big data systems used for analytic workloads as tables and columns rather than schema-less filesystems, but BigQuery took this mainstream to the point where organizations such as Qubit, the London startup I featured on the Drill to Detail episode and where I’m currently responsible for their analytics product strategy, use it to ingest, process and stored hundreds of terabytes of customer event data each day and store it neatly organized into tables, columns, projects and industry-specific business views.
If you logged into BigQuery’s SQL web interface and didn’t know about the massively-parallel big data architecture BigQuery took from Dremel you’d assume it was just another database server, and more importantly you can get on with your analysis rather than struggling with JSON SerDes and Hive EXPLODE and LATERAL VIEW clauses to decode your schema-on-read dataset. In today’s service-based PaaS cloud analytics platforms the two previously distinct relational data warehousing and Hadoop distributed processing technologies are converging with cloud storage and elastically-provisioned compute and storage to give us something new and potentially very interesting … just as new BI tool startup Looker coming out of the big data and e-commerce space has begun introducing a new generation of software developers and data architects to the benefits of BI data modeling, semantic models and data-layer abstraction.
The trouble with using tools such as Tableau and Oracle Data Vizualization Desktop against even tabular-structured data sources like Google BigQuery for business-user analysis is that they treat each report as it’s own silo of data; users can’t easily navigate from one area of data to another as reports are based on single data extracts of metrics and attributes from one subject area designed to make data visualizations and data discovery simple for departmental users.
Traditional enterprise BI tools like Business Objects and Cognos do give users the ability to create rich, integrated semantic business models over platforms such Google BigQuery using point-and-click graphical tools, but in the kinds of startups and engineering-led early-adopter companies implementing BigQuery and wanting to structure and join-together reporting data as create enterprise semantic models they’re more interested in working with markup languages, Git repositories and doing it all in code.
LookML, a language developed by Looker for describing dimensions, aggregates, calculations and data relationships in a SQL database is used to define semantic models in Looker’s BI tool and is aimed squarely at the software engineers typically working on the e-commerce and big data platforms that use BigQuery to host and analyze big data customer activity datasets. The code snippet below shows some simple LookML used to define an orders view in a Looker semantic model, and developers are encouraged to share code examples on Looker’s community Looker Blocks Directory.
We’re using Looker at Qubit to create semantic models for use with Live Tap, an analytics service that runs on top of Google BigQuery that marketers use to analyze and explore the event-level customer activity datasets we build for them. Although BigQuery provides SQL access and a web UI for running queries we knew marketers would prefer to use a graphical BI tool, and Looker and the sample code we’re planning on sharing on Github for customers to use and extend in their own Looker environments means we can present all of their data joined-up, with meaningful business names for metrics and dimension attributes and organized into subject areas that hide the complexity of the underlying BigQuery table structures.
And so the story ends up going full loop; big data solves the scale and agility issues that were holding back data warehouses, but added complexity that in-time was addressed by running that infrastructure in the cloud as a serviceeventually was addressed by running it as a service in the cloud, and now in-turn adopts some of the characteristics of the platforms it usurped to make developers more productive and help end-users understand and analyze all their corporate data as one big joined-up dataset.
Google Cloud’s Next developer event is running in San Francisco in a couple of week’s time and it’ll be interesting to see what’s coming next for BigQuery, and in another case of big data adopting traditional database capabilities I’ll definately be interested in hearing about Google Cloud Spanner — and how it could do for OLTP what MapReduce and Dremel did for our data warehousing relational databases.