Where to next with Reporting

Valberg Larusson
The Cloud Builders Guild
4 min readSep 14, 2018

Data has changed. We used to have all-the-data stored in relational tables sitting on local network infrastructure. From there we would schedule data collation from multiple sources and ETL them into another set of relational tables for reporting. Those days are finally gone.

Reporting can now come from more sources than just a set of relational databases

The two fundamental changes at a practical level are; the data now comes both structured or unstructured and the source is usually not on the local network any more. There is also more data than there was before and options for linking or even supplementing datasets have become more available.

The most exciting part though is that the tools and methods we have to work with the data have evolved very rapidly in the last few years. In particular methods to manage unstructured data in a reliable way and the ability to manage vast amounts of data in ‘lakes’ rather than ‘warehouses’.

Lakes and Warehouses

Lets start by describing the difference between the concept of a Data Lake and a Data Warehouse. A Data Lake is a place where you can store your Structured and Unstructured data. A Data Warehouse only works well with Structured data.

Systems that provide the ability to work with data lakes also have a specific focus on large volumes of data and varied types of data sources and formats. Data in a lake can consist of anything from a relational database to excel spreadsheets or log files. Lake systems are particularly useful to manage data from noSQL or document centric databases.

The challenge of semi- and unstructured data

Semi-structured data, such as document centric noSQL data, JSON objects and XML files or feeds are data sets that have some structure to them but do not apply the structure consistently. This variability causes problems in relational database systems because the absence of consistency is detrimental to relational systems.

To illustrate let’s imagine that a data warehouse takes the following table as source input into its process:

Name | Phone | Address

.. then one day the same table looks like this:

Name | Address | Email

Not only has the phone number column been omitted but a new column has been added and the Address is now in a new location. A relational system is programmed to process the table and calculate the results. If you change the table the system can not calculate the results. It simply lacks the instructions to do that reliably so it fails.

Semi-structured data also often arises when integrating several relational database sources. When the same type of data, such as an address is recorded with different structure definitions in the relational source systems the resulting joining of that data will have no clearly defined structure.

This problem however is the norm when you have semi-structured data sets. XML files will only contain tags for which there are values available and such feeds change on a regular basis. Web services may change over time and the JSON string look very different from one day to the next.

This flexibility is important to application developers. You no longer have to stick with the data schema that was put in place years ago you can reconfigure the structure of the data to more quickly adapt your application to changing requirements and create the relationships after the fact as you need them.

Querying semi-structured data with SQL like commands

Deriving insights from data that does not have a reliable structure to it requires a different approach to relational reporting. The querying language has to be more flexible than traditional SQL. One approach is to use an object-oriented approach by creating objects from each record. This allows the ability to query the data with almost no knowledge of the structure.

Another is to re-structure the data by providing a schema to apply to your data set. This way you can use a SQL like language to query the data. An example of this is the open source project https://github.com/Puchaczov/Musoq

It’s in the cloud, man

Today many of the systems we use live in a centralised data centre rather than on our own network. This may mean that the best place to process the data for reporting is not on your own network but rather in the “cloud” where your source systems reside.

This can make serviced data lake offerings an attractive option. One such option is Snowflake. Another option is to build your own serverless solution using AWS services like S3, Glue and Kinesis. Microsoft Azure also offers a managed service called Azure Data Lake. This is an instance of Hadoop and Spark that is managed for you by Azure (it’s good to see Microsoft so fully embrace Open Source now).

You can also run your own set-up of Hadoop and Spark either in the cloud or on your own hardware. This can be a good option when your project is highly exploratory and you are not planning to run this as a productionised service that needs to be maintained and supported. It is also a good solution when your data can not be moved off your network for whatever reason.

So where to?

Going forward the business environment needs a solution that can handle varied, incomplete sets of data in a reliable and consistent way. They need to be able to add all those random spreadsheets and information from purchased data streams to the pool as well as the data from core business systems. They need to be able to scale up quickly when a marketing campaign is about to launch or research hits a critical stage and they don’t want to pay for the cost of hardware sitting dormant when the system is not in high demand.

We have outgrown the era of the traditional Data Warehouse for reporting and entered the era of cloud enabled Data Lakes.

--

--