Data just happens, right?!

Hamish Mogan
Jan 15, 2017 · 7 min read

I was browsing through the weekend edition of the national financial paper here in Australia when I stumbled upon an interview with the Managing Director of a company spruiking the virtues of state-of-the-art analytics and A.I. Artificial Intelligence in directing their customers to the right opportunities.

I read with interest, as I am good friends with the person who runs the analytics function for this group of companies, and noticed that the team itself only got one mention, in passing. As I read on, I felt a sense of shared pain that twins must experience when one of the pair is in agony.

I know my friend wasn’t looking for a full page bio, complete with cheerleaders, pom-poms and a marching band, trumpeting the uber-elite skills of his crack team of analysts, data engineers and data scientists. But, it’d be nice, right? The best leaders I’ve seen have always made sure that the limelight falls on the foot soldiers that make the magic happen.

That got me thinking. Right now, the mass media are all salivating about the commercial application of artificial intelligence, from ordering pizzas through to travel and generic customer service. Machine learning is being referred to as this elusive unicorn from another dimension that pisses rainbows and farts out optimised models that print money. I wish it were that easy!

But the fact is, one of the key pilars of success for machine learning is an impeccably clean data pipeline, a pipeline that takes many, many hands of different skills to plumb. Without this asset, organisations are simply rubbing two random sticks together, hoping like crazy theyll generate enough friction for a fire.

And just like our healthcare system where the doctor is attributed with the majority of the accolades in delivering lifesaving treatments (*), or the head chef at the Michelin star restaurant in serving up the incredible meal, this magic unicorn, and its valuable excrement, is merely the headline to an often complicated piece of machinery.

Im sure you’ve heard of the term “Garbage in, Garbage out” but if not, this refers to a system that takes an input, performs a calculation or transformation, and the produces an output which has some mathematical or logical relationship to the input. If the quality of the input is not curated and sculpted to be exactly as intended and is therefore sub-optimal, the machine will do exactly what you are asking it to do, and will deliver a sub-optimal outcome.

In the world of big data, Garbage in, Garbage out is really the secret sauce that produces the elusive rainbow, so lets look at the process a little further, as my team here at Vocus Group (you would know us from our brands in the Australian ISP market of Dodo, iPrimus and Commander and our Corporate and Wholesale division) are deep in the weeds of delivering the river of gold that is the data pipeline.

1. Victor Frankenstein is your DBA

Just like Mary Shelley’s book “Frankenstein” where Victor Frankenstein creates a grotesque monster, your Source Systems have often taken on characteristics and functionality as and when it was required and with little planning to integrate this functionality into the core of the system, resulting in a data model that resembles a patchwork quilt. This may sound harsh and disparaging to the DBA and application designers out there, but it’s (often) not their fault in that functionality that causes abrupt or sigificant change to the schemas or originally intended structures is often critical in nature. Be the regulatory or profit dirven, or security or privacy driven.

Regardless of the reason, data can go along happily minding its own business until one day, the world changes and now data needs to be structured and interpreted in a completely different manner. This sort of change event can be further complicated when the changes are not documented (and lets be frank here, how many of us are either so vigilant or have sufficient time to document every single change made to a model or schema?) and then the keeper of the knowledge, the person(s) who implement the change, move on from the organisation.

2. “Business logic” is illogical

Oh ,the amount of times ive stepped into a dataset relating to customer activities, be they transactions or clickstream data, and tried to tie that together with our current “Active” customers. Low and behold, the elusive simplifying field exists! customer.status. So you painstakingly join tables, group queries and then develop models and reports based on that field, to then find out it doesnt reconcile to anything!

Welcome to illogical business logic, whereby someone, someday thought it would be a good idea to effectively “code” into the data fields that flag customers or transactions with certain attributes to make dissemination of insights so much easier. But, alas, they didnt pass on that knowledge when they took the job at Google, and now noone knows how that piece of code in the ETL (extract, transform, load) process is supposed to actually do when it runs!

3. Its easier to get a table at Fat Duck!

ETL processes are either propetary systems (Talend, Alteryx for example) or custom built scripts (written in either Java or Python normally) whereby data is transported from a source system to a destination environment and undergoes a level of cleaning or transformation. If you are looking for the hero of the meal, the well executed ETL process is the cherry on top of the Sundae. The problem is, its needs time to run, and it needs to run on or next to the production system. If the use case for the data in question needs regular or even near real-time accurate data in the destination environment, the challenge can sometimes be the scarcity of resources on the source systems.

If I had a dollar for every time a DBA has told me “No no, you cant run that service now, the database is under incredible load and another process would bring the whole thing down.”, i would be a very rich man. The reality is, the DBA’s job is to ensure that their database is highly available and has the highest of data integrity. So, kicking off the analytics team is not an all-together unreasonable starting position in the negotiation for time on the database. But, assuming the business has identified the reasons why they need this data and how they intend to use it, the game of chess proceeds with all eyes moving to the risk versus reward assessment.

The negotiation almost always finishes with the ETL process being allocated a small window of execution availability at some ungodly hour in the night and this process requires the Analytics team to summon all of their “Stakeholder Management Skills” in delivering a satisfactory answer to the risk versus reward argument.

4. Is that the best you got?!

Now that you have understood the data model, and its changes over time, you’ve unwrapped and unwrapped and unwrapped the business logic away from the base data like some very expensive game of “parse-the-parcel”, and you’ve negotiated the minefield of securing a slot for your ETL processes to run that would make achieving a global policy agreement on Climate Change look like a walk in the park, you now have your data. Well done.

The kick in the teeth can then be that you dont have enough data or enough “quality” data. Machine learning models require extensive datasets to train on and whilst the other term that is rising in the “Top terms used by Mass Media that are poorly understood” list, deep learning has a potential solution of transfer learning in its toolkit, this can mean that the party is over.

Unfortunately, the functions identified above, and they are by far from exhaustive or complete in the many challenges faced by professionals in their combat with data, are simply invisible to 99% of the organisations in which they happen, nor to the leaders who rely on them for their decision making capacity and market differentiating sophisticated models. Those that toil away at the sharp end of these processes are often unsung and unheralded.

So raise a Red Bull or Jolt Cola with me in thanking the tireless work of Data Engineers, Data Architects, Cloud Engineers, Analysts and Data Scientists whose tireless work enables the unicorn to piss rainbows. And whilst you are there, tip your lid to the DBAs and Application Support teams that are even further downstream in this process, as without their knowledge, patience and support, the music stops.

(*) With respect to Doctors, their job would not be possible without the selfless and heroic efforts of our paramedics, nurses, medical researchers and other supporting staff in our hospitals and universities. Before you hurl anything within reach at the screen or ipad that you are reading this on, im not saying Doctors do not perform an incredible, and also selfless, service for their communities. I am simply pointing to reality that our ambos and nurses are not rewarded inline with their value they provide. I would also like to point out that our Teachers and Lecturers who craft and shape young minds as they toil away in learning their crafts are also severely unrewarded. Its not right. Anyway, back to data..

Hamish Mogan

Written by

Analytics and Big Data leader who is passionate about the role analytical practices plays in the value chain of small and big business.