Putting the Horse back before the Cart in Data Science

Published in

Zolnai.ca

5 min readJul 11, 2017

Ian White recently said “…. the biggest limitation of what I refer to as ‘data science thinking’ [is] letting technical skills drive the analysis, only later incorporating domain understanding.”

He likely didn’t know this is the week of premier GIS (geographic information systems) vendor ESRI User Conference touting “The Science of Where” — that #EsriUC trends at #2 ahead of #Wimbledon attests to its importance at least to the twitterati — but as creator (if not inventor) of GIS it de-facto puts the tech first, even if it does listen to its customers (I can attest to as former industry manager there).

As a geologist turned to petroleum then mapping incl. GIS and business processes, I also de-facto cannot overstate the importance of ‘métier’ or domain knowledge.

Is Data Science thus not facing the same conundrum? Matching technical input with professional output is especially hard if cross-industry fertilisation happens! Amazon Web Services is said to hire senior engineers in order to better understand the “output”, whilst junior hotshots take care of the coding or input. Many bemoaned the loss of data modelling expertise in the new wave of AGILE development — the attraction of quick-wins and the facility of ‘rapid app dev’ did put the proverbial cart before the horse — and as Randall Shane quipped “technical debt never gets paid back”.

Standards

That is why I helped start Professional Petroleum Data Model (PPDM) 25 yrs. ago, follow Open Geospatial Consortium (OGC) as it evolves cross-themes, and participated in Energistics metadata specs recently.

These are all efforts to codify the ‘métier’ into something coders and data managers can take up as input, and professionals and executives can understand as output. But as Sun Tsu said, “battles are won before entering the battlefield”, which means that unless the data is properly understood and structured, no massaging will turn it into actionable information.

Metadata

And don’t let the list above restrict you: If those data models grew from the structured RDBMS world, that doesn’t invalidate them in the unstructured world! Every data scientist knows that all data cannot be codified, never mind normalised. The proverbial ‘invisible elephant in the room’ is exactly that: the data that is unstructured and cannot be pegged anywhere.

But do you leave anything unstructured on the cutting floor? No! If you do your homework and properly document the metadata — the seemingly thankless task of understanding and documenting the data-about-the-data, which helps you discover what can be normalised and what cannot — and that is where you discover what is ‘métier incontournable’, domain knowledge that cannot be done without.

Information Supply Chain

Standards & metadata are all well and good, but how do you wrap your mind around it? I won’t go over the abundance of tools and practices available — in petroleum the Petroleum Network Education Conference (PNEC) and Oil IT are go-to places — but rather focus on a Software as a Service (SaaS) platform that enables all that.

I partnered with LINQ Ltd. because their framework allows to, um, link:

People — staffing hours, costs and dependencies
Process — data and software “boxes” to drop in drag&drop workflows incl. the above
Technology —meaning to tie together all of the above, where focusing on process helps blend technologies as required by business outcome

Whilst the online work-space is easy and intuitive, only the domain knowledge can accurately drive the specs and then the workflows, both on structured and unstructured data.

Presented at PESGB, PETEX last year, and Finding Petroleum last year and this year in London, a Digital Energy Journal at bottom gives some details.

Structured or not?

Above is an attempt to sort 15K family slides by colour aspect. This is quintessential unstructured data, as slides scanned about a decade ago don’t even have any metadata on them… No amount of ‘métier’ will help here!

Most corporations have, however, unstructured data that can be sorted via their metadata. By the same token, too, they will not trust anyone to put their core assets online, period… No problem! I also teamed up with Pinker Find, who are taxonomy experts and index files in-situ — it’s a SaaS service too, only it’s all held on-premise so that nothing leaves corporate vaults. And like LINQ its freemium costing model help work both together very easily. So you get the best of both worlds:

a non-core workflow generation process that’s on-line and benefits from an evolving framework
a core asset cataloguing service that’s on-premise and preserves the integrity of client data

Watch this space for Sterling Geo’s Smart M.App also on freemium model.

Data Science redux

This is a very simple case of new tech helping businesses elucidate their business processes. More complex cases from the previous paragraph indicate that there’s a whole range of opportunities to apply Data Science, but fit-for-purpose and driven by business needs not technical capacity. That truly puts corporations back in the driver seat, to help them benefit from complex tech such as Ian White proposes, but without overwhelming them either. As a business friend just told me, practical is what businesses need, don’t they?

Digital Energy Journal article on Slideshare: