Photo via Pixabay, Gerd Altmann

CIO’s Priorities (part 2 — Data Strategy)

Like one of my economics professors once said — everyone has an ideology. In the software & technology industry just like everywhere else it is common for people to make decisions based on their ideology rather than data. This is at least in part because its easier and quicker, but its also attributable to the fact that most organizations do a very poor job of understanding, collating, analyzing and exposing the data they have available to them.

In my 15+ years in this industry I’ve seen everything from buying equipment and software to layoffs and restructuring be driven by ideology and emotion. Its rather astounding that in a scientifically aligned industry where we often trumpet how smart we are, that we still make decisions in such inefficient and non-analytical ways.

I’m sure you’ve witnessed the same variety of ideologically driven decisions. You might even be thinking — “That decision making got us this far, so whats the big deal?”

There are two problems with using ideology and emotion to drive decision making in business (and life, but thats another topic!). First is that the whole economy is evolving to a point where the highest performing companies are using data to make informed decisions about everything from font colors to self driving cars.

Second is that while the emotional and ideologically informed decision may appear to be lower stress, and quicker to execute it actually undercuts foundational human characteristics. People want to be informed and they want transparency. Consider trying to explain why you rejected funding a project. If the decision was made because of your personal opinion (ideology or emotional) both you and the person who’s budget request you rejected will feel pretty crappy in the long run. If you made the same rejection based on data both you and the rejected party would feel better about the decision, and be more informed about making the request successful next time. This applies to everything.

Your competitors are either already doing this or will be soon — even if you and your competition are in a ‘legacy’ economic market. This doesnt just apply to startups and tech companies. It applies to hiring, marketing, purchasing, M&A and almost every transaction and revenue stream you can imagine in business.

You need to shift your decisions from being emotionally and ideologically driven to being analytical to stay competitive, and in order to do that you need to be more informed and have more data available for analysis.

With all of this in mind its no surprise that most CIO’s selected developing a Data Strategy as their number 1 priority this year, even as the topic has been at the forefront of many leaders minds for years now. Why the shift from side project to number 1 strategic priority ? Timing and technology have aligned.

With the explosion of data showing no signs of slowing down (it won’t in our lifetimes) and a large number of Open Source and proprietary toolsets being developed across the spectrum now is the time to develop the initial strategy for your organization to succeed with using data to make informed decisions moving forward.

My recommended approach to implementing a Data Strategy is broken down into two phases. First is the preparation / acquisition of data — namely finding it and classifying it. The second part is processing and leveraging the data while possibly (depending on use case) exposing it to other internal teams and 3rd parties. I focus mostly on the first part here as this is where many struggle. In any case this strategy will need to be a living part of your organization and should grow as understanding of your situation evolves and as technological advances and market changes over the next few years force adaptation, but the current state of the industry allows an initial “1.0” version to address most of the core issues organizations face while developing a data strategy.

Identification & Classification

Within your four walls there is a wealth of data. Some of this data is legacy databases, some of it is stored in documents on laptops or in a document management system. The first, and arguably hardest, part of developing a data strategy is understanding your current situation. Are your data sources many small “data puddles”, a few large “data lakes”? Finding all of your (valuable) data in a large enterprise can be a long project, so try to break this down into bite-sized chunks and get easy / early wins.

Aim your initial project towards serving an area of the business which is underserved. If you have good customer analytics, help recruiting. If you have a great M&A team focus on marketing. As you make them more successful and alleviate significant burdens that team becomes your champions.

Once you have identified a few key data sources you should start working on a classification matrix or nomenclature. This will allow you to label data sources or data elements based on their criticality, availability, purpose, and potential use cases. For example logs from your customer facing API’s and websites, that could be labeled “mission critical, customer, buyer behavior” depending on the nomenclature you implement. These labels can make running analytics and ‘finding’ relevant data sources much easier for the data scientists and analysts later on.

A last note here is that you want data that is as close to the point of origin as possible, and if you can pull similar or relevant data from multiple sources that is even better. Just remember that running analysis on ‘mucked up’ data tends to produce skewed results.

Collection & Connectivity

Now that you have located and classified some data that you would like to work with you’ll need to develop a plan to collect and connect that data to the systems running analysis. Usually analytics, deep learning and machine learning tools are deployed against datasets ‘offline’, or on a secondary cluster. This is to prevent any sort of production or customer facing service degradation in the primary environments. This isnt always the case as ‘real time’ analytics frameworks and tools often have safeguards in-place, but for getting started most tools and for large data sets ‘separation’ from production is a good thing.

A few different approaches that need to be taken here depend on what your data landscape looks like. If you have multiple databases (Oracle, PostgreSQL etc) a common step that you might want to take is unifying the view, or creating a common access tier. This can be accomplished using a toolset that either enables a single ‘view’ of the disparate data sources which is then accessed by the processing tools or caching some of the data from the sources for close to real time analytic purposes. There is also the good old ETL data warehousing approach. Common examples of tooling in these spaces include AWS Data Pipeline, Hazelcast, JBoss Data Grid & Virtualization and dozens of others.

When evaluating tools in this space remember the key elements you will want to asses: flexibility (eg how many data sources, and of what types can you access?), speed and processing method (eg real time vs batch), integration and of course scale. Your data is going to continue to grow and diversify. Make sure your tools can support that and tie your initial tool selection to a mid-term roadmap.

Lastly when talking about data access it is common in many companies that endpoints are not properly connected. This results in missing out on high value information. Good examples of this are sensors on delivery trucks, mining equipment and other elements in the “internet of things”. If for example you are a repair company and have a fleet of trucks with GPS units and drivers with smart phones you should be asking if you can collect and analyze that data somehow, and better empower your workforce with mobile applications rather than hard copy and phone calls (which is another topic entirely).

This is where integration and messaging come into play. Standardizing communication protocols, installing agents on the endpoint, centralizing logs and/or building an API are all typical approaches. As with any such endeavor bear in mind the cost of maintaining the custom built vs buying the commercial offerings along with the pros and cons of using commercial open source. A good (but slightly old) example on using Active-MQ in these scenarios can be viewed here. The main objective for right now is getting the data into the pipeline from the endpoints so that it can be leveraged for analysis.

Storage & Processing

A key element in the storage and processing part of your decision is if you are going to be doing real time, batch or both. In the case of ‘real time’ your working data set is usually going to be significantly smaller than a batch processing environment, and your tooling will be slightly different as well. Even though tools in both spaces use map/reduce as a standard the actual implementation will differ depending on your approach.

Generally I advocate that clients start with batch processing first as real time processing introduces certain technical complexities that can slow down projects a fair bit to start, and since their data is largely ‘untapped’ a wealth of information can be gleaned by starting with the easier approach provided by batch. This isn’t always the case, so make sure you understand what data is available to you now, using what approach and interface.

Given the cost considerations at scale most big metal storage vendors are not great options in this space. Rather building clusters using standard x86 or ARM systems using Software Defined Storage (SDS) is the de-facto approach for large data sets. Bear in mind that while you can use most general purpose storage devices (SAN or NAS) for big data purposes the performance and cost wont be optimal. You can read about SDS offerings here and here.


As the data is now consolidated, labeled and available in a standardized system (and tools !) you now have to decide how to make the data available, and to whom. The most common scenario here is to expose the data via specific tooling for analysts and data scientists. However depending on the sensitivity of the data and how well thought out the access controls are within your data pipeline you might want to consider allowing access via more generic API’s or bulk exports of the data sets.

This would allow other internal developers, partners and potentially even 3rd parties and customers access to the data, and the results of how they integrate and analyze the data will often yield pleasant results.

This has the potential to increase the value of your data as the user base consuming the data grows in size and diversity. A longer term benefit is that you are now creating data about how your data is being accessed and used (more data!), this could ultimately lead to additional revenue streams and/or product enhancements.

There are plenty of options on how you run analysis and what tools fit what use cases. So rather than repeat all that I’ll advise you to take a look at the below resources, and drop me a line if you have any thoughts in this area.

Using Red Hat JBoss Data Virtualization with Hortonworks (Hadoop)

How Coursera uses AWS Data Pipeline

The Evolution of Netflix’s Data Pipeline

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.