Six Steps Up: From Zero to Data Science for the Enterprise

Published in

Inside Machine learning

10 min readMar 19, 2017

Data has intrinsic value to the enterprise, but how to quantify these data assets has been a struggle for many organizations and for many enterprises as they establish modern data practices and data organizations. In most organizations, data in and of itself doesn’t have intrinsic value. The data’s value emerges only after we build platforms for data science.

There are six phases to the process:

Conceptually organizing data into 360 views

2. Mapping the decisions that the organization makes

3. Valuing each decision for the organization

4. Aligning the data to each decision

5. Establishing data governance as an enabler

6. Execution

As the types of decisions and the value of these decisions changes and matures over time, the 6 phases repeat. Depending on the resources available, the phases can be attacked in serial or in parallel.

Phase 1: Conceptually organizing data into 360 views

These days, it’s common to hear talk of creating ‘Customer360’ views of data. Beyond the urge to chase a trend, it’s worth considering how the notion of a 360 view can help organizations organize their data and development efforts across the entire data portfolio.

For any given organization, there are only a finite number of ways to group data, given the data opportunities and the data’s own logical organization. Typically, you can boil most data down to one of three key conceptual assets: Customer, Product, and Company. Some enterprises may require one or two others to account for Locations or some understanding of the Transactional aspects of the business. To be trendy, we can re-label these as Customer360, Product360, Company360 — and if necessary Location360 and Transaction360.

At this point, business schools, trade journals, and consultancies have firmly established the concept of a Customer360, and a quick web search reveals quite a bit of insight, including a detailed infographic from IBM:

Different stakeholders have different views about the desire for a Customer360, but perhaps the most clarifying is that for a company to truly drive value and delight its customers, the business must understand those customers and approach every question from their perspective. Without a Customer360 built on a foundation of data science, the business will only ever have a qualitative view of customers. I believe a true, quantitative understanding of customers relies on rigorous data science.

Less attention has been paid to the concept of a Product360, but it’s no less important. Depending on the business, a Product360 can potentially drive more value through cost savings and cost avoidance than the business can derive from new revenue. The ultimate goal of a Product360 is creating assets that allow the business to explore each product from earliest inception through the end of its lifecycle. The components of a Product360 include:

Product Development Pipeline
Supply Chain Manufacturing
Supply Chain Inventory
Regulatory
SKU Mapping
Product Categories
Product Marketing
Sales
Lifecycle management
Social media response to product

And lastly, Company360 serves as a catch-all for any data that doesn’t fit into Customer360 or Product360. That includes financial, talent, legal, and anything else of value. Company360 might seem like a collection of orphan data, but it’s a critical component of a data platform since this information is required to answer analytical questions about the enterprise. The two most important components of Company360 are usually Talent and Finance. In fact, without finance data, Product360 is incomplete; financial performance data for products and services is crucial for deriving value from these assets. Similarly, gaining insights about the current talent pool, compensation packages, and attracting and retaining top talent is instrumental in the success of any organization. Each enterprise is different and there are always arguments to be made for more data assets, but simplicity is key to success. Regardless of how assets are structured, what matters is to begin thinking about and treating data as assets. Just try to keep the total number of assets under 7.

To construct the assets, you’ll first need to think conceptually. You’ll know you’re on target when you can create logical representations of the data that will make up the assets. Creating these logical representations should help to define the physical architecture of the asset. Specifically, let the natural structure of the data define the architecture — rather than any preconceived notions of using Hadoop or of cramming everything into a relational structure. Each resulting asset should be wrapped with a set of secure application programing interfaces (APIs) that are structured using language that’s natural to a subject matter expert.

You can drive bulk access to data with a messaging system such as Kafka for near-time consistency, and you can move data via conventional ETL tooling or, better yet, by leveraging Spark-based data pipes. Let these be the primary routes of access to the data. You should strictly prohibit direct backend access. It’s simply not necessary with the combination of API and ETL. This data can be built in a private cloud and then made available to a public or hybrid cloud as necessary.

Phase 2: Mapping the decisions that the organization makes

Analytics maturity ranges from traditional business intelligence and descriptive analytics (which are typically retrospective or lagging indicators) to more predictive capabilities, and finally to cognition and artificial intelligence. For most enterprises, the true value is not inherent in the data itself, but in applying that data to make decisions more efficiently and effectively. The first step is walking through the various business functions to create an enterprise-wide map of the decisions being made. Resources permitting, this phase can happen in parallel with Phase 1.

Understanding what decisions need to be made across the entire organization might seem like an easy exercise, but the process, when done correctly, is more difficult — and enlightening — than any other part of this process. It will also help you understand which parts of the business are ‘ready’ to leverage more advanced analytics.

Phase 3: Valuing each decision for the organization

At an essential level, we assign every decision made in an organization to one of three categories in terms of monetary value: the decision either generates revenue, reduces costs, or avoids costs altogether. The ability to make each decision in advance using predictive analytics or cognition has the potential to increase that decision’s value. The delta between the value of making that decision and the value of making that decision in advance with some level of accuracy is the value of applying data science.

Example 1: Generating Revenue: New arbitrage

We’re all aware that machine learning can generate revenue by personalizing customer experiences to boost product sales, offer just-in-time services, and deliver targeted ads. (Think of Amazon, Uber, and Google.) But using data science to generate revenue also means giving organizations the ability to forecast demand, plan inventory, optimize scheduling, coordinate supply chains, refine pricing, and track down new resources — from oil and wind to undervalued equities.

Consider BlocPower, a New York City startup developing clean energy projects in US inner cities. With help from IBM, BlocPower performs a comprehensive energy audit of each property it manages to look for the mix of high-efficiency technology to optimize energy consumption. The company acts as arbiter to generate revenue for investors (and savings for customers) using an innovative crowdfunding marketplace — with data science as the foundation.

Looking to the future, we can visualize that data science and machine learning may also eventually help to formulate new strategies and discover new lines of business — perhaps by proposing partnerships, spotting opportunities in other sectors, or even designing new products to appeal to untapped demographics. (Deep learning, especially convolutional networks, and reinforcement learning are the areas to watch.)

Example 2: Reducing Costs: Logistics.

Because ML algorithms are especially adept at optimization, more and more organizations are using machine learning to reduce costs. One particularly impressive example is Digital Water, a company providing device-level visibility and intelligence to firms treating water and wastewater. As their site says, “The platform provides real-time insight, analysis and user-level controls to interact and benefit from the expanding smart-grid infrastructure.” IBM helped Digital Water reduce costs by monitoring and managing electric loads in conjunction with existing data systems. Collecting data and identifying patterns gives Digital Water the ability to forecast events and allocate resources in advance — at less expense.

Example 3: Avoiding Costs: Biological Product Development.

Whether in agriculture, biofuels, or pharma, development pipelines in bioscience tend to be shaped like wide funnels, with many leads early in the pipeline leading to very few marketable products. To avoid development costs, it’s crucial to isolate the promising leads and terminate the rest as early in the development process as possible. Predictive or prescriptive analytics helps give investigators that power. Organizations can either pocket the cost avoidance as an increased margin or re-invest it into the pipeline to generate more leads.

Monsanto has executed this approach with huge impact. Combining a state-of-the-art genomics pipeline with a state-of-the-art data science pipeline, they have shaved a year off their product development pipeline while simultaneously increasing the number of leads they evaluate. Using this approach they have essentially moved a full year’s worth of testing from the field into the lab making their genomics lab in Chesterfield, MO the largest ‘field testing’ operation in the world. The work they did was facilitated by having an extremely mature data science pipeline.

Phase 4: Aligning the data to each decision

Once the assets are conceptually defined, organizations can map and value the decisions being made. The next step is to align the data assets to the decisions — and to assign a value to each asset based on the contribution that each asset makes to each decision. Be sure to work with the finance team on this step. The mapping won’t typically be one-to-one, but it’s generally a straightforward and defensible process. The outcome should be a net present value for the data and analytics by logical domains.

With this process, the organization can convert a very nebulous thing called data into specific and tangible data assets that could possibly be added to the books of the organization as such.

Phase 5: Establishing data governance as an enabler

If you’ve completed the steps above in serial, then at this point you’ve used data to create theoretical value for the company in the form of these 360-data assets, and you have a path to realize this value via the decision mapping.

However, the effort falls apart if individuals can’t access the data. Data science teams need application programming interfaces (APIs) around each asset to facilitate controlled and monitorable access to the assets. I believe it’s crucial to apply policies up front. To do so, you’ll need to classify all data and specify all role-level entitlements.

More importantly, the change of data access policies is likely to require a cultural change as well, as the organization shifts from being highly protective by default to being open by default. Start with the mindset that all data is public and walk back and justify higher levels of classification step by step. Applying a classification scheme should result in a bell curve where most of the data is in the middle, with relatively less data under very tight or very loose restriction.

Phase 6: Execution

Executing everything you’ve accomplished in the first five phases is likely to require efforts on several fronts:

Engineering the various components
Continuously improving data quality
Optimizing and automating decisions
Applying cognition to guide next best action and the development of self-service BI tools
Changing the organization itself

Most importantly the team must be held accountable via solid tracking of metrics.

Around each data asset, build small DevOps agile teams that consist of engineers and data scientists. These teams become the new owners of this data for the enterprise, and they’re responsible for everything from architecture to support, including security and access based on the pre-determined policies from Phase 5.

Similarly, establish a centralized stewardship team to work in conjunction with the asset teams to resolve any data quality issues. Include the appropriate subject matter expert from the business if necessary.

Conventional customized business intelligence (BI) should go away over time and become a self-service using widely available tools such as Watson Analytics, Cognos, Tableau, and so on. Make an effort to bake the basic data science into the assets so that asset teams can execute it directly in conjunction with the business units.

Ultimately, the most significant insights will come from leveraging multiple data assets to solve the more complex problems defined in Phase 2. Establish a central and/or federated data science team to address the prioritized decisions. Avoid structuring the team to align with the data assets but rather structure it to address the domains that fall out from the decision mapping.

Not shown in the list are the legacy business intelligence teams, which should also report into the CDO. These are likely large teams that have been critical to the ongoing success of the company. The goal is to reduce the size of these teams over time, likely through a combination of retraining, restructuring, and attrition. Ideally, a relatively small initial investment in the new team will pay dividends that encourage expansion. Consider funding the expansion by pulling from the legacy teams.

Capturing metrics is essential both for the data asset teams and the stewardship teams. Focus the metrics on the milestones of creating the APIs and then ultimately on the usage of the APIs. Measure stewardship by the number of data errors identified and resolved without focusing on the current state of the data but rather the desired state of the data. In all probability, data quality will be much different than expected since it’s likely never had so much visibility.

Measure the success of the data science team by the value they’ve realized from Phase 3 (valuing the decisions) and by the number of decisions they optimize and automate. Ensure that the entire team feels accountable for the size of the legacy BI team since as the team builds the assets, the quality will go up, decisions will be more optimized and automated, and the amount of custom BI will inevitably go down.

Conclusion

The proliferation and diversity of data demands the forethought to evaluate and align that data with the concrete decisions that organizations need to make. That process won’t be quick or easy, but it’s sure to pay dividends in the long-term as users of the data gain the ability to work with speed and confidence within clear, comprehensive policies.