The Data Science Mindset: 6 Principles To Build Healthy Data-Driven Skills & Organizations

Briit
Total Data Science
Published in
15 min readDec 21, 2021
free image from unsplash.com

INTRODUCTION

Five years ago, the McKinsey Global Institute (MGI) released Big data: The next frontier for innovation, competition, and productivity. In the years since, data science has continued to make rapid advances, particularly on the frontiers of machine learning and deep learning.

Organizations now have troves of raw data combined with powerful and sophisticated analytics tools to gain insights that can improve operational performance and create new market opportunities. Most profoundly, their decisions no longer have to be made in the dark or based on gut instinct; they can be based on evidence, experiments, and more accurate forecasts. As we take stock of the progress that has been made over the past five years, we see that companies are placing big bets on data and analytics.

But adapting to an era of more data-driven decision making has not always proven to be a simple proposition for people or organizations. Many are struggling to develop talent, business processes, and organizational muscle to capture real value from analytics. This is becoming a matter of urgency, since analytics prowess is increasingly the basis of industry competition, and the leaders are staking out large advantages. Meanwhile, the technology itself is taking major leaps forward — and the next generation of technologies promises to be even more disruptive. Machine learning and deep learning capabilities have an enormous variety of applications that stretch deep into sectors of the economy that have largely stayed on the sidelines thus far.

According to the Harvard Business Review, the biggest obstacles to creating data-based businesses aren’t technical; they’re cultural, a kind of mindset. It is simple enough to describe how to inject data into a decision-making process. It is far harder to make this normal, even automatic, for employees — a shift in mindset that presents a daunting challenge.

THE FRONTIERS OF MACHINE LEARNING, INCLUDING DEEP LEARNING, HAVE RELEVANCE IN EVERY INDUSTRY AND WIDE-RANGING POTENTIAL TO SOLVE PROBLEMS

Conventional software programs are hard-coded by humans with specific instructions on the tasks they need to execute. By contrast, it is possible to create algorithms that “learn” from data without being explicitly programmed. The concept underpinning machine learning is to give the algorithm a massive number of “experiences” (training data) and a generalized strategy for learning, then let it identify patterns, associations, and insights from the data.

In short, these systems are trained rather than programmed. Some machine learning techniques, such as regressions, support vector machines, and k-means clustering, have been in use for decades. Others, while developed previously, have become viable only now that vast quantities of data and unprecedented processing power are available. Deep learning, a frontier area of research within machine learning, uses neural networks with many layers (hence the label “deep”) to push the boundaries of machine capabilities.

Data scientists have recently made breakthroughs using deep learning to recognize objects and faces and to understand and generate language. Reinforcement learning is used to identify the best actions to take now in order to reach some future goal. These type of problems are common in games but can be useful for solving dynamic optimization and control theory problems — exactly the type of issues that come up in modeling complex systems in fields such as engineering and economics. Transfer learning focuses on storing knowledge gained while solving one problem and applying it to a different problem. Machine learning, combined with other techniques, could have an enormous range of uses.

Machine learning can be combined with other types of analytics to solve a large swath of business problems.

screenshot from google

Machine learning has broad potential across industries and use cases.

screenshot from google

The industry-specific uses that combine data richness with a larger opportunity are the largest bubbles in the top right quadrant of the chart. These represent areas where organizations should prioritize the use of machine learning and prepare for a transformation to take place. Some of the highest-opportunity use cases include personalized advertising; autonomous vehicles; optimizing pricing, routing, and scheduling based on real-time data in travel and logistics; predicting personalized health outcomes; and optimizing merchandising strategy in retail.

The use cases in the top right quadrant fall into four main categories. First is the radical personalization of products and services for customers in sectors such as consumer packaged goods, finance and insurance, health care, and media — an opportunity that most companies have yet to fully exploit. The second is predictive analytics. This includes examples such as triaging customer service calls; segmenting customers based on risk, churn, and purchasing patterns; identifying fraud and anomalies in banking and cybersecurity; and diagnosing diseases from scans, biopsies, and other data.

The third category is strategic optimization, which includes uses such as merchandising and shelf optimization in retail, scheduling and assigning frontline workers, and optimizing teams and other resources across geographies and accounts.

The fourth category is optimizing operations and logistics in real time, which includes automating plants and machinery to reduce errors and improve efficiency, and optimizing supply chain management.

UNDERSTANDING THE HEALTHY DATA SCIENCE ORGANISATION FRAMEWORK

Being a data-driven organization implies embedding data science teams to fully engage with the business and adapting the operational backbone of the company (techniques, processes, infrastructures, and culture). The Healthy Data Science Organization Framework is a portfolio of methodologies, technologies, resources that, if correctly used, will assist your organization (from business understanding, data generation and acquisition, modeling, to model deployment and management) to become more data-driven. This framework, as shown below in Figure 1, includes six key principles:

  1. Understand the Business and Decision-Making Process
  2. Establish Performance Metrics
  3. Architect the End-to-End Solution
  4. Build Your Toolbox of Data Science Tricks
  5. Unify Your Organization’s Data Science Vision
  6. Keep Humans in the Loop
screenshot from google

Figure 1. Healthy Data Science Organization Framework

Given the rapid evolution of this field, data scientists and organizations typically need guidance on how to apply the latest data science techniques to address their business needs or to pursue new opportunities.

In the up-coming pages, we will explore the six(6) principles that data scientists as well as organisations need to build healthy data-driven skills and organisations.

PRINCIPLE 1:

UNDERSTAND THE BUSINESS AND DECISION-MAKING PROCESS

For most organizations, lack of data is not a problem. In fact, it’s the opposite: there is often too much information available to make a clear decision. With so much data to sort through, organizations need a well-defined strategy to clarify the following business aspects:

  • How can data science help organizations transform business, better manage costs, and drive greater operational excellence?
  • Do organizations have a well-defined and clearly articulated purpose and vision for what they are looking to accomplish?
  • How can organizations get support of C-level executives and stakeholders to take that data-driven vision and drive it through the different parts of a business?

In short, companies need to have a clear understanding of their business decision-making process and a better data science strategy to support that process.

With the right data science mindset, what was once an overwhelming volume of disparate information becomes a simple and clear decision point.

Driving transformation requires that companies have a well-defined and clearly articulated purpose and vision for what they are looking to accomplish. It often requires the support of a C-level executive to take that vision and drive it through the different parts of a business.

Organizations must begin with the right questions. Questions should be measurable, clear and concise and directly correlated to their core business. In this stage, it is important to design questions to either qualify or disqualify potential solutions to a specific business problem or opportunity. For example, start with a clearly defined problem: a retail company is experiencing rising costs and is no longer able to offer competitive prices to its customers. One of many questions to solve this business problem might include: can the company reduce its operations without compromising quality?

There are two main tasks that organizations need to address to answer those type of questions:

  • Define business goals: the Data Science team needs to work with business experts and other stakeholders to understand and identify the business problems.
  • Formulate right questions: companies need to formulate tangible questions that define the business goals that the data science teams can target.

USE CASE:

Francesca Lazzeri, PhD (Twitter: @frlazzeri) is Senior Machine Learning Scientist at Microsoft on the Cloud Advocacy team and an expert in big data technology innovations and the applications of machine learning-based solutions to real-world problems.

Last year, the Azure Machine Learning team developed a recommendation-based staff allocation solution for a professional services company. By making use of Azure Machine Learning service, we built and deployed a workforce placement recommendation solution that recommends optimal staff composition and individual staff with the right experience and expertise for new projects. The final business goal of our solution was to improve our customer’s profit.

Project staffing is done manually by project managers and is based on staff availability and prior knowledge of an individual’s past performance. This process is time-consuming, and the results are often suboptimal. This process can be done much more effectively by taking advantage of historical data and advanced machine learning techniques.

In order to translate this business problem into tangible solutions and results, we helped the customer to formulate the right questions, such as:

  1. How can we predict staff composition for a new project? For example, one senior program manager, one principal data scientist and two accounting assistants.
  2. How can we compute the Staff Fitness Score for a new project? We defined our Staff Fitness Score as a metric to measure the fitness of staff with a project.

The goal of our machine learning solution was to suggest the most appropriate employee to a new project, based on employee’s availability, geography, project type experience, industry experience and hourly contribution margin generated for previous projects.

These solutions can address gaps or inefficiencies in an organization staff allocation that need to be overcome to drive better business outcomes. Organizations can gain a competitive edge by using workforce analytics to focus on optimizing the use of their human capital.

In the next few paragraphs, we will see together how Francesca and her team built this solution for their customer through a data science mindset.

PRINCIPLE 2:

ESTABLISH PERFORMANCE METRICS

In order to successfully translate this vision and business goals into actionable results, the next step is to establish clear performance metrics. In this second step, organizations need to focus on these two analytical aspects that are crucial to define the data solution pipeline (Figure 2) as well:

  • What is the best analytical approach to tackle that business problem and draw accurate conclusions?
  • How can that vision be translated into actionable results able to improve a business?
screenshot from google

Figure 2. Data solution pipeline

This step breaks down into three sub-steps:

  1. Decide what to measure

Let’s take Predictive Maintenance, a technique used to predict when an in-service machine will fail, allowing for its maintenance to be planned well in advance. As it turns out, this is a very broad area with a variety of end goals, such as predicting root causes of failure, which parts will need replacement and when providing maintenance recommendations after the failure happens, etc.

Many companies are attempting predictive maintenance and have piles of data available from all sorts of sensors and systems. But, too often, customers do not have enough data about their failure history and that makes it very difficult to do predictive maintenance — after all, models need to be trained on such failure history data in order to predict future failure incidents. So, while it’s important to lay out the vision, purpose and scope of any analytics projects, it is critical that you start off by gathering the right data. If the problem is to predict the failure of the traction system, the training data has to encompass

all the different components for the traction system. The first case targets a specific component whereas the second case targets the failure of a larger subsystem. The general recommendation is to design prediction systems about specific components rather than larger subsystems.

Given the above data sources, the two main data types observed in the predictive maintenance domain are: 1) temporal data (such as operational telemetry, machine conditions, work order types, priority codes that will have timestamps at the time of recording. Failure, maintenance/repair, and usage history will also have timestamps associated with each event); and 2) static data (machine features and operator features, in general, are static since they describe the technical specifications of machines or operator attributes. If these features could change over time, they should also have timestamps associated with them). Predictor and target variables should be preprocessed/transformed into numerical, categorical, and other data types depending on the algorithm being used.

2. Decide how to measure it

Thinking about how organizations measure their data is just as important, especially before the data collection and ingestion phase. Key questions to ask for this sub-step include:

  • What is the time frame?
  • What is the unit of measure?
  • What factors should be included?

A central objective of this step is to identify the key business variables that the analysis needs to predict. We refer to these variables as the model targets, and we use the metrics associated with them to determine the success of the project. Two examples of such targets are sales forecasts or the probability of an order being fraudulent.

3. Define the success metrics

After the key business variables identification, it is important to translate your business problem into a data science question and define the metrics that will define your project success. Organizations typically use data science or machine learning to answer five types of questions:

  • How much or how many? (regression)
  • Which category? (classification)
  • Which group? (clustering)
  • Is this weird? (anomaly detection)
  • Which option should be taken? (recommendation)

Determine which of these questions companies are asking and how answering it achieves business goals and enables measurement of the results. At this point it is important to revisit the project goals by asking and refining sharp questions that are relevant, specific, and unambiguous.

For example, if a company wants to achieve a customer churn prediction, they will need an accuracy rate of “x” percent by the end of a three-month project. With this data, companies can offer customer promotions to reduce churn.

In the case of our professional services company, we decided to tackle the first business question (How can we predict staff composition, e.g. one senior accountant and two accounting assistants, for a new project?). For this customer engagement, we used five years of daily historical project data at individual level. We removed any data that had a negative contribution margin or negative total number of hours. We first randomly sample 1000 projects from the testing dataset to speed up parameter tuning. After identifying the optimal parameter combination, we ran the same data preparation on all the projects in the testing dataset.

USE CASE:

By Francesca Lazzeri, PhD and team.

Below (Figure 3) is a representation of the type of data and solution flow that we built for this engagement:

screenshot from google

Figure 3. Representation of the type of data and solution flow

We used a clustering method: the k-nearest neighbors (KNN) algorithm. KNN is a simple, easy-to-implement supervised machine learning algorithm. The KNN algorithm assumes that similar things exist in close proximity, finds the most similar data points in the training data, and makes an educated guess based on their classifications. Although very simple to understand and implement, this method has seen wide application in many domains, such as in recommendation systems, semantic searching, and anomaly detection.

In this first step, we used KNN to predict the staff composition, i.e. numbers of each staff classification/title, of a new project using historical project data. We found historical projects similar to the new project based on different project properties, such as Project Type, Total Billing, Industry, Client, Revenue Range etc. We assigned different weights to each project property based on business rules and standards. We also removed any data that had negative contribution margin (profit). For each staff classification, staff count is predicted by computing a weighted sum of similar historical projects’ staff counts of the corresponding staff classification. The final weights are normalized so that the sum of all weights is 1. Before calculating the weighted sum, we removed 10% outliers with high values and 10% outliers with low values.

For the second business question (How can we compute Staff Fitness Score for a new project?), we decided to use a custom content-based filtering method: specifically, we implemented a content-based algorithm to predict how well a staff’s experience matches project needs. In a content-based filtering system, a user profile is usually computed based on the user’s historical ratings on items. This user profile describes the user’s taste and preference. To predict a staff’s fitness for a new project, we created two staff profile vectors for each staff using historical data: one vector is based on the number of hours that describes the staff’s experience and expertise for different types of projects; the other vector is based on contribution margin per hour (CMH) that describes the staff’s profitability for different types of projects. The Staff Fitness Scores for a new project are computed by taking the inner products between these two staff profile vectors and a binary vector that describes the important properties of a project.

We implemented this machine learning steps using Azure Machine Learning service. Using the main Python SDK and the Data Prep SDK for Azure Machine Learning, we built and trained our machine learning models in an Azure Machine Learning service Workspace. This workspace is the top-level resource for the service and provides a centralized place to work with all the artifacts we have created for this project.

In order to create a workspace, we defined the following configurations:

Workspace name

Enter a unique name that identifies your workspace. Names must be unique across the resource group. Use a name that’s easy to recall and differentiate from workspaces created by others.

Subscription

Select the Azure subscription that you want to use.

Resource group

Use an existing resource group in your subscription, or enter a name to create a new resource group. A resource group is a container that holds related resources for an Azure solution.

Location

Select the location closest to your users and the data resources. This location is where the workspace is created.

When we created a workspace, the following Azure resources were added automatically:

  • Azure Container Registry
  • Azure Storage
  • Azure Application Insights
  • Azure Key Vault

USE CASE:

By Francesca Lazzeri, PhD and team.

The workspace keeps a list of compute targets that you can use to train your model. It also keeps a history of the training runs, including logs, metrics, output, and a snapshot of your scripts. We used this information to determine which training run produces the best model.

After, we registered our models with the workspace, and we used the registered model and scoring scripts to create an image to use for the deployment (more details about the end-to-end architecture built for this use case will be discussed below). Below is a representation of the workspace concept and machine learning flow (Figure 4):

screenshot from google

Figure 4. Workspace concept and machine learning flow

There are four other comprehensive principles that you will love to sit down and read while you sip your coffee.

I have compiled everything to make it easier for you in this book that you can read for FREE

If you like this article, kindly give it a like. Thanks in advance.

If you wish to write in this Newsletter, kindly reach out to us via Whatsapp: +919467891831

You may want to check: Full Stack Data Scientist BootCamp

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — -

SUBSCRIBE TO Newsletter For FREE: https://lnkd.in/ewB9KR4j

OTHER ARTICLES:

--

--

Briit
Total Data Science

Data Science | Artificial Intelligence | Machine Learning