Business Problem Solving with Data Science — Planning the Analysis Post 3 of 3

Shweta Doshi
13 min readFeb 27, 2020

--

This is the final post in the Business Problem Solving with Data Science Series

Post 1 : https://medium.com/@shwedoshi/business-problem-solving-with-data-science-a76cf3f0fc7

Post 2 : https://medium.com/@shwedoshi/business-problem-solving-with-data-science-scope-your-solution-post-2-of-3-693ab4b568ba

To recap,

Business Problem Statement : Find good restaurants in the specified locality so that they can be onboarded on the app.

Business Impact: More good restaurants on the app, improves the brand value and drives traffic to the app further leading to increase in Daily Active Users, daily value of orders and impacting revenue.

Data Science Problem statement: Predict the rating bucket of a given restaurant (good or bad) to decide if the restaurant can be onboarded to the app.

Data Science Metric: Precision

So you’ve already answered a lot of important questions about your analysis of the business problem. You know how and why it is important to the business. You’re focused on a specific decision stakeholders need to make. You’ve identified what metrics you will need to make your case to the company’s stakeholders.

You have a few more questions to answer before you should begin your actual analysis?

1. What data is available to answer your questions, and is that data sufficient for you to give an answer you feel good about?

2. How difficult it is to obtain the data that you are looking for? Is the data in public domain or does it incur costs to obtain the data that you need?

3. What is the form factor for the data you need? Is it in a neatly labelled format? If not available in the required format, how much effort does it take to label the data?

4. Which data can be acquired easily, which data needs additional effort to acquire? Align your milestones to make a minimum viable product with easily acquired data first and then add more and more data.

5. Do all the data you need exist in datasets that can be easily joined together? Or will you have to spend time figuring out how to link records across datasets?

6. How many pieces of data that you want can actually be missing or inaccessible before you decide that the analysis is simply not feasible?

Always remember that the key to solving the problem is obtaining, cleaning and wrangling with the data. An estimated 80% of effort is spent in this stage, so you need to be patient and try to question the data at every stage. Data is the key for success or failure for the data science project.

The first step of the analysis is to collect the data.

For our case study, our data sources could be:

Scrape the data from other restaurant search and discovery sites like Zomato where a lot of information about the restaurants are present.

These would have the ratings information also.

Explore trending venues in particular neighbourhoods especially restaurants — (using FourSquared API)

In addition, we can even scrape twitter to obtain the conversations or tweets regarding the restaurants of interest. These can be an additional input for the data.

Now the data that comes from Zomato would be a table that could contain the restaurant name, cuisine, location, est delivery time, min order value, a few photos, reviews of the restaurant, recent streak of the reviews etc. Suppose you collect data from another competitor like Swiggy or UberEats, while the data points would be more or less the same, the form of data could be different, the column names could be different. And there could be even additional and extra information. For example, there could be delivery for a particular restaurant by Swiggy but the facility is not available on Zomato.

Now imagine, getting these disparate disjoint into a single table on which further analysis could be done. And if you plan to do this at scale, you need to get the engineering team also involved in designing a complete data pipeline that can store the huge incoming data and then provide data in a form that you as a data scientist can then work upon.

For now, let us assume, you have painstakingly standardized the different columns and got the data you needed. Now since the data is collected through a third-party source or api, there could be a lot of inconsistencies in the data and cleaning and transforming this data becomes really crucial.

For example, the estimated delivery time is returned as a string. But any valuable insights or analysis could only be possible if it is stored in a time format. So that semantic parsing must also be done. What is presented here is just the tip of the iceberg. As you go deeper into data wrangling, analysis and aligning the data to the problem, more such challenges would arise that need to be overcome.

Let’s say you have persisted and got the data to the format you wanted to. Now, let us talk about the target variable. The target variable for the analysis is the rating buckets. First step is to convert the ratings into a standard format. Some sites might rate a restaurant from 1 to 5, others might say 1 to 10. We need to transform these ratings into the right buckets of good and bad restaurants. Many times such conversions — where you are deriving values from existing values might not be straightforward.

Other data operations can be:

Identify various data quality issues. Any data corruption that can be identified? Incorrect data values? For example, if say delivery time is shown in negative values, then you know that the data is incorrect for time cannot be negative.

Resolve missing entries, inconsistencies in the data and any semantic errors like wrongly labeled columns.

Extract new features from existing features or identify new ones.

Encode the data i.e convert non-numeric to numeric data.

Stakeholders often ask for things they can’t have. It is especially common for stakeholders to want answers to questions at the same time they lack the data needed to provide those answers. For example, they want to predict the rating of a brand new restaurant (whose data is not available). It’s your job as a data scientist to identify data problems before you conduct your analysis, and to only spend your time trying methods that are appropriate to the situation. Sometimes you won’t even realize that a crucial data point is missing until you are in the thick of your analysis.

Now with all the operations that you have done and questions you have answered, you have a rough dataset to begin with. Let’s assume the final form of the dataset to be as follows.

Target variable: Rating Bucket — Good or Bad If you decided to go in the direction of regression, it would be the rating itself)

Features:

Url of the restaurant page from where data was scraped

Name

Location of the restaurant

Cuisine

Cost for 2

Delivery available Yes or no

City

Reviews

Dishes typically liked by people

Conversations around the restaurant From Twitter

Till we reach this point, know for sure that you have done aggregation of data sources. Now a basic dataset is ready and your stakeholders are eager for you to start your analysis. Here are some guidelines for planning your datasets:

Identify all dataset needs ahead of time. Make sure you have all the pieces to the data puzzle available. For example, you could say: “At the very least, I need the location data, cost for 2, cuisine and reviews to start with.”

Differentiate between necessity and sufficiency. It is relatively easy to identify when the lack of certain information will make your analysis hard to do. That is a focus on necessity — the things you need in order to proceed with your work. It is harder to focus on sufficiency: even if you have everything you need, that doesn’t mean you’ll still be able to complete the analysis as planned. If data from different datasets don’t have a common key on which to join the information, or you can’t get access to some datasets even though they exist, or some of the data have so many missing values that they cannot support your use case, then your analysis will disappoint both you and your stakeholders. For example, you could say, “We’ve identified where all the data is. Do all the data stores have a common column like a restaurant id or name that can tie the datasets together?”

Understand the data-generating process. Even if the data technically exists somewhere in a database, take the time to figure out how it got there. Understand anyways it was filtered, transformed, or otherwise processed before it got to the place where you will receive it. Also, focus on data refresh cycles: how old is the data? When does it get updated? How is it updated? What/who decides when it is updated? For example, you could ask: “How is our location data stored? Do we have a separate record for every time a user gives a new review of the restaurant?”

Know when additional data collection is necessary. Sometimes the only way to complete an analysis is to collect more data. If additional data collection isn’t possible, then the scope and goals of the analysis need to be renegotiated with stakeholders

Planning your datasets serves several purposes:

1. It minimizes surprises. It is always easier to plan for contingencies before you begin your analysis than it is to try to adapt in the middle of your work as deadlines approach.

2. It ensures you have all of the support you need. If stakeholders are made aware of problems in the data from the start, they will be more patient and sympathetic when you face delays or unexpected obstacles.

3. It generates good ideas for new datasets.With the dataset in our hand, it is time to do the exploratory data analysis and other explorations.

Exercise — Get the Dataset

Now you have the complete data science problem in hand. All you need to do now is run it through the usual ML pipeline. Just as discussed above, plan out the data sources from where you would collect data. Some of the data that you require would be available, and some would be difficult. Think of doing everything w.r.t data just that you are not coding in the jupyter notebook. Identify the bare minimum data that you can collect along with the sources of how you would collect. Identify the challenges you can anticipate. Look at the apps in the same domain of the problem that you are solving and look at the different datapoints that are present. At the end of this exercise, you need to have all the features of the dataset on paper along with the target metric.

Plan your Methods

People create data sets for specific purposes — purposes that people will often have forgotten by the time you come around and want to use the dataset for an analysis. It’s easy to look at a column name and assume the dataset has what you need. Because of that, it’s very common for data scientists to find out, at least half way into their analysis, that the data they have isn’t really the data they need. Some of those problems manifest themselves only through careful exploratory data analysis. Hence a thorough EDA is essential before applying the methods. This also gives an opportunity to present to the stakeholders a few insights that might be useful for them.

Consider the following questions:

For a particular city give a locality wise breakup of the number of restaurants in each locality.

For each locality, break up the number of restaurants based on their cuisine i.e Indian, Continental, Chinese.

Visualize the density of restaurants in the localities, whether the restaurants are very close or spread far apart.

For every locality, give the average cost of restaurants in that locality.

For every locality, what are the percentages of the restaurants that deliver food?

Break up the data into good restaurants and bad restaurants and explore further

Avg cost of good and bad restaurants

Are all the good restaurants present in the same locality or are they spread across different localities?

Cuisine wise breakup of good and bad restaurants. Is there a particular cuisine in a city or locality that is not doing so well.

It’s unlikely that all of the information needed is stored in one place, already formatted in a way that makes it ready for your investigation and answer the questions. You’ll need to bring all of that data together, which means you need a plan.

If you are able to answer most of the questions in the EDA phase and identify the right insights to the stakeholders that is itself a huge value add. You could even do some statistical analysis like find the answers to questions like:

Is there any relationship between the rating and the cost of the restaurant?

The above question can be answered by a chi-squared statistical test.

You have your business needs. You have your milestones. You have your data. It might seem like there is (finally) nothing left to do but conduct the analysis. But there is still one more step.

Consider the following questions:

Which methods are inappropriate for your analysis?

Of those methods that are appropriate, what are the costs and benefits of using each one?

If you find a number of methods that are appropriate and have roughly the same costs and benefits (and you probably will), how do you decide how to proceed?

This is the core competency of a data scientist: choosing and using analytic techniques to derive value from data. Given the problem at hand, here are some ways you could go about planning what methods you will investigate:

Identify un-suitable methods first. Judge whether a black box solution would suffice for the business needs or the model we apply needs to be interpretable to explain the results to the stakeholders.

Keep constraints in mind. If your preferred method requires a GPU but you don’t have easy access to a GPU, then it shouldn’t be your preferred method, even if you think it is analytically superior to its alternatives. Similarly, some methods simply do not work well for large numbers of features, or only work if you know beforehand how many clusters you want. Save time by thinking about the constraints each method places on your work — because every method carries constraints of some kind.

Choose boring technology. Analytic approaches like deep learning and reinforcement learning are exciting. As a general rule, the more exciting the technology is, the less you should use it. Technologies are exciting when they are relatively new, and when technologies are new, they are less stable and harder to support and maintain. A “boring” technology contains much fewer surprises. Look for surprises in your data, not in your technology, and you will tend to build tools that last longer and work better.

Be willing to walk away. Even after you eliminated unsuitable methods and further narrowed down your list to accommodate your project’s constraints, you will still likely have more than one method that could plausibly work for you. There is no way to know beforehand which of these methods is better — you will have to try as many of them as possible, and try each with as many initializing parameters as possible, to know what performs best. You will probably run out of time before you run out of models and configurations to try. Don’t fall into the trap of thinking you need to ask for more time in order to test everything — set yourself a time limit and go with the best you have at the end of that time.

Planning your methods serves several purposes:

1. It keeps you from wasting your time on methods that will not ultimately suit your purpose. If a method works beautifully but does not work at scale, and you need it to work at scale, then it is not a good method to choose. If a method can’t handle a high number of variables without overfitting, and you have a high number of variables, it is not a good method to choose.

2. It keeps your mind open to all opportunities — even the less appealing

ones. It’s often not particularly fun to implement a simple heuristic or use a model that has been around for decades, but that is often the most appropriate choice for a business.

3. It keeps your work compatible with the rest of the business. Be a good

colleague and think about how your work is going to impact others. Your work shouldn’t just accomplish your own commitments to stakeholders. It should make it at easy as possible for others, such as engineers, to accomplish their commitments. Build things in a way that others can use them as easily as possible.

Keeping the above thoughts in mind, and coming back to the case study. We need the solution to be interpretable so that the results can be explained to the stakeholders properly. There are constraints on the deployment costs so the use of GPU must be avoided. Since the stakeholder, Director of Sales B2B, needs to hit the target of new restaurants, he needs the solution soon. Consider all these, the logistic regression or random forests might be the right choices for modelling.

At every step of the analysis, as explained in the beginning, there could be roadblocks that cause us to revisit our assumptions and go back to the beginning. It is a spiral model of development of the solution. What you have just seen is an illustration of how a business problem is converted into a data science problem and how the data analysis is done. The approach and the questions might differ from case to case, but overall these guidelines would get the job done.

Exercise — Models and Analysis

With the dataset identified, you now have the final step — analysis and modelling. After arriving at the dataset, think of what data explorations would you do and what questions you will answer. Get the list of questions that you will definitely ask during the EDA. Looking at the data, think deeply on what ML algorithms are best suited for the analysis. Think of the pros and cons of each algorithm and which algorithm would you like to proceed for your analysis.

Bonus: Think of the presentation that you would like to make for your stakeholders and what you would communicate with the stakeholders. Think on the lines of what is interesting to them, what would benefit the business objective etc.

Summary

Let’s do a quick recap of what we have done till now, mapping the complete steps to the entire pipeline that we began with:

Here is a checklist that summarizes the various guidelines and principles for the journey from a business problem to a data science problem.

Frame the Problem

Get concrete as fast as you can

Focus on consequences

Look for opposites

Look for hidden problems

Understand timing

Understand expectations

Understand downstream effects

Understand when the business problem isn’t a data science problem

Set the objectives

Eliminate possibilities

Think about dependencies

Group milestone activities by entity

Include housekeeping items

Think modularly

Get external advice

Prioritize pain points

Think explicitly about trade-offs

Figure out the business’s “value” units

Subset all metrics

Keep it as explainable as possible

Prepare the Data

Identify all dataset needs ahead of time

Differentiate between necessity and sufficiency.

Understand the data-generating process

Know the data refresh cycles

Build and Train the model

Identify un-suitable methods first

Keep constraints in mind

Choose boring technology

Be willing to walk away

The final step : Make predictions and fine-tune is about looking at the results and fine-tune the model as well as the business problem if needed and iteratively go through the steps. As mentioned in the beginning, the complete process represents an iterative spiral instead of a linear path. To summarize, the entire process of converting a business problem to a data science problem is extremely important. Data science is not just about getting a readymade data and applying EDA along with ML algorithms. No company will have data ready for analysis. Identifying the problem, distilling it into a data science solvable version is equally important.

At the end of the day, data science is more about problem-solving and structured thinking that just chasing a metric.

--

--

Shweta Doshi

I am an unapologetic idealist who believes that to gain quality education,we need to transform the way we teach & learn.I am the Co-Founder at www.greyatom.com