How to Analyze Data (When You Don’t Know What You Don’t Know)

by Kyle Kristiansen

Published in

The Opex Analytics Blog

5 min readOct 25, 2019

As an operations research consultant with a lot of data to analyze and no time to waste, I’ve found that it’s easy to get lost if you’re not careful.

It’s a lot like when you’re riding your bike. You might see people playing soccer in the park, the flashy sign of a new restaurant in your neighborhood, or disgruntled commuters in bumper-to-bumper traffic. But which is more valuable: capturing every detail about your surroundings, or observing the minivan to your right that’s veering too close for comfort?

As analytics consultants, we help our clients make better decisions with their data. To ensure that our analysis is sound, we need to know the ins and outs of their data — strengths, weaknesses, assumptions, patterns, and more.

In this post, I’ll share my framework for approaching any new data I encounter. I focus on three main tasks: cleaning, scoping, and exploring. I use this scaffold to convert raw data into information that I can use to make decisions, explore trends, or build a full-scale operations research model.

Cleaning

What fields or records might throw a wrench in my analysis?

I recently processed a big dataset of orders, which included general order info, ship times, and receipt times. I needed to visualize transit lead times, so I subtracted each order’s ship date from its receipt date. To my surprise, I found a nontrivial number of negative values! After some investigation, I found that these illogical time sequences were the result of data input errors. I discussed this finding with my client, and we agreed to drop any records with date misalignment (which made up around 8% of all records). With the data now nice and clean, I was one step closer to making sense of it.

Data cleaning can involve many things, but missing values, outliers, and observations that violate assumptions are are usually my three biggest speed bumps. When I find sparse columns dominated by null or missing values, I often drop them entirely or think of sensible ways to impute their values. If I notice unrealistically high or low values, I make sure to investigate them more thoroughly (because not all outliers are bad). And I certainly look at negative lead times.

Analysis without the removal of these errors may look strange, and could invalidate the conclusions drawn from your work. Make the data as clean as it can be to ensure your work is clear, effective, and inherently valuable.

Scoping

Are there specific segments of the data that are relevant to the problem I want to solve?

As part of my aforementioned project, I wanted to pull all receipt records from the second quarter of 2019. These receipts, though logged in 2019, encompass shipped and ordered records from as far back as August of 2018. Had I blindly scoped my data to shipped in Q2 2019, I would have dropped a bunch of products with longer lead times, and my analysis would have been incomplete.

On top of that, a sneakier attribute was tucked into the unique identifier of each record — its last letter acted as a suffix that indicated if the order was expedited or a sample. Both of these order types are abnormal, and shouldn’t be considered when analyzing the general lead time distribution.

Think about attributes in the data that you can use to focus your analysis. Your project may demand that you spotlight specific parts of a product hierarchy, certain areas of the country, or certain order types. In this stage, discussions with subject matter experts will be crucial to gain insight into the way each data field speaks about a given record. Find out what your area of focus should be and make sure you cover it completely; no more, no less.

Exploring

What are the statistics telling me about my data?

When reviewing lead time per product for my analysis, I needed to model the data’s probability distribution to simulate future scenarios. Taking a look at each field’s statistics provided intuition as to whether I could test for normal (where PP-plots are very helpful) or Poisson distributions. If the data were especially concentrated, the orders might be scheduled for delivery with a penalty for lateness, completely changing the way I approach the order simulations.

In general, calculating each numeric field’s mean, mode, median, min, max, and standard deviation will help you get a better sense of its distribution. A median that’s significantly offset from the average can suggest that the field is skewed, while checking the standard deviation will give you some insight into the field’s general variability. Along with stats, plotting your data can be a great way to understand it. Univariate (i.e., single-variable descriptive) plots, like histograms or box-and-whisker plots, will allow you to hone in on distribution information. Bivariate (i.e., two-variable descriptive) plots, like scatterplots or time series plots, can give you insight into the relationships that exist between any two features.

If you’re a fan of using Python for data exploration, the module -pandas_profile (check here for their GitHub) is a great tool for automating much of the descriptive work in exploratory analysis, which will free up your time for more complicated efforts. A solid understanding of your data will pay dividends as you build models, create reports, and think about the problem.

Conclusion

Using this framework (cleaning → scoping → exploring) won’t guarantee you total freedom from the puzzles that accompany new data, but can help you structure the way you approach raw information into a format that is ready to be analyzed. In my experience, thinking through a data prep plan ahead of time will help you conduct meaningful analysis more efficiently. With some practice, you’ll find what works for you and adapt this process to your own needs.

Whether it’s getting ready for the day or performing data analysis, getting the order right is important. So when you get the urge to skip the preparation and immediately dress up your data with sharp visualizations, slow down and get to know it first — you’ll be glad you did.

_________________________________________________________________

If you liked this blog post, check out more of our work, follow us on social media (Twitter, LinkedIn, and Facebook), or join us for our free monthly Academy webinars.