Analytics Vidhya
Published in

Analytics Vidhya

Applying Data Science in Manufacturing: Part IV — Summary and Conclusions

Note: The modifications to this article were to correct typos and grammatical errors

In Part I of this post ( I discussed how Manufacturing processes are data rich environments and the possibility of improving processes through the application of machine learning techniques. Of specific interest was the possibility of a new process control paradigm: instead of controlling process parameters per a value range, control through parameter relationships.

In Part II ( model building to establish parameter relationships was performed on a batch process dataset. A classification model was built which demonstrated excellent predictive accuracy.

For Part III ( a continuous process dataset was evaluated. This dataset was an order of magnitude larger than the batch process both row wise and column wise. Calibration of the target measurements was a concern and this author felt further information was needed from the process owner to build a useful model. An analysis was completed with that caveat.

Part IV will be a more in depth discussion of the Thoughts and Lessons Learned in the Post Mordems of Parts II and III. The following will be discussed:

  • Focusing on part and missing the whole


In Part II the feature data provided to Kaggle had been standardized to a z statistic and split in train and test dataframes. The shape function was used to compare the number of rows in each dataframe. shape , describe ,info , head and value_counts provided basic information about a dataframe. Function results need to be retained for retrieval in the data cleaning, preprocessing and modeling steps.

In this case I didn’t mentally retain that the shape analysis showed the test dataframe had one less column than the train dataframe. The dataset author had structure the datasets in a Kaggle competition format. Focusing on the shape row output, I overlooked the column output. I did not set aside a validation dataframe to evaluate my training data models.

In Part III multicollinearity of the feature variables was not checked. I missed something important for good analysis.

One may say that these types of mistakes are going to happen. I agree. There is a cost to the mistakes, but there is also a cost in trying to ensure mistakes never occurs. I believe the story from the chart below is valid: there is a minimum cost in the cost vs. “mistake proofing” curve. I believe that minimum can be approached through a checklist.

Find the minimum cost between error prone and perfect

Checklists can be bureaucratic clunkers adding little value, or useful mental triggers. There is something to be said about stopping to think and filling out a form. If the checklist is completed rote, no value is added. But if it’s not rote, our mind thinks about what it just checked. It can trigger new thought patterns. Creativity can be stimulated. What begins as a valuable verification activity can also be a seed for innovation.


Far too many of us STEMers are terrible, absolutely terrible at connecting with non STEMers. We won’t shift from our left brain, data driven mindset towards a more right brain mindset when presenting to non STEMers. We think graphs are the answer. Connecting with your audience, an essential skill if your analysis is going to have any practical value, requires more than just pretty graphs. It requires understanding your audience’s concerns and making them your concerns. Authentically, not just with words during a 1/2 hour presentation. It requires letting go of ego, thinking of others, not getting bent out of shape when you don’t get the credit you deserve.

We also do a bad job connecting with fellow STEMers. In Parts II and III I discussed the need to get the Engineers trained in the Data Science/Machine Learning mindset. The Engineers are just as competent technically as Data Scientists, but haven’t learned large dataset analysis. They may have already experienced the frustration of applying small dataset analysis techniques to large datasets. You are introducing technical ideas that are not only foreign but easily misunderstood as “wrong”. And once a bad first impression is made, you’ve made things a whole lot worse on yourself.

So start early with educating on the Data Science/Machine Learning way of thinking. Take baby steps. Give people time to wrestle with the concepts before showing them your analysis results. Give them time to become internally convinced instead of just believing you.


In my Data Science education projects and Kaggle competition experience effort was expended to eke out every possible 1% accuracy improvement by refining the model (optimal number of features, hyperparameter optimization, different model, etc.). Statistical uncertainty is lessened for predictions from large datasets as compared to small ones. With large datasets minimal accuracy improvements are real, i.e. you really will have more accurate predictions for future data. But in the overall picture, was it worth it? What else happened within the organization, i.e. perception changes, attitude shifts because of your model improvement efforts? Did you have a model that was understood and accepted by the right people, but then you dropped it for a model giving 1% accuracy improvement that only you understand? What are the ramifications of doing that?


This point has already been heavily emphasized in Parts I — III. What changes/actions the data analysis is suggesting must be doable physically, mentally/emotionally (see above), culturally. If needed modify recommendations to at least get some success. I’ve made the mistake, and seen others make the mistake, of giving decision makers recommended actions that just aren’t palatable. Sometimes we leave them with a binary, all-or-nothing choice where the “all” is just too much. Baby steps, partial victories are better than defeat.


In Python data is identified as either numeric (float, int), character (string) or datetime (sort of numeric). Numeric implies either an interval or ratio scale. Just because it’s identified one way, however, doesn’t mean it should be treated as such. In Part III there were only 2 -4 unique values for each of the 12 columns of raw material properties. It’s likely those properties are measured on a numeric interval or numeric ratio scale(concentration of an impurity for example), but what we actually have in the dataset cannot be classified as such. Is our chosen model going to properly utilize the data it receives from the dataset? We all know about one hot encoding to convert character to numeric, but are there other data considerations for the model? Think about the actual data types in your dataset.


In Part III of this series measurement results raised calibration concerns. In addition, measurement error for temperature measurement devices eliminated some feature columns. Because of measurement error these column values could not be distinguished one from another. The measurement values being recorded were confounded with measurement error.

Measurement confounding does not just apply to Manufacturing. Here’s a made up example:

Say an organization provides a service that is used by the Financial industry and Retail industry. Management wants to know how the service is viewed by the clients. A survey is created. The survey is mailed to the Financial industry clients for completion, while the Retail clients are given it in person. Analysis indicates Retail clients are more satisfied with the service than Financial clients. Further resources are expended to find out why the difference, when in reality there is none. The data was gathered in different ways between the two client groups, and any difference in client satisfaction is confounded with how the data was gathered. You cannot separate the two.

Know how your data was gathered. Look for possible biases. Look for inconsistencies. When sampling is necessary, don’t get seduced by the random sample cure-all. There may still be bias in the random sample. Think about the sampling. Think about the question the analysis is answering. Just think.


Discipline in analysis is good, but in the extreme it can also help create analysis paralysis, an unhealthy dependency, a counterproductive devotion to the numbers. On several occasions in my career I was asked “how do you know that’s an outlier?” “Really? We gathered these data points. This one value is so far removed from the others that it’s obviously an outlier. The lab techs, who gathered the data told me what got messed up to cause this outlier. And you don’t want to call this an outlier?”

I completely agree that at times the cause of an outlier needs to be identified. You may learn something valuable by studying it. If you don’t want this outlier to occur again, you need to take some action. But this mentally lazy attitude of “the numbers don’t lie” when physics, chemistry, and what should be common sense say otherwise is counterproductive. Judgement and the accompanying courage will always be needed in the practical application of STEM skills. Yes, sometimes you’re stuck with having to accept the data even though everyone accepts there was a data gathering error. You’ll have to live with it. But don’t hide behind the “how do you know?” mental laziness.


In Part III I experienced such troubleshooting difficulty with a fairly simple concept that it’s inspired a future article on scaling methods. For this upcoming article I will not dig deep into the math/statistics theory and then write what I discovered. What I plan to do is spend some time learning fundamental principles around the different scaling methods. Then I’ll fabricate some datasets and replicate the problem I experienced.

I find this empirical, this “show me instead of tell me” approach useful. I have fiction writers in my family, and one of the principles of fiction writing is to “show” instead of “tell”. Deep mathematical formulas are impressive, and if you want to go that deep go for it. But business value is generated by something that “works”. Want persuasive power? Learn what you need to, then show that it works.

The author currently resides in Pittsburgh, PA and has worked in the Aerospace, Semiconductor, Textile and Medical Device Industries. He can be reached at



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store