A Machine Learning Use Case for Manufacturing Planning

Vaibhav Mehrotra
sclable
Published in
11 min readApr 14, 2021

Using classification techniques to support automated manufacturing process design from product specifications.

Spitfire manufactured during WWII. Credits: Birmingham Museums Trust.

As we saw in our article series on AI in Manufacturing, manufacturing plants offer one of the most complex yet most promising environments to deploy large-scale AI-based solutions. This article provides a concrete use-case where a machine learning algorithm is used to learn from historical production data how to infer the number of necessary manufacturing steps solely based on the primary product specifications.

Problem Description

A manufacturer specializes in the custom production of small mechanical parts. Their customers provide them with specifications of the parts to be produced.

The manufacturer must first determine the exact steps to follow to produce these parts and then decide on appropriate price quotations. The manufacturer has various machine tools on an assembly line used to produce such parts. These machines, or, in our specific case stages, are steps on a production line, where each step is responsible for giving certain properties to the part.

Known Issues and Bottlenecks

First, this manufacturing process design is done manually by senior engineers who are domain experts, which is a bottleneck in the production pipeline.

Second, the manufacturer has various factories, and the knowledge exchange between these factories is limited. Similar parts might have been made in different factories, but this knowledge is not transferred sufficiently and sometimes not at all.

Third, the manufacturer gets a lot of such quotation requests from its customers, with a large part of these requests ending up with no order, so this design process drains many resources from the company.

Formulation

Problem Statement: Client’s View
Optimize the production cycle by mitigating the bottleneck of the time taken by the domain expert in determining the steps of production by aiding them with recommendations using machine learning.

Potential time gains: from two weeks to a few minutes.

Problem Statement: Data Scientist’s View
Use supervised learning methods to classify a product into the number of required production stages using its specifications as input features.

This is a significantly reduced version of the problem of overall production planning in manufacturing. The comprehensive definition of the problem could be: given the desired product specifications, generate the entire manufacturing sequence.

Data Science Workflow

For this particular case, the labels are: [‘2’, ‘3’, ‘4’, ‘5’, ‘6’]. This means that some parts require 2 machines on the assembly line, whereas others might need 6.

This is a classification problem. Even though the number between 2 and 6 might make it seem like a regression problem, we must remember that these are categorical variables. There are no 2.1, 2.5 or 2.9 stage parts.

It's time to look at some code snippets!
This exercise is carried out in python, using pandas, matplotlib, seaborn, numpy, and scikit-learn.

1. Data Exploration

Real-life databases of industrial companies can be extensive and messy. Even a small production line will log dozens of parameters that may or may not be useful for the given problem statement.

In this example, the dataframe has shape: (10232, 308). 308 columns for only 10232 rows can be a difficult job to analyze and model due to the curse of dimensionality. Also, as a Data Scientist with minimum knowledge of mechanical engineering and the client's database's nuances, there is a need to build domain knowledge.

This step is necessary to better understand the data, the different columns/features, and their meanings. The objectives can then be translated into data science problems.

A domain expert has to be constantly in touch with the data scientist for adequate knowledge transfer.

Upon an initial round of communication, the domain experts named columns deemed necessary for this analysis. The exercise aims to predict target_features, in particular, the '# of stages' (number of stages)

We can see that some columns are continuous, whereas others are categorical. We separate them into continuous_features, categorical_features.

We use the df.describe() function on the different types of features for a preliminary look.

Continuous Features 4 is 0 for all values. This has to be communicated to the client. There is an error in their database, or some semantic meaning has not been understood or communicated correctly.

We notice that the target_features need cleaning as the most occurring value for `# of stages` is Machine not found. This shows missing/erroneous data.

Description for categorical values is not displayed here to protect data. It shows that each category's most common product type represents 15–50% of all products.

This gives us a hint that there might be a long tail distribution of the frequency of products. In simple words, a very few products might be produced much more often than the bulk of the other products. The top 5 products may cover 85% of the manufacturing pipeline.

2. Data Cleaning

Generally, the data cleaning step involves removing or fixing incomplete, erroneous, irrelevant, duplicated, or improperly formatted data. In this exercise, we only identify and remove null values.

  • Removing Null values from categorical variables

We find different null values from a manual inspection of the dataset, such as NONE or 9 spaces. Removal of null values can be automated, but we prefer to do this manually to better understand what different null values can be present.

For larger datasets however, manual inspection is not possible and therefore a different logic has to be used to identify all possible values.

Removing these null values, we lose 348 rows which account for only ~3% of the initial dataset.

  • Removing Null values from continuous variables

We remove all values which are 0.0 from the continuous features.

This might not always be recommended because 0.0 does not necessarily mean a null value. This had to be clarified by the domain expert who said that in their dataset, for these columns, 0.0 is indeed a null value.

Removing these null values, we lose 1316 rows which account for ~13% of the initial dataset.

  • Removing Null values from target variables.

Machine not found, Machine found, no details, - were identified as the null values for the target variable. For supervised learning, a class label is necessary, and hence only rows with relevant labels need to be kept. Suppose this exercise aimed only to visualize the distribution of features or find clusters using unsupervised learning techniques (K-Means, DBSCAN). In that case, we could have retained rows that do not have a class label/target variable.

Removing these null values, we lose 5813 rows, accounting for ~57% of the initial dataset.

This is a big loss in the data. The client is asked to improve data collection standards and explore possibilities to populate these values using other columns which might not have been considered.

3. Data Visualization

We have used the seaborn library for simple visualizations and matplotlib for 3-D plots. Categorical data is visualized as count plots (bar graphs), whereas continuous features are visualized as distribution plots.

  • Target Labels

There is a class imbalance problem in the target features. Only 19 examples exist from the class label 2 , while 1312 exist for 4. This indicates that the data must be stratified during the test-train split to maintain this proportion.

Here, we need to ask the client if this distribution is a real representation of what happens in their factories, or if it is only due to the subset of data they have provided.

  • Continuous Features

Visualization of the numerical data shows long-tail distributions, which is quite common.

  • Categorical Features
X-axis labels have been disabled intentionally

As predicted, the long tail distribution is also followed in the categorical features. This means that we have many data for some frequently produced parts with categorical features, as shown in the plot, and very little data for many parts as they are not produced as often.

At this stage, we can ask the client to subset and focus the exercise on only the top 10% most produced parts as these bring maximum value. At the same time, our algorithms will have more confidence/support for these parts and we can be more certain about our evaluations

  • Correlation plots

Identifying a strong positive or negative correlation between pairs of numerical values gives us an insight into how one feature is dependent on another. A diagonal line (at 45°) shows a strong positive correlation.

Highly correlated features show that the dimensionality of the data can be further reduced using techniques such as PCA to more easily visualize the feature space. High correlation points towards redundancy in the data. Correlation analysis also gives us information about which algorithms should be used for data modeling, as some of them are independent of correlation, whereas others might perform better when features are uncorrelated.

Like 2D correlation plots, we look at 3D scatter plots of continuous features to identify clusters with the same class labels.

Here, a diagonal hyperplane shows a high positive correlation.

4. Data Modeling

Tree-based models have proven to be strong for classification and regression tasks.

Model selection is a comprehensive process that considers the use-case, client expectations, type and amount of data, and current state-of-the-art.

One good practice is to use an iterative-improvement approach, given enough time and resources. Start with a base-line model as a benchmark which is explainable and make iterative improvements, either improving the model or using a better model. The key is to make sure that every new model or approach has better accuracy than the benchmark. Whenever you get a lower accuracy with a ‘better’ model, try to find out what aspects about your algorithm fail to work with the data.

We first used Decision Trees for this exercise, then upgraded to Random Forest with hyperparameter optimization and XGBoost. This post covers the Random Forest algorithm.

Categorical features cannot be directly fed into a Random Forest algorithm. They have to be encoded using a technique such as one-hot encoding.

We use a test-train split of 0.2. Set stratify = True so that the distribution of class labels is the same in test and training subsets.

The stratification is essential to maintain the distribution of different classes from the dataset. If the data is not stratified in this exercise, the under-represented classes such as '2' could be missed entirely in the training or testing split. In such a case, the classifier trains on very few samples of a given class and does not learn how to classify such labels correctly.

Note that the three graphs look the same except for the scaling on the Y-axis. This shows that the distributions are preserved across y, y_train and y_test

We use the RandomForestClassifier module from sklearn for this exercise.

At first, the Random Forest algorithm parameters can be set to default or randomly, with some domain experience. We will fix this later using Hyperparameter Optimisation. The detailed explanation of each parameter is out of this post's scope and might be covered in some follow-up posts.

We calculate model accuracy and cross-validation score with 10 splits.

This accuracy is to be used as a benchmark, and every improvement to the model should aim to beat this accuracy. One choice at this stage would be to move to a more robust model such as XGBoost, but we choose further to improve the RandomForestClassifier to its best possible accuracy.

We must now find the correct hyperparameters to be fed to this model using Hyper Parameter Optimization.

5. Hyper Parameter Optimization

This step is helpful when you want to squeeze out the best performance from your model.

We use hyperparameter optimization to determine the best parameters to be given to the RandomForestClassifier. Define a dict that will act as a grid for the grid search optimization method using the RandomizedSearchCV module from sklearn.

This process provides the best hyperparameters for the model given the grid. We can now narrow down our grid and rerun hyperparameter optimization for a few iterations.

We see that the n_estimators parameter has an ideally chosen value of 1000. We used a step size of 100 for this process. Hence, we should rerun this process using a more precise grid (smaller range and smaller step size). This must be done for other hyperparameters too.

n_estimators = [int(x) for x in np.arange(800, 1400, 50)]

Grid search is an expensive process and one must balance the trade-off between the resources spent and accuracy gained.

Once we reach an optimal accuracy given the resources and time spent, we use these hyperparameters to retrain our RandomForestClassifier.

Here, we have taken the final parameters after a few iterations of hyperparameter optimization using the grid search technique.

You can see the improvement in the accuracy.

Instead of using grid search which requires extra effort, one can use Bayesian hyperparameter optimization techniques which will converge to the best hyperparameters more smoothly and without re-runs.

6. Model Evaluation

Accuracy is not enough for multi-class classification problems, especially for imbalanced and balanced classes.

We, therefore, use precision and recall as metrics for our evaluation.

  • Precision is defined as the fraction of relevant instances among all retrieved instances.
  • Recall, sometimes referred to as 'sensitivity,' is the fraction of retrieved instances among all relevant instances.

A perfect classifier has precision and recall, both equal to 1.

F1-Score is the harmonic mean of precision and recall and is a single metric used to evaluate the classifier.

We visualize the outputs using a confusion matrix to better understand these metrics, which have predicted labels on the x-axis and the true labels on the y-axis.

In this confusion matrix, we want all the numbers to be placed along the diagonal, which gives us precision = recall = F1 score = 1.

We notice decent scores for classes with high support ('4', '5'). Class' 2' shows perfect precision and recall, but only 4 examples exist in the test split. This says that we cannot have very high confidence here.

The higher the support, the more certain we can be that our algorithm will generalize in the future.

Class ‘3’ shows lowest precision and recall. These results must be discussed with the client to get to the bottom of why the model is unable to classify this. Maybe the features are not adequately representing the process at the factory.

It might also be an error in the raw data; the client might be using 3-stage machines to produce 2-stage parts if 2-stage machines were not available. In such a case the labels/data needs to be fixed or we need to enter the domain of multi-label classification.

Insights

Tree-based models such as Random Forest report feature importance, which is very useful in industrial applications. The most essential features from the data can be directly communicated to the client. This opens a dialogue for understanding the data better and validating the results with the client.

Relative Feature Importance determined by the model for the given data

We can now directly ask the client if their engineers take these features into account and prioritize them when designing the production process. This brings insights to both the client as well as the data scientist.

Tree-based methods are a good choice because of their high level of explainability. An entire random forest model can be displayed as an image. Neural networks do not offer this level of explainability and also require much larger datasets.

Conclusion

We tried to establish a proof of concept to see whether a small part of a large manufacturing pipeline can be automated using machine learning. As mentioned, this use-case can be extended and turned into a production-grade system that could support production planning by automating the design of the manufacturing process.

Without any advanced feature engineering, we could achieve a reasonable accuracy of our model. Notice that focusing on classes (in this example steps) with a higher number of occurrences and putting away the low-count classes could not only help improve the accuracy of the model but would also have a higher business value. The reason for the latter is the possibility to automatize the process for "mainstream" product specifications thanks to a very reliable model while letting human engineers focus on the rare or exotic customer's requests.

This article was written for Sclable's blog on Medium.
If you liked it, give it a 👏 and share if you ❤️

--

--