Applying Data Science in Manufacturing: Part II — Batch Process Methodology and Lessons Learned

Published in

The Startup

9 min readJun 17, 2020

In Part I of this post (https://medium.com/@ryandmonson/applying-data-science-in-manufacturing-part-i-background-and-introduction-ccb15743e001) I hypothesized that Machine Learning modeling and subsequent control of process parameters could help reduce variation in Manufacturing.

In this post I’ll go through steps to create a predictive model for alloy grade using inputs from a metal alloy manufacturing dataset on the Kaggle website. Training and testing datasets for the metal alloy manufacturing are found on Kaggle at https://www.kaggle.com/esotericazzo/metal-furnace-dataset. All coding is in Python.

READ AND SUMMARIZE DATASETS

Code for importing Numpy, Pandas and OS:

import numpy as np 
import pandas as pd

#Input data files are available in the read-only "../input/" directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

Code to read in the datasets as Pandas dataframes and summarize:

#Read data into Pandas dataframes, describe

df_test = pd.read_csv("/kaggle/input/metal-furnace-dataset/Test.csv")
df_train = pd.read_csv("/kaggle/input/metal-furnace dataset/Train.csv")
pd.set_option('display.float_format', lambda x: '%.3e' % x)

print("train shape:",df_train.shape, "test shape:", df_test.shape)
df_train.describe()

From the printed output we learn:

The df_train shape is (620, 29). The test shape is (266, 28).
All column values have been standardized to a z/t statistic
- It is unknown if the train and test datasets were independently standardized to avoid data leakage.
Column f9 is all zeros.

Histograms of data in the dataframe columns are available on Kaggle. From these we learn:

f0, f22-f24 look like continuous variables. The other x variable columns appear categorical
The y variable column “grade” takes on integer values 0–4
There are no missing values

From this point forward x variables will be referred to as ‘features’, the y variable as ‘target’.

In order to better understand the feature data the following code was used to plot the number of unique values per column:

# plot out the number of unique values in each column
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style = 'ticks')

unique_values = {}
for x in x_params:
    unique_values[x] = df_train[x].value_counts().size

fig = plt.figure(figsize = (12,8))
ax = fig.add_subplot(111)
height = unique_values.values()
x = np.arange(len(x_params))
y = range(0,80,5)
plt.bar(x = x, height = height)
tick_positions = range(0,len(unique_values))
ax.set_xticks(tick_positions)
ax.set_xticklabels(unique_values.keys(), rotation =90, size = 12)
ax.set_xlabel("Parameter", c = 'y', size = 18)
ax.set_yticks(y)
ax.set_yticklabels(y)
ax.tick_params(axis = 'both',colors = 'black')
ax.grid(axis = 'y')
ax.set_ylabel("# of Unique Values", c = 'y', size = 18)
plt.show()

# of Unique Values in each Column of the Train DataFrame

From this plot we learn:

Parameters f22 — f24 have over 20 unique values
Parameters f0 — f2, f5 have 5 to 10 unique values
All other parameters (21 total) have < 5 unique values.

f0 — f27 are a mixture of continuous and numeric categorical variables.

Given the data is numeric a regression model for alloy grade prediction makes sense. A classification model also makes sense, given that so many variables are categorical. Regression is understood and used regularly by engineers in manufacturing industries. Classification not so much.

Begin modeling with regression. Before building the model, however, multicollinearity in the feature data will be checked. Also, a function will be created to bin model output for actual vs. predicted comparison.

LINEAR REGRESSION — MULTICOLLINEARITY

The Variance Inflation Factor (VIF) measures the relationship between feature variables, which can be a problem for a model. The following code determines the VIF for each feature variable:

#identify multicollinearity between independent variables with VIF
from sklearn.linear_model import LinearRegression
def vif (x_names, data):
    '''    VIF = 1/(1 - r2), where r2 is between an x as the dependent variable and all the remaining x's as independent
    variables'''
    vif_dict = {}
    for name in x_names:
        not_x = [i for i in x_names if i != name]
        X,y = data[not_x], data[name]
        #get r-squared
        r_squared= LinearRegression().fit(X,y).score(X,y)
        #get VIF
        vif = 1/(1 - r_squared)
        #write to dict
        vif_dict[name] = vif
    return vif_dictx_vifs = vif(x_params, df_train)

f3 has the largest VIF at 5, next is f4 at 4. Remainder are 1 to 3. Literature provides varying views on VIF magnitude being a concern for modeling. I used the rule that VIF’s < 5 are not a concern. The analysis will proceed assuming no multicollinearity.

LINEAR REGRESSION — BINNING

The linear regression model will produce non -integer predictions for alloy grade (for example 1.57, 2,83, etc.). I chose to convert these predictions into integers by rounding. This is what the inspectors are doing when they determine the grade. They force the grade to be 0,1,2,3 or 4.

The following function bin the predictions:

def binning (predictions):
    for n,p in enumerate(predictions):
        
        if p <= 0.5:
            predictions[n] = 0
        if (p > 0.5) and (p <= 1.5):
            predictions[n] = 1
        if (p > 1.5) and (p <= 2.5):
            predictions[n] = 2
        if (p > 2.5) and (p <= 3.5):
            predictions[n] = 3
        if p > 3.5:
            predictions[n] = 4
    
    return predictions

REGRESSION MODEL A — FEATURE SELECTION

For the regression model feature selection I leaned on my Data Science coursework training: choose features having a Pearson’s correlation with the target variable of > 0.2. The following code produced a ranked set of features with correlation coefficients >0.2:

#correlations between x variables and grade w/0.2 as a cutoff

train_corr = df_train[x_params].corrwith(df_train['grade'])
sorted_corrs = abs(train_corr).sort_values()
strong_corrs = sorted_corrs[sorted_corrs > .2]

REGRESSION MODEL A — MODEL FIT , PREDICTION & ACCURACY

The following code fit Model A using the strong_corrs features, made grade predictions, binned the predictions and determined the prediction accuracy:

# fit strongly correlated x's, train dataset

from sklearn.linear_model import LinearRegression

lr = LinearRegression()
X = df_train[strong_corrs.index]
y = df_train['grade']
lr.fit(X,y)
train_pred = lr.predict(X)
train_pred_bin = binning(train_pred)#fraction of LR model grade predictions that were correct

from sklearn.metrics import accuracy_score

acc_perc = accuracy_score(df_train['grade'], train_pred_bin, normalize=True)

Results of model A:

training dataset model accuracy was 77%
model contained 8 features

It was decided to try a different approach to creating a regression model. Instead of choosing factors based on correlation coefficients, why not let a machine learning algorithm choose them? (and do the modeling along the way). Elastic Net Cross Validation (ENCV) seemed like a reasonable approach to accomplish this. ENCV combines the regularization penalties of Ridge (l2, sum of squared x coefficients)and Lasso (l1, sum of x coefficents) regression in addressing the bias/variance tradeoff in building a model.

REGRESSION MODEL B — ENCV MODEL FIT , PREDICTION & ACCURACY

The following code was used to for developing a model based on the ENCV algorithm in SciKit-Learn:

# fit Elastic Net model to training data using different l1/l2 ratios.
#The ElasticNetCV algorithm will choose the best fitting  ratio from the list

from sklearn.linear_model import ElasticNetCV
l1 = [.1, .5, .7, .9, .95, .99, 1] #list of l1/l2 ratios
X = df_train[x_params]
y = df_train['grade']

lren = ElasticNetCV(l1_ratio = l1)
lren.fit(X, y)
lren_predict = lren.predict(X)
bin_lren_predict = binning(lren_predict)
acc_lren = accuracy_score(y,bin_lren_predict, normalize = True)

Results of model B:

Optimal l1/l2 ratio was .7
training dataset model accuracy was 79%
model contained 23 features

Regression model B contained almost 3x as many features as model A, yet did not achieve a significant increase in accuracy. It’s possible, however, have a meaningful improvement in accuracy over model A with new data.

The 23 model B coefficients were sorted for magnitude and compared to the 8 model A coefficients sorted by correlation magnitude. Three features were at the top of both sorts: f2, f14, f18. The next and final regression model will use only these 3 features.

REGRESSION MODEL C— 3 FEATURES MODEL FIT , PREDICTION & ACCURACY

The following code was used to fit a regression model to features f2, f14 and f18:

#linear regression Model C

lr_r1 = LinearRegression()
Xr1 = df_train[['f2','f14','f18']]
lr_r1.fit(Xr1,y)
lr_r1_pred = lr_r1.predict(Xr1)
bin_lr_r1_pred = binning(lr_r1_pred)
acc_lr_r1 = accuracy_score(df_train['grade'], bin_lr_r1_pred, normalize=True)

Results of model C:

training dataset model accuracy was 77%
model contained 3 features

The following table summarizes the Regression models:

+-------+---------------+---------------------------+
| Model | # of features | train prediction accuracy |
+-------+---------------+---------------------------+
| A     |             8 | 77%                       |
| B     |            23 | 79%                       |
| C     |             3 | 77%                       |
+-------+---------------+---------------------------+

The train and test datasets have been structured like a Kaggle competition, i.e. there is no target column in the test dataset. Therefore, the test dataset could not be used to evaluate model accuracy on new data.

The following code was used to make predictions of the target variable from the test dataset, then write those predictions to a .csv file for all 3 regression models. The dataset owner can evaluate these models against the test dataset target actuals.

#Fit test data to all 3 LR models

X_test0 = df_test[strong_corrs.index]
test_pred0 = lr.predict(X_test0)
test_pred0_bin = binning(test_pred0) #predict using Model A

X_testlren = df_test[x_params]
test_lren = lren.predict(X_testlren)
test_lren_bin = binning(test_lren) #predict using Model B

X_test1= df_test[['f2','f14','f18']]
test_pred1 = lr_r1.predict(X_test1)
test_pred1_bin = binning(test_pred1) #predict using Model C#Write df_ test predicted values from the 3 regression predictions to csv files

# Model A
np.savetxt("lr0_test_pred.csv", test_pred0_bin, delimiter=",")
# Model B
np.savetxt("lren_test_pred.csv", test_lren_bin, delimiter=",")
# Model C
np.savetxt("lr1_test_pred.csv", test_pred1_bin, delimiter=",")

As mentioned earlier in this post, classification is another methodology for prediction. Random Forest classification, where results from Decision Trees of various “sizes” are averaged to avoid overfitting, will next to be used to model the data.

RANDOM FOREST MODEL — FEATURE SELECTION, FIT and PREDICTION

Features for the Random Forest were selected using the Recursive Feature Elimination Cross Validation (RFECV ) ensemble method in Scikit — Learn. RFECV takes a model (in this case Random Forest) and recursively eliminates features until the maximum average model accuracy score from Cross Validation is achieved.

The following code was used to execute the RFECV method on the train dataset features:

from sklearn.feature_selection import RFECV  
from sklearn.ensemble import RandomForestClassifierdef feature_selection(df):
    all_X = x_params
    all_y = df['grade']
    rfc = RandomForestClassifier(n_estimators = 100, random_state = 1)
    grid = RFECV(rfc, cv = 5) #use the default no. of crossfolds
    grid.fit(X = df[all_X], y = all_y)
    num_features = grid.n_features_
    rank = grid.ranking_
    best_columns = df[all_X].columns[grid.support_]
    return all_X, all_y, num_features, rank, best_columnsall_X, all_y,num_features,rank, best_columns = feature_selection(df_train)

The number of feature columns chosen by the algorithm was 13. Model fit and prediction was performed with the following code:

#fit RF classifier using the selected features, make grad predictions, #determine accuracy score

def my_RFC(df,Xcols,ycol):
    rfc = RandomForestClassifier(n_estimators = 100, random_state = 1)
    X = df[Xcols]
    y =df[ycol]
    rfc.fit(X,y)
    rfc_prediction = rfc.predict(X)
    rfc_acc = rfc.score(X,y)
    return rfc_prediction, rfc_accrfc_prediction, rfc_acc = my_RFC(df_train, best_columns, 'grade')

This model produced a training dataset model accuracy of 99+%. For the training dataset this is a significant improvement over regression, but there may be overfitting in this model and usage by the engineers and operators is not as straightforward as with regression. For these results to be used effectively, the relationship between the feature variables and the target variable has to be clearly laid out for the operators.

Two more classification models were created from subsets of the features but will not be discussed here. The predictions from the 3 classification models on the test data were written to .csv files for use by the dataset owner.

The entire analysis can be found in a Kaggle notebook at https://www.kaggle.com/ryandmonson/alloy-manufacturing’

POST MORDEM — THOUGHTS, LESSONS LEARNED

When the shape of the dataframes was determined, focus was on the difference in number of rows (test/train split) and not the difference in columns. I missed that the test set had no target column, and never considered setting aside a portion of the train dataset for validation. That would have provided me with an evaluation of which model could be the most accurate on new data. Even with data leakage and shrinkage of an already small train dataset validation would have been useful.
From my experience the 0.2 cutoff for the correlation coefficient would raise skepticism in the engineers. They generally will not consider variables to be correlated with that low of coefficients. Process engineers are educated and utilize small datasets. The Data Science/Machine Learning world uses large datasets. Inference and uncertainty are different for large vs. small datasets. Educating them prior to presenting the model results should be done to maximize the probability of their buy-in.
The accuracy differences between the regression models were insignificant. The 3 parameter model has less risk of overfitting (and therefore useful for new data) than the 23 parameter model.
The classification model was far more accurate than the regressions, and may be the best model for prediction. If the classification results were to be used in Operations to adjust features, there would need to be a lookup table or some other mechanism to clearly establish the f0 — f27 vs. grade relationship for the operators.

In Part III we’ll discuss modeling of a continuous Manufacturing process.

Applying Data Science in Manufacturing: Part II — Batch Process Methodology and Lessons Learned

Written by Ryan Monson