Analytics Vidhya
Published in

Analytics Vidhya

Applying Data Science in Manufacturing: Part III — Continuous Process: Methodology and Lessons Learned

In Part II of this post (https://medium.com/@ryandmonson/applying-data-science-in-manufacturing-part-ii-batch-process-methodology-and-lessons-learned-d18d360d8953) several models for predicting alloy grade from a batch manufacturing process were created. Classification modeling was far more accurate than Regression modeling in predicting the training target variable. At the end of the article a post mortem was documented, outlining lessons learned and thoughts on making the modeling results useful to the batch process manufacturing operation.

In Part III a continuous manufacturing process will be analyzed. This process differs from the batch process as follows:

  • Processing is sequential, with time lags between process steps

The dataset for this continuous manufacturing process is found on Kaggle at https://www.kaggle.com/supergus/multistage-continuousflow-manufacturing-process. After reading the dataset notes I decided to limit the analysis to Raw Material introduction through Stage 1 Combiner measurements. Data columns for downstream process steps will be dropped from the dataset. The following process flow will be analyzed:

Figure 1 — Parallel and Sequential Steps for the Part III Continuous Process

For this analysis x variables will be referred to as ‘features’, y variables as ‘targets’. All coding is in Python.

READ AND SUMMARIZE DATASETS

The code for importing Numpy, Pandas and OS is the same as in Part II, the only change being the path to the dataset (/kaggle/input/multistage-continuousflow-manufacturing-process/continuous_factory_process.csv). Checking the data types revealed that the date time stamp was an object datatype. It was converted to the datetime data type.

Histograms of data in the dataframe columns are available on Kaggle. From these we learn:

  • Besides the date time column, all other data types are numeric

Data columns from processes downstream of the Stage 1 Combiner measurements were dropped. Remaining columns were renamed as follows:

#replace these column title words...
col_list_old =["Actual","AmbientConditions.","AmbientHumidity","AmbientTemperature","Machine","RawMaterial", "Property", "RawMaterialFeederParameter", "Zone","Temperature","MotorAmperage","MaterialPressure","ExitZoneTemperature",'FirstStage.CombinerOperation.Temperature','Stage1.Output.Measurement','Stage2.Output.Measurement']
#...with these
col_list_new =
["Act", "Amb", "_Hum", "_Temp", "Mach", "RM", "Prop", "RM_Feed_Param","Zone", "Temp", "Amps", "Mat_Press", "Exit_Zone_Temp","Stage1_Temp",'Stage1_Measure', 'Stage2_Measure']
#function to modify column names
def clean_column_names(df):
df.columns = df.columns.str.strip() #strip whitespace

# zip the lists as tuples, then into a dictionary
col_dict = dict(zip(col_list_old, col_list_new))

#items - list of tuple key/value pairs (old,new)
for i, j in col_dict.items():
new_columns = [column.replace(i, j) for column in df.columns] # replace old with new
df.columns = new_columns
return df
df = clean_column_names(df)

As noted earlier, many of the temperature features have little variation (range of 2 degs or less). Regardless of whether the scale is Centigrade or Fahrenheit, this range is within the measurement error of thermocouples, thermistors or infrared devices used in Manufacturing to measure temperature. Looking at the column min and max values indicates for many columns the temperature variability is just measurement error¹. These columns will not be beneficial for predicting target values and making process adjustments (they are effectively constants). They will be dropped from the analysis.

The Kaggle histograms indicate other features have extremely low coefficients of variation. These columns will also not be beneficial for predicting target values (too little variation) and will also be dropped.

One could argue this a poor approach statistically, that normalization or standardization places all the features on a “level playing field” as to magnitude and variation. That is true, and a feature being dropped may have predictive power in the model. But will the operators be able to make the tiny adjustments suggested by the model when the parameters are “unstandardized” and made operational? Will these tiny tweeks be outside the magnitude of measurement error? It may be that tiny tweeks to certain features make a difference in output, but I’m going to move forward with dropping the following columns:

  • Amb_Temp.U.Act

After dropping the columns, Python Lists were created of features and targets:

  • Feature: Raw material columns (rm_cols)

These Lists will be useful for the upcoming analysis.

PREPROCESSING

E-mail exchange with the dataset owner indicates this is a polymer extrusion process. There are time lags in the process: material entering Machine 1,2 or 3 → material exiting Machine 1,2,3 → material entering Stage 1 Combiner → material exiting Stage 1 Combiner → material being measured.

Consider a 1 inch section of material entering Machine 1. That 1 inch section of material experienced lags in absolute time as it moves through the process.The structure of the DataFrame does not account for these lags. The date time stamp indexes a machine feature, 1st Stage Combiner feature or measurement target at a point in absolute time, not what the 1 inch section of material experienced. The measurements (target actuals) are for a section of material that experienced certain temperatures, amps, pressures, etc. We want a model that predicts what the material targets will be based on the feature values it experienced.

This may be a moot point if the features/targets change slowly relative to their sampling rate. Additionally the dataset owner was unable/unwilling to provide linespeed or equipment physical dimensions for this process.

It was decided to proceed with creating new datasets that introduced a time lag. The following visualizes the process with the time lag and the features:

Figure 2 — Process Flow Showing Retained Features and Time Lag

As Figure 2 shows, the time shift between process steps/machines will be addressed, not the time shift within the machines. The time shifts are assumed to be uniform. Dataframes with 3 and 5 second shifts will be created.

For me this analysis process has already had a fair amount of “hand waving” (When my wife asks about this article I’ve responded by mimicking the penguins from the movie “Madagascar”: ‘You didn’t see anything’). Columns have been dropped because I don’t think small process tweeks matter or are doable (I don’t have access to the engineers/operators so I don’t know if that’s true). I’m putting time lags into the data columns, when I don’t know what those lags are. I do have experience working around polymer extruders, but the 3 and 5 second time lags are guesses. Also, I still have to deal at some point with the negative and zero values in the measurement columns (both the actuals and setpoints). Given the dataset notes indicate all measurements are in mm, the negative and zero measurement values make no sense.

If this analysis was being done “for real”, I would not have made these assumptions/guesses. I would have found answers from the operators/engineers before moving forward with the analysis.

Because this article is intended to be instructive, I’ll continue and complete an analysis of this data, culminating in a model for predicting a target from the features.

The following code was used to create new dataframes reflecting the 3 and 5 second time lags:

#create 3 sec shift dataframe
df_init3 = df_init.copy()
df_init3[fspp_cols] = df_init3[fspp_cols].shift(-3, axis = 0,
fill_value = "shifted")
df_init3[meas_cols] = df_init3[meas_cols].shift(-6, axis = 0,
fill_value = "shifted")
df_init3[meas_cols_sp] = df_init3[meas_cols_sp].shift(-6, axis = 0,
fill_value = "shifted")
#shifted columns dtype changed to 'object'. Change back to float
cols = df_init3.select_dtypes(exclude=['float64', 'int64']
).columns.to_list() cols.remove('time_stamp')
df_init3[cols] = df_init3[cols].apply(pd.to_numeric,
downcast='float', errors='coerce')
#repeat for 5 sec shift dataframe
df_init5 = df_init.copy()
df_init5[fspp_cols] = df_init5[fspp_cols].shift(-5, axis = 0)
df_init5[meas_cols] = df_init5[meas_cols].shift(-10, axis = 0)
df_init5[meas_cols_sp] = df_init5[meas_cols_sp].shift(-10, axis = 0)
cols = df_init5.select_dtypes(exclude=['float64', 'int64']
).columns.to_list()
cols.remove('time_stamp')
df_init5[cols] = df_init5[cols].apply(pd.to_numeric,
downcast='float', errors='coerce')

The following code was used to check that the time lag shift is correct for the 3 second time lag df:

#check the shift on df_init3. 1st stage combiner temps shift by 3s,
#1st stage measurements shift by 6s
df_test = pd.concat(objs = [df_init.loc[3,:],
df_init3.loc[0, : ],
df_init.loc[14087,:],
df_init3.loc[14081, : ]],
axis=1, join='outer', ignore_index= True,
keys=None, levels=None, names=None,
verify_integrity=False, copy=True)
df_test.columns = ['df_init', 'df_init3(3)',
'df_init', 'df_init3(6)']
df_test3 = df_test.loc[fspp_cols, : ]
df_test6 = df_test.loc[meas_cols, :]
print(df_test3)
print(df_test6)
#checked

Based on these results, assume the 5 sec time lag dataframe is also correct.

FURTHER DATA CLEANING

As previously indicated there are negative and zero values in the measurement actuals and setpoints. Working with the 3 sec time lag dataframe, count the number of zeros in each setpoint column:

#count the number of values == 0 in each setpoint columnsp_dict_0 = {}
for i in meas_cols_sp:
value = (df_init3[i] == 0).sum()
sp_dict_0[i] = value

print(sp_dict_0)

Each measurement setpoint column contains 56 zeros. Are they from the same rows? The database owner indicated that rows with all zeros in the setpoints are when the measurement system “drops out”. Code could be written to verify the setpoint zeros are all from the same rows, but given the database owner feedback, I’ll just drop these rows.

#drop all zero setpoint rowsdf_init3 = df_init3.loc[(df[meas_cols_sp] > 0).all(axis=1)]

Executing the shape method showed that the number of rows in the dataframe was reduced by exactly 56. All the measurement setpoint column zeros were in the same rows.

Because of the time lag shift, certain columns in the 3 second dataframe tail contain “shifted” instead of a numeric value. The drop and tail methods were used to drop the last 6 rows of the dataframe.

The describe method was executed on the 3 second dataframe. Even after dropping certain rows there are still negatives and zeros in the measurement actual columns.

The analysis will be simplified to a single target variable. The next steps to complete the analysis were:

  1. Select a single target actual column which contains a large number of nonzero values.

SELECT A TARGET VARIABLE

The describe method applied to the meas_cols and histograms from the following code were used to select a target variable:

# Import library and write function for histogram plottingimport matplotlib.pyplot as pltdef hist_plot(df,columns):
fig = plt.figure(figsize = (20, 20))
for counter,sp in enumerate(columns):
ax = fig.add_subplot(5,3,counter+1)
ax.set_title(sp, color = 'red')
ax.tick_params(color = 'red', labelcolor = 'red')
bins = 10
plt.hist(df[sp],bins, align = 'left')
plt.subplots_adjust(hspace = 0.5)
plt.show()
return
hist_plot(df_init3, meas_cols)

Histograms for all 15 measurement actuals were generated. Below is the Measurement 10 Actual histogram:

Measurement 10 values > 0 ranged from 4.2mm to 13.1mm, with relative few values ≤ 0. Measurement 10 was chosen as the target variable. All other measurement columns (actual and setpoint) were dropped. Rows where Measurement 10 ≤ 0 were also dropped:

#drop measurement columns (Act and Setpoint) != 10
rm_col = meas_cols.copy()
rm_col.remove('Stage1_Measure10.U.Act')
df_init3.drop(rm_col, axis = 1,inplace = True)
rm_colsp = meas_cols_sp.copy()
rm_colsp.remove('Stage1_Measure10.U.Setpoint')
df_init3.drop(rm_colsp,axis = 1, inplace = True)
# `checked`#drop rows where Measure10 <= 0df_init3 = df_init3[df_init3['Stage1_Measure10.U.Act'] > 0]
df_init3['Stage1_Measure10.U.Act'].describe()

CREATE TRAIN, TEST DATASETS FROM 3 SEC TIME LAG DF

#create train, test datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegression
train3,test3 = train_test_split(df_init3,
test_size = 0.25,
random_state = 1)
#create min/max scaled train X, also a train y
X_train3df = train3.drop(['Stage1_Measure10.U.Setpoint',
'Stage1_Measure10.U.Act',
'time_stamp'],
axis = 1)
scaler_train = MinMaxScaler()
scaler_train.fit(X_train3df)
X_train3 = scaler_train.transform(X_train3df)# scale
y_train3 = train3.pop('Stage1_Measure10.U.Act')
#create test X, test y (will scale test X after choosing
#model features)
X_test3 = test3.drop(['Stage1_Measure10.U.Setpoint',
'Stage1_Measure10.U.Act',
'time_stamp'],
axis = 1)
y_test3 = test3.pop('Stage1_Measure10.U.Act')

MODELING

The following steps were followed to create a model for predicting Measurement 10:

- Create a function for Recursive Feature Elimination (RFE)
- Create a Linear Regression function
- Execute the RFE function to choose features
- Execute the linear regression function on the 3 sec time lag DataFrame using the RFE selected features from the train and test datasets

Below is code for the RFE and Linear Regression functions:

from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
def choose_features(X, y):

lr = LinearRegression()
grid = RFE(lr, n_features_to_select = 5, step = 1)
grid.fit(X,y)
rem_X = grid.transform(X)
num_features = grid.n_features_
column_rank = grid.ranking_
return num_features, column_rank, rem_X
#function for linear regression
def lr_reg(X_train, X_test, y_train, y_test):
lr = LinearRegression(fit_intercept = False)
lr.fit(X_train, y_train)
prediction = lr.predict(X_test)
Rsq = lr.score(X_test, y_test)
coefs = lr.coef_
return Rsq, coefs, prediction

Call the RFE function and use the results to create the numpy array for the test features:

# call the RFE function (_rem indicates "remaining")num_features, column_rank, X_train3_rem = choose_features(X_train3,
y_train3)
#Get list of remaining features columns after RFE by
#zipping together column_rank boolean, column names.
#Then sort ascending, put the no. of columns equal to
#number of features into a list
zipped = dict(zip(X_train3df.columns, column_rank))
sort_zip = sorted(zipped.items(),
key=lambda x: x[1], reverse=False)
sort_list =[]
for i in range(0,num_features):
sort_list.append(sort_zip[i][0])

print(sort_list)
#Create X_test w/ reduced feature set
X_test3_temp = test3[sort_list]
scaler_test = MinMaxScaler()
scaler_test.fit(X_test3_temp)
X_test3_rem = scaler_test.transform(X_test3_temp)# scale
# Now have reduced, scaled train and test feature arrays
# for regression and have sort_list giving the feature names

Call the Linear Regression function and print results:

#call linear regression function
Rsq, coefs, prediction = lr_reg(X_train3_rem,
X_test3_rem,
y_train3,
y_test3)
print("Number of features (3s lag):", num_features, '\n',
num_features,"Regression Rsq (3s lag):", Rsq, '\n',
"Model coefficient names:", sort_list, '\n'
"Model coefficents (3s lag):", coefs)

Compare results to a regression model using all of the features:

# what's the difference in R2 between 5 feature and all
# feature model?
scaler_test_all = MinMaxScaler()
scaler_test_all.fit(X_test3)
X_test3_sall = scaler_test_all.transform(X_test3)# scale all
Rsq, coefs, prediction = lr_reg(X_train3,
X_test3_sall,
y_train3,
y_test3)
print("All features regression Rsq (3s lag):", Rsq, '\n',
"All features model coefficents (3s lag):", coefs)

Results of the modeling:

  • 5 feature model,3 sec lag df, gave an R² of 0.30 . ‘Mach3.RM.Prop1’, ‘Mach3.RM.Prop2’, ‘Mach3.RM.Prop3’, ‘Mach3.RM.Prop4’ coefficients were magnitude 1E11,1E12. Other coefficient magnitude 1E-01.

The R² values are low, indicating the proportion of variability explained by the models is small compared to the total variability.

Why do the 4 italicized features having such large coefficients compared to the other features? I don’t know. The Machine 1 raw material properties are part of the model and the values are numerical ordinal (like Machine 3 raw material properties) but do not have extremely large coefficients. Why Machine3 is different from Machine1 is puzzling.

The entire analysis can be found in a Kaggle notebook at https://www.kaggle.com/ryandmonson/rmonson-multistep-mfg.

POST MORDEM — THOUGHTS, LESSONS LEARNED

  • More work could be done on target/feature selection. May need to rethink including raw material properties in the model. The Kaggle columns show these features are numeric ordinal (they are likely continuous, but what the dataset has are only 3 or 4 values for each raw material property).

*****

¹Measurement error is the uncertainty of a measurement result. Say a thermocouple has a measurement error of +- 1℃. The thermocouple reads 25℃. The true temperature is somewhere in the range 24–26℃ (this is a confidence interval, so the probability for the true temperature is uniform with this range). For this example, because of measurement error it cannot be claimed that a 24℃ temperature reading is different from a 26℃. From my experience in Manufacturing, measurement error is frequently ignored (and even fought against).

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store