Applying Data Science in Manufacturing: Part III — Continuous Process: Methodology and Lessons Learned
In Part II of this post (https://medium.com/@ryandmonson/applying-data-science-in-manufacturing-part-ii-batch-process-methodology-and-lessons-learned-d18d360d8953) several models for predicting alloy grade from a batch manufacturing process were created. Classification modeling was far more accurate than Regression modeling in predicting the training target variable. At the end of the article a post mortem was documented, outlining lessons learned and thoughts on making the modeling results useful to the batch process manufacturing operation.
In Part III a continuous manufacturing process will be analyzed. This process differs from the batch process as follows:
- Processing is sequential, with time lags between process steps
- Multiple measurements are taken of the product after processing as compared to one
- The processing data is raw as opposed to being standardized to a t/z statistic
- The dataset contains 14K+ rows vs. ≈880 total rows for the batch process.
The dataset for this continuous manufacturing process is found on Kaggle at https://www.kaggle.com/supergus/multistage-continuousflow-manufacturing-process. After reading the dataset notes I decided to limit the analysis to Raw Material introduction through Stage 1 Combiner measurements. Data columns for downstream process steps will be dropped from the dataset. The following process flow will be analyzed:
For this analysis x variables will be referred to as ‘features’, y variables as ‘targets’. All coding is in Python.
READ AND SUMMARIZE DATASETS
The code for importing Numpy, Pandas and OS is the same as in Part II, the only change being the path to the dataset (/kaggle/input/multistage-continuousflow-manufacturing-process/continuous_factory_process.csv). Checking the data types revealed that the date time stamp was an object datatype. It was converted to the datetime data type.
Histograms of data in the dataframe columns are available on Kaggle. From these we learn:
- Besides the date time column, all other data types are numeric
- There are no missing values
- The column names are long
- Many of the temperature feature columns have little variation
Data columns from processes downstream of the Stage 1 Combiner measurements were dropped. Remaining columns were renamed as follows:
#replace these column title words...
col_list_old =["Actual","AmbientConditions.","AmbientHumidity","AmbientTemperature","Machine","RawMaterial", "Property", "RawMaterialFeederParameter", "Zone","Temperature","MotorAmperage","MaterialPressure","ExitZoneTemperature",'FirstStage.CombinerOperation.Temperature','Stage1.Output.Measurement','Stage2.Output.Measurement']#...with these
col_list_new =
["Act", "Amb", "_Hum", "_Temp", "Mach", "RM", "Prop", "RM_Feed_Param","Zone", "Temp", "Amps", "Mat_Press", "Exit_Zone_Temp","Stage1_Temp",'Stage1_Measure', 'Stage2_Measure']#function to modify column names
def clean_column_names(df):
df.columns = df.columns.str.strip() #strip whitespace
# zip the lists as tuples, then into a dictionary
col_dict = dict(zip(col_list_old, col_list_new))
#items - list of tuple key/value pairs (old,new)
for i, j in col_dict.items():
new_columns = [column.replace(i, j) for column in df.columns] # replace old with new
df.columns = new_columns
return dfdf = clean_column_names(df)
As noted earlier, many of the temperature features have little variation (range of 2 degs or less). Regardless of whether the scale is Centigrade or Fahrenheit, this range is within the measurement error of thermocouples, thermistors or infrared devices used in Manufacturing to measure temperature. Looking at the column min and max values indicates for many columns the temperature variability is just measurement error¹. These columns will not be beneficial for predicting target values and making process adjustments (they are effectively constants). They will be dropped from the analysis.
The Kaggle histograms indicate other features have extremely low coefficients of variation. These columns will also not be beneficial for predicting target values (too little variation) and will also be dropped.
One could argue this a poor approach statistically, that normalization or standardization places all the features on a “level playing field” as to magnitude and variation. That is true, and a feature being dropped may have predictive power in the model. But will the operators be able to make the tiny adjustments suggested by the model when the parameters are “unstandardized” and made operational? Will these tiny tweeks be outside the magnitude of measurement error? It may be that tiny tweeks to certain features make a difference in output, but I’m going to move forward with dropping the following columns:
- Amb_Temp.U.Act
- Mach1.Zone1Temp.C.Act
- Mach1.Zone2Temp.C.Act
- Mach2.RM.Prop1
- Mach2.RM.Prop2
- Mach2.RM.Prop4
- Mach2.Zone1Temp.C.Act
- Mach2.Zone2Temp.C.Act
- Mach2.Amps.U.Act
- Mach2.MotorRPM.C.Act
- Mach2.ExitZoneTemp.C.Act
- Mach3.Zone1Temp.C.Act
- Mach3.Zone2Temp.C.Act
- Mach3.ExitZoneTemp.C.Act
- FirstStage.CombinerOperation.Temp3.C.Act
After dropping the columns, Python Lists were created of features and targets:
- Feature: Raw material columns (rm_cols)
- Feature: Machine 1 -3 parameter columns (mpp_cols)
- Feature: 1st Stage Combiner parameter columns (fspp_cols)
- Target actuals: Actual measurements (meas_cols)
- Target: Measurement setpoints (meas_cols_sp)
These Lists will be useful for the upcoming analysis.
PREPROCESSING
E-mail exchange with the dataset owner indicates this is a polymer extrusion process. There are time lags in the process: material entering Machine 1,2 or 3 → material exiting Machine 1,2,3 → material entering Stage 1 Combiner → material exiting Stage 1 Combiner → material being measured.
Consider a 1 inch section of material entering Machine 1. That 1 inch section of material experienced lags in absolute time as it moves through the process.The structure of the DataFrame does not account for these lags. The date time stamp indexes a machine feature, 1st Stage Combiner feature or measurement target at a point in absolute time, not what the 1 inch section of material experienced. The measurements (target actuals) are for a section of material that experienced certain temperatures, amps, pressures, etc. We want a model that predicts what the material targets will be based on the feature values it experienced.
This may be a moot point if the features/targets change slowly relative to their sampling rate. Additionally the dataset owner was unable/unwilling to provide linespeed or equipment physical dimensions for this process.
It was decided to proceed with creating new datasets that introduced a time lag. The following visualizes the process with the time lag and the features:
As Figure 2 shows, the time shift between process steps/machines will be addressed, not the time shift within the machines. The time shifts are assumed to be uniform. Dataframes with 3 and 5 second shifts will be created.
For me this analysis process has already had a fair amount of “hand waving” (When my wife asks about this article I’ve responded by mimicking the penguins from the movie “Madagascar”: ‘You didn’t see anything’). Columns have been dropped because I don’t think small process tweeks matter or are doable (I don’t have access to the engineers/operators so I don’t know if that’s true). I’m putting time lags into the data columns, when I don’t know what those lags are. I do have experience working around polymer extruders, but the 3 and 5 second time lags are guesses. Also, I still have to deal at some point with the negative and zero values in the measurement columns (both the actuals and setpoints). Given the dataset notes indicate all measurements are in mm, the negative and zero measurement values make no sense.
If this analysis was being done “for real”, I would not have made these assumptions/guesses. I would have found answers from the operators/engineers before moving forward with the analysis.
Because this article is intended to be instructive, I’ll continue and complete an analysis of this data, culminating in a model for predicting a target from the features.
The following code was used to create new dataframes reflecting the 3 and 5 second time lags:
#create 3 sec shift dataframe
df_init3 = df_init.copy()
df_init3[fspp_cols] = df_init3[fspp_cols].shift(-3, axis = 0,
fill_value = "shifted")
df_init3[meas_cols] = df_init3[meas_cols].shift(-6, axis = 0,
fill_value = "shifted")
df_init3[meas_cols_sp] = df_init3[meas_cols_sp].shift(-6, axis = 0,
fill_value = "shifted")#shifted columns dtype changed to 'object'. Change back to float
cols = df_init3.select_dtypes(exclude=['float64', 'int64']
).columns.to_list() cols.remove('time_stamp')
df_init3[cols] = df_init3[cols].apply(pd.to_numeric,
downcast='float', errors='coerce')#repeat for 5 sec shift dataframe
df_init5 = df_init.copy()
df_init5[fspp_cols] = df_init5[fspp_cols].shift(-5, axis = 0)
df_init5[meas_cols] = df_init5[meas_cols].shift(-10, axis = 0)
df_init5[meas_cols_sp] = df_init5[meas_cols_sp].shift(-10, axis = 0)cols = df_init5.select_dtypes(exclude=['float64', 'int64']
).columns.to_list()cols.remove('time_stamp')
df_init5[cols] = df_init5[cols].apply(pd.to_numeric,
downcast='float', errors='coerce')
The following code was used to check that the time lag shift is correct for the 3 second time lag df:
#check the shift on df_init3. 1st stage combiner temps shift by 3s,
#1st stage measurements shift by 6sdf_test = pd.concat(objs = [df_init.loc[3,:],
df_init3.loc[0, : ],
df_init.loc[14087,:],
df_init3.loc[14081, : ]],
axis=1, join='outer', ignore_index= True,
keys=None, levels=None, names=None,
verify_integrity=False, copy=True)
df_test.columns = ['df_init', 'df_init3(3)',
'df_init', 'df_init3(6)']
df_test3 = df_test.loc[fspp_cols, : ]
df_test6 = df_test.loc[meas_cols, :]print(df_test3)
print(df_test6)#checked
Based on these results, assume the 5 sec time lag dataframe is also correct.
FURTHER DATA CLEANING
As previously indicated there are negative and zero values in the measurement actuals and setpoints. Working with the 3 sec time lag dataframe, count the number of zeros in each setpoint column:
#count the number of values == 0 in each setpoint columnsp_dict_0 = {}
for i in meas_cols_sp:
value = (df_init3[i] == 0).sum()
sp_dict_0[i] = value
print(sp_dict_0)
Each measurement setpoint column contains 56 zeros. Are they from the same rows? The database owner indicated that rows with all zeros in the setpoints are when the measurement system “drops out”. Code could be written to verify the setpoint zeros are all from the same rows, but given the database owner feedback, I’ll just drop these rows.
#drop all zero setpoint rowsdf_init3 = df_init3.loc[(df[meas_cols_sp] > 0).all(axis=1)]
Executing the shape
method showed that the number of rows in the dataframe was reduced by exactly 56. All the measurement setpoint column zeros were in the same rows.
Because of the time lag shift, certain columns in the 3 second dataframe tail contain “shifted” instead of a numeric value. The drop
and tail
methods were used to drop the last 6 rows of the dataframe.
The describe
method was executed on the 3 second dataframe. Even after dropping certain rows there are still negatives and zeros in the measurement actual columns.
The analysis will be simplified to a single target variable. The next steps to complete the analysis were:
- Select a single target actual column which contains a large number of nonzero values.
- Drop the remaining target actual and setpoint columns
- Divide the dataframe into train and test datasets, select features, create model, fit and make target predictions
SELECT A TARGET VARIABLE
The describe
method applied to the meas_cols and histograms from the following code were used to select a target variable:
# Import library and write function for histogram plottingimport matplotlib.pyplot as pltdef hist_plot(df,columns):
fig = plt.figure(figsize = (20, 20))
for counter,sp in enumerate(columns):
ax = fig.add_subplot(5,3,counter+1)
ax.set_title(sp, color = 'red')
ax.tick_params(color = 'red', labelcolor = 'red')
bins = 10
plt.hist(df[sp],bins, align = 'left') plt.subplots_adjust(hspace = 0.5)
plt.show()
returnhist_plot(df_init3, meas_cols)
Histograms for all 15 measurement actuals were generated. Below is the Measurement 10 Actual histogram:
Measurement 10 values > 0 ranged from 4.2mm to 13.1mm, with relative few values ≤ 0. Measurement 10 was chosen as the target variable. All other measurement columns (actual and setpoint) were dropped. Rows where Measurement 10 ≤ 0 were also dropped:
#drop measurement columns (Act and Setpoint) != 10
rm_col = meas_cols.copy()
rm_col.remove('Stage1_Measure10.U.Act')
df_init3.drop(rm_col, axis = 1,inplace = True)rm_colsp = meas_cols_sp.copy()
rm_colsp.remove('Stage1_Measure10.U.Setpoint')
df_init3.drop(rm_colsp,axis = 1, inplace = True)# `checked`#drop rows where Measure10 <= 0df_init3 = df_init3[df_init3['Stage1_Measure10.U.Act'] > 0]
df_init3['Stage1_Measure10.U.Act'].describe()
CREATE TRAIN, TEST DATASETS FROM 3 SEC TIME LAG DF
#create train, test datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegressiontrain3,test3 = train_test_split(df_init3,
test_size = 0.25,
random_state = 1)
#create min/max scaled train X, also a train yX_train3df = train3.drop(['Stage1_Measure10.U.Setpoint',
'Stage1_Measure10.U.Act',
'time_stamp'],
axis = 1)
scaler_train = MinMaxScaler()
scaler_train.fit(X_train3df)
X_train3 = scaler_train.transform(X_train3df)# scale
y_train3 = train3.pop('Stage1_Measure10.U.Act')#create test X, test y (will scale test X after choosing
#model features)X_test3 = test3.drop(['Stage1_Measure10.U.Setpoint',
'Stage1_Measure10.U.Act',
'time_stamp'],
axis = 1)
y_test3 = test3.pop('Stage1_Measure10.U.Act')
MODELING
The following steps were followed to create a model for predicting Measurement 10:
- Create a function for Recursive Feature Elimination (RFE)
- Create a Linear Regression function
- Execute the RFE function to choose features
- Execute the linear regression function on the 3 sec time lag DataFrame using the RFE selected features from the train and test datasets
Below is code for the RFE and Linear Regression functions:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegressiondef choose_features(X, y):
lr = LinearRegression()
grid = RFE(lr, n_features_to_select = 5, step = 1)
grid.fit(X,y)
rem_X = grid.transform(X)
num_features = grid.n_features_
column_rank = grid.ranking_
return num_features, column_rank, rem_X#function for linear regression
def lr_reg(X_train, X_test, y_train, y_test):
lr = LinearRegression(fit_intercept = False)
lr.fit(X_train, y_train)
prediction = lr.predict(X_test)
Rsq = lr.score(X_test, y_test)
coefs = lr.coef_
return Rsq, coefs, prediction
Call the RFE function and use the results to create the numpy array for the test features:
# call the RFE function (_rem indicates "remaining")num_features, column_rank, X_train3_rem = choose_features(X_train3,
y_train3)#Get list of remaining features columns after RFE by
#zipping together column_rank boolean, column names.
#Then sort ascending, put the no. of columns equal to
#number of features into a listzipped = dict(zip(X_train3df.columns, column_rank))
sort_zip = sorted(zipped.items(),
key=lambda x: x[1], reverse=False)
sort_list =[]
for i in range(0,num_features):
sort_list.append(sort_zip[i][0])
print(sort_list)#Create X_test w/ reduced feature set
X_test3_temp = test3[sort_list]
scaler_test = MinMaxScaler()
scaler_test.fit(X_test3_temp)
X_test3_rem = scaler_test.transform(X_test3_temp)# scale# Now have reduced, scaled train and test feature arrays
# for regression and have sort_list giving the feature names
Call the Linear Regression function and print results:
#call linear regression function
Rsq, coefs, prediction = lr_reg(X_train3_rem,
X_test3_rem,
y_train3,
y_test3)
print("Number of features (3s lag):", num_features, '\n',
num_features,"Regression Rsq (3s lag):", Rsq, '\n',
"Model coefficient names:", sort_list, '\n'
"Model coefficents (3s lag):", coefs)
Compare results to a regression model using all of the features:
# what's the difference in R2 between 5 feature and all
# feature model?scaler_test_all = MinMaxScaler()
scaler_test_all.fit(X_test3)
X_test3_sall = scaler_test_all.transform(X_test3)# scale all
Rsq, coefs, prediction = lr_reg(X_train3,
X_test3_sall,
y_train3,
y_test3)
print("All features regression Rsq (3s lag):", Rsq, '\n',
"All features model coefficents (3s lag):", coefs)
Results of the modeling:
- 5 feature model,3 sec lag df, gave an R² of 0.30 . ‘Mach3.RM.Prop1’, ‘Mach3.RM.Prop2’, ‘Mach3.RM.Prop3’, ‘Mach3.RM.Prop4’ coefficients were magnitude 1E11,1E12. Other coefficient magnitude 1E-01.
- 26 feature model,3 sec lag df, gave an R² of 0.52. ‘Mach3.RM.Prop1’, ‘Mach3.RM.Prop2’, ‘Mach3.RM.Prop3’, ‘Mach3.RM.Prop4’ coefficients were of magnitude 1E00,1E01. Other coefficients were magnitude1E-01, 1E-02.
The R² values are low, indicating the proportion of variability explained by the models is small compared to the total variability.
Why do the 4 italicized features having such large coefficients compared to the other features? I don’t know. The Machine 1 raw material properties are part of the model and the values are numerical ordinal (like Machine 3 raw material properties) but do not have extremely large coefficients. Why Machine3 is different from Machine1 is puzzling.
The entire analysis can be found in a Kaggle notebook at https://www.kaggle.com/ryandmonson/rmonson-multistep-mfg.
POST MORDEM — THOUGHTS, LESSONS LEARNED
- More work could be done on target/feature selection. May need to rethink including raw material properties in the model. The Kaggle columns show these features are numeric ordinal (they are likely continuous, but what the dataset has are only 3 or 4 values for each raw material property).
- Initial feature removal included features where measurement error explained the measurement variability. This idea has broader application and will be discussed further in Part IV.
- Calibration addresses measurement data validity but in this case calibration does not seem to have been addressed. This idea has broader application and will be discussed further in Part IV.
- In evaluating features to include in the model I completely missed checking for multicollinearity. This was done in Part II (batch processing analysis) of this series but was not done for this Part. When elements of analysis connect strongly with previous experience (In my case measurement error and calibration), essential other elements can get overlooked. Checklists can help in avoiding this mistake.
- In addition to measurement error and calibration, application on the manufacturing floor connects strongly with my experience. Features were dropped based on my assessment that operators could not make impactful adjustments. Analysis results need to be operational.
- Many times we focus on just R² or some other accuracy score to judge our model. In this case the coefficients themselves also told a story that should be investigated. A complete understanding of the model will lead to more robust decision making about process changes.
- Much troubleshooting time was spent with the output of the LinearRegression instance. Originally the StandardScaler was used for preprocessing, and the
predict
method from LinearRegression provided ridiculous results. After quite a bit of thinking, research and swearing I realized StandardScaler is used when the data is normally distributed. I switched over to the MinMaxScalar and got sensible results. This experience was so frustrating it inspired the next article I’ll compose and publish after this series is completed. - I originally considered choosing features based on their correlation with the target, an approach taught in my Data Science course of study. This approach requires a judgement call on the correlation cutoff value. To avoid a judgement call, I went the RFECV route (Recursive Feature Elimination w/Cross Validation). RFECV chooses the number of features based on the highest average accuracy score from the Cross Validations. Not surprisingly, the most accurate model from RFECV used all the features. In the end I went the RFE route, making a judgement call of 5 features. Which is “better”, determining features by RFE or by correlation coefficient? Both require a judgement call, and the calculations differ. As numbers, objectivity driven Data Science is, there is still a need for judgement calls.
*****
¹Measurement error is the uncertainty of a measurement result. Say a thermocouple has a measurement error of +- 1℃. The thermocouple reads 25℃. The true temperature is somewhere in the range 24–26℃ (this is a confidence interval, so the probability for the true temperature is uniform with this range). For this example, because of measurement error it cannot be claimed that a 24℃ temperature reading is different from a 26℃. From my experience in Manufacturing, measurement error is frequently ignored (and even fought against).