Predictive maintenance on NASA’s Turbofan Engine Degradation dataset (CMAPSS)

Rohit Malhotra
11 min readDec 5, 2021

--

Photo by Pablo Romay on Unsplash
  1. Introduction

Predictive maintenance techniques are designed to help determine the condition of in-service equipment in order to estimate when maintenance should be performed. This approach promises cost savings over routine or time-based preventive maintenance, because here tasks are performed only when warranted.

In this article an attempt has been made to to predict RUL( Remaining Useful Life of NASA Turbofan Engine) by applying various ML Models. Dataset considered is FD001 dataset which can also be accessed from link https://ti.arc.nasa.gov/tech/dash/groups/pcoe/prognostic-data-repository/#turbofan.

Data Set FD001 is the first in the series and the least complex.
Train trjectories: 100 engines. Test trajectories: 100 engines. Fault Modes: ONE. Datasets include simulations of multiple turbofan engines over time, each row contains the following information:

  1. Engine unit number
  2. Time, in cycles
  3. Three operational settings
  4. 21 sensor readings. No information regarding sensors have been given. If there has been some information regarding sensor type i.e. pressure sensor , temperature sensor , vibration sensor etc. then we could have fetch some more information about degradation of engine using domain knowledge.

2. Importing Necessary Libraries

%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
pd.set_option("display.max_rows", None)

3. Importing the Dataset

# define column names for easy indexing
index_names = ['unit_nr', 'time_cycles']
setting_names = ['setting_1', 'setting_2', 'setting_3']
sensor_names = ['s_{}'.format(i) for i in range(1,22)]
col_names = index_names + setting_names + sensor_names
# read data
train = pd.read_csv('train_FD001.txt',sep='\s+', header=None, names=col_names)
test = pd.read_csv('test_FD001.txt',sep='\s+', header=None, names=col_names)
y_test = pd.read_csv('RUL_FD001.txt', sep='\s+', header=None, names=['RUL'])
# Train data contains all features (Unit Number + setting parameters & sensor parameters)
# Test data contains all features (Unit Number + setting parameters & sensor parameters)
# Y_test contains RUL for the test data.
train.head()
Train Dataset

There are 100 unique engine.

train['unit_nr'].unique()# There are 100 no unique engines.
100 no of engines.

There are 100 unique engines in the dataset.

train['unit_nr'].unique()

Since the true RUL values (y_test) for the test set are only provided for the last time cycle of each engine ,therefore the test set is subsetted to represent the same.

y_test.shape# RUL value for 100 no of engines
(100,1)

4. Data Pre-Processing + Data Visualization

train.describe()

We can remove setting_3 column as we can see that it’s value is not changing therefore will not not add any information to our prediction.

train=train.drop('setting_3',axis=1)

Now we will calculated and add RUL(Remaining Useful Life) to the train dataset.

RUL = Max Life Time Cycle of Engine-Current Time Cycle

Max Life Time Cycle of Engine means time when engine has degraded.

def add_remaining_useful_life(df):
# Get the total number of cycles for each unit
grouped_by_unit = df.groupby(by="unit_nr")
max_cycle = grouped_by_unit["time_cycles"].max()

# Merge the max cycle back into the original frame
result_frame = df.merge(max_cycle.to_frame(name='max_cycle'), left_on='unit_nr', right_index=True)

# Calculate remaining useful life for each row
remaining_useful_life = result_frame["max_cycle"] - result_frame["time_cycles"]
result_frame["RUL"] = remaining_useful_life

# drop max_cycle as it's no longer needed
result_frame = result_frame.drop("max_cycle", axis=1)
return result_frame
train = add_remaining_useful_life(train)
train[sensor_names+['RUL']].head()
Calculated RUL appended at the end

4.1 Plotting of RUL

df_max_rul = train[['unit_nr', 'RUL']].groupby('unit_nr').max().reset_index()
df_max_rul['RUL'].hist(bins=15, figsize=(15,7))
plt.xlabel('RUL')
plt.ylabel('frequency')
plt.show()
Count of RUL Vs RUL

We can see the RUL for engine is bit skewed on left side. This may affect performance of the model however as on now we will keep it as it.

4.2 Plotting of Sensor Variable Vs our Target Variable

def plot_sensor(sensor_name):
plt.figure(figsize=(13,5))
for i in train['unit_nr'].unique():
if (i % 10 == 0): # only plot every 10th unit_nr
plt.plot('RUL', sensor_name,
data=train[train['unit_nr']==i])
plt.xlim(250, 0) # reverse the x-axis so RUL counts down to zero
plt.xticks(np.arange(0, 275, 25))
plt.ylabel(sensor_name)
plt.xlabel('Remaining Use fulLife')
plt.show()
for sensor_name in sensor_names:
plot_sensor(sensor_name)

Plot for some of the sensors having relation with RUL is :

Sensor S2 Vs RUL
Sensor S3 Vs RUL
Sensor S4 Vs RUL

We can observe sensors (Features or independent variables) have strong relation with the RUL (our dependent variable) in range of RUL from 125 to 0.

In order to ensure model perform better we will later clip all higher value of RUL(above 125) to 125.

5. Extraction of only Important Features which have strong relation with RUL of the engine

This is done to select only important features for model building in order to avoid problem of “ Overfitting”.

First we will plot heat map to see correlation of features with RUL.

plt.figure(figsize=(25,18))
sns.heatmap(train.corr(),annot=True ,cmap='Reds')
plt.show()
Heat Map

Now we will select only those features which have absolute value of correlation with RUL greater & equal than 0.5.

cor=train.corr()
#Selecting highly correlated features
train_relevant_features = cor[abs(cor['RUL'])>=0.5]
train_relevant_features['RUL']
list_relevant_features=train_relevant_features.index
list_relevant_features=list_relevant_features[1:]
list_relevant_features

Above list contains important features have correlation of magnitude greater and equal to 0.5 with our target variable RUL.

Now we will keep only those important features in both train & test dataset.

# Now we will keep onlt these imprtant features in both train & test dataset.
train=train[list_relevant_features]

Separating Train features and RUL (Target Variable) from Train Dataset

# train & y_train
# Calculated RUL variable is our Target variable.
y_train=train['RUL']
X_train=train.drop(['RUL'],axis=1)
X_train.head(5)
X_Train containing only features
# Test data set , keeping only train columns/features.
X_test=test[X_train.columns]
X_test.head(5)

In test data we have keep only train selected features.

Clipping of RUL at 125 as after 125 , RUL is responding to the sensor values after this value. This is done to improve performance of the applied models.

y_train= y_train.clip(upper=125)

6. Building Model

We will create one evaluate function which will contain metrics for regression problem.

# first create an evaluate function
def evaluate(y_true, y_hat, label='test'):
mse = mean_squared_error(y_true, y_hat)
rmse = np.sqrt(mse)
variance = r2_score(y_true, y_hat)
print('{} set RMSE:{}, R2:{}'.format(label, rmse, variance))
return rmse,variance;

Model 1: Linear Regression

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train1 = sc.fit_transform(X_train)
X_test1 = sc.transform(X_test)
# create and fit model
lm = LinearRegression()
lm.fit(X_train1, y_train)
# predict and evaluate
y_hat_train1 = lm.predict(X_train1)
RMSE_Train,R2_Train=evaluate(y_train, y_hat_train1,'train')
y_hat_test1 = lm.predict(X_test1)
RMSE_Test,R2_Test=evaluate(y_test, y_hat_test1,'test')

Results are :

Now we will create Dataframe which will contains results of all applied models. We will keep on appending the results in this dataframe.

# Make Dataframe which will contain results of all applied Model
Results=pd.DataFrame({'Model':['LR'],'RMSE-Train':[RMSE_Train],'R2-Train':[R2_Train],'RMSE-Test':[RMSE_Test],'R2-Test':[R2_Test]})
Results
Result of LR Model

Now we will Plot Actual Vs Predicted RUL for the Train Data

# Plot Actual Vs Predicted RUL for Train Data
fig = plt.figure();
plt.figure(figsize=[20,12])
plt.plot(y_train,color="blue", linewidth=2.5, linestyle="-",label="Actual")
plt.plot(y_hat_train1,color="red", linewidth=2.5, linestyle="-",label="Predicted")
fig.suptitle('Actual and Predicted', fontsize=20) # Plot heading
#plt.xlabel('Index', fontsize=18) # X-label
plt.ylabel('RUL', fontsize=16) # Y-label
plt.legend()
plt.title("Actual RUL Vs Predicted RUL for Train Data")
Actual RUL vs Predicted RUL for Train Data
# Making predictions using test set
y_hat_test1 = lm.predict(X_test1)
evaluate(y_test, y_hat_test1)
Result for Test Det

Now we will Plot Actual Vs Predicted RUL for the Test Data

# Plot Actual Vs Predicted RUL for Test Data
#c = [i for i in range(1,81,1)]
fig = plt.figure();
plt.figure(figsize=[20,12])
plt.plot(y_test,color="blue", linewidth=2.5, linestyle="-",label="Actual")
plt.plot(y_hat_test1,color="red", linewidth=2.5, linestyle="-",label="Predicted")
fig.suptitle('Actual and Predicted', fontsize=20) # Plot heading
#plt.xlabel('Index', fontsize=18) # X-label
plt.ylabel('RUL', fontsize=16) # Y-label
plt.legend()
plt.title("Actual RUL Vs Predicted RUL for Test Data")
Actual RUL Vs Predicted RUL for Test Data

We can see that on test data which is unseen by the model , RMSE & R2 score value comes out be 22.9 , 0.695.We will try to build further more models to see how they are performing.

Model 2: Applying SVM Model

from sklearn.svm import SVR
regressor = SVR(kernel = 'rbf')
regressor.fit(X_train1, y_train)
# predict and evaluate
y_hat_train1 = regressor.predict(X_train1)
RMSE_Train,R2_Train=evaluate(y_train, y_hat_train1)
y_hat_test1 = regressor.predict(X_test1)
RMSE_Test,R2_Test=evaluate(y_test, y_hat_test1)

Results are :

# Make Dataframe which will contain results of all applied Model
Results=Results.append(pd.DataFrame({'Model':['SVM'],'RMSE-Train':[RMSE_Train],'R2-Train':[R2_Train],'RMSE-Test':[RMSE_Test],'R2-Test':[R2_Test]}),ignore_index=True)
Results
Results after LR & SVM Model

Model 3 : Decision Tree

from sklearn.tree import DecisionTreeRegressor
dt = DecisionTreeRegressor(random_state=42, max_depth=15, min_samples_leaf=10)
dt.fit(X_train1, y_train)
y_hat_train1 = dt.predict(X_train1)
RMSE_Train,R2_Train=evaluate(y_train, y_hat_train1, 'train')
y_hat_test1 = dt.predict(X_test1)
RMSE_Test,R2_Test=evaluate(y_test, y_hat_test1)
# Make Dataframe which will contain results of all applied Model
Results=Results.append(pd.DataFrame({'Model':['DT'],'RMSE-Train':[RMSE_Train],'R2-Train':[R2_Train],'RMSE-Test':[RMSE_Test],'R2-Test':[R2_Test]}),ignore_index=True)
Results

RMSE & R2 score has further increased. There is some overfitting happening as we have kept maximum depth of model high. Overfitting means when there difference between R2 score on train data and test data is high. In intuitive terms it means when model has learned each and every thing about train data but have not taken cognizance of basic generalized pattern of the train data.

Model 4: Random Forest

from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(random_state=42, n_jobs=-1, max_depth=6, min_samples_leaf=5)
rf.fit(X_train1, y_train)
y_hat_train1 = rf.predict(X_train1)
# Evaluating on Train Data Set
RMSE_Train,R2_Train=evaluate(y_train, y_hat_train1, 'train')
# Evaluating on Test Data Set
y_hat_test1 = rf.predict(X_test1)
RMSE_Test,R2_Test=evaluate(y_test, y_hat_test1)
# Make Dataframe which will contain results of all applied Model
Results=Results.append(pd.DataFrame({'Model':['DF'],'RMSE-Train':[RMSE_Train],'R2-Train':[R2_Train],'RMSE-Test':[RMSE_Test],'R2-Test':[R2_Test]}),ignore_index=True)
Results

Applying Random forest model which is an bagging technique of ensemble type of model , our model overfitting model has been avoided.

Knowing important features. This can be known using syntax rf.feature_importances_

imp_df = pd.DataFrame({
"Varname": X_train.columns,
"Imp": rf.feature_importances_})
imp_df.sort_values(by="Imp", ascending=False)

Above is the list of important features/sensors which have strong relation with RUL.

Hyperparameter Tuning using Grid Search

from sklearn.model_selection import GridSearchCV
rf = RandomForestRegressor(random_state=42, n_jobs=-1)
# Create the parameter grid based on the results of random search
params = {
'max_depth': [1, 2, 5, 10, 20],
'min_samples_leaf': [5, 10, 20, 50, 100],
'max_features': [2,3,4,5,6],
'n_estimators': [10, 30, 50, 100]
}
# Instantiate the grid search model
grid_search = GridSearchCV(estimator=rf, param_grid=params,
cv=4, n_jobs=-1, verbose=1)
%%time
grid_search.fit(X_train1, y_train)
rf_best = grid_search.best_estimator_
y_hat_train = rf_best.predict(X_train1)
# Evaluating on Train Data Set
RMSE_Train,R2_Train=evaluate(y_train, y_hat_train1, 'train')
# Evaluating on Test Data Set
y_hat_test1 = rf_best.predict(X_test1)
RMSE_Test,R2_Test=evaluate(y_test, y_hat_test1)
# Make Dataframe which will contain results of all applied Model
Results=Results.append(pd.DataFrame({'Model':['DF with Tuning'],'RMSE-Train':[RMSE_Train],'R2-Train':[R2_Train],'RMSE-Test':[RMSE_Test],'R2-Test':[R2_Test]}),ignore_index=True)
Results

DF with hyperparameter tuning model performance has further improved.

Model 5 : Using Ridge Regression

from sklearn.linear_model import Ridge, Lasso
params = {'alpha': [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0,
9.0, 10.0, 20, 50, 100, 500, 1000 ]}
ridge = Ridge()# cross validationfolds = 5
ridge_model_cv = GridSearchCV(estimator = ridge,
param_grid = params,
cv = folds,
return_train_score=True,
verbose = 1)
ridge_model_cv.fit(X_train1, y_train)
# display the mean scoresridge_cv_results = pd.DataFrame(ridge_model_cv.cv_results_)
ridge_cv_results = ridge_cv_results[ridge_cv_results['param_alpha']<=500]
ridge_cv_results[['param_alpha', 'mean_train_score', 'mean_test_score', 'rank_test_score']].sort_values(by = ['rank_test_score'])
# plotting mean test and train scores with alpharidge_cv_results['param_alpha'] = ridge_cv_results['param_alpha'].astype('int32')# plottingplt.plot(ridge_cv_results['param_alpha'], ridge_cv_results['mean_train_score'])
plt.plot(ridge_cv_results['param_alpha'], ridge_cv_results['mean_test_score'])
plt.xlabel('alpha')
plt.ylabel('Negative Mean Absolute Error')
plt.title("Negative Mean Absolute Error and alpha")
plt.legend(['train score', 'test score'], loc='upper right')
plt.show()
# get the best estimator for lambdaridge_model_cv.best_estimator_
# check the coefficient values with lambda = 10alpha = 100
ridge = Ridge(alpha=alpha)
ridge.fit(X_train1, y_train)
ridge.coef_
# Evaluating on Test Data Set
y_hat_test1 = ridge.predict(X_test1)
RMSE_Test,R2_Test=evaluate(y_test, y_hat_test1)
# Make Dataframe which will contain results of all applied Model
Results=Results.append(pd.DataFrame({'Model':['Ridge'],'RMSE-Train':[RMSE_Train],'R2-Train':[R2_Train],'RMSE-Test':[RMSE_Test],'R2-Test':[R2_Test]}),ignore_index=True)
Results

Performance have reduced with Ridge Regression Model. We need to apply Hyper parameter tuning to improve the performance.

Model 6 : XG Boost Algorithm

# Evaluating on Train Data Set
y_hat_train1 = xgb_r.predict(X_train1)
# Evaluating on Train Data Set
RMSE_Train,R2_Train=evaluate(y_train, y_hat_train1, 'train')
# Evaluating on Test Data Set
y_hat_test1 = xgb_r.predict(X_test1)
RMSE_Test,R2_Test=evaluate(y_test, y_hat_test1)
# Make Dataframe which will contain results of all applied Model
Results=Results.append(pd.DataFrame({'Model':['XgBoost'],'RMSE-Train':[RMSE_Train],'R2-Train':[R2_Train],'RMSE-Test':[RMSE_Test],'R2-Test':[R2_Test]}),ignore_index=True)
Results

With XGboost model which is a favourite model in Kaggle competitions and very powerful model , accuracy has increased. We can see that the XGboost model is not overfitting as model offers regularized gradient boosting framework.

Model 7 : Using Tensor Flow ANN Model

import tensorflow as tf
ann = tf.keras.models.Sequential()
ann.add(tf.keras.layers.Dense(units=26, activation='relu'))
ann.add(tf.keras.layers.Dense(units=10, activation='relu'))
ann.add(tf.keras.layers.Dense(units=10, activation='relu'))
ann.add(tf.keras.layers.Dense(units=1, activation='linear'))
ann.compile(optimizer='rmsprop', loss='mse', metrics=['mae'])
ann.fit(X_train1, y_train, batch_size = 32, epochs = 75)
# Evaluating on Train Data Set
y_hat_train1 = ann.predict(X_train1)
# Evaluating on Train Data Set
RMSE_Train,R2_Train=evaluate(y_train, y_hat_train1, 'train')
# Evaluating on Test Data Set
y_hat_test1 = ann.predict(X_test1)
RMSE_Test,R2_Test=evaluate(y_test, y_hat_test1,'test')
# Make Dataframe which will contain results of all applied Model
Results=Results.append(pd.DataFrame({'Model':['ANN'],'RMSE-Train':[RMSE_Train],'R2-Train':[R2_Train],'RMSE-Test':[RMSE_Test],'R2-Test':[R2_Test]}),ignore_index=True)
Results

With ANN model performance is nearby to the XGBoost Model.

7 Conclusion

  1. We have applied different types of ML Models and see how they are performing.
  2. Best results we are getting is with XGBoost & ANN Model. But with ANN model we have one well known disadvantage that model is not intuitive in nature.
  3. Further in my future posts i will try to perform below listed feature engineering to enhance the performance of the model:
    # Creating Categorical variable “ Alarm for Low & High Sensor Variable” .
    # Creating new features with lagged values.
    # Applying feature transformation like mean,SD,etc.
    # We can also set new features with related to % change in peak value of the sensors
    4. Many more feature engineering techniques can be applied subject to having domain knowledge of the sensors , Machine past history , machine OEM recommendations etc.

Complete code along with the dataset can be found at my Githubpage.

I would like to interact and listen for your feedback.

You can contact me at my LinkedIn profile.

--

--

Rohit Malhotra

Passionate to Utilize Capabilities of Data Analytics to Improve Performance of Industrial Assets. https://www.linkedin.com/in/rohitmalhotra67/