Predictive Maintenance on Factory Data

7 min readSep 17, 2020

In the era of digitization the concept of Smart Factory attracts a lot of attention. Modern industry becomes connected and highly automated. Such factories need their machines to run smoothly and with minimal down times. Predictive maintenance helps to deal with breakages. It aims to identify possible failures and helps to schedule the maintenance of detected devices.

In current blog post I illustrate the process of building a model that detects failures of factory machines. I use an open dataset from one of the Schwan’s factories. It contains time series values that include telemetry, device errors and failures. The aim is to predict device failure 12 hours before it happens. An assumption is that half of a day is enough for a technician to react and to handle a possible issue.

Exploring dataset

The first thing to do is reading the dataset and loading data into pandas data frames. This logic is quite trivial, so I do not post it here. Instead you can access the original Jupyter Notebook at this link.

In general there are 3 files to read: telemetry.csv, failures.csv and errors.csv. At the end we get 3 pandas data frames: telemetry_df, failures_df and errors_df.

In total there are 876100 rows of telemetry values for 100 machines during 1 year on an hour basis. This is a lot, but most of the time devices work well. All in all the data set contains only 3919 errors and 761 failures. The latest are the values we try to predict.

We have data for 100 machines. In real predictive maintenance cases it often makes sense to create a separate model for each machine for having best predictions. In this example we assume that one model might work for all the devices. To check it, let’s build box plots for telemetry values of all machines and compare their distributions.

import matplotlib.pyplot as plt
import numpy as np


volt_values = []
rotate_values = []
pressure_values = []
vibration_values = []

for i in range(1,101):
    volt_values.append(telemetry_df[telemetry_df['machineID'] == i]["volt"])
    rotate_values.append(telemetry_df[telemetry_df['machineID'] == i]["rotate"])
    pressure_values.append(telemetry_df[telemetry_df['machineID'] == i]["pressure"])
    vibration_values.append(telemetry_df[telemetry_df['machineID'] == i]["vibration"])

    
fig, axs = plt.subplots(4, 1, constrained_layout=True, figsize=(18, 16))
fig.suptitle('Telemetry values per machine', fontsize=16)


def build_box_plot(plot_index, plot_values, title):
    axs[plot_index].boxplot(plot_values)
    axs[plot_index].set_title(title)
    axs[plot_index].set_xticks([1, 26, 51, 76, 101])
    axs[plot_index].set_xticklabels([1, 25, 50, 75, 100])

    
build_box_plot(0, volt_values, "Volt")
build_box_plot(1, rotate_values, "Rotate")
build_box_plot(2, pressure_values, "Pressure")
build_box_plot(3, vibration_values, "Vibration")
plt.show()

The picture above show us that all machines have approximately same distributions of their telemetry values. So, we might assume that they are similar from the technical point of view and hence we can train one model for all of them.

Of course it’s still hard to decide if our data is sufficient for predicting failures. So, we need to explore it further. Now we combine all 3 data frames into one (see the original notebook) and build scatter plots for pairs “Rotate-Volt” and “Vibration-Pressure”. We also mark the failure cases in red. By doing this we want to check if failures are grouped together or occur for extreme values known as outliers.

Looking at the plots we might conclude that there are simple patterns. It does not mean that telemetry is useless. But it tells us that it won’t be a straight forward and good predictor as it is. From the other side there still can be complex patterns especially in the way how values behave over time. These patterns cannot be seen on simple plots but still can be detected by non-linear models.

Now let’s explore errors. They might happen some time before the failures and be a good indicator. To check them we first randomly sample 6 failure cases. Then we pick a time frame of 48 hours before the failure. Plots below show both telemetry values and errors that are marked by red lines.

48 hours before the failure happend. Red lines denote errors.

In all 6 examples there are errors that happen several hours before the actual failure. So, they might be a good predictor.

Prepare dataset

After exploring dataset we have several assumptions:

We can combine data for all machines and train one model.
Telemetry is not a good predictor alone, but still might improve the model in some cases. So, we calculate the rolling values and use them during the model training.
Errors seem to be a good predictor.
The number of failures is low comparing to the number of normal functioning samples.

First of all we need to prepare a dataset for training. Preparation includes:

Pick all failure cases
Randomly sample 30000 normal cases
Subtract 12 hours (predict failure 12 hours before) and pick a time window of 36 hours back.
Calculate the number of errors of each type during the defined time frame
Calculate telemetry statistics (min, max, std, mean) during the defined time frame
Split the dataset into train, test and validation

Let’s prepare the column names and a method for calculating statistics.

hours_ahead = 12
hours_lag = 36
tel_columns = ["volt", "rotate", "pressure", "vibration"]
error_columns = ["error1", "error2", "error3", "error4", "error5"]

col_names = []
for tel_c in tel_columns:
    col_names.extend([tel_c + "_min", tel_c + "_max", tel_c + "_std", tel_c + "_mean"])

for err_c in error_columns:
    col_names.append(err_c + "_sum")


def get_time_span_statistics(source_df, lag_start, lag_end):
    lag_values_df = source_df.iloc[lag_start:lag_end]
    failure_record = []

    for col_name in tel_columns:
        failure_record.extend([lag_values_df[col_name].min(), 
                               lag_values_df[col_name].max(), 
                               lag_values_df[col_name].std(), 
                               lag_values_df[col_name].mean()])

    for col_name in error_columns:
        failure_record.append(lag_values_df[col_name].sum())
    
    return failure_record

Next step is to calculate values for all failure cases.

failure_records = []
failure_ranges = []

for f_index in failure_indexes:
    start_i = f_index - hours_ahead - hours_lag
    end_i = f_index - hours_ahead
    
    failure_ranges.extend(np.arange(f_index - hours_ahead - hours_lag, f_index + hours_ahead + hours_lag))
    failure_records.append(get_time_span_statistics(telemetry_df, start_i, end_i))

failure_records_df = pd.DataFrame(failure_records)
failure_records_df.columns = col_names
failure_records_df['is_error'] = True

Sample normal cases and combine them with failures.

normal_functioning_records = []
normal_functioning_indexes = telemetry_df.drop(failure_ranges).sample(30000).index

for n_index in normal_functioning_indexes:        
    start_i = n_index - hours_ahead - hours_lag
    end_i = n_index - hours_ahead
    normal_functioning_records.append(get_time_span_statistics(telemetry_df, start_i, end_i))

normal_functioning_records_df = pd.DataFrame(normal_functioning_records)
normal_functioning_records_df.columns = col_names
normal_functioning_records_df['is_error'] = Falsecombined_df = pd.concat([failure_records_df, 
normal_functioning_records_df], ignore_index=True)# shuffle the data set
combined_df = combined_df.sample(len(combined_df))

Now we need to split the combined dataset into train, test and validation subsets.

split_mask = np.random.rand(len(combined_df)) < 0.7

x_df = combined_df.drop(['is_error'], axis=1)
y_df = combined_df['is_error']

x_train = x_df[split_mask]
y_train = y_df[split_mask]

x_test_validation = x_df[~split_mask]
y_test_validation = y_df[~split_mask]

split_mask = np.random.rand(len(x_test_validation)) < 0.5
x_validation = x_test_validation[split_mask]
y_validation = y_test_validation[split_mask]

x_test = x_test_validation[~split_mask]
y_test = y_test_validation[~split_mask]

As a result we have the following subsets:

Training set: total items = 21379, failure items = 474
Validation set: total items = 4599, failure items = 120
Test set: total items = 4741, failure items = 125

The subsets are imbalanced. We need to keep it in mind while picking the validation metrics. For example, accuracy is not sufficient. It’s better to look at precision, recall and F1 score.

Training and evaluation

After preparing the dataset we are ready to start training. In current example we are going to train a gradient boosting model. This is a quite powerful ensemble algorithm that provides a non-linear classifier. For evaluation we use AUCPR, which is better for imbalanced datasets.

from xgboost import XGBClassifier


model = XGBClassifier(max_depth=10, n_estimators=100, seed=0)
model.fit(
    x_train, 
    y_train, 
    eval_set=[(x_validation, y_validation)], 
    early_stopping_rounds=10, 
    eval_metric="aucpr"
)

The training reaches an AUC value of 0.98345, which is a very high result. To get final metrics we need to evaluate the model on the test set that has not been used during the training.

The script below calculates precision, recall, F-score, builds ROC curve and confusion matrix.

from sklearn import metrics


test_predictions = model.predict(x_test)
precision, recall, fscore, _ = metrics.precision_recall_fscore_support(y_test, test_predictions, average='weighted')

print("Precision {}, Recall {}, F-Score {}".format(precision, recall, fscore))
metrics.plot_roc_curve(model, x_test, y_test)
metrics.plot_confusion_matrix(model, x_test, y_test)

Final scores:

Precision = 0.99719
Recall = 0.99722
F-Score = 0.99719

The evaluation on a test set shows that the model performs very well. It means that such model can be used in production. It’s still necessary to keep in mind that some devices might differ from the others. So it’s recommended to verify and fine tune the model on each machine before deploying it. However, model deployment is another important process that is not covered by current article.

To conclude, this blog post provides an example of training a model for predictive maintenance. It uses a real dataset that contains time-series telemetry values as well as information about errors and failures. The article illustrates the process of data exploration and preparation. Finally, it shows the training of Gradient Boosting model and its evaluation.

Here you can access the notebook with original scripts.

Predictive Maintenance on Factory Data

Exploring dataset

Prepare dataset

Training and evaluation

Written by Andrey Karpov