# Machine Learning for Prediction in Hydraulic Fracturing

## Decision Tree to Predict Number of Perf Clusters per Stage

Unconventional shale production has contributed to most of the worldwide oil and gas economy these days that comes from major producers such as USA, Canada, and Russia. In the production phase, one of the most crucial activities is perforating the reservoir from the wellbore. The reservoir, which we encounter as a solid rock phase, needs to be perforated to provide artificial channels (or permeability) for flowing the oil or gas to the wellbore. Later, proppant will be pumped to open these channels further — a process known as hydraulic fracturing.

Perforating a reservoir from a wellbore is not easy. For instance, we need to design how many clusters per stage are needed. What are clusters and stages? A cluster is a set of perforations shot into the casing and repeated over a number of intervals or spacing. The group of clusters then form the stage that will be stimulated with the proppant. One stage may consist of as little as 1 to more than 15 clusters.

Usually, engineers design the perforation using numerical modeling software. It relies on many mathematical models such as finite element models. However, given that we live in a data-driven world where we can make use of data to generate a data-driven model using Machine Learning.

The objective of this article is to show how to make a predictive model that predicts the number of clusters per stage needed to perforate our reservoir, given the data about the reservoir rock, reservoir fluid, target depth, and so on.

This article is accompanied by source codes in a Jupyter notebook for reproducibility. Access it here or scroll down in the viewer below.

# Overview of dataset

The dataset is obtained from the SPE data repository. One must have an SPE account to access this repository, however, we have already pre-processed and prepared this data into much cleaner data here.

Thank you Sebastien Matringe for granting permission to use the data for this article! Sebastien is one of the committee members at BERG/SPE that owned this repository.

This dataset consists of 53 rows representing individual leases (or wells) and 27 columns. 2 columns (the name of lease `Lease`

and the name of the formation `Formation/Reservoir`

) are categorical and 25 others are the numerical variables that represent information about reservoir properties, fluid properties, and perforation geometry. Our target is the number of clusters per stage `# Clusters per Stage`

.

On average, the leases have 6 clusters per stage. The smallest and largest number of clusters per stage are 3 and 15. We could sort the leases from the largest to the smallest number of clusters per stage. Here is the Top 10 that consists of 4 leases (Cardinal, Hawk, Falcon, and Crow) that have 15 clusters per stage and 6 leases (Lark, Jay, Osprey, Sparrow, Swift, and Kite) with 9 clusters per stage.

# Feature Selection

There are too many features to be used for our predictive model. Feature selection is required to reduce the number of our features. We use a heatmap plot of correlation coefficient to analyze which features have enough strong correlation to our target `# Clusters per Stage`

.

There are 2 redundant features that we will not use here; `# Clusters`

and `# Stages`

. Our target is calculated from these 2 features. And then, we spot 2 other redundant features that have a very strong correlation; `Reservoir Temperature (deg F)`

and `Sandface Temp (deg F)`

. In oilfield terms, sandface is defined as the interface between the reservoir and the wellbore. We will not use the sandface feature.

Which ones will we use as features? And how strong is a strong correlation? We have very few observations, that is only 53. To answer these, we can devise a statistical test called the **t-test**.

Find a good intro video about t-test here.

Here, the t-test is applied to determine the **critical value of the correlation coefficient** given the N number of observations. We found that for 53 observations and a 5% confidence level, the critical value of the correlation coefficient is 0.27. Therefore, a strong correlation between two independent features must be above 0.27.

## Rcrit = +/- 0.27

Back to the correlation heatmap, features that have a correlation with `# Clusters per Stage`

below 0.27 are cast out. We have finally reduced from 25 to 12 features.

One new feature is added from a **categorical variable** `Formation/Reservoir`

, which is encoded as 0 for Bossier Shale, 1 for Eagle Ford, 2 for Haynesville Shale, 3 for Marcellus, and 4 for Upper Marcellus.

Thus, there are in total 13 features.

Oil saturation has a positive correlation to the number of clusters per stage, whereas gas saturation has a negative correlation. Therefore, we could say that oil shale reservoirs have more clusters per stage, whereas gas shale reservoirs have fewer clusters per stage. Also, net pay has a negative correlation. The thinner the pay zone, the more clusters per stage we perforate.

**Machine Learning Strategy for Small Datasets**

The predictive model (or regressor) that we use is **Decision Tree**. Decision Tree does regression by dividing up the input space into two splits (binary tree). The split with the best cost (lowest cost because we minimize cost) is selected. All input variables and all possible split points are evaluated and chosen in a greedy manner (e.g. the very best split point is chosen each time).

The data is considered a small dataset because it has only 53 observations. For small datasets, there is a risk of overfitting. Here, a specific strategy for cross-validation, called **Leave-One-Out Cross-Validation (LOOCV)**, is used. LOOCV is done by splitting one instance from the whole instance as a validation instance, then cross-validating it to the rest instances. LOOCV is also considered as a K-Fold Cross-Validation with fold number equals the number of instances (observations).

We also use **validation curves** to evaluate which combination of **hyperparameters** gives us the best model that is neither overfitting nor underfitting. Two hyperparameters of the decision tree under our study here are the maximum depth of the tree `max_depth`

and the minimum number of samples required to be at a leaf node `min_samples_leaf`

.

Below are validation curves of `max_depth`

and `min_samples_leaf`

with Mean Squared Error (MSE) as the loss function. The MSE is calculated by averaging the scores from doing LOOCV. In the left graph, validation (or test) error stabilizes at `max_depth`

equals to 5. The training error is also low. In the right graph, the model overfits at fewer `min_samples_leaf`

because of low training error but high validation error. It minimizes at `min_samples_leaf`

equals to 4. Then, it shoots up to both high training and validation error, thus the model starts to underfit.

Therefore, the best combination that might give us the most optimum Decision Tree model is `max_depth`

of 5 and `min_samples_leaf`

of 4. We use this model to fit 53 observations.

# Prediction

The model is ready for our inputs. This model is useful to recommend the number of perforation clusters per stage given reservoir information. Given below is an example of such a case.

The company asks how many perf clusters per stage they need. We have already made a program that uses the above Decision Tree model to predict the number of clusters per stage. Check out the notebook (link above) to use and run this program. Using this program gives us an answer of **7 clusters per stage**.

# Model Interpretability

Until this stage, we are still blind to understand how the model is doing its prediction. To explain our model, we can **visualize the binary trees**. Below is the visualization of our tree. We can see five descriptive features that stand out in the decision nodes; Gas Specific Gravity, Gas Saturation, Lateral Length, Net Pay, and Bottom of Perforation.

We could also plot the **decision space**. Note that the Decision Tree produces a discrete prediction, instead of a continuous prediction such as SVM. Assuming we have similar inputs to the above example case but varying two variables — Gas Specific Gravity and Net Pay — produces the following decision partition plot;

There are 4 spaces produced by our Decision Tree model each represented by different colors on the color bar;

- 3.867 (or 4) clusters per stage — dark blue
- 5.83 (or 6) clusters per stage — orange
- 9 clusters per stage — purple
- 15 clusters per stage — light blue

If we look into the visualized binary tree above, each of these spaces is produced by chains of decisions, as follows;

- Gas Specific Gravity ≤ 0.631 → Gas Specific Gravity ≤ 0.575 → Net Pay ≤ 142.5 ft → 3.867 clusters per stage
- Gas Specific Gravity ≤ 0.631 → Gas Specific Gravity ≤ 0.575 → Gas Saturation ≤ 0.8 → 5.83 clusters per stage
- Gas Specific Gravity ≤ 0.631 → Bottom Perforation ≤ 16,212.5 ft → 9 clusters per stage
- Gas Specific Gravity ≤ 0.631 → Bottom Perforation >16,212.5 ft → 15 clusters per stage

Quod erat demonstrandum.

# Conclusion

We have successfully implemented machine learning to predict the number of perforation clusters per stage in a hydraulic fracturing operation. We use a small number of observations from 53 shale oil and gas leases with originally 25 features. Through feature selection, we have reduced from 25 to only 12 features with 1 categorical feature encoded from the Formation Name column. We build a Decision Tree model and use Leave-One-Out Cross-Validation (LOOCV) and validation curves to find the best combination of two hyperparameters that avoids overfitting. Using this model, we fit the 53 observations and predict from a new input given by the user. We have shown a sample case of a company that needs to know how many clusters per stage for their lease. To explain how the model works “behind the curtain”, we visualize the binary decision trees and decision space. It was found that at least 5 features stand out to make difference in the decisions.

It is obvious that we need much more observations to make a more reliable and robust predictive model. If more data similar to this SPE Data Repository could be made public, further research could be carried out.

# Data Reference

SPE Data Repository: Data Set: 1, Well Number: All Wells. From URL: https://www.spe.org/datasets/dataset_1/csv_files/dataset_1_all_wells/well_data

**Follow me for upcoming articles on more of my ML AI experiments in our amazing energy industry! 💡**