# Partial Least Squares: MATLAB, R and Python codes — All you have to do is just preparing data set (very simple, easy and practical)

I release MATLAB, R and Python codes of Partial Least Squares (PLS).

You can buy each code from the URLs below.

#### MATLAB

https://gum.co/nVse
Please download the supplemental zip file (this is free) from the URL below to run the PLS code.
http://univprofblog.html.xdomain.jp/code/MATLAB_scripts_functions.zip

#### R

https://gum.co/lfaBt
Please download the supplemental zip file (this is free) from the URL below to run the PLS code.
http://univprofblog.html.xdomain.jp/code/R_scripts_functions.zip

#### Python

https://gum.co/eMrX
Please download the supplemental zip file (this is free) from the URL below to run the PLS code.
http://univprofblog.html.xdomain.jp/code/supportingfunctions.zip

### Procedure of PLS in the MATLAB, R and Python codes

To perform appropriate PLS, the MATLAB, R and Python codes follow the procedure below, after data set is loaded.

1. Autoscale objective variable (Y) and explanatory variable (X)
Autoscaling means centering and scaling. Mean of each variable becomes zero by subtracting mean of each variable from the variable in centering. Standard deviation of each variable becomes one by dividing standard deviation of each variable from the variable in scaling.
Scaling is arbitrary (but recommended), but centering is required since PLS is based on rotation of axises.

2. Estimate Y with cross-validation (CV), changing the number of components from 1 to m
Leave-one-out CV is very famous, but it causes over-fitting when the number of training samples is high. So, 5-fold or 2-fold CV is better. First, training samples are divided into 5 or 2 groups. Second, one group is handled as test samples and model is built with the other group(s). This is repeated 5 or 2 times until every group is handled as test samples. Then, not calculated Y but estimated Y can be obtained.
m must be less than the number of X-variables, but m=30 is sufficient at a maximum.

3. Calculate Root-Mean-Squared Error (RMSE) between actual Y and estimated Y for each number of components

4. Decide the optimal number of components with the minimum RMSE value
It is OK to decide the optimal number of components with the first local maximum RMSE value

5. Construct PLS model with the optimal number of components and get standard regression coefficient

6. Calculate determinant coefficient and RMSE between actual Y and calculated Y (r2C and RMSEC) and determinant coefficient and RMSE between actual Y and estimated Y (r2CV and RMSECV)
r2C means the ratio of Y information that the PLS model can explain.
RMSE means the average of Y errors in the PLS model.
r2CV means the possible ratio of Y information that the PLS model can estimate for new samples.
RMSECV means the possible average of Y errors for new samples.
Better PLS models have higher r2CV values and lower RMSECV values.
Large difference between r2C and r2CV and that between RMSEC and RMSECV mean PLS model’s overfitting to training samples.

*Caution! r2CV and RMSECV cannot represent true predictability of the PLS model since it is CV not external validation.

7. Check plots between actual Y and calculated Y, and between actual Y and estimated Y
Outliers of calculated and estimated values can be checked.

8. In prediction, subtract the mean in the autoscalling of X in 1. from X-variables, and then, divide X-variables by the standard deviation in the autoscalling of X in 1., for new samples

9. Estimate Y based on the standard regression coefficient in 5.

10. Multiply the standard deviation in the autoscalling of Y in 1. by estimated Y, and then, add the mean in the autoscalling of Y in 1. to estimated Y

### How can I perform PLS?

#### 1. Buy the code and unzip the file

MATLAB: https://gum.co/nVse

Python: https://gum.co/eMrX

#### 4. Prepare data set. For data format, see the article below.

https://medium.com/@univprofblog1/data-format-for-matlab-r-and-python-codes-of-data-analysis-and-sample-data-set-9b0f845b565a#.3ibrphs4h

#### 5. Run the code!

Estimated values of Y for “data_prediction2.csv” are saved in ”PredictedY2.csv”.