Optimizing Scikit-learn models using Ray backend
Authors: Divyank Garg, Ajit Patankar, Sabyasachi Mukhopadhyay, Subhabrata Banerjee, Pooja Ayanile
Introduction
As described in the previous posts, Ray can be used to distribute model training across a cluster of machines. The previous example did not have integration of sklearn and Ray. In this post we describe tight integration of sklearn and Ray and show how this integration results in better performance without sacrificing accuracy. This integration is achieved by specifying Ray as the backend for sklearn. We address this specific use case in this post.
Case studies
In this use case, we build a classification model using a data set of approximately 20K with 10 features. The use case compares accuracy and model training time with Ray as Sklearn backend and without using Ray.
1.Evaluation of Scikit-learn Models without Ray
We used scikit-learn techniques like SVC, AdaBoostClassifier and RandomForestClassifier for model prediction and evaluating accuracy. If Ray is not used and a stand-alone Sklearn package is used then training is performed on a single core and the performance is limited by its capacity. Total time and accuracy an individual model takes to perform all such processes is shown in Table 1:
Table-1 Time and accuracy result for individual scikit-learn model without using Ray
Note- Model training time for RandomForestClassifier can be improved by using n_jobs parameter. It helps to distribute the process to available cores on a single machine and can thus reduce the time.
2.Evaluation of Scikit-learn Models using Ray Backend
Ray provides a backend to optimize the scikit-learn models. Ray backend is implemented for joblib using Ray actors and it helps in distributing the scikit-learn programs from a single node to a Ray cluster. By distributing programs to all available cores within a cluster, it helps to reduce the computation time as well as it may improve in accuracy in some cases.
To compare results without Ray, we took the same data set, model and performed the same model training actions with Ray backend. Model training time and accuracy for different models is shown below in Table-2:
Table-2 Time and accuracy result for individual scikit-learn model using Ray
Representative code is shown in the following code blocks.
Python Function Implementation with ray backend
import joblibfrom ray.util.joblib import register_rayregister_ray()with joblib.parallel_backend(‘ray’):st = time.time()model=SVC(gamma=’auto’,random_state=0)#fit modelmodel.fit(X_train, y_train)# train modely_train_predict=model.predict(X_train)# predicty_train_predict=model.predict(X_train)y_cv_predict=model.predict(X_cv_subset)y_test_predict=model.predict(X_test_subset)# accuracytrain_accuracy=accuracy_score(y_train, y_train_predict)cv_accuracy=accuracy_score(y_cv_subset, y_cv_predict)test_accuracy=accuracy_score(y_test_subset, y_test_predict)# macro f1 scoretrain_f1_score_macro=f1_score(y_train,y_train_predict,average=’macro’)cv_f1_score_macro=f1_score(y_cv_subset,y_cv_predict,average=’macro’)test_f1_score_macro=f1_score(y_test_subset,y_test_predict,average=’macro’)et= time.time()
Note: The Ray backend is only defined for scikit-learn libraries. It cannot be used for deep learning methods.
Comparison of Results
A comparison of the above results is shown in Figure-1 and Figure-2.
Fig-1 Comparing total time for individual model with Ray and without Ray
The Figure-1 clearly shows that in all the three cases the training time is much less when Ray backend is used. This is achieved by distribution of algorithm tasks to the Ray cluster and various cores that comprise it. For example, in the Random Forest algorithm, training of each tree in the ensemble will be distributed across all the available cores and thus achieving linear scaling. Other algorithms such as SVC are not amenable for such naive parallelization and thus performance improvement may be limited.
Fig-2 Comparing validation accuracy for individual model with Ray and without Ray
The Figure-2 shows that in all three cases the validation accuracy is marginally higher while using Ray as compared to not using Ray. This is possible due to availability of more computational power while distributing the program within the cluster.
Table-3 Summary of improvements for individual models with and without Ray
Conclusion
Ray backend provides good optimization for the scikit-learn library and reduces training time along with possible increase in validation accuracy and almost never any decrease. The benchmarks for a sample data set are summarized in Table-3. Note that part of the performance improvement is achieved by simply effectively using the cluster while further improvement is achieved by tightly integrating sklearn with Ray as a back-end.
In upcoming blogs we will describe the use of deep learning models by using RaySGD which implements Distributed Tensorflow and parallelizing the deep learning models.
Previous Blog: https://medium.com/juniper-team/model-selection-using-ray-c712febd1252