Traditional AI — Start with Scikit-Learn: User Guide | by Heting Li | May, 2024

5 min readMay 18, 2024

Introduction

Following the tutorial of Traditional AI — Use Case is All You Need!, you may be curious about the capabilities of Traditional AI models and tools that can be applied to your own use cases. This tutorial provides an overview of Scikit-Learn features based on its User Guide, along with practical examples. By the end of the tutorial, we hope you will have an intuitive understanding of the major models and tools that can be utilized when building your own AI use cases.

2. Scikit-Learn Feature Overview

On the Scikit-Learn main page, six key categories of features are outlined, which fall into two main areas: Modeling and Data Processing.

Modeling

Classification: Models and algorithms can identify which category an object belongs to (e.g Anomaly Detection, Image recognition )
Regression: Models and algorithms can predict a continuous-valued attribute associated with an object (e.g Consumption forecast, Statistic prediction, Stock price).
Clustering: Models and algorithms can automatically group of similar objects into sets. (e.g Statistical Segmentation, Grouping experiment outcome)
Model Selection: Tools can compare, validate and choose parameters and models.

Data Processing

Preprocessing: Data Feature extraction and normalization.
Dimensionality Reduction: Reducing the number of random variables of feature to consider for preprocessing and modeling.

By clicking the feature , for example “Classification”, it will direct to the User Guide page. While by clicking the button “Example”, it will direct to the practical examples and demos for the corresponding features.

If you may remember our first tutorial ‘Let’s Begin! — Showcase your AI demo with Two Clicks! actually belongs to one of the “Examples” of the feature “Classification”

3. Scikit-Learn User Guide Tour by an Example

Instead of going through the detailed User Guide, let’s use an example to demonstrate how to utilize the major components of Scikit-Learn. Hopefully it will provide practical experience, helping us explore the User Guide further based on our own use case requirements in the future.

Let us continue using the dataset from “Handwritten Digits Recognition” as introduced in the first tutorial ‘Let’s Begin! — Showcase your AI demo with Two Clicks!. This time, instead of performing “Classification” , we will practice with “Clustering” feature. Clustering involves identifing how many different characters are represented inthe dataset, without considering the exact number values.

Let’s click the “Example” under the “Clustering” section and get into the first example: A demo of K-Means clustering on the handwritten digits data. We can run the demo either in JupterLite online, or on our local development environment, which we setup during the Traditional AI Development Environment Setup

The notebook includes detailed comments that clearly explain the code. The overall modeling flow include consists of “Load Data”, “Dimension Reduction by PCA”, “Data Standardization”, “Clustering by K-Mean”, ultimately segementing the 3823 handwritten digits pictures into 10 categories. Some further illustration of the process is described in Additional illustrations of the process are provided in the Appendix as optional reading material.

The detailed user guide for the above tools and algorithms can be accessed via the User Guide hyperlink highlighted on the main page.

Detailed API usage can be reviewed by clicking the library hyperlink or browsing the “API” pages.

4. Scikit-Learn User Guide Highlights

After experimenting with the “Clustering” demo, we gain an initial understanding of how the Scikit-Learn User Guide is connected to practical AI modeling development. Now, let’s zoom out for an overview of the User Guide to see how we can utilize it for our own AI use case development.

The Scikit-Learn User Guide includes 11 chapters, which can be categorized into four areas:

These four areas correspond to the AI development process: Data Collection, Data Pre-Processing, Model Selection, Model Evaluation and Tuning and Engineering, as we discussed in the conclusion of the first tutorial ‘Let’s Begin! — Showcase your AI demo with Two Clicks!

5. Scikit-Learn Model Selection

There are considerable models and algorithms supported by Scikit-Learn, each require extensive reading, practicing and experiment. However we may chose several popular models and algorithms which can be likely applied, for example RandomForest for Classification, K-Mean for Clustering, Lasso for Regression. Furthermore, Scikit Learn provide a cheat-sheet Choosing the right estimator to guide the model selection process. This guide does not cover the all models, algorithm and tools, however it can be a handy instruction to build an initial AI Use Case from scratch.

6. Conclusion

Scikit-Learn offers a comprehensive set of models, algorithms, and tools for AI development. The User Guide, API, Examples, and algorithm cheat-sheet provide sufficient information to initiate AI use case development. However, the diverse models and various parameters involved in Data Preprocessing, Model Selection, and Tuning require extensive experimentation to identify the best approach for our AI use case. This process demands a deep understanding of both the data and the models. While consulting data sources and data analytics experts can always help the AI development process, let’s continue our journey to gain deeper insights into data preprocessing and the models offered by Scikit-Learn in the upcoming tutorials: Traditional AI — Scikit-Learn Module Introduction (Under construction)

6. Appendix (Optional Reading)

In the demo of A demo of K-Means clustering on the handwritten digits data, the stardardization process apply the StandardScale algorithm to standardize the digit picture data which is represented by 64 features, each feature is a grey scale integer (range 0–15).

Written by Heting Li