DATA SCIENCE THEORY | MACHINE LEARNING | KNIME ANALYTICS PLATFORM
What is Machine Learning? An Introduction with examples in Python and KNIME
Learn data science theory with code and no-code
Discover the basics of Machine Learning with examples in Python and KNIME in this practice-based introduction. Learn how algorithms can learn from data and make predictions, and see its potential applications.
Machine Learning and artificial intelligence have always been a hot topic in recent years. However, many things still seemed far away or even impossible. But with the introduction of ChatGPT, the quality and possibilities of artificial intelligence have expanded tremendously. Everything now seems more and more possible every day.
But what actually is Machine Learning?
Machine Learning is a subfield of artificial intelligence that focuses on the development of algorithms and models that can learn patterns and relationships in data and make predictions or decisions without explicit programming.
It is based on the idea that systems can learn from experience, identify patterns in data, and improve their performance over time.
Types of Machine Learning
In the field of Data Science, the different types of Machine Learning techniques commonly employed include:
- Supervised Learning
- Unsupervised Learning
- Semi-supervised Learning
- Reinforcement Learning
- Deep Learning
Supervised Learning
Supervised learning is extensively used in Data Science. The goal is to learn a mapping function that can predict the correct label for new, unseen data.
It’s like a child learning with a teacher who knows the right answers.
For instance, you can train a model to classify incoming emails as either spam or non-spam.
Supervised learning has been successfully used in many applications along with spam filtering, such as predicting customer churn, disease diagnosis, fraud detection, credit risk assessment, sentiment analysis, handwriting recognition, and many more.
Unsupervised learning
Unsupervised learning is about finding patterns or structures in unlabeled data.
It’s like exploring data without any specific guidance.
For example, you can use clustering to group customers into different segments based on their purchasing behavior.
Unsupervised learning has proven its value in a wide array of applications, including customer segmentation, anomaly detection, data visualization, market analysis, recommendation systems, clustering of gene expression data, pattern discovery in manufacturing processes, and much more.
Semi-supervised Learning
This approach is useful when labeling data is expensive or time-consuming. The algorithm learns on labeled data and then use that learning to predict the output for unlabeled data.
The output of labeled and unlabeled data is then used to train another supervised learning model.
Semi-supervised learning has been successfully applied across various fields, such as refining spam filters, assisting medical diagnoses, improving image recognition, enhancing language translation, optimizing fraud detection, refining speech recognition, and supporting autonomous vehicle navigation.
Reinforcement Learning
Reinforcement learning involves training an agent to learn through trial and error. It’s like teaching a computer to play a game by rewarding it for good moves and penalizing it for bad ones. The agent explores different actions and learns to maximize its cumulative reward over time.
Reinforcement learning is not as widely used in data science as supervised and unsupervised learning. Therefore, we will not discuss this topic in any more detail. However, it is still an important part of machine learning and has many applications beyond games and robotics.
Deep Learning
Deep learning is a subset of machine learning that focuses on artificial neural networks inspired by the human brain. These networks have multiple or “deep” layers and can learn complex patterns from large amounts of data.
Deep learning unlike traditional machine learning, where humans often need to decide which features are important, Deep learning can automatically learn and extract relevant features from the raw data.
Therefore, this approach does not require the human operator to formally specify all the knowledge that the computer needs.
Feature engineering is the process of refining and shaping raw data into a format that machine learning models can effectively use. This process involves selecting, transforming, and often creating new features to improve a model’s accuracy and understanding of complex patterns in the data.
Deep learning has showcased remarkable success in a multitude of domains. From image recognition and natural language understanding to autonomous driving, medical diagnosis, recommendation systems, and more, deep learning has transformed industries by enabling computers to learn and make decisions from data on a complex level.
While deep learning is a form of supervised learning, it distinguishes itself by leveraging deep architectures and large-scale datasets to achieve superior performance in tasks such as image recognition, natural language processing, and speech recognition. However, it’s important to note that deep learning can also be used in unsupervised learning and reinforcement learning settings, where the objective is different from traditional supervised learning.
Before we get into the real examples on types of machine learning, we first need to understand the difference between the two main types of data that machine learning algorithms work with.
Data types
Numerical data
Numerical data is information expressed in numbers. It represents quantities that can be measured or counted, like height, weight, blood pressure, or age. Numerical data can be used for various mathematical operations such as addition, subtraction, multiplication, and division.
Categorical data
Categorical data represents different categories or labels. It’s used to group things. Examples include eye color, animal types, or customer segments. Unlike numerical data, categorical data doesn’t involve numbers that can be mathematically manipulated, but it’s valuable for classification and comparison purposes, helping us understand the distinct groups within a dataset.
Understanding the difference between these two types of data is important in order to use the right algorithms for the corresponding problem.
A few words in advance about the paradigm of data science tools:
Python and Pandas have become the standard tool for Data Science in recent years and have established themselves in this field. While Python has become a standard in data science, visual programming tools like KNIME have their advantages as well.
These tools offer a more visual and intuitive approach to data analysis and modeling, making them valuable for those who prefer a graphical interface. KNIME, for instance, excels in workflows, automation, and accessibility for non-programmers, complementing Python’s strengths in coding and customization.
Depending on your specific needs and preferences, you might find a combination of Python and visual tools like KNIME to be a powerful approach in data science.
The following examples are shown in both tools. Nevertheless, I still recommend you to read this article before you start:
Examples for Supervised Learning
We will now look at two examples of supervised learning using specific cases. In the first example, numerical values are predicted. For this purpose, a prediction model for house prices is created.
In the second example, categorical values are predicted. For this we will use the data of the Titanic disaster.
Supervised learning for numerical target values
Suppose you are a data scientist and you have a dataset of housing prices in a city, including the size of each house (in square feet) and its price. You would like to build a machine learning model that can predict the price of a house based on its size.
To illustrate this example, we will load in python the kc_house_data dataset, which is a compilation of real estate data properties sold in Seattle’s King County. This dataset includes 19 key features, such as house price, bedroom and bathroom count, as well as the square footage of both living space and the lot itself.
This is a supervised learning problem, where you have labeled data (the size and price of each house) and you want to train a model to make predictions for new, unseen data. In this case, the model you would build is a simple linear regression model.
The linear regression model makes the assumption that there is a linear relationship between for example the size of the house and its price, and it estimates the coefficients of this linear relationship using the training data. The coefficients can then be used to make predictions for new data.
In this example, you would use the size of the house as the input (or feature) and the price as the output (or label).
To make a simple linear regression model for the data above, we can use the scikit-learn library in Python. Here are the steps to follow:
- First, we need to import the necessary libraries:
# import the libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression
2. Next, we need to load the data into a Pandas DataFrame:
# Load the dataset
df = pd.read_csv('Data/kc_final.csv' ,sep=";")
3. We can then select the “sqft_living” column as the input feature and the “price” column as the output feature:
# Select the input and output variables
X = df['sqft_living'].values.reshape(-1, 1)
y = df['price'].values
4. We can now create an instance of the LinearRegression object and fit the data to the model:
# Create and fit the linear regression model
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X, y)
5. Finally, we can make predictions using the model and plot the results:
# Make predictions
y_pred = regressor.predict(X)
# Plot the data and the regression line
plt.scatter(X, y, color='blue')
plt.plot(X, y_pred, color='red')
plt.title('Linear Regression')
plt.xlabel('sqft_living')
plt.ylabel('price')
plt.show()
By training on labeled data, the model learns the relationship between house size and price. The red line corresponds to the output of the model. House size in square feet seems to be a good indicator of price, but it is still not very accurate. Further input variables could improve the predictive accuracy of the model.
Let’s now do the same in KNIME:
The KNIME workflows with all the following examples can be found on my KNIME Community Hub space
In KNIME everything is done by so called building blocks or nodes.
One node (the CSV-Reader) loads the file, the Column filter select the attributes sqt_liv and price and finally the Linear Regression Learner calculates the predictions.
Additional prediction attributes can be added via the Column Filter node, which lead to an improved prediction model.
It’s up to you which tools you want to use. But listen to the wisdom of an experienced data scientist:
The best tool is the one that helps you reach your result as fast as possible.
Supervised Learning for categorical target values
The sinking of the Titanic in 1912 remains one of the most tragic and well-documented maritime disasters in history. In the quest to uncover hidden insights from this historic event, data scientists have turned to a dataset containing information about passengers onboard.
By analyzing this data set, we hope to answer the question of what factors led to the survival of passengers.
We import the necessary libraries in Python:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree # Import plot_tree
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScale
Next we load the Titanic-Dataset, which is also available on kaggle.
# Load the dataset
data = pd.read_csv('Data/Titanic-Dataset.csv')
data
We will use a decision tree as a prediction model in our following example because one can represent its rules particularly well, although there are many other prediction algorithms that would give even better results.
In order for our decision tree model to work properly, we need to perform one-hot encoding on the ‘Sex’ attribute, transforming it into separate binary columns such as ‘Sex_male’ and ‘Sex_female’, as our decision tree require numerical input data rather than categorical values like ‘male’ and ‘female’.
# Encode categorical variable 'Sex' using one-hot encoding
data = pd.get_dummies(data, columns=['Sex'])
Furthermore, for our decision tree model to work properly, we need to address missing values in the ‘Age’ attribute by performing imputation. Imputing values improves prediction accuracy by ensuring complete, continuous data for effective pattern learning. In our case we us the mean value.
# Handle missing values
imputer = SimpleImputer(strategy='mean')
data['Age'] = imputer.fit_transform(data[['Age']])
Now we choose the input and output variables:
# Select features (X) and target (y)
X = data[['Pclass', 'Age', 'Sex_female']] # Include the encoded 'Sex_female' column
y = data['Survived']
We split the data in a train and test set and create a DecisionTreeClassifier.
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=51,stratify=y)
# Create a decision tree classifier with gain ratio (entropy) and pruning
clf = DecisionTreeClassifier(random_state=51, min_samples_leaf=20, ccp_alpha=0.01, criterion='entropy') # Use 'entropy' for gain ratio
In supervised learning, data is split into training and test sets to assess how well the model generalizes to new, unseen data, prevent overfitting, and optimize its performance.
Finally we fit the model to train set and measure its prediction accuracy on the test set:
# Fit the classifier on the training data
clf.fit(X_train, y_train)
# Make predictions on the test data
y_pred = clf.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
print(f'Accuracy: {accuracy}')
print(f'Confusion Matrix:\n{conf_matrix}')
print(f'Classification Report:\n{class_report}')
The model achieves 78% accuracy on the test data and shows higher precision in predicting who survived (86%) compared with predicting who did not survive (75%).
We can also visualize the decision tree in a graph.
# Visualize the compact decision tree
plt.figure(figsize=(10, 6))
plot_tree(clf, filled=True,feature_names=X.columns, class_names=['Not Survived', 'Survived'], fontsize=10, impurity=False) # Provide an empty list for feature names
plt.title("Compact Decision Tree Visualization (Max Depth = 4)")
plt.show()
The rules of the tree are to be read as follows:
There are 801 cases in the top box. If you split them by male in the left path (=Sex_female < 0. 5 => 0) and female in the right path, you see that females have a greater chance of survival (=blue).
Another strong criterion is the passenger class (=Pclass). For the better Pclass 1, there was a higher chance of survival. For women, even Pclass 2 helped to achieve a better chance of survival.
Age is also crucial. Younger boys had a clear advantage.
So it seemed that the well-known sailor’s saying applied to this disaster.:
“First wife and children”
Let’s now do it now in KNIME:
In KNIME we load the “Titanic Dataset” with a CSV Reader node.
Then we change the Survived attribute from Number to String, since our decision tree model is for categorical variables. We also impute the Age attribute with the mean and use the Partitioning node to split the data into 90% training set and 10% test set.
The accuracy of the decision tree with KNIME is light better with 81%.
It seems that we got better samples when partitioning our data into test and training data.
The decision tree model shows
The decision tree model shows about the same rules as in the Python model. The attributes sex, Pclass and age also play a decisive role in this decision tree.
Conclusion
We learned about the basic types of machine learning and used two concrete examples of supervised learning to explore the subject in more depth using both Python and KNIME.
Both tools are freely available and give a solid basis to not only get into Data Science, but also to apply it in a professional environment.
In the articles that follow, we’ll take a closer look at the topics we didn’t cover in detail. So stay tuned!
Thanks for reading and may the Data Force be with you!
Please feel free to share your thoughts or reading tips in the comments.
If you enjoy reading stories like these and want to support me as a writer, consider signing up to become a Medium member.
It’s $5 a month, giving you unlimited access to thousands of Data science articles. If you sign up using my link, I’ll earn a small commission with no extra cost to you.
Follow me on Medium, LinkedIn or Twitter
and follow my Facebook Group “Data Science with Yodime”
Material for this project:
Jupyter-Code: Github
KNIME-workflow: KNIME Community Hub
References: