Mnist Dataset for Machine Learning
Mnist consists of a collection of 70,000 grayscale images of handwritten digits from 0 to 9. Each image is a 28x28 pixel square. The dataset is divided into training and testing sets, making it suitable for training and evaluating machine learning algorithms.
The MNIST dataset is a popular and widely used dataset in the field of machine learning and computer vision. It stands for “Modified National Institute of Standards and Technology” and consists of a large collection of handwritten digits. The dataset is often used as a benchmark for testing and developing machine learning algorithms, particularly for digit recognition tasks.
- Data: The dataset contains 70,000 grayscale images of handwritten digits (0 through 9). Each image is a 28x28 pixel square, making the images relatively small compared to many other datasets. These images have been preprocessed and centered to ensure consistency.
- Training and Testing Sets: The MNIST dataset is typically divided into two subsets: a training set and a test set. The training set contains 60,000 images, and the test set contains 10,000 images. This separation allows researchers and practitioners to train machine learning models on one subset and evaluate their performance on the other.
- Labeling: Each image in the dataset is associated with a corresponding label, indicating which digit (0–9) is written in the image. These labels are used to train and evaluate the accuracy of machine learning models.
- Use Cases: MNIST is often used as a benchmark for various machine learning algorithms, especially for image classification tasks. It’s a relatively simple dataset compared to more complex image datasets like CIFAR-10 or ImageNet, making it a good starting point for learning and experimentation.
- Challenges: While MNIST is considered a standard dataset, it’s not without its challenges. Achieving high accuracy on MNIST is a relatively straightforward task for modern machine learning models. As a result, researchers often use more challenging datasets to evaluate the robustness and generalization capabilities of models.
The MNIST dataset has been widely used for educational purposes, as a starting point for exploring deep learning, and for benchmarking various machine learning algorithms. However, it’s important to note that it has become somewhat outdated in recent years due to its simplicity. Researchers often seek more complex and diverse datasets to better reflect the challenges encountered in real-world applications.
→ Set of 70,000 small images of digits handwritten by high school Students and employees of the US Census Bureau
→ All images are labelled with the respective digit they represent
→ MNIST is the hello world of machine learning
→ There are 70,000 images, and each images has 784 (28*28) features.
→ Each image is 28*28 pixels , and each feature simply represents one pixel’s intensity from 0 (white) to 255 (Black)
Requisites :
Jupyter Notebook :
The Jupyter Notebook is an open-source web application that allows to creation and share documents that contain live code, equations, visualizations, and narrative tests. this will make it easy for. this will make it easy for us to document and use the code as a notebook.
2. Python Libraries
- NumPy:
NumPy is a fundamental package for scientific computing in Python. It provides support for arrays and matrices, essential for data manipulation. - scikit-learn:
Scikit-learn (sklearn) is a popular machine-learning library in Python. It includes various tools for data analysis and modeling, including algorithms for classification and regression. - Matplotlib:
Matplotlib is a plotting library for creating visualizations in Python. It’s useful for visualizing data and results.
You can install these libraries using the following command:
pip install numpy scikit-learn matplotlib
3. Basic Python Programming Knowledge: Familiarity with basic Python programming concepts such as variables, arrays/lists, functions, and control flow is essential for working with the MNIST dataset and implementing machine learning algorithms.
Set up jupyter Notebook
To install and set up Jupyter Notebook :
- Install Python:
If you don’t have Python installed on your Windows system, you’ll need to install it first. You can download the latest Python installer for Windows from the official Python website (
Python Releases for Windows ). Be sure to check the box that says “Add Python X.X to PATH” during the installation process, where “X.X” represents the version number of Python.
2. Open Command Prompt or PowerShell:
After installing Python, open either Command Prompt or PowerShell on your Windows system. You can do this by searching for “cmd” or “PowerShell” in the Start menu.
3. Install Jupyter using pip:
To install Jupyter, you can use the pip
package manager that comes with Python. Run the following command to install Jupyter:
pip install jupyter
This command will download and install Jupyter and its dependencies.
4. Start Jupyter Notebook:
Once Jupyter is installed, you can start it by running the following command in your Command Prompt or PowerShell:
jupyter notebook
This command will launch a Jupyter Notebook server, and it will open a web browser window displaying the Jupyter Notebook interface
web browser window displaying the Jupyter Notebook interface:
5. Use Jupyter Notebook:
You can create new Jupyter notebooks, open existing ones, and start working with Python code, text, and visualizations in the Jupyter Notebook interface. It’s an interactive environment for data analysis, machine learning, and more.
6. click on “New” located in the top of the right corner
7. Click on “Notebook” that redirects to a page
8. Select Kernal Python 3(ipykernal)
Mnist: Fetch the data and then split it into train and test sets and apply a few ML algorithms to detect a given digit
- Fetching Dataset
from sklearn.datasets import fetch_openml
mnist = fetch_openml("mnist_784")
This code imports the fetch_openml
function from scikit-learn and uses it to fetch the MNIST dataset, which consists of 28x28 pixel images of handwritten digits (0-9). The dataset is stored in the mnist
variable for further use in machine learning tasks.
3. we can see ‘data’ , ‘pixels (1–784)’ , ‘targets’ ,’description’ in the minist when we write command :
mnist
4. Loading the MNIST Dataset
x, y = mnist['data'], mnist['target']
here, we’ve loaded the MNIST dataset into x
(which contains the input data) and y
(which contains the target labels).
5. Checking Data Shapes:
x.shape
y.shape
These lines are used to check the dimensions (shape) of the x
and y
arrays, which represent the dataset's input data and target labels, respectively.
6. Using %matplotlib inline
for Inline Plotting in Jupyter Notebook
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
In Jupyter Notebook, %matplotlib inline
is a magic command that enables the rendering of Matplotlib plots directly within the notebook interface. This command is particularly useful for data visualization and generating plots, as it allows you to see the graphical output of Matplotlib commands within your Jupyter Notebook cells.
Overall, %matplotlib inline
is a valuable tool for anyone working with data analysis and visualization in Jupyter Notebook, as it simplifies the process of creating and examining plots as part of your data analysis workflow.
7. Extracting and Reshaping a Handwritten Digit Image
some_digit = x.to_numpy()[36002]
some_digit_image = some_digit.reshape(28,28)
some_digit
is a variable that stores a flattened representation of a handwritten digit image from the MNIST dataset.x
represents the MNIST dataset, where each row corresponds to a flattened 28x28 pixel image of a handwritten digit..to_numpy()
is used to convert the datasetx
into a NumPy array.[36002]
selects a specific row (sample) from the dataset, in this case, the 36,002nd row, which corresponds to one of the handwritten digit images.some_digit_image
is another variable that stores the same handwritten digit image, but in its original 2D format (28 rows by 28 columns)..reshape(28, 28)
is applied tosome_digit
to transform the flattened representation into a 2D array. This reshaping is done to prepare the image for visualization or further analysis.
this code segment extracts a specific handwritten digit from the MNIST dataset, first in its flattened form and then reshapes it into a 2D image format, making it ready for display or further processing.
8. Visualizing a Handwritten Digit Image
To visualizes a handwritten digit image using Matplotlib. It displays the digit image in grayscale, removes axis labels, and shows the image within the Jupyter Notebook or Python environment. This is a common step in data exploration and analysis when working with image datasets like MNIST.
plt.imshow(some_digit_image, cmap=matplotlib.cm.binary, interpolation="nearest")
plt.axis("off")
plt.show()
plt.imshow(some_digit_image, cmap=matplotlib.cm.binary, interpolation="nearest")
is used to display the handwritten digit image stored in the variablesome_digit_image
using Matplotlib. Here's what each part of this line does:plt.imshow
: This function from Matplotlib is used to display an image.some_digit_image
: It's the 2D NumPy array representing the handwritten digit image.cmap=matplotlib.cm.binary
: Thecmap
parameter specifies the colormap to be used for displaying the image. In this case,matplotlib.cm.binary
is used to display the image in grayscale, where black represents the digit's ink and white represents the background.interpolation="nearest"
: Theinterpolation
parameter determines how the image should be interpolated (scaled) if its dimensions don't match the display size."nearest"
interpolation is used to maintain the pixel's original values without interpolation.plt.axis("off")
: This line is used to turn off the axis labels and ticks in the Matplotlib plot. Since this is an image display, you typically don't need axis labels.plt.show()
: Finally, this command is used to display the image plot on your screen.
9. Splitting the MNIST Dataset into Training and Testing Sets
x_train, x_test = x[0:60000], x[60000:70000]
y_train, y_test = y[0:60000], y[60000:70000]
this code snippet performs a critical step in the machine learning workflow by splitting the MNIST dataset into separate training and testing sets. This separation is crucial for training a machine learning model on one portion of the data and evaluating its performance on another unseen portion. It helps assess the model’s generalization ability to make predictions on new, unseen data. Typically, the training set is used to train the model, and the testing set is used to evaluate its accuracy and performance metrics.
x_train
andx_test
are created to store subsets of the input datax
from the MNIST dataset. Specifically:x_train
contains the first 60,000 samples (images) fromx
, which are typically used for training machine learning models.x_test
contains the next 10,000 samples (images) fromx
, which are reserved for testing the trained model's performance.y_train
andy_test
are created to store subsets of the target labelsy
corresponding to the input data. These subsets correspond to the same 60,000 training samples and 10,000 testing samples, respectively.
10. Shuffling the Training Data for Improved Model Training (Optional)
By performing this shuffling operation, the training data is presented to the machine learning model in a random order. This randomness can be advantageous, particularly in situations where the data might exhibit inherent patterns or biases related to its order. Shuffling ensures that the model isn’t biased by any specific order in the training data, ultimately leading to more robust and unbiased model training.
import numpy as np
shuffle_index = np.random.permutation(60000)
shuffle_index
x_train, y_train = x_train.[shuffle_index], y_train.[shuffle_index]
11. Creating a 2 detector || if the digit is 2 or not , Preprocessing Target Labels for Binary Classification
y_train = y_train.astype(np.int8)
# y train ko sabai string number ma convert hunxhha
y_test = y_test.astype(np.int8)
y_train_2 = (y_train==2)
y_test_2 = (y_test==2)
y_train
The purpose of these transformations is to prepare the target labels for binary classification, where you’re interested in distinguishing one specific class (digit 2) from the rest. By converting the labels to numerical format and creating binary labels (True
or False
), you set up the data for training and evaluating a binary classification model that can predict whether a given image represents the digit 2 or not.
12. Importing the Logistic Regression Model
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model
is a part of the scikit-learn (sklearn) library, which is a popular machine-learning library in Python.LogisticRegression
is the specific machine learning model being imported from sci-kit-learn. Logistic Regression is a widely used algorithm for binary and multi-class classification tasks.
By importing LogisticRegression
from sklearn.linear_model
, you make the Logistic Regression algorithm available for use in your Python code. This model can be used to train and make predictions for classification tasks, including binary classification tasks like the one you've been working on, where you're trying to classify whether an image represents the digit 2 or not. Once imported, you can create an instance of the Logistic Regression model and train it using your training data.
- Creating and Training a Logistic Regression Classifier
clf = LogisticRegression(tol=0.1)
clf.fit(x_train, y_train_2)
clf
is a variable representing a Logistic Regression classifier.LogisticRegression(tol=0.1)
creates a Logistic Regression classifier with a specified tolerance (tol
) of 0.1. The tolerance parameter controls the stopping criteria for the optimization algorithm used by the Logistic Regression model during training. It determines when the optimization process should converge to find the best parameters.clf.fit(x_train, y_train_2)
trains the Logistic Regression classifier (clf
) using the training datax_train
and the binary target labelsy_train_2
. This is where the model learns to make predictions based on the input data and the specified binary classification task, which, in this case, is to predict whether a digit image represents the digit 2 (True
) or not (False
).
After training, the clf
classifier will have learned a decision boundary that separates the digit 2 from other digits based on the features (pixel values) of the images. This trained classifier can then be used to make predictions on new, unseen data and evaluate its performance on the test dataset.
14. Making Predictions with a Trained Logistic Regression Classifier
clf.predict([some_digit])
15. Performing Cross-Validation to Evaluate Model Accuracy
from sklearn.model_selection import cross_val_score a = cross_val_score(clf, x_train, y_train_2, cv=3, scoring="accuracy")
a
from sklearn.model_selection import cross_val_score
imports thecross_val_score
function from sci-kit-learn'smodel_selection
module. This function is used for cross-validation, which is a technique for assessing the performance of machine-learning models.clf
is the previously trained Logistic Regression classifier.x_train
is the training data, andy_train_2
is the corresponding binary target variable indicating whether each training sample represents the digit 2 or not.cv=3
specifies the number of cross-validation folds. In this case,cv=3
this means that the dataset will be split into three parts, and the model will be trained and evaluated three times, each time using a different fold for testing and the remaining folds for training.scoring="accuracy"
specifies that the evaluation metric for the cross-validation should be accuracy. Accuracy measures the proportion of correctly classified samples, which is a common metric for classification tasks.a
is a variable that stores the results of the cross-validation. After running this code,a
will contain an array of accuracy scores, with each score representing the accuracy achieved on one of the cross-validation folds.
By performing cross-validation in this way, you obtain multiple accuracy scores for the model, which can help assess its overall performance and robustness. You can then analyze these scores to understand how well the model generalizes to different subsets of the training data and make informed decisions about model tuning and hyperparameter selection.
16. Calculating the Mean Accuracy from Cross-Validation Scores
a.mean()
By calculating the mean accuracy, you can assess the overall performance of the Logistic Regression classifier and gain insights into how well it is likely to perform on new, unseen data. It helps you make informed decisions about the model’s suitability for the given task and whether further model refinement or parameter tuning is necessary.
In conclusion, this analysis and machine learning pipeline showcase the fundamental steps involved in handling and understanding the MNIST dataset, preparing data for model training, training a classifier, making predictions, and evaluating model performance. While MNIST is a classic dataset used for educational purposes, it’s important to recognize its simplicity and consider more challenging datasets for real-world applications.
Here is the full code that you can try out in your Jupyter notebook
from sklearn.datasets import fetch_openml
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
mnist = fetch_openml('mnist_784')
x, y = mnist['data'], mnist['target']
some_digit = x.to_numpy()[36001]
some_digit_image = some_digit.reshape(28, 28) # let's reshape to plot it
plt.imshow(some_digit_image, cmap=matplotlib.cm.binary,
interpolation='nearest')
plt.axis("off")
plt.show()
x_train, x_test = x[:60000], x[6000:70000]
y_train, y_test = y[:60000], y[6000:70000]
shuffle_index = np.random.permutation(60000)
x_train, y_train = x_train.[shuffle_index], y_train.[shuffle_index]
# Creating a 2-detector
y_train = y_train.astype(np.int8)
y_test = y_test.astype(np.int8)
y_train_2 = (y_train == '2')
y_test_2 = (y_test == '2')
# Train a logistic regression classifier
clf = LogisticRegression(tol=0.1)
clf.fit(x_train, y_train_2)
example = clf.predict([some_digit])
print(example)
# Cross Validation
a = cross_val_score(clf, x_train, y_train_2, cv=3, scoring="accuracy")
print(a.mean())