Mnist Dataset for Machine Learning

Binaya Puri
12 min readDec 21, 2023

--

Mnist consists of a collection of 70,000 grayscale images of handwritten digits from 0 to 9. Each image is a 28x28 pixel square. The dataset is divided into training and testing sets, making it suitable for training and evaluating machine learning algorithms.

The MNIST dataset is a popular and widely used dataset in the field of machine learning and computer vision. It stands for “Modified National Institute of Standards and Technology” and consists of a large collection of handwritten digits. The dataset is often used as a benchmark for testing and developing machine learning algorithms, particularly for digit recognition tasks.

  1. Data: The dataset contains 70,000 grayscale images of handwritten digits (0 through 9). Each image is a 28x28 pixel square, making the images relatively small compared to many other datasets. These images have been preprocessed and centered to ensure consistency.
  2. Training and Testing Sets: The MNIST dataset is typically divided into two subsets: a training set and a test set. The training set contains 60,000 images, and the test set contains 10,000 images. This separation allows researchers and practitioners to train machine learning models on one subset and evaluate their performance on the other.
  3. Labeling: Each image in the dataset is associated with a corresponding label, indicating which digit (0–9) is written in the image. These labels are used to train and evaluate the accuracy of machine learning models.
  4. Use Cases: MNIST is often used as a benchmark for various machine learning algorithms, especially for image classification tasks. It’s a relatively simple dataset compared to more complex image datasets like CIFAR-10 or ImageNet, making it a good starting point for learning and experimentation.
  5. Challenges: While MNIST is considered a standard dataset, it’s not without its challenges. Achieving high accuracy on MNIST is a relatively straightforward task for modern machine learning models. As a result, researchers often use more challenging datasets to evaluate the robustness and generalization capabilities of models.

The MNIST dataset has been widely used for educational purposes, as a starting point for exploring deep learning, and for benchmarking various machine learning algorithms. However, it’s important to note that it has become somewhat outdated in recent years due to its simplicity. Researchers often seek more complex and diverse datasets to better reflect the challenges encountered in real-world applications.

→ Set of 70,000 small images of digits handwritten by high school Students and employees of the US Census Bureau

→ All images are labelled with the respective digit they represent

→ MNIST is the hello world of machine learning

→ There are 70,000 images, and each images has 784 (28*28) features.

→ Each image is 28*28 pixels , and each feature simply represents one pixel’s intensity from 0 (white) to 255 (Black)

Requisites :

Jupyter Notebook :

The Jupyter Notebook is an open-source web application that allows to creation and share documents that contain live code, equations, visualizations, and narrative tests. this will make it easy for. this will make it easy for us to document and use the code as a notebook.

2. Python Libraries

  • NumPy:
    NumPy is a fundamental package for scientific computing in Python. It provides support for arrays and matrices, essential for data manipulation.
  • scikit-learn:
    Scikit-learn (sklearn) is a popular machine-learning library in Python. It includes various tools for data analysis and modeling, including algorithms for classification and regression.
  • Matplotlib:
    Matplotlib is a plotting library for creating visualizations in Python. It’s useful for visualizing data and results.

You can install these libraries using the following command:

pip install numpy scikit-learn matplotlib

3. Basic Python Programming Knowledge: Familiarity with basic Python programming concepts such as variables, arrays/lists, functions, and control flow is essential for working with the MNIST dataset and implementing machine learning algorithms.

Set up jupyter Notebook

To install and set up Jupyter Notebook :

  1. Install Python:

If you don’t have Python installed on your Windows system, you’ll need to install it first. You can download the latest Python installer for Windows from the official Python website (

Python Releases for Windows ). Be sure to check the box that says “Add Python X.X to PATH” during the installation process, where “X.X” represents the version number of Python.

2. Open Command Prompt or PowerShell:

After installing Python, open either Command Prompt or PowerShell on your Windows system. You can do this by searching for “cmd” or “PowerShell” in the Start menu.

3. Install Jupyter using pip:

To install Jupyter, you can use the pip package manager that comes with Python. Run the following command to install Jupyter:


pip install jupyter

This command will download and install Jupyter and its dependencies.

4. Start Jupyter Notebook:

Once Jupyter is installed, you can start it by running the following command in your Command Prompt or PowerShell:

jupyter notebook

This command will launch a Jupyter Notebook server, and it will open a web browser window displaying the Jupyter Notebook interface

web browser window displaying the Jupyter Notebook interface:

5. Use Jupyter Notebook:

You can create new Jupyter notebooks, open existing ones, and start working with Python code, text, and visualizations in the Jupyter Notebook interface. It’s an interactive environment for data analysis, machine learning, and more.

6. click on “New” located in the top of the right corner

7. Click on “Notebook” that redirects to a page

8. Select Kernal Python 3(ipykernal)

Mnist: Fetch the data and then split it into train and test sets and apply a few ML algorithms to detect a given digit

  1. Fetching Dataset
from sklearn.datasets import fetch_openml
mnist = fetch_openml("mnist_784")

This code imports the fetch_openml function from scikit-learn and uses it to fetch the MNIST dataset, which consists of 28x28 pixel images of handwritten digits (0-9). The dataset is stored in the mnist variable for further use in machine learning tasks.

3. we can see ‘data’ , ‘pixels (1–784)’ , ‘targets’ ,’description’ in the minist when we write command :

mnist

4. Loading the MNIST Dataset

x, y = mnist['data'], mnist['target']

here, we’ve loaded the MNIST dataset into x (which contains the input data) and y (which contains the target labels).

5. Checking Data Shapes:

x.shape

y.shape

These lines are used to check the dimensions (shape) of the x and y arrays, which represent the dataset's input data and target labels, respectively.

6. Using %matplotlib inline for Inline Plotting in Jupyter Notebook

%matplotlib inline

import matplotlib 
import matplotlib.pyplot as plt

In Jupyter Notebook, %matplotlib inline is a magic command that enables the rendering of Matplotlib plots directly within the notebook interface. This command is particularly useful for data visualization and generating plots, as it allows you to see the graphical output of Matplotlib commands within your Jupyter Notebook cells.

Overall, %matplotlib inline is a valuable tool for anyone working with data analysis and visualization in Jupyter Notebook, as it simplifies the process of creating and examining plots as part of your data analysis workflow.

7. Extracting and Reshaping a Handwritten Digit Image

some_digit = x.to_numpy()[36002] 
some_digit_image = some_digit.reshape(28,28)
  • some_digit is a variable that stores a flattened representation of a handwritten digit image from the MNIST dataset.
  • x represents the MNIST dataset, where each row corresponds to a flattened 28x28 pixel image of a handwritten digit.
  • .to_numpy() is used to convert the dataset x into a NumPy array.
  • [36002] selects a specific row (sample) from the dataset, in this case, the 36,002nd row, which corresponds to one of the handwritten digit images.
  • some_digit_image is another variable that stores the same handwritten digit image, but in its original 2D format (28 rows by 28 columns).
  • .reshape(28, 28) is applied to some_digit to transform the flattened representation into a 2D array. This reshaping is done to prepare the image for visualization or further analysis.

this code segment extracts a specific handwritten digit from the MNIST dataset, first in its flattened form and then reshapes it into a 2D image format, making it ready for display or further processing.

8. Visualizing a Handwritten Digit Image

To visualizes a handwritten digit image using Matplotlib. It displays the digit image in grayscale, removes axis labels, and shows the image within the Jupyter Notebook or Python environment. This is a common step in data exploration and analysis when working with image datasets like MNIST.

plt.imshow(some_digit_image, cmap=matplotlib.cm.binary, interpolation="nearest")
plt.axis("off")
plt.show()
  • plt.imshow(some_digit_image, cmap=matplotlib.cm.binary, interpolation="nearest") is used to display the handwritten digit image stored in the variable some_digit_image using Matplotlib. Here's what each part of this line does:
  • plt.imshow: This function from Matplotlib is used to display an image.
  • some_digit_image: It's the 2D NumPy array representing the handwritten digit image.
  • cmap=matplotlib.cm.binary: The cmap parameter specifies the colormap to be used for displaying the image. In this case, matplotlib.cm.binary is used to display the image in grayscale, where black represents the digit's ink and white represents the background.
  • interpolation="nearest": The interpolation parameter determines how the image should be interpolated (scaled) if its dimensions don't match the display size. "nearest" interpolation is used to maintain the pixel's original values without interpolation.
  • plt.axis("off"): This line is used to turn off the axis labels and ticks in the Matplotlib plot. Since this is an image display, you typically don't need axis labels.
  • plt.show(): Finally, this command is used to display the image plot on your screen.

9. Splitting the MNIST Dataset into Training and Testing Sets

x_train, x_test = x[0:60000], x[60000:70000]

y_train, y_test = y[0:60000], y[60000:70000]

this code snippet performs a critical step in the machine learning workflow by splitting the MNIST dataset into separate training and testing sets. This separation is crucial for training a machine learning model on one portion of the data and evaluating its performance on another unseen portion. It helps assess the model’s generalization ability to make predictions on new, unseen data. Typically, the training set is used to train the model, and the testing set is used to evaluate its accuracy and performance metrics.

  • x_train and x_test are created to store subsets of the input data x from the MNIST dataset. Specifically:
  • x_train contains the first 60,000 samples (images) from x, which are typically used for training machine learning models.
  • x_test contains the next 10,000 samples (images) from x, which are reserved for testing the trained model's performance.
  • y_train and y_test are created to store subsets of the target labels y corresponding to the input data. These subsets correspond to the same 60,000 training samples and 10,000 testing samples, respectively.

10. Shuffling the Training Data for Improved Model Training (Optional)

By performing this shuffling operation, the training data is presented to the machine learning model in a random order. This randomness can be advantageous, particularly in situations where the data might exhibit inherent patterns or biases related to its order. Shuffling ensures that the model isn’t biased by any specific order in the training data, ultimately leading to more robust and unbiased model training.

import numpy as np


shuffle_index = np.random.permutation(60000)
shuffle_index


x_train, y_train = x_train.[shuffle_index], y_train.[shuffle_index]

11. Creating a 2 detector || if the digit is 2 or not , Preprocessing Target Labels for Binary Classification

y_train = y_train.astype(np.int8)    
# y train ko sabai string number ma convert hunxhha
y_test = y_test.astype(np.int8)
y_train_2 = (y_train==2)
y_test_2 = (y_test==2)
y_train

The purpose of these transformations is to prepare the target labels for binary classification, where you’re interested in distinguishing one specific class (digit 2) from the rest. By converting the labels to numerical format and creating binary labels (True or False), you set up the data for training and evaluating a binary classification model that can predict whether a given image represents the digit 2 or not.

12. Importing the Logistic Regression Model

from sklearn.linear_model import LogisticRegression

  • from sklearn.linear_model is a part of the scikit-learn (sklearn) library, which is a popular machine-learning library in Python.
  • LogisticRegression is the specific machine learning model being imported from sci-kit-learn. Logistic Regression is a widely used algorithm for binary and multi-class classification tasks.

By importing LogisticRegression from sklearn.linear_model, you make the Logistic Regression algorithm available for use in your Python code. This model can be used to train and make predictions for classification tasks, including binary classification tasks like the one you've been working on, where you're trying to classify whether an image represents the digit 2 or not. Once imported, you can create an instance of the Logistic Regression model and train it using your training data.

  1. Creating and Training a Logistic Regression Classifier

clf = LogisticRegression(tol=0.1)

clf.fit(x_train, y_train_2)

  • clf is a variable representing a Logistic Regression classifier.
  • LogisticRegression(tol=0.1) creates a Logistic Regression classifier with a specified tolerance (tol) of 0.1. The tolerance parameter controls the stopping criteria for the optimization algorithm used by the Logistic Regression model during training. It determines when the optimization process should converge to find the best parameters.
  • clf.fit(x_train, y_train_2) trains the Logistic Regression classifier (clf) using the training data x_train and the binary target labels y_train_2. This is where the model learns to make predictions based on the input data and the specified binary classification task, which, in this case, is to predict whether a digit image represents the digit 2 (True) or not (False).

After training, the clf classifier will have learned a decision boundary that separates the digit 2 from other digits based on the features (pixel values) of the images. This trained classifier can then be used to make predictions on new, unseen data and evaluate its performance on the test dataset.

14. Making Predictions with a Trained Logistic Regression Classifier

clf.predict([some_digit])

15. Performing Cross-Validation to Evaluate Model Accuracy

from sklearn.model_selection import cross_val_score a = cross_val_score(clf, x_train, y_train_2, cv=3, scoring="accuracy")

a

  • from sklearn.model_selection import cross_val_score imports the cross_val_score function from sci-kit-learn's model_selection module. This function is used for cross-validation, which is a technique for assessing the performance of machine-learning models.
  • clf is the previously trained Logistic Regression classifier.
  • x_train is the training data, and y_train_2 is the corresponding binary target variable indicating whether each training sample represents the digit 2 or not.
  • cv=3 specifies the number of cross-validation folds. In this case, cv=3 this means that the dataset will be split into three parts, and the model will be trained and evaluated three times, each time using a different fold for testing and the remaining folds for training.
  • scoring="accuracy" specifies that the evaluation metric for the cross-validation should be accuracy. Accuracy measures the proportion of correctly classified samples, which is a common metric for classification tasks.
  • a is a variable that stores the results of the cross-validation. After running this code, a will contain an array of accuracy scores, with each score representing the accuracy achieved on one of the cross-validation folds.

By performing cross-validation in this way, you obtain multiple accuracy scores for the model, which can help assess its overall performance and robustness. You can then analyze these scores to understand how well the model generalizes to different subsets of the training data and make informed decisions about model tuning and hyperparameter selection.

16. Calculating the Mean Accuracy from Cross-Validation Scores

a.mean()

By calculating the mean accuracy, you can assess the overall performance of the Logistic Regression classifier and gain insights into how well it is likely to perform on new, unseen data. It helps you make informed decisions about the model’s suitability for the given task and whether further model refinement or parameter tuning is necessary.

In conclusion, this analysis and machine learning pipeline showcase the fundamental steps involved in handling and understanding the MNIST dataset, preparing data for model training, training a classifier, making predictions, and evaluating model performance. While MNIST is a classic dataset used for educational purposes, it’s important to recognize its simplicity and consider more challenging datasets for real-world applications.

Here is the full code that you can try out in your Jupyter notebook

from sklearn.datasets import fetch_openml
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

mnist = fetch_openml('mnist_784')
x, y = mnist['data'], mnist['target']

some_digit = x.to_numpy()[36001]
some_digit_image = some_digit.reshape(28, 28) # let's reshape to plot it

plt.imshow(some_digit_image, cmap=matplotlib.cm.binary,
interpolation='nearest')
plt.axis("off")
plt.show()

x_train, x_test = x[:60000], x[6000:70000]
y_train, y_test = y[:60000], y[6000:70000]

shuffle_index = np.random.permutation(60000)
x_train, y_train = x_train.[shuffle_index], y_train.[shuffle_index]

# Creating a 2-detector
y_train = y_train.astype(np.int8)
y_test = y_test.astype(np.int8)
y_train_2 = (y_train == '2')
y_test_2 = (y_test == '2')

# Train a logistic regression classifier
clf = LogisticRegression(tol=0.1)
clf.fit(x_train, y_train_2)
example = clf.predict([some_digit])
print(example)

# Cross Validation
a = cross_val_score(clf, x_train, y_train_2, cv=3, scoring="accuracy")
print(a.mean())

--

--