notMNIST

Here’s the code:

https://gist.github.com/akki2825/e31f146039dca1147e346c553de761f5

Line 3: __future__ statements need to be near the top of the file because they change fundamental things about the language, and so the compiler needs to know about them from the beginning.

Line 4: because we are using graphs!

Line 5: to use number matrices.

Line 6: os is among the modules that are loaded when Python starts up. It assigns its path attribute to an os-specific path module.

Line 7: to pick a random number in a given range

Line 8: This module implements a common interface to many different secure hash and message digest algorithms.

Line 9: we all know what json is!!

Line 10: This module provides access to some variables used or maintained by the interpreter and to functions that interact strongly with the interpreter.

Line 11: to access tar files

Line 12: for displaying stuff

Line 13: This package contains various functions for multi-dimensional image processing.

Line 14: sklearn people!

Line 15: to work around with urls

Line 16: Pickling” is the process whereby a Python object hierarchy is converted into a byte stream.

Line 21: yeah, you guessed it right, here’s where we get the dataset from.

Line 23–33: we are defining a function, which takes in three parameters, namely, filename, expected_bytes and force. Lets retrieve the dataset from the url using urlretreive by adding in the filename. (hope, there’s enough space on your disk for the download. The files are around 250MB) We get the statinfo and compare it with our expected_bytes value, if it turns out to be same, Voila! We have downloaded the right one. If you fail to, just recheck the filename.

Line 35: train_filename (training data), calling our previous function

Line 36:test_filename (test data), calling our previous function

Line 38: number of classes is 10.

Line 39: Seed the random number generator, so it generates the same number every time the program runs.

Line 41: define a function named maybe_extract which takes in filename and force as parameters.

Line 42: removing the .tar.gz and assigning it into a variable named root.

Line 43: if the directory is root

Line 45: skipping the extraction.

Line 46- 60: extract the data. use the tarfile function to open the file and use extractall() function to extract the data and close the same. Create an array named data_folders of sorted images into 10 classes . If the number of classifications is not equal to number of classes raise an exception. return the array!

Line 62 and 63: the folders are divided into train_folders and test_folders

Line 65- 71: index 0 should all be As, 1 = all Bs, 2 = all Cs etc. Open the pickle_file and lets load the pickle by unpickling it, meaning, converting byte_codes to object. Pick a random image index and lets extract a 2D slice of the image. Lets plot it and display using imshow() function.

Line 73: assigning the image_size to 28. That is, the pixel width and height.

Line 74: The Number of levels per pixels is 255. To avoid overflow.

Line 76: Lets load the data for a single letter label using a function named load_letter which has folder and min_num_images as parameters.

Line 78: List the stuff in the folder and have it in a variable (image_files).

Line 79: Create a n-dimensional array of shape, image_size and image_size and let the datatype be float32 and assign it to ‘dataset’.

Line 82–91: Iterate through image_files and reassure that the images we have are of the same size as that as we had assigned before.

Line 93–102: print out the tensor, mean, SD of the dataset.

Line 104–121: Set the extension of the dataset to .pickle

Line 123–129: dataset and labels are being created.

Line 131–163: define a function to create validation dataset, validation labels, training dataset and training lables. Reshuffle the letters to have random validation and training set.

Line 166–168: Let the training set size be 200000, validation set size — 10000 and testing set size be 10000.

Line 170–176: merge the dataset using the merge_datasets function (mentioned above) and print them out seperately.

Line 178–185: Randomizing the dataset and labels. random permutations.

Line 187–205: save all the stuff in the pickle_file.

Line 207: Alright, lets load the notMNIST.pickle

LIne 209- 220: Checking for any duplicates. return the length of the duplicate indices, if any and print them out.

Line 222–225: Finally we are with our ‘clean’ training data, training labels, testing data and testing labels.

Line 228–234: We are using Logistic regression has our classifier. Before implementing the algo, we need to convert the training and testing into 2D array and this is done flatten() func.

Line 236–240: We get the scores of different sets of training samples and print them out!

Alright that’s it for now! Thank you for spending your time. Cheers!

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.