Tutorial 3: Participating in a Kaggle Competition

David Yang
Fenwicks
Published in
5 min readApr 21, 2019

Prerequisite: Tutorial 2 (Cifar10)

Kaggle is a popular website for data science competitions. In each competition, Kaggle provides a training set (with labels) and a test set (without labels). Your mission is to train a model on the training set, and use the model to predict the labels on the test set. Then, you submit your predictions to Kaggle, and their server gives you back a score. During an ongoing competition, your score is also listed in a public “leader board”, which shows the scores of all contestants. For example, this is the public leaderboard for Kaggle’s Cifar10 competition:

The scores in the Cifar10 competition correspond to accuracy on the Cifar10 test set. So, our model in Tutorial 2, which obtains around 94% accuracy, would land us in top-10 of this competition. In this tutorial, we’ll do a “late submission” to this competition using the model in Tutorial 2. We’ll get a score from Kaggle, but not a rank, since the competition already finished 5 years ago.

Setting up Kaggle. To use Kaggle, we need to register an account on their website. Once the account is set up, go to the account profile page, and create an “API token”:

This “API token” is simply a file called “kaggle.json”. With it, we can download the data files from Kaggle, submit predictions, and get scores, using Kaggle’s open-source API. This API is pre-installed on Google Colab, but to use it, we must upload kaggle.json to Colab’s machine. This is bothersome as Colab clears files on disk once in a while, and every time it does that, we have to upload kaggle.json again.

In this tutorial, we instead upload kaggle.json to Google Drive, and download it to Colab from there. In the following, we assume that kaggle.json is stored in the root directory of your Google Drive. Fenwicks provides a one-liner to set up Kaggle:

fw.colab_utils.kaggle_setup_from_gdrive()

The above function also mounts Google Drive as a virtual directory in Colab, called gdrive, after an authorization step.

Preparing Data. We download the Cifar10 data from Kaggle with its API:

!kaggle $fw.datasets.URLs.KAGGLE_CIFAR10

The training set from Kaggle is exactly the same as the original one. The test set, however, is different: to prevent cheating (that is, simply submitting the test labels as predictions), Kaggle inflated the test set to 300,000 images. Among these, only the 10,000 real test images are used to calculate leaderboard score — your predictions for the rest of the test sets are simply ignored.

The datasets contain two 7zip files: train.7z and test.7z. Let’s decompress them:

!apt install libarchive-dev
!pip install libarchive
data_dir_local = './data'
fw.io.unzip(['./train.7z', './test.7z'], data_dir_local)

It is possible to decompress these files using Ubuntu’s 7z command, but somehow, on Colab this takes a very, very long time. The above code, using the libarchive library, doesn’t have this problem.

The decompressed data files are .png images. We first convert these images into a single TFRecords file, and pre-computes the mean and standard deviation of each color channel, as in Tutorial 2:

data_dir, work_dir = fw.io.get_gcs_dirs(BUCKET, PROJECT)
local_data_dir = '.'
local_train_fn = os.path.join(data_dir_local, "train.tfrec")path_train, y_train, labels = fw.data.data_dir_label_csv_tfrecord(
data_dir=os.path.join(data_dir_local, 'train'),
csv_fn='./trainLabels.csv', output_fn=local_train_fn,
file_ext='png')
n_train, n_classes = len(path_train), len(labels)
X_train_mean, X_train_std, img_size, _ =
fw.preprocess.compute_image_mean_std(local_train_fn, n_train,
batch_size=100)

After that, we upload the TFRecords file for training data to Google Cloud Storage (GCS), since TPUs cannot access files on the local disk of the Colab machine.

train_fn = os.path.join(data_dir, "train.tfrec")
fw.io.upload_to_gcs(local_train_fn, train_fn)

Next, we do the same for the test set. This time, we directly write to a TFRecords file on GCS, instead of creating the file locally and subsequently uploading it to GCS.

test_fn = os.path.join(data_dir, "test.tfrec")

path_test = fw.data.data_dir_no_label_tfrecord(
data_dir=os.path.join(data_dir_local, 'test'),
output_fn=test_fn, file_ext='png')
n_test = len(path_test)

Input pipeline. Recall from Tutorial 2 that the training set needs to go through a series of data augmentations: padding, random cropping, random flipping, and Cutout. Images in the test set, on the other hand, are fed to the input pipeline directly. In this tutorial, there are two main differences. First, our inputs here are images stored in the .png format, rather than Numpy arrays. Second, we have not yet normalized the inputs. This means that the parser in the input line must parse PNG images into arrays, and then normalize the resulting Tensors. To do so, we define the following transforms:

train_tfms = [
fw.transform.tfm_standard_scaler(X_train_mean, X_train_std),
fw.transform.tfm_pad_crop(4),
fw.transform.tfm_random_flip(),
fw.transform.tfm_cutout(8, 8),
fw.transform.tfm_set_shape(img_size, img_size),
]

test_tfms = [
fw.transform.tfm_standard_scaler(X_train_mean, X_train_std),
fw.transform.tfm_set_shape(img_size, img_size),
]

The last transform (that is, fw.transform.tfm_set_shape) hardcodes the size of the images. This step is specific to TPUs, which requires image size information when compiling the Tensorflow computing graph into TPU’s machine code.

With these transforms, we create the training and test input pipelines, as follows:

parser_train = fw.data.get_tfexample_image_parser(train_tfms)
parser_test = fw.data.get_tfexample_image_parser(test_tfms,
has_label=False)

Model training and prediction. We build and train the neural network model exactly as in Tutorial 2, using the fast DavidNet architecture. One difference here is that we have test data (with no labels) rather than validation data (with labels). So, we need to perform predictions instead of evaluation. To do this, we specify the prediction batch size in the TPUEstimator:

est = fw.train.get_tpu_estimator(steps_per_epoch, model_func,
work_dir, trn_bs=BATCH_SIZE, val_bs=10000, pred_bs=10000)

Here, the prediction batch size pred_bs is set to 10k, which is an ad hoc choice. In general, this batch size can be any integer as long as the size of the test set, n_test, is a multiple of pred_bs.

After completing model training, we do prediction with our model, as follows.

y_preds = []
test_ids = []
for i, pred in enumerate(est.predict(test_input_func)):
y_preds.append(pred['y_pred'])
fn = os.path.basename(path_test[i])
test_ids.append(fn[:-4])

In the above code, test_ids saves the IDs of test images, whose order is given in path_test, obtained when we create the test TFRecords data file. The expression fn[:-4] removes the “.png” extension from the file name, leaving only the test image ID.

Finally, we create the submission file from the predictions, and submit to Kaggle:

d = {'id': test_ids, 'label': [labels[y] for y in y_preds]}
df = pd.DataFrame(data=d)
df.to_csv('submission.csv', index=False)
!kaggle competitions submit -c cifar-10 -f submission.csv

We can check the score of this submission by reviewing all submissions:

!kaggle competitions submissions -c cifar-10

That’s it — we have participated in a Kaggle competition, though an old one held 5 years ago. Here’s the complete Jupyter notebook:

All tutorials:

--

--