This tutorial walks through creating and submitting predictions to Kaggle for the 2nd Annual Women in Data Science (WiDS) Datathon, which asks participants to detect the presence of oil palm plantations in satellite imagery. For an introduction to the datathon challenge and how to participate, check out the competition on Kaggle and this blog post.
The WiDS Datathon is hosted on the Kaggle platform. In order to enter the datathon competition, you will need to submit your model predictions online. Kaggle will then score your predictions and place your team on the competition leaderboard. More information on what the leaderboard represents and how Kaggle scores multiple submissions can be found in these Kaggle docs.
This tutorial picks up after you have completed these steps:
- Create a Kaggle account (https://www.kaggle.com/).
- Download the Planet Oil Palm Satellite Imagery Data, comprised of a training set and a test set. Each image spans 256 x 256 pixels and is labeled with a binary label (0 or 1) that denotes whether the image captures the presence of any oil palm plantations.
- Install your programming language and tools of choice for creating machine learning models.
Once you have the datasets downloaded and your machine learning tools set up, the remaining steps are to:
- Load and process the satellite imagery to be input into algorithms of your choice.
- Optional: apply data augmentation or unsupervised methods to the images to help improve algorithm performance.
- Train an algorithm to classify whether an image captures oil palm plantations on the training dataset.
- Apply the algorithm to the test dataset and obtain predictions.
- Upload the predictions to Kaggle and see your score!
Load and Process the Dataset
The dataset of Planet imagery is comprised of 15,244 training .jpg images and 6,534 test .jpg images, all taken over the island of Borneo where oil palm production is a major industry. For the training set, there is a corresponding “traininglabels.csv” file that lists the images by identification number (image_id) and gives their corresponding binary label (has_oilpalm): 0 for no oil palm plantation and 1 for presence of oil palm plantation. A third column (score) gives the confidence score based on how much the crowdsourced annotators agree on the binary label (1 denotes full agreement).
Some useful modules and packages for loading and manipulating imagery include PIL, OpenCV, imageio, and matplotlib.image for Python, and magick, OpenImageR, and jpeg for R. Since each image is a 256 x 256 RGB .jpg file, you can represent it as an 256 x 256 x 3 array. The final representation of the images will depend on what image processing steps and classification algorithm you choose. We will cover image analysis and machine learning algorithms in separate blog posts. (Make sure you’re signed up for the Community Mailing List to get updates, and visit widsconference.org/datathon throughout the competition!)
Generate and Upload Predictions
We won’t discuss specific machine learning algorithms in this post except to mention that frameworks and modules like sklearn, TensorFlow, and PyTorch in Python and e1071, rpart, nnet, and Keras in R are excellent places to look for algorithms that have been implemented.
Once you’ve trained your classification algorithm on the training set, you’ll want to generate a “testpredictions.csv” file similar to “traininglabels.csv” that stores your algorithm’s prediction for the probability that each test image contains oil palm plantation. Note that the columns must be named “image_id” and “has_oilpalm”, and you’ll want to include a prediction for each of the 6,534 test images. Rather than a hard binary classification of 0 or 1, these “has_oilpalm” probabilities will be a number between 0 and 1.
When you’re happy with your predictions, navigate over to the WiDS Datathon Competition on Kaggle and click the “Submit Predictions” link. Choose your predictions file from the “Upload Files” option, and add a comment to describe this submission if you’d like to remember what you changed between submissions. Then hit “Make Submission”!
Kaggle will compute the AUC (Area Under receiver operating characteristic Curve) score on each of your submissions, and your place on the leaderboard will be based on this score. To learn more about AUC, take a look at this Kaggle post.
And that’s it! Now what’s left is the fun part: building the best model you can. Good luck and happy modeling to everyone.