How to Setup a Computationally Limitless Kaggle Environment for (Virtually) Free in 20 Minutes

Tanner
8 min readMay 20, 2019

--

In more ways than one, I was taught data science — or more accurately, self-learned — starting 100 feet off the ground. More recently, I’ve filled in the gaps to be able to run a project of massive scale for virtually free from my old personal MacBook. Key to “free” is that Google offers $300 free trial credit, which goes a long way, and even without that credit offers many computing and storage services for free.

The instructions and videos I’ve consumed describe solutions in isolation rather than in a full environment. Since I’m not a software engineer or dev ops professional, I won’t skip any important details (e.g., which GCP environment should spin up or storage my projects on? How do I use ssh? How can I upload/download files from the remote GCP instance to my personal computer?). In this post, I will walk through the steps of spinning up your virtual environment to build shareable, repeatable code for your large personal or Kaggle projects.

1. Create G Cloud Project and Install G Cloud on Local Macbook

Your gmail account of course lets you do a lot more than access email, including access Google Cloud Platform (GCP). Do a Google search for “Google Cloud Console” (or similar) https://console.cloud.google.com/compute/ . Then find the sidebar on the left with a “Compute” section. Select Compute Engine -> VM Instances, which will be your home for a lot of this setup.

Google allows you to spin up a free 24x7 f1-micro instance (0.6 Gb memory) virtual CPU , for at least one month. (Later, we’ll see they also give you 30Gb of persistent storage space).

Create a machine type of f1-micro instance and change the “boot disk” to one of the Ubuntu images (16.04 LTS will work fine).

One of the only things I won’t cover — please install Google Cloud SDK on your Macbook. Make sure Python’s also installed, but don’t worry about running any core G Cloud commands: https://cloud.google.com/sdk/docs/quickstart-macos

G Cloud has a lot of options. We want VM instances.

2. Run ssh on Local and Connect to G Cloud

Your local macbook can already generate random ssh keys that allow you to securely connect to a remote server.

In your local terminal, run each line separately (use a complex password since your remote connection will be open on the Internet):

ssh-keygen -t rsa -f ~/.ssh/first_rsa -C  first_project
cd ~/.ssh/
vi first_rsa.pub

To exit: Esc : wq (Memorize these keystrokes: seemingly random set of letters that exit most full-screen Terminal states)

Now copy and paste everything (including the white space, but not the vertical ~ section. Oddly, I’ve found it only works when I begin selecting from the bottom).

Revisit your GCP instance and click on it to access the VM instance details. Click Edit, and add paste your SSH key into GCP. Be sure to check off Allow HTTP and Allow HTTPS access as well.

Finally, save and return to VM instances. Then click on Connect dropdown, and select view G Cloud command. Copy into your local terminal and you should now be connected to your G Cloud instance in your local terminal.

Understanding how SSH works on a brief level is invaluable when working to implement your models with engineers or spinning up your own computing instances on the cloud.

3. Install Miniconda with Jupyter and Spyder on terminal, and Filezilla on Macbook

Again, run each line in your G Cloud terminal separately (installation will require hitting “Enter” a lot or holding it down.)

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
source .bashrc
# Installing these packages will take ~5 minutes.
conda install pandas jupyter ipython ipython-notebook

Anyone can learn enough to be dangerous with a Linux terminal and git commands in a single night. We won’t run these commands in this step, but learning them can begin to shed light on how applications are built and communicate with one another.

4. Stand up a Jupyter Notebook

Lots of data scientists like to present their work in Jupyter Notebooks (I use JupyterLab locally), but it’s also nice to use an IDE for debugging and viewing your objects and code holistically.

Connecting to your remote Google compute instance requires your remote terminal to “listen” for a connection.

Your Jupyter Connection (Anaconda Notebooks) on a remote G Cloud server will soon be accessible to anyone with the IP address and password. Pick a complex one to be safe.)

 # Authenticate your Gmail account
gcloud auth login
# Find your G Cloud Project ID
# These lines are one single command:
gcloud compute --project=YOUR-GCLOUD-PROJECT-ID firewall-rules create jupyter-rule --direction=INGRESS --priority=1000 --network=default --action=ALLOW --rules=tcp:8888 --source-ranges=0.0.0.0/0
# Generate a Jupyter configuration with a secure password
jupyter notebook --generate-config
jupyter notebook password
# Host Jupyter Server
jupyter notebook --ip=0.0.0.0
#ctrl+c -> y to quit in terminal

Access Jupyter in browser after copying your external IP address from the VM instances page (option to create a static IP): http://your-external-IP:8888/

5. Remote Connect from local Spyder IDE

Install Filezilla on your local machine: https://filezilla-project.org/ and launch both Filezilla and your Spyder IDE.

Next, sftp into your G Cloud instance.
Host: sftp://User@External-IP
Username: empty
Password: ssh-password
Port: empty (default)

conda install spyder-kernels
exit

Now exit and rejoin gcloud to complete spyder-kernels installation.

# Reveal your .json file directory (copy this out somewhere!)
jupyter --runtime-dir
# Listen for Spyder remote connection with this command:
python -m spyder_kernels.console
#ctrl+z will interrupt process, don't use this yet!

We can’t run any commands while listening for a Spyder connection, and a unique .json file is produced. To connect, we must first sftp into our G Cloud instance to download the kernel-****.json file shown in the terminal. You must download this .json file each new time you connect (I couldn’t get the -existing command to work.)

Spyder Connection

Launch Spyder (via Anaconda IDE) , Consoles -> Connect to an existing kernel.

Kernel ID/Connection file: Find the kernel-*.json file you just downloaded to your local.
User: user@external ip [from G Cloud VM instances, no “sftp://” as before. ]
Pw: ssh-password
SSH Key: leave blank

While Jupyter notebooks are the latest trend for sharing data science work, sometimes it’s helpful to have an IDE for debugging. It’s also nice to have all of the bells and whistles of an IDE so you can look into functions, see all of your objects and data, and more. I like to begin in Spyder and then copy my work to Jupyter when I’m simplifying my code and describing my process.

5. Save your work: G Cloud Storage Bucket

Navigate to GCP sidebar to Storage -> Storage -> Browser and then Create Bucket (30 GB free)

 # Create a file just to test. Write something. Remember 
# "esc:wq" to quit!
nano test_file.ipynb
pip install google_compute_engine
gsutil config
# If necessary, copy resulting link into browser and authorize
gsutil scp test_file.ipynb gs://practical-lodge-232.appspot.com

6. Install Kaggle and Download a Project

pip install kaggle
cd /gcp_storage_bucket_path/
pip download "kaggle_project"
#To submit:
kaggle submit_predictions "filename.csv"

Kaggle is one of my favorite places to learn the latest advances in data exploration and model-building to improve accuracy of your models. Google bought Kaggle and has since added quite a few features to its kernels so that anyone can run and share their work. I’d recommend more computing power than their stock kernel if you’re planning to compete in the competitions, which will really accelerate your modeling abilities.

7. Back up code: Github

Make a New Repository on your Github account. Next we’ll install git and make a backup folder of all code and data on your GCP instance.

pip install git
cd /gcloud_storage_bucket_folder_path/
git clone
https://github.com/tkulb/elo_modeling.git
# Pulls down latest changes since folder was "cloned" or last
# "pulled"
git pull

Edit your code in Spyder or Jupyter as you would. Save new data files. Git will notice anything new, modified, or deleted in the data folder.

 # -A adds all files. Exclude all but small data files. 
# There's a 100 MB per-file limit on Github, but you could sample
git add "item" or "-A"
# -M allows you to write a message,
# but you should always do that or you'll get an error)
git commit -M "write your message here"
# Uploads "committed" changes to Github)
git push

Every data science team at my former employer promoted Git-style version control, but very ever actually used it. Every serious data scientist should learn the basics of Github from command line n order to collaborate with other data scientists, develop their best data science, and showcase their work.

Conclusion

There you have it. One of the best ways to improve or round out your data science skills is by participating in Kaggle competitions. You now have the tools to work on the biggest datasets on your oldest laptop. And you’ve learned about Linux shell, Github, Kaggle, and installing python dependencies in the process.

Once I burn through my G Cloud credit, I’ll replicate this process with AWS for a Kaggle competition and compare the experience. AWS offers 12 months on their free tier, and may also offer credits for faster computing resources.

Bonus: Virtual Environments

After a python project, allow someone to replicate your project exactly by snapshotting your “environment” versions.

Install packages in a virtual environment that will not affect your system libraries or packages, and allows someone else to easily create an identical environment if they have your requirements.txt file.

Different virtual environments:

Docker — newest method of the bunch, and will spin up an identical cluster/instance to the one you’ve built.

Venv — We’ll use venv, which is a newer version of virtualenv that stores all of the libraries and packages you’ve intalled

In the terminal run:

install venv
source create first_virtual_environment list_packages_here
# add system-site-packages to document all base packages as well
source activate first_virtual_environment
# Example installation
pip install timeit

Run all the python you want now to create your project. Export your dependencies so your environment can be easily replicated on a fresh instance.

 # Export package dependencies 
source -> requirements.txt
# Return to “base” environment
source deactivate

--

--