Chris Pavlou
May 31 · 10 min read

Today, at Kubecon Europe 2019, Arrikto announced the release of the new MiniKF, which features Kubeflow v0.5. The new MiniKF enables data scientists to run end-to-end Kubeflow Pipelines locally, starting from their Notebook.

This is great news for data scientists, as until now, there was no easy way for one to run an end-to-end KFP example on-prem. One should have strong Kubernetes knowledge to be able to deal with some steps. Specifically, for Kubeflow’s standard Chicago Taxi (TFX) example, which will be the one we will presenting in this post, one should:

  • Understand K8s and be familiar with kubectl
  • Understand and compose YAML files
  • Manually create PVCs via K8s
  • Mount a PVC to a container to fill it up with initial data

Using MiniKF and Arrikto’s Rok data management product (MiniKF comes with a free Rok license), we showcase how to streamline all these operations to reduce time, and provide a much more friendly user experience. A data scientist starts from a Notebook, builds the pipeline, and uses Rok to take a snapshot of the local data they prepare, with a click of a button. Then, they can seed the Kubeflow Pipeline with this snapshot using only the UIs of KFP and Rok.

We showcase a simplified data science experience, which improves a user’s workflow, removing the need for even a hint of K8s knowledge. We also introduce the first steps towards a unified integration of Notebooks & Kubeflow Pipelines.

In a nutshell, this tutorial will highlight the following benefits of using MiniKF, Kubeflow, and Rok:

  • Easy execution of a local/on-prem Kubeflow Pipelines e2e example
  • Seamless Notebook and Kubeflow Pipelines integration with Rok
  • KFP workflow execution without K8s-specific knowledge

Kubeflow’s Chicago Taxi (TFX) example on-prem tutorial

Let’s put all the above together, and watch MiniKF, Kubeflow, and Rok in action.

One very popular data science example is the Taxi Cab (or Chicago Taxi) example that predicts trips that result in tips greater than 20% of the fare. This example is already ported to run as a Kubeflow Pipeline on GCP, and included in the corresponding KFP repository. We are going to showcase the Taxi Cab example running locally, using the new MiniKF, and demonstrate Rok’s integration as well. Follow the steps below and you will run an end-to-end Kubeflow Pipeline on your laptop!

Install MiniKF

Open a terminal and run:

 vagrant init arrikto/minikf
vagrant up

Open a browser, go to 10.10.10.10 and follow the instructions to get Kubeflow and Rok up and running.

For more info about how to install MiniKF, visit the MiniKF page:

https://www.arrikto.com/minikf/

Create a Notebook Server

On the MiniKF landing page, click the “Connect” button next to Kubeflow to connect to the Kubeflow Dashboard:

Once at Kubeflow’s Dashboard, click on the “Notebooks” link on the left pane to go to the Notebook Manager:

You are now at the Kubeflow Notebook Manager, showing the list of Notebook Servers, which is currently empty. Click on “New Server” to create a new Notebook Server:

Enter a name for your new Notebook Server, and select Image, CPU, and RAM:

Add a new, empty Data Volume, for example of size 2GB, and name it “data” (you can give it any name you like, but then you will have to modify some commands in later steps):

Once you select all options, click “Spawn” to create the Notebook Server, and wait for it to get ready:

Once the Server is ready, the “Connect” button will become active (blue color). Click on the “Connect” button to connect to your new Notebook Server:

A new tab will open up with the JupyterLab landing page:

Bring in the Pipelines code and data

Create a new terminal in JupyterLab:

Bring in Arrikto’s pipeline code to run the Chicago Taxi Cab example on-prem. Run the following command in the terminal you just created:

wget https://raw.githubusercontent.com/arrikto/kubeflow-examples/kubecon-demo/taxi-cab-on-prem/tfx-chicago-taxi-pipeline-on-prem-arr.py

To bring in the data, we will use the official Kubeflow pipelines repository. Clone the pipelines repository on the home directory:

git clone -b 0.1.25 https://github.com/kubeflow/pipelines

Find the corresponding input files that the original pipeline expects to find in Google Object Storage and copy them in the root directory of the Data Volume (PVC):

cp -av pipelines/samples/tfx/taxi-cab-classification ~/data/

Note that you should have the taxi-cab-classification directory under the data directory, not just the files. You now have a local Data Volume populated with the data the pipeline code needs.

Compile the pipeline

To compile the pipeline, run:

dsl-compile --py tfx-chicago-taxi-pipeline-on-prem-arr.py --output tfx-chicago-taxi-pipeline-on-prem-arr.tar.gz

After successful compilation, you will see the file containing the compiled pipeline popping up on JupyterLab’s left pane:

Snapshot the Data Volume

In later steps, the pipeline is going to need the data that we brought into the Data Volume previously. For this reason, we need a snapshot of the Data Volume. As a best practice, we will snapshot the whole JupyterLab, and not just the Data Volume, in case the user wants to go back and reproduce their work.

We will use Rok, which is already included in MiniKF, to snapshot the JupyterLab. Go to the MiniKF landing page and click the “Connect” button next to Rok, to open the Rok UI:

This will open the Rok login page:

Log in using the following credentials:

Username: user
Password: 12341234

This is the Rok UI landing page:

Create a new bucket to host the new snapshot. Click on the “+” button on the top left:

A dialog will appear asking for a bucket name. Give it a name and click “Next”. We will keep the bucket “Local” for this demo:

Clicking “Next” will result in a new, empty bucket appearing in the landing page. Click on the bucket, to go inside:

Once inside the bucket, click on the Camera button to take a new snapshot:

By clicking the Camera button, a dialog appears asking for the K8s resource that we want to snapshot. Choose the whole “JupyterLab” option, not just the single Data Volume (“Dataset”):

Most fields will be pre-filled with values automatically by Rok, for convenience. Select your JupyterLab from the dropdown list:

Provide a commit title and a commit message for this snapshot. This is to help you identify the snapshot version in the future, the same way you would do with your code commits in Git:

Then, choose a name for your snapshot:

Take the snapshot, by clicking the “Snapshot” button:

Once the operation completes, you will have a snapshot of your whole JupyterLab. This means you have a snapshot of the Workspace Volume and a snapshot of the Data Volume, along with all the corresponding JupyterLab metadata to recreate the environment with a single click. The snapshot appears as a file inside your new bucket. Expanding the file will let you see the snapshot of the Workspace Volume and the snapshot of the Data Volume:

Now that we have both the pipeline compiled and a snapshot of the Data Volume, let’s move on to run the pipeline and seed it with the data we prepared.

Upload the Pipeline to KFP

Before uploading the pipeline to KFP, we first need to download the compiled pipeline locally. Go back to your JupyterLab and download the compiled pipeline. To do so, right click on the file on JupyterLab’s left pane, and click “Download”:

Once the file is downloaded to your laptop, go to Kubeflow Dashboard and open the “Pipelines Dashboard”:

This will take you to the Kubeflow Pipelines UI:

Click ‘+ Upload pipeline’:

On the pop-up window choose the .tar.gz file you downloaded locally and give a name to your new pipeline. Then click “Upload”:

The pipeline should get uploaded successfully and will now appear on the pipelines list with the defined name:

Click on it to view all the pipeline steps:

Create a new Experiment Run

Create a new Experiment by clicking “+ Create experiment”:

Choose an Experiment name, and click “Next”:

By clicking “Next”, the KFP UI sends you to the “Start a new run” page, where you are going to create a new Run for this Experiment. Enter a name for this Run (note that the Pipeline is already selected. If this is not the case, just select the uploaded Pipeline):

Note that the Pipeline’s parameters show up:

Seed the Pipeline with the Notebook’s Data Volume

On the “rok-url” parameter we need to specify the snapshot of the Notebook Server’s Data Volume, which we created in the previous step:

Switch back to the Rok UI tab, and copy the Data Volume’s Rok URL. To do so, expand the file inside the bucket, find it’s Data Volume and click the “Copy file link” button next to the Data Volume entry. Make sure that you have copied the Data Volume Rok URL and not the Rok URL of the whole JupyterLab, or the Workspace Volume:

Paste the Rok URL in the “rok-url” parameter:

Run the Pipeline

Now that we have defined the data to seed the pipeline, we can run the pipeline. We are leaving all other parameters as is, and click “Start”:

Now, click on the Run you created to view the progress of the Pipeline:

As the Pipeline runs, we see the various steps running and completing successfully. The Pipeline is going to take 10 to 30 minutes to complete, depending on your laptop specs. During the first step, Rok will instantly clone the snapshot we provided, for the next steps to use it:

The training step is going to take a few minutes:

You can also watch the video of this tutorial:

kubeflow

Official Kubeflow Blog.

Chris Pavlou

Written by

kubeflow

kubeflow

Official Kubeflow Blog.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade