Effortlessly Execute JupyterLab Cells in the Cloud

Jagane Sundar
InfinStor
Published in
7 min readDec 14, 2020

There are many occasions when Python code in a JupyterLab notebook may need to be run in the cloud. InfinStor offers the easiest way for cloud execution of Python code authored in JupyterLab.

InfinStor Starter Functionality

InfinStor Starter captures Python code from a JupyterLab cell, combines it with an execution environment, such as Conda or Docker, and saves it in the cloud as an InfinStor Transform.

Subsequently, InfinStor transforms can be run in a Cloud instance of your choice. InfinStor can also help you choose the cheapest cloud.

Additionally, InfinStor transforms may be scheduled for periodic runs in the cloud. Periodic runs can be scheduled hourly, weekly, monthly, etc. There is no need to create Airflow DAGs or configure any other job scheduler for this functionality — it is available straight from the InfinStor JupyterLab sidebar GUI.

Finally, all such transform executions are recorded in the included InfinStor MLflow service.

All of the functionality of InfinStor is accessed through the primary InfinStor GUI, which is the InfinStor JupyterLab sidebar. There is no need to use the command line interface, or to author xml/json/yaml configuration files.

Capture JupyterLab Cell and Save as InfinStor Transform

The first step in remotely executing python code from a cell in your JupyterLab notebook is to capture it, along with its execution environment, and save it in your InfinStor account as an InfinStor Transform

JupyterLab Cell Python Code

The python code in your JupyterLab cell cannot read data from the local file system. This transform may execute in a variety of environments — in an IPython kernel in your JupyterLab server machine, in a single VM in the cloud, or in a cluster of machines in the cloud. This requires that the input data to this code should be stored somewhere in the cloud, preferably in a Cloud Object Store

Capture Transform

In the following screen capture example, the cell contains a single line of Python code

print(‘hello world’ )

Select this JupyterLab cell and press the ‘Capture’ button in ‘Transforms->Develop’

Provide an intuitive name for the transform. In this example, we use ‘helloworld’

Next, you are presented with transform environment options — Conda, Docker or ‘Copy From Existing Transform’

In this example, we choose the option ‘Conda Environment’. The system now lists the Conda environments available in the JupyterLab server machine.

We choose the base environment and capture the transform.

Run Transform Immediately in the Cloud

To run a previously captured transform in the cloud, press the ‘Select’ button in ‘Transforms->Run’

There are four choices for Input Data in the dialog that pops up.

  1. InfinSnap — Snapshot of the state of a bucket at a specific point in time. This feature is available in InfinStor Premium and above
  2. InfinSlice — Slice of data that was ingested between a start time and an end time. This feature is available in InfinStor Premium and above
  3. No Input Data — this is useful for transforms that perform their own I/O, perhaps reading from the Internet, non InfinSnap enabled buckets, etc. This feature is available in all editions of InfinStor
  4. MLflow Artifact — Artifacts from a previous MLflow run can be used as Input Data for this transform execution

In our simple ‘Hello world’ example, we do not need any input data for the transform, so we choose ‘No Input Data’ and continue. Next, we choose the transform to run. The transform search box is useful for filtering transforms. In this example, we choose the previously captured transform ‘helloworld’

The next dialog prompts for kwargs to be passed into the transform during execution. We don’t have any kwargs in this example.

Following the kwargs selection dialog, we are presented with the ‘Run Location’ dialog. This is where we can choose to run the transform inline in the JupyterLab, or in the cloud in a single VM. The final choice of AWS EMR Cluster is for running the transform in a distributed manner. We choose ‘Single Virtual Machine’.

Following the choice of Single Virtual Machine, we choose the type of instance/cloud.

When we continue after choosing the type of instance, a new JupyterLab notebook cell is injected into the current notebook. This notebook cell has the code that can be executed in order to run the transform in the chosen Single Virtual Machine.

Hit the ‘Run’ button in the notebook toolbar, and the job will be submitted to the InfinStor system for execution in the Cloud.

There are two notables in the above screen capture . The first line indicates that the MLFLOW_TRACKING_URI environment variable is set to infinstor://infinstor/ — this is to set the destination for the MLflow tracking client. The second line indicates that this particular run does not have any cached input from a prior run, so the transformation will indeed be executed.

Tracking the remote execution of a transform is accomplished using MLflow. If you press the ‘Open MLflow’ button in the MLflow section of the sidebar, a new tab is opened with the MLflow UI. The most recent transform run is listed as the first row in the table. This experiment run contains details such as the stderr-stdout, artifacts, etc.

Run Transform Periodically (on a schedule) in the Cloud

Once a transform has been captured and its basic functionality tested, it can be run periodically in the Cloud.

Click on the Create button in ‘Transforms->Periodic Run’

In the first dialog, name the periodic run — we use helloworld here. The periodicity of the run is also chosen here. Our choice is every hour at the 0th minute mark. Other options supported make this similar to *nix cron.

The next screen allows for the choice of ‘Input Data’. The options available are the similar to ‘Run Transform’, however, they mean something slightly different for ‘Periodic Runs.

  1. InfinSnap — snapshot of the chosen bucket/path at the time the run triggers
  2. InfinSlice — percentage of slice of data from the time the run triggers to the previous trigger. Percent values allowed are 10%, 25%, 50%, 100% of data ingested in the interval
  3. No Input Data

We use ‘No Input Data’ for this trivial example

Next, we choose the transform. The options are the same as the ones presented in ‘Run Transform’. We choose the helloworld transform for our example

The next dialog provides the opportunity to provide positional and kwargs to the transform

The run location is chosen in the next dialog — options include VM instance types from all supported and configured Clouds

Finally, the Periodic Run is scheduled. If we come back later and view the MLflow UI, we can see that every hour the Periodic Run helloworld triggered.

InfinStor Editions

InfinStor is sold as a subscription Software as a Service, through the AWS Marketplace. Three editions are available.

1. InfinStor Starter, with MLflow — Free

  • Capture transforms from JupyterLab using the InfinStor sidebar and execute these transforms in the cloud of your choice
  • Schedule transforms for periodic runs in the cloud
  • Full MLflow functionality, as a managed service in the cloud

2. InfinStor Premium — 90-day free trial, then $4 per Terabyte per month

  • Adds InfinSnap, the ability to snapshot Cloud Object Stores and InfinSlice, the ability to read slices of data ingested between a start and an end time

3. InfinStor Enterprise — Contact sales@infinstor.com

  • All the functionality of InfinStor Premium, installed and operated in your own cloud account.

--

--

Jagane Sundar
InfinStor

Entrepreneur, Technology Enthusiast, Machine Learning student, Cloud Computing expert, Big Data expert, Distributed Coordination expert