Navigating the Data Science Abyss: Working in an Air-Gapped Environment

Data science in air-gapped environments does not need to be an impossible mission. Explore strategies to enhance comfort in sensitive data environments.

Published in

weles.ai

12 min readNov 22, 2023

Captivating still from the 1996 Paramount Pictures’ Mission Impossible film, featuring Ethan Hunt (Tom Cruise) suspended on a wire, conducting a high-stakes operation on a state-of-the-art computer within a secure, hermetically sealed facility. — *Photo: Paramount Pictures*

Managing, storing, and processing strategic and sensitive data presents formidable security challenges for organizations. Companies deploy diverse strategies to safeguard their data workflows and ML models from known threats.

ML models security — from MLOps to inference

Security is an important part of any computer system. IT security is a well-described and known field. Red teams are…

medium.com

With the internet being a prevalent avenue for potential breaches, isolating the environment from online vulnerabilities emerges as a compelling defence mechanism. However, the obvious security benefits of “air-gapping” the work environment come with nuanced considerations, impacting the workflows of essential teams, such as the data science team. Navigating these considerations demands a specialized approach to address the day-to-day challenges in this fortified work environment seamlessly.

Discover how a recently onboarded data scientist, entering air-gapped environments for the first time, can seamlessly transform their workday into a gratifying experience within the company.

For this demo, I used machines running Ubuntu 23.04. You can simulate a simplified air-gapped environment with any public and private cloud like AWS, Azure, GCP or on-premise OpenStack and configure networking to block internet inbound and outbound apart from the SSH communication port or use VMs with Multipass.

Computing infrastructure

Depending on the company infrastructure, different environment configurations can be expected. Common examples are:

A bare local machine with its local computing resources, optionally additional user-assigned network file system and network access to the internal company data storage:

A whiteboard diagram with icons representing a Data Scientist accessing a Local Computing Machine with network access over a specified IP and DNS name to a Company Data Storage and an optional, user-assigned network file system. All are confined in an air-gapped environment box — An air-gapped environment containing a bare local machine with network access to the company data storage

A MaaS system or private cloud like OpenStack with virtual machines accessible over SSH, managed by a hosting service with a console accessible over an internal address and user-assigned integral file system:

A whiteboard diagram with icons representing a Data Scientist accessing a Virtual Machine over SSH hosted by a VM service in a private cloud with a designed user file system and network access over a specified IP and DNS name to a Company Data Storage. All are confined in an air-gapped environment box — An air-gaped environment containing a private cloud with VM service and network access to the company data storage

Data Science experiments locally using Ubuntu and VSCode with cloud GPUs.

I’m a Data Scientist. I do not own a GPU but experiment on one “almost” locally. Ubuntu, VSCode and NVidia GPU…

medium.com

A distributed ML workflow platform like Kubeflow, MLflow or just JupyterHub, accessible over an internal address:

A whiteboard diagram with icons representing a Data Scientist using a local notebook, accessing a distributed computing platform running on Kubernetes with access to a Company Data Storage, over a specified IP and DNS name. All are confined in an air-gapped environment box — An air-gaped environment containing a distributed computing platform service and network access to the company data storage

Organizations customize the infrastructure implementation to suit their standards and requirements so that the components setup may vary slightly, but the common denominator stays — no connection to the outside world.

Upon completing security checks, obtaining authentication credentials, and receiving the initial tasks (e.g., conducting an Exploratory Data Analysis (EDA) on internal data or fine-tuning a Large Language Model (LLM)), the next step involves logging into the working environment.

Creating an environment

As distributed computing platforms often offer ready-to-use data science workspaces, in the event of encountering a bare local machine or a VM setup, one would typically commence by setting up the environment manually.

In this case, the choice is to start with a conda environment and add it into a jupyter notebook:

# Download conda installer
$ curl -o Miniconda3-latest-Linux-x86_64.sh https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh 
# Runconda installer
$ bash Miniconda3-latest-Linux-x86_64.sh -bf 
# Delete conda installer
$ rm Miniconda3-latest-Linux-x86_64.sh
# Initialize conda
$ ~/miniconda3/bin/conda init

Unfortunately, as soon as the curl is executed, a domain resolution error occurs:

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
curl: (6) Could not resolve host: repo.anaconda.com

Using apt install [...] , pip install [...] , npm install [...]gives the same result, even trying to spawn a jupyter container with docker run -it — rm -p 10000:8888 -v “${PWD}”:/home/${USER}/task1 quay.io/jupyter/datascience-notebook falls for it and might even become an issue when specifying custom images for such platforms as Kubeflow.

These error messages might appear intimidating, but fear not! Although the presented example confirms the lack of networking access to the outside world, there is a high chance of eventually installing these packages and running the environment. It just needs extra steps…

Installing from private source mirrors

Organizations operating in air-gapped environments have long recognized the limited access to open-source repositories of system packages as a significant obstacle to workflow performance. This challenge has also created a lucrative niche for companies specializing in security-certified private mirror servers, providing a strategic solution to the critical need for secure and controlled package distribution within such restricted settings.

Now, it is common for companies to implement private mirror servers into their infrastructure in different configurations, either as a single registry server provided by a trusted vendor or multiple registries.

Depending on the security standards and freshness and availability requirements, the configuration often relies on a demilitarized zone (DMZ):

A whiteboard diagram of an air-gapped environment implementing a private registry mirror server, accessible to the data science workstation. The registry is updated by a staging registry in the DMZ environment, where packages are cyclically imported from remote registries and scanned for security confirmation. The package migration is performed by a Data Exchange Medium — Private registry mirror synchronization workflow

This setup allows access to the required packages over the specified address. Still, as the example has shown, the computing machines cannot associate the default registry addresses with the current workflow. This means the package managers have to be accordingly configured (the following registry examples contain hyperlinks to tutorials on how to create a local setup):

apt
apt requires appending the registry details to /etc/apt/sources.list.d/ directory:

$ echo "deb [arch=<your architecture> signed-by=this/is/optional_key.public] http://package.registry.internal:port/apt-repo stable main" | sudo tee /etc/apt/sources.list.d/<registry_name>.list

$ sudo apt update
$ sudo apt upgrade -y

# Reboot if kernel changed
$ sudo reboot

# Install npm package manager for jupyterhub
$ sudo apt install nodejs npm

npm
npm is quite easy to configure, using the CLI:

$ npm config set registry http://package.registry.internal:port/optional_user
# Optionally login if required
$ npm login

# Install a package required to run jupyterhub
$ npm install -g configurable-http-proxy

pip
like npm, pip is configurable over CLI for on-user and global levels:

$ pip config --user set global.index http://package.registry.internal:port/optional_user/pypi
$ pip config --user set global.index-url http://<optional username>:<and password>@package.registry.internal:port/optional_user/
$ pip config --user set global.trusted-host package.registry.internal

# Install jupyterhub and notebook packages
$ pip install jupyterlab notebook

After successfully running the commands above, entering the command jupyterhub should start the local JupyterHub service and accessing http://localhost:8000 in the browser or if using VMs (make sure to configure security to allow ingress over the given port or use SSH port forwarding) http://<vm_adress>:8000 should allow the use of JupyterHub on the machines.

docker (OCI registry)
when hosting local registries docker can resolve them when providing the full name:

# Optional when secured:
$ docker login package.registry.internal:<port>

$ docker run -it - rm -p 10000:8888 -v "${PWD}":/home/${USER}/task1 package.registry.internal:<port>/jupyter/datascience-notebook

Another way is to implement persistent change to the client by appending the registry mirror to /etc/docker/daemon.json:

{
  "registry-mirrors": ["http://package.registry.internal:<port>"]
}

Bear in mind, if running on a K8s cluster like MicroK8s, the private registry address has to be appended as well to be accessed by the given Distributed Computing Platform.

All this looks like a lot to remember. Fortunately, when the organization's security standards allow, a different registry access mechanism is utilized — access over caching proxy.

Installing over a caching proxy

Caching proxies, like Squid, serve as intermediaries in package pull workflows. They fetch external data, like software updates or patches, and temporarily store it in an isolated environment. This lets internal systems access essential information without constant external connectivity. The caching proxy becomes a local source for data retrieval, ensuring quick access to critical updates while maintaining an air gap — keeping a separation from external networks. The security compromise for package availability depends on the quality of the vendors’ registry.

A whiteboard diagram of an air-gapped environment with a computing station, used by a data scientist, accessing a public package registry over a caching proxy server outside of the air-gapped confinement — Accessing registries over a caching proxy

Configuring the environment (/etc/environment) correctly enables the proper functioning of package managers in such environments. HTTPS_PROXY, HTTP_PROXY, https_proxy, and http_proxy variables need to be configured with the proxy endpoint, along with setting NO_PROXY and no_proxy to include the IP ranges occupied by the pods and services. Assuming the organization's proxy is at http://squid.internal:3128, /etc/environment should set the following environment variables:

HTTPS_PROXY=http://squid.internal:3128
HTTP_PROXY=http://squid.internal:3128
NO_PROXY=10.0.0.0/8,192.168.0.0/16,127.0.0.1,172.16.0.0/16 # Internal cluster and machine adressess
https_proxy=http://squid.internal:3128
http_proxy=http://squid.internal:3128
no_proxy=10.0.0.0/8,192.168.0.0/16,127.0.0.1,172.16.0.0/16 # Internal cluster and machine adressess

Bear in mind, if running on a K8s cluster like MicroK8s, the proxy configuration has to be applied to every node and to initialize the nodes need to be restarted:

Awesome! With the right caching proxy configuration GET requests to https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh are also possible, which means, a conda environment can be created and connected to the Jupyter Notebook and no workaround using pip is required:

# Download conda installer
$ curl -o Miniconda3-latest-Linux-x86_64.sh https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh 
# Runconda installer
$ bash Miniconda3-latest-Linux-x86_64.sh -bf 
# Delete conda installer
$ rm Miniconda3-latest-Linux-x86_64.sh
# Initialize conda
$ ~/miniconda3/bin/conda init
# Restart terminal

$ conda create --name my_env
$ conda activate my_env
$ conda install -c anaconda ipykernel notebook

$ python -m ipykernel install --user --name=my_env
$ jupyter notebook

Copy the access URL from the terminal http://127.0.0.1:8888/tree?token=<token>and paste it into the browser.

Screenshot of first page of Jupyter Notebook in browser — Successful deployment of Jupyter Notebook

And install pytorch to try it out:

Jupyter Notebook successfully installing and running pytorch on a simple example — PyTorch test

With this configuration, it is finally possible to start working on the tasks, and with all the favourable conditions, setting up the environment will not take longer than 20 minutes.

Alright!

With the journey nearly over, let’s consider one more quite common scenario.

Imagine being in the shoes of a data virtuoso navigating an environment without the luxury of a caching proxy. During intricate experiments, the call arises to enrich your training data on the fly, tapping into the wealth of a CSV file perched on an external website. To add a touch of adrenaline, a cutting-edge PyTorch package, hot off the digital presses a mere few hours ago, beckons for immediate incorporation. With the server orchestrating its daily package refresh post-midnight, the clock becomes a relentless adversary — no time for patience, for this experiment is slated for completion by day’s end. Behold a snapshot of the dynamic ballet that is a data scientist’s daily grind, where real-time updates and seamless external data integration reign supreme, sculpting the essence of superior training data quality and relevance.

But how to access data available on the internet, and, what is more challenging, how to deliver them to the air-gapped environment keeping the security standards?

Accessing external data

The answer can be found when looking back at the role of DMZ in the infrastructure — it imports files from a public registry over the internet and exports them into the air-gapped environment!

Just like private registries serve a purpose, the demands of a data science workflow call for a solution that enables external data access and initial exploration on a temporary, easily reproducible machine. This machine shouldn’t have access to the strategic environment, maintaining stringent security standards. Think of it as a sandbox!

A sandbox environment

Despite its seemingly condescending name, it plays a crucial role. Its main job is to let users create a computer setup with governed internet and resource access, do what they need, and then wipe it out to set up another machine if there’s significant damage. Using a VM service is the most common and suitable solution.

Like the air-gap workstation, the infrastructure setup and overall availability can differ based on the organization’s needs. To emphasize safety, some companies place their DMZ access points physically away from air-gapped ones. This means users have to go through security checks multiple times during working hours.

Private registry mirror synchronization workflow whiteboard diagramextension by implementing a VM service in the DMZ, accessible dor the data scientist over SSH — A Sandbox environment in the DMZ

However, the files must still reach the air-gapped environment so PyTorch gets updated…

And again, depending on the organization’s security standards, this challenge can be solved in many ways.

The file exchange medium

Companies establish strict rules to safely move data from the DMZ to the air-gapped environment. These rules vary based on specific requirements.

For instance, a simple setup involves a portable file system with a write-only mount point in the DMZ and a non-executable, read-only mount point in the air-gapped environment. Before mounting in the air gap, a security scan ensures content compliance.

A whiteboard diagram showing the air-gapped environment and DMZ with additional components— file system mount points. The data will be written to the exchange medium on the DMZ with a write-only connection, scanned before mounting and reading off in the air-gapped environment with a read-only connection. — Transporting files using portable Data Exchange Medium

This approach, although robust, introduces some inconvenience factors into the workflow, so companies, if the standards allow, tend to implement a more automatic data transfer, by incorporating governed object storage systems like S3 or MinIO:

A whiteboard diagram showing the DMZ with write-only access to an object storage system with an automatic security scan hook transporting data to an air-gap accessible read-only bucket — Transporting files using an object storage exchange medium

Intermediate object storage systems allow companies to configure access control settings for buckets on write and read sites and define optional on-event actions. Once approved as secure, the files can be moved to the read-only bucket for the air-gapped environment.

Having a sandbox instance available and a configured object storage system (MinIO) it is finally possible to access external files:

# Install AWS CLI to access MinIO S3 endpoint
$ sudo apt install awscli -y

# Configure AWS CLI to reach MinIO endpoint
$ export AWS_ACCESS_KEY_ID=<provided MinIO username value>
$ export AWS_SECRET_ACCESS_KEY=<provided MinIO password value>
$ export AWS_DEFAULT_REGION=<provided region>

# Test 

$ aws --endpoint-url http://<object storage enfpoint>:<port> s3 ls

The test command should output the available buckets for the given environment:

Command line output for aws — endpoint-url http://10.152.183.72 s3 ls, outputting the name of ds-air-gap-read — Bucket list command result for the air-gapped environment

Now, let’s download the required files on the DMZ, the following commands are run:

# Download the dataset
$ wget https://<your_data>/<path>.csv
# ... perform operations and save
# Upload the dataset to MinIO
$ aws --endpoint-url http://<object storage enfpoint>:<port> s3 cp dataset.csv s3:/<write buket name>/dataset.csv
# Check if data is available
$ aws --endpoint-url http://<object storage enfpoint>:<port> s3 ls <write bucket name>

Again, the test command should output the available objects in the bucket for the given environment:

Command line output for aws — endpoint-url http://10.152.183.72 s3 ls ds-dmz-write, outputting the object name dataset.csv — Bucket objects list command results for the air-gapped environment

The following commands on the DMZ install the latest conda PyTorch version:

# Download the latest pytorch package
$ conda install --download-only pytorch -c pytorch
# Upload the pytorch package to MinIO
$ aws --endpoint-url aws --endpoint-url http://<object storage enfpoint>:<port> s3 cp ./miniconda3/pkgs/pytorch-<version>.conda s3://ds-dmz-write/pkgs/pytorch-<version>.conda

This is an excellent time to run security checks on the uploaded files and move them to the read bucket when signed off.
After done, let’s check the MinIO console to see what files are available in the air-gap read bucket:

Screenshot of the MinIO console showing the contents of ds-air-gap-read bucket. The contents show one file dataset.csv and one subdirectory pkgs — Files in the air-gap read bucket

Running the following commands facilitates the download of data in the air-gapped environment:

# Download all the files
$ aws --endpoint-url http://<object storage enfpoint>:<port> s3 cp s3://<read bucket name> ./minio_data --recursive
download: s3://<read bucket name>/dataset.csv to minio_data/dataset.csv
download: s3://<read bucket name>/pkgs/pytorch-<version>.conda to minio_data/pkgs/pytorch-<version>
# Update PyTorch
$ conda update --offline minio_data/pkgs/pytorch-<version>.conda

Downloading and Extracting Packages:

Preparing transaction: done
Verifying transaction: done
Executing transaction: done
# Confirm new PyTorch version
$ conda list
# packages in environment at /home/ubuntu/miniconda3:
#
# Name                    Version                   Build  Channel
...
pytorch                   <version>           <build>    <unknown>

Similarly to conda — apt, pip and npm also support downloading and installing packages from local files, which can be found in their official documentation. Bear in mind packages may depend on other packages — make sure to download and install them accordingly.

Delivering custom OCI images to the distributed computing platform requires building or pulling the images necessary to the DMZ's local registry and then saving them to a file with the docker save . After transferring them to the air-gapped environment, load them usingdocker load the command, tag them with the private registry name by running docker tag image package.registry.internal:<port>/image and finally push them to the registry with docker push package.registry.internal:<port>/image. Now, to create a container of that image with docker run all it needs is to specify the image as package.registry.internal:<port>/image . The mechanism also applies when pulling custom images from a distributed computing platform.

Tasks finished, ready for next challenges, now fully accommodated to the air-gapped environment!

Summary

Data Science in air-gapped environments introduces new challenges, enforcing a security-first and centralized approach to satisfy workflow comfort and performance.

This article described the journey of a newly hired data scientist who explored the nuances of the air-gapped infrastructure, found his way around securely accessing necessary data and files from inside and outside the system, and eventually made a comfortable workplace.

What does MLOps mean to you?

How to use MLOps and what you can expect from MLOps platforms.

medium.com

There are multiple ways to customize the environment. Here are a few ideas:

Add more tools to your machine images using snaps
With data engineers, implement data pipelines to keep your data always up-to-date
Add scheduled power-on and power-off for the VM to optimise the cost

Keep on experimenting with open-source tools and share your results!

Useful links

Reach out to me via my social media

Navigating the Data Science Abyss: Working in an Air-Gapped Environment

Data science in air-gapped environments does not need to be an impossible mission. Explore strategies to enhance comfort in sensitive data environments.

ML models security — from MLOps to inference

Security is an important part of any computer system. IT security is a well-described and known field. Red teams are…

Computing infrastructure

Data Science experiments locally using Ubuntu and VSCode with cloud GPUs.

I’m a Data Scientist. I do not own a GPU but experiment on one “almost” locally. Ubuntu, VSCode and NVidia GPU…

Creating an environment

Installing from private source mirrors

Installing over a caching proxy

Accessing external data

A sandbox environment

The file exchange medium

Summary

What does MLOps mean to you?

How to use MLOps and what you can expect from MLOps platforms.

Useful links

Written by Rafał Siwek