Navigating the Data Science Abyss: Working in an Air-Gapped Environment
Data science in air-gapped environments does not need to be an impossible mission. Explore strategies to enhance comfort in sensitive data environments.
Managing, storing, and processing strategic and sensitive data presents formidable security challenges for organizations. Companies deploy diverse strategies to safeguard their data workflows and ML models from known threats.
With the internet being a prevalent avenue for potential breaches, isolating the environment from online vulnerabilities emerges as a compelling defence mechanism. However, the obvious security benefits of “air-gapping” the work environment come with nuanced considerations, impacting the workflows of essential teams, such as the data science team. Navigating these considerations demands a specialized approach to address the day-to-day challenges in this fortified work environment seamlessly.
Discover how a recently onboarded data scientist, entering air-gapped environments for the first time, can seamlessly transform their workday into a gratifying experience within the company.
For this demo, I used machines running Ubuntu 23.04. You can simulate a simplified air-gapped environment with any public and private cloud like AWS, Azure, GCP or on-premise OpenStack and configure networking to block internet inbound and outbound apart from the SSH communication port or use VMs with Multipass.
Computing infrastructure
Depending on the company infrastructure, different environment configurations can be expected. Common examples are:
- A bare local machine with its local computing resources, optionally additional user-assigned network file system and network access to the internal company data storage:
- A distributed ML workflow platform like Kubeflow, MLflow or just JupyterHub, accessible over an internal address:
Organizations customize the infrastructure implementation to suit their standards and requirements so that the components setup may vary slightly, but the common denominator stays — no connection to the outside world.
Upon completing security checks, obtaining authentication credentials, and receiving the initial tasks (e.g., conducting an Exploratory Data Analysis (EDA) on internal data or fine-tuning a Large Language Model (LLM)), the next step involves logging into the working environment.
Creating an environment
As distributed computing platforms often offer ready-to-use data science workspaces, in the event of encountering a bare local machine or a VM setup, one would typically commence by setting up the environment manually.
In this case, the choice is to start with a conda environment and add it into a jupyter notebook:
# Download conda installer
$ curl -o Miniconda3-latest-Linux-x86_64.sh https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
# Runconda installer
$ bash Miniconda3-latest-Linux-x86_64.sh -bf
# Delete conda installer
$ rm Miniconda3-latest-Linux-x86_64.sh
# Initialize conda
$ ~/miniconda3/bin/conda init
Unfortunately, as soon as the curl
is executed, a domain resolution error occurs:
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
curl: (6) Could not resolve host: repo.anaconda.com
Using apt install [...]
, pip install [...]
, npm install [...]
gives the same result, even trying to spawn a jupyter container with docker run -it — rm -p 10000:8888 -v “${PWD}”:/home/${USER}/task1 quay.io/jupyter/datascience-notebook
falls for it and might even become an issue when specifying custom images for such platforms as Kubeflow.
These error messages might appear intimidating, but fear not! Although the presented example confirms the lack of networking access to the outside world, there is a high chance of eventually installing these packages and running the environment. It just needs extra steps…
Installing from private source mirrors
Organizations operating in air-gapped environments have long recognized the limited access to open-source repositories of system packages as a significant obstacle to workflow performance. This challenge has also created a lucrative niche for companies specializing in security-certified private mirror servers, providing a strategic solution to the critical need for secure and controlled package distribution within such restricted settings.
Now, it is common for companies to implement private mirror servers into their infrastructure in different configurations, either as a single registry server provided by a trusted vendor or multiple registries.
Depending on the security standards and freshness and availability requirements, the configuration often relies on a demilitarized zone (DMZ):
This setup allows access to the required packages over the specified address. Still, as the example has shown, the computing machines cannot associate the default registry addresses with the current workflow. This means the package managers have to be accordingly configured (the following registry examples contain hyperlinks to tutorials on how to create a local setup):
- apt
apt
requires appending the registry details to/etc/apt/sources.list.d/
directory:
$ echo "deb [arch=<your architecture> signed-by=this/is/optional_key.public] http://package.registry.internal:port/apt-repo stable main" | sudo tee /etc/apt/sources.list.d/<registry_name>.list
$ sudo apt update
$ sudo apt upgrade -y
# Reboot if kernel changed
$ sudo reboot
# Install npm package manager for jupyterhub
$ sudo apt install nodejs npm
- npm
npm
is quite easy to configure, using the CLI:
$ npm config set registry http://package.registry.internal:port/optional_user
# Optionally login if required
$ npm login
# Install a package required to run jupyterhub
$ npm install -g configurable-http-proxy
- pip
likenpm
,pip
is configurable over CLI for on-user and global levels:
$ pip config --user set global.index http://package.registry.internal:port/optional_user/pypi
$ pip config --user set global.index-url http://<optional username>:<and password>@package.registry.internal:port/optional_user/
$ pip config --user set global.trusted-host package.registry.internal
# Install jupyterhub and notebook packages
$ pip install jupyterlab notebook
After successfully running the commands above, entering the command jupyterhub
should start the local JupyterHub service and accessing http://localhost:8000
in the browser or if using VMs (make sure to configure security to allow ingress over the given port or use SSH port forwarding) http://<vm_adress>:8000
should allow the use of JupyterHub on the machines.
- docker (OCI registry)
when hosting local registries docker can resolve them when providing the full name:
# Optional when secured:
$ docker login package.registry.internal:<port>
$ docker run -it - rm -p 10000:8888 -v "${PWD}":/home/${USER}/task1 package.registry.internal:<port>/jupyter/datascience-notebook
Another way is to implement persistent change to the client by appending the registry mirror to /etc/docker/daemon.json
:
{
"registry-mirrors": ["http://package.registry.internal:<port>"]
}
Bear in mind, if running on a K8s cluster like MicroK8s, the private registry address has to be appended as well to be accessed by the given Distributed Computing Platform.
All this looks like a lot to remember. Fortunately, when the organization's security standards allow, a different registry access mechanism is utilized — access over caching proxy.
Installing over a caching proxy
Caching proxies, like Squid, serve as intermediaries in package pull workflows. They fetch external data, like software updates or patches, and temporarily store it in an isolated environment. This lets internal systems access essential information without constant external connectivity. The caching proxy becomes a local source for data retrieval, ensuring quick access to critical updates while maintaining an air gap — keeping a separation from external networks. The security compromise for package availability depends on the quality of the vendors’ registry.
Configuring the environment (/etc/environment) correctly enables the proper functioning of package managers in such environments. HTTPS_PROXY, HTTP_PROXY, https_proxy, and http_proxy variables need to be configured with the proxy endpoint, along with setting NO_PROXY and no_proxy to include the IP ranges occupied by the pods and services. Assuming the organization's proxy is at http://squid.internal:3128
, /etc/environment
should set the following environment variables:
HTTPS_PROXY=http://squid.internal:3128
HTTP_PROXY=http://squid.internal:3128
NO_PROXY=10.0.0.0/8,192.168.0.0/16,127.0.0.1,172.16.0.0/16 # Internal cluster and machine adressess
https_proxy=http://squid.internal:3128
http_proxy=http://squid.internal:3128
no_proxy=10.0.0.0/8,192.168.0.0/16,127.0.0.1,172.16.0.0/16 # Internal cluster and machine adressess
Bear in mind, if running on a K8s cluster like MicroK8s, the proxy configuration has to be applied to every node and to initialize the nodes need to be restarted:
Awesome! With the right caching proxy configuration GET
requests to https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
are also possible, which means, a conda environment can be created and connected to the Jupyter Notebook and no workaround using pip
is required:
# Download conda installer
$ curl -o Miniconda3-latest-Linux-x86_64.sh https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
# Runconda installer
$ bash Miniconda3-latest-Linux-x86_64.sh -bf
# Delete conda installer
$ rm Miniconda3-latest-Linux-x86_64.sh
# Initialize conda
$ ~/miniconda3/bin/conda init
# Restart terminal
$ conda create --name my_env
$ conda activate my_env
$ conda install -c anaconda ipykernel notebook
$ python -m ipykernel install --user --name=my_env
$ jupyter notebook
Copy the access URL from the terminal http://127.0.0.1:8888/tree?token=<token>
and paste it into the browser.
And install pytorch
to try it out:
With this configuration, it is finally possible to start working on the tasks, and with all the favourable conditions, setting up the environment will not take longer than 20 minutes.
Alright!
With the journey nearly over, let’s consider one more quite common scenario.
Imagine being in the shoes of a data virtuoso navigating an environment without the luxury of a caching proxy. During intricate experiments, the call arises to enrich your training data on the fly, tapping into the wealth of a CSV file perched on an external website. To add a touch of adrenaline, a cutting-edge PyTorch package, hot off the digital presses a mere few hours ago, beckons for immediate incorporation. With the server orchestrating its daily package refresh post-midnight, the clock becomes a relentless adversary — no time for patience, for this experiment is slated for completion by day’s end. Behold a snapshot of the dynamic ballet that is a data scientist’s daily grind, where real-time updates and seamless external data integration reign supreme, sculpting the essence of superior training data quality and relevance.
But how to access data available on the internet, and, what is more challenging, how to deliver them to the air-gapped environment keeping the security standards?
Accessing external data
The answer can be found when looking back at the role of DMZ in the infrastructure — it imports files from a public registry over the internet and exports them into the air-gapped environment!
Just like private registries serve a purpose, the demands of a data science workflow call for a solution that enables external data access and initial exploration on a temporary, easily reproducible machine. This machine shouldn’t have access to the strategic environment, maintaining stringent security standards. Think of it as a sandbox!
A sandbox environment
Despite its seemingly condescending name, it plays a crucial role. Its main job is to let users create a computer setup with governed internet and resource access, do what they need, and then wipe it out to set up another machine if there’s significant damage. Using a VM service is the most common and suitable solution.
Like the air-gap workstation, the infrastructure setup and overall availability can differ based on the organization’s needs. To emphasize safety, some companies place their DMZ access points physically away from air-gapped ones. This means users have to go through security checks multiple times during working hours.
However, the files must still reach the air-gapped environment so PyTorch gets updated…
And again, depending on the organization’s security standards, this challenge can be solved in many ways.
The file exchange medium
Companies establish strict rules to safely move data from the DMZ to the air-gapped environment. These rules vary based on specific requirements.
For instance, a simple setup involves a portable file system with a write-only mount point in the DMZ and a non-executable, read-only mount point in the air-gapped environment. Before mounting in the air gap, a security scan ensures content compliance.
Intermediate object storage systems allow companies to configure access control settings for buckets on write and read sites and define optional on-event actions. Once approved as secure, the files can be moved to the read-only bucket for the air-gapped environment.
Having a sandbox instance available and a configured object storage system (MinIO) it is finally possible to access external files:
# Install AWS CLI to access MinIO S3 endpoint
$ sudo apt install awscli -y
# Configure AWS CLI to reach MinIO endpoint
$ export AWS_ACCESS_KEY_ID=<provided MinIO username value>
$ export AWS_SECRET_ACCESS_KEY=<provided MinIO password value>
$ export AWS_DEFAULT_REGION=<provided region>
# Test
$ aws --endpoint-url http://<object storage enfpoint>:<port> s3 ls
The test command should output the available buckets for the given environment:
Now, let’s download the required files on the DMZ, the following commands are run:
# Download the dataset
$ wget https://<your_data>/<path>.csv
# ... perform operations and save
# Upload the dataset to MinIO
$ aws --endpoint-url http://<object storage enfpoint>:<port> s3 cp dataset.csv s3:/<write buket name>/dataset.csv
# Check if data is available
$ aws --endpoint-url http://<object storage enfpoint>:<port> s3 ls <write bucket name>
Again, the test command should output the available objects in the bucket for the given environment:
The following commands on the DMZ install the latest conda PyTorch version:
# Download the latest pytorch package
$ conda install --download-only pytorch -c pytorch
# Upload the pytorch package to MinIO
$ aws --endpoint-url aws --endpoint-url http://<object storage enfpoint>:<port> s3 cp ./miniconda3/pkgs/pytorch-<version>.conda s3://ds-dmz-write/pkgs/pytorch-<version>.conda
This is an excellent time to run security checks on the uploaded files and move them to the read bucket when signed off.
After done, let’s check the MinIO console to see what files are available in the air-gap read bucket:
Running the following commands facilitates the download of data in the air-gapped environment:
# Download all the files
$ aws --endpoint-url http://<object storage enfpoint>:<port> s3 cp s3://<read bucket name> ./minio_data --recursive
download: s3://<read bucket name>/dataset.csv to minio_data/dataset.csv
download: s3://<read bucket name>/pkgs/pytorch-<version>.conda to minio_data/pkgs/pytorch-<version>
# Update PyTorch
$ conda update --offline minio_data/pkgs/pytorch-<version>.conda
Downloading and Extracting Packages:
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
# Confirm new PyTorch version
$ conda list
# packages in environment at /home/ubuntu/miniconda3:
#
# Name Version Build Channel
...
pytorch <version> <build> <unknown>
Similarly to conda
— apt
, pip
and npm
also support downloading and installing packages from local files, which can be found in their official documentation. Bear in mind packages may depend on other packages — make sure to download and install them accordingly.
Delivering custom OCI images to the distributed computing platform requires building or pulling the images necessary to the DMZ's local registry and then saving them to a file with the docker save
. After transferring them to the air-gapped environment, load them usingdocker load
the command, tag them with the private registry name by running docker tag image package.registry.internal:<port>/image
and finally push them to the registry with docker push package.registry.internal:<port>/image
. Now, to create a container of that image with docker run
all it needs is to specify the image as package.registry.internal:<port>/image
. The mechanism also applies when pulling custom images from a distributed computing platform.
Tasks finished, ready for next challenges, now fully accommodated to the air-gapped environment!
Summary
Data Science in air-gapped environments introduces new challenges, enforcing a security-first and centralized approach to satisfy workflow comfort and performance.
This article described the journey of a newly hired data scientist who explored the nuances of the air-gapped infrastructure, found his way around securely accessing necessary data and files from inside and outside the system, and eventually made a comfortable workplace.
What does MLOps mean to you?
How to use MLOps and what you can expect from MLOps platforms.
medium.com
There are multiple ways to customize the environment. Here are a few ideas:
- Add more tools to your machine images using snaps
- With data engineers, implement data pipelines to keep your data always up-to-date
- Add scheduled power-on and power-off for the VM to optimise the cost
Keep on experimenting with open-source tools and share your results!
Useful links
Reach out to me via my social media