HANDS-ON: DATA SCIENCE

Data Science experiments locally using Ubuntu and VSCode with cloud GPUs.

I’m a Data Scientist. I do not own a GPU but experiment on one “almost” locally. Ubuntu, VSCode and NVidia GPU accelerated Virtual Machine (VM) in the cloud is all I need.

Bartłomiej Poniecki-Klotz
Ubuntu AI

--

As a Data Scientist, I want to experiment in a flexible environment. I need four powerful GPUs to train my Large Language Model (LLM) today. Tomorrow, I do the Exploratory Data Analysis (EDA) and consume 128GB of memory. Each day has its challenges — I need infrastructure to solve them.

GDPR and government regulations force companies to keep data within geographic locations. Sensitive data cannot leave the data centre premises. On the other hand, providing flexible Virtual Machines (VMs) is an essential cloud capability — private or public. If you merge these two, you get an on-demand and flexible Data Science experimentation environment.

Is it this simple? Yes.

You only need a VM with SSH access from your local computer and VSCode installed. VSCode has a feature called Remote Window. It opens the SSH tunnel, downloads the VSCode server to the VM and opens a new window on your desktop. All code, data and configurations stay in the cloud VM. Your computer is only a terminal to input and output. So, you get the look and feel of local code development while everything stays in the cloud. Additionally, you can create multiple environments on different VM flavours.

VSCode integration using Remote Window feature with cloud VM with GPU over SSH.
Visual Studio Code integration with Cloud VM with GPU over SSH

When using VSCode Remote Window, all your development, code, data and configurations stay in the cloud VM. Your computer is only a terminal to input and output.

  1. Create an EC2 instance with a cheap GPU (T4)
  2. Configure Ubuntu drivers and development tools
  3. Set VSCode Remote Window
  4. Run GPU workload

I used an AWS EC2 instance during the demo. You can similarly use other public and private clouds like Azure, GCP or on-premise OpenStack. Any Virtual Machine with SSH access works!

GPU in the cloud

Public clouds provide a cost-efficient way to put your hands on the GPU accelerated VMs. I spin it up when needed and power it down after work. In AWS, they start from $0.5 per hour.

First, I go to the EC2 Dashboard in the AWS console and create a new VM. In the wizard, I select the Ubuntu 22.04 operating system. The chosen version is the LTS release, meaning I can get ten years of security patching and maintenance. Additionally, Ubuntu works well with NVidia GPUs and gets frequent driver updates.

AWS EC2 launch new VM wizard with selection of instance type.
AWS EC2 wizard — select the instance type

Next, I select the instance type — g4dn.xlarge is a good choice. It contains an NVidia T4 and 16GB of memory. It should be sufficient for working with data samples. If you need more memory, then select beefier instances. There are also instances with more powerful GPUs, like the p3 instance family with NVidia V100.

AWS EC2 launch instance wizard — selection of AWS key pair form
AWS EC2 wizard — select the key pair

The next step is the selection of a “Key pair”, which I use to ssh into the instance. I create a new key pair or use an existing one. If you lose your private key, you cannot recover it. You have to recreate it.

AWS EC2 launch wizard — create new security group with SSH access from any IP allowed.
AWS EC2 wizard — create a new security group

Network security group configuration needs to allow access to the instance on port 22. When connecting to it using SSH, you use port 22 by default. You can do it once and select an existing security group in the future.

Volume size selection depends entirely on your use case. If you want to test the approach, 30GB AWS free tier volume is all you need. For MLOps environments, you need more disk space. Most popular tools run on Kubernetes. Required docker images use up a significant amount of space. One of the popular MLOps tech stacks is Kubeflow, which you can deploy on any CNCF Kubernetes.

If you work in the corporate AWS account, you might need to adjust the above configurations — especially networking and proxy. The goal is to create an EC2 instance with IP available from your local machine, where you can SSH. The VM also needs access to the internet.

AWS EC2 launch wizard — summary of the instance: g4dn.xlarge, 275GB in 2 volumes and security group with SSH access
AWS EC2 wizard — instance summary

After launching the instance, it’s ready for configuration in a few minutes.

Configure VM

The first step is to check if you can SSH into the machine. I use the SSH CLI tool, available in Linux, Mac and Windows. The private key is the same one I selected when creating the VM in AWS. The newly created Ubuntu instance in AWS has the user “ubuntu”. When deploying an EC2 instance, I chose a VPC public subnet to access it directly from my desktop.

$ ssh -i ~/.ssh/bpk.pem ubuntu@<EC2_public_ip>

Quick troubleshooting hints — check if:

  • Instance is running
  • SSH over public IP
  • SSH using the “ubuntu” user
  • SSH using the proper key selected in the AWS console
  • Security group allows traffic on port 22

System and drivers

It is a good practice to update packages as a first action continually. You can add this as a part of the cloud-init script. This way, most fixable CVEs are solved before you log in.

$ sudo apt update
$ sudo apt upgrade -y

#reboot if kernel changed
$ sudo reboot

I have the newest versions of the kernel and packages now. It’s time to install GPU drivers. In Ubuntu, the ubuntu-drivers application installs the latest drivers and CUDA in aligned versions.

$ sudo apt install ubuntu-drivers-common awscli -y
$ sudo ubuntu-drivers autoinstall

$ sudo reboot

Now, I reboot the machine. The nvidia-smi tool shows that NVidia T4 is waiting for you. You are ready to go.

$ nvidia-smi
Wed Sep 13 08:15:20 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.05 Driver Version: 535.86.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:1E.0 Off | 0 |
| N/A 25C P8 9W / 70W | 2MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+

Prepare environment workspace

I created a folder for the data science project and installed conda.

$ mkdir -p workspace

$ curl -o Miniconda3-latest-Linux-x86_64.sh https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

$ bash Miniconda3-latest-Linux-x86_64.sh -bf

$ rm Miniconda3-latest-Linux-x86_64.sh

$ ~/miniconda3/bin/conda init


$ source ~/.bashrc

Both actions above are optional. You can use pip to install packages and select the user folder for your Remote Window in VSCode. Unfortunately, some packages require conda to install. Additionally, it’s more convenient to look only at project folders instead of all files in the system.

Create AMI

AWS console — go to actions, Image and templates, Create image.
AWS EC2 AMI — create an image from the running instance

The setup is a proper base for my future projects. I created AMI from it and made it public. The AMI in the eu-west-1 region — ami-0460245d7f86724fc

Next time you need an experiment environment, spin it up in a minute. When creating a new EC2, select your AMI instead of Ubuntu 22.04 LTS.

Creating the AMI before you start working is essential because you want it to be project-agnostic. Unfortunately, if you create AMI later, you add project-specific VSCode configurations or code.

VSCode project

Connect the VSCode Remote Window

In VSCode, I select the environment for the Remote Window from SSH configurations. If you use a public cloud instance and recreate it each time, change the HostName in ~/.ssh/config to the instance public IP. You can also use static IP — i.e. Elastic IP in AWS. Be aware of the cost of this solution, i.e. AWS charges for Elastic IP when it is not associated with a running instance.

You can similarly set it up from the VSCode or directly edit it in the file ~/.ssh/config. My SSH config for the data science experiment environment looks like this:

Host ds-exp
HostName 3.251.90.107
IdentityFile ~/.ssh/bpk.pem
User ubuntu

I opened a new window in the Visual Studio Code. Next, select the blue rectangle in the bottom left corner, like in the picture below.

VSCode startup screen with Remote Window in the bottom left corner.
VSCode startup page — open Remote Window

Then connect to the virtual machine over SSH you want and start working. At this stage, the IDE does all the work to download its binaries on the VM and connect for terminal sessions. The whole process takes under a minute in a public cloud like AWS.

After connecting to the instance, select the folder “workspace” created earlier. In this folder, all work happens. One of the strong points of VSCode is the plugin ecosystem. When using Remote Window, you need to install the required plugins. You install them per project instead of globally. For Data Science projects, Python and Jupyter plugins are a must-have. I install them from the extensions tab.

What I’m still missing is the Python interpreter. Conda installed previously provides a Python-based environment in the needed version.

$ conda create --name <env-name> python=3.11

After I create a jupyter notebook, I can select the interpreter. All Conda environments are listed there. I run a cell to check if it works. You may see a prompt to install ipython.

Screen from a jupyter notebook. Python code executes correctly.
VSCode jupyter notebook — python code in the cell executes correctly

Now, you have a working experiment environment. Let’s make it even more awesome with some integrations.

GPU test

Pytorch is one of the most well-known open-source Machine Learning libraries. It allows you to build, train and save Deep Neural Network models. You can use it to quickly check if GPUs are working and visible for your workloads.

I installed PyTorch (link to webpage) with CUDA support.

$ pip install torch torchvision torchaudio -q

I ran a simple calculation using a torch tensor.

import torch
x = torch.rand(5, 3)
print(x)

You can expect a tensor with five-by-three dimensions with random numbers. Now you know that you import the library correctly.

Does the GPU work in PyTorch?

import torch
print(torch.cuda.is_available())

Output “True” means that the GPU works.

VSCode jupyter notebook — PyTorch code with GPU acceleration executes correctly with outputs.
VSCode jupyter notebook — GPU available in the PyTorch

Git

Integration with the source repository is a bare minimum to start working. In VSCode, you can use UI or a terminal to execute Git commands.

VSCode Remote Window — clone Git repository screen
VSCode Remote Window — clone Git repository

One thing worth mentioning is that your git credentials from the local machine are available in the VM. You can work with private repositories easier.

All your code changes stay on the VM, so remember to commit your changes frequently.

Object storage

VSCode Remote Window, AWS EC2 instance with GPU and Object storage connected as a chain.
VSCode using Remote Window connects to VM over SSH, VM connects to Object Storage

Data Science experiments require data. Object storage is a cost-efficient way to store them. S3-compatible object storage provides the API to download and upload data. But you need to authenticate to the service before you can do this. Most clouds support two ways of authentication — access, secret key pair and role-based access. The first is cloud-agnostic because it’s part of the S3 standard. Because it is a Basic Authentication mechanism, it is less secure. The second is more convenient and safe because you set privileges on the VM level. In AWS, it’s called instance profiles.

First, I create a bucket with some data. I want to give read-only access to this bucket. I use the IAM policy below. My bucket’s name is bpk-test-data-bucket. You need to change it in policy.

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "ListObjectsInBucket",
"Effect": "Allow",
"Action": "s3:ListBucket",
"Resource": "arn:aws:s3:::bpk-test-data-bucket"
},
{
"Sid": "GetObjectActions",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:GetObjectVersion"
],
"Resource": "arn:aws:s3:::bpk-test-data-bucket/*"
}
]
}

Here are the steps to follow:

  1. Create an IAM policy with access to the bucket
  2. Create an IAM user and attach the policy directly to him
  3. In the tab Security Credentials -> Access Keys -> Press the button “Create access key”
  4. Select Command Line Interface (CLI) from radio buttons
  5. Download the keys in the form of a file. The keys look like this
  6. Go to the Virtual Machine terminal session and configure the AWS CLI.

Time to test the results.

As a Data Scientist, I want to look for objects in the bucket and download them. I have read-only access, so I cannot upload results to the bucket as expected.

$ aws s3 ls s3://bpk-test-data-bucket/
$ aws s3 cp s3://bpk-test-data-bucket/test.txt .
$ ll
$ aws s3 cp test.txt s3://bpk-test-data-bucket/test-uploaded.txt

You might encounter additional issues in an enterprise setup, like lack of permissions, networking between storage and your VM or using private S3 endpoints for improved security.

Cleanup

After you finish the experimentation, it’s time to clean up the resources and configuration changes:

  • Close VSCode Remote Window
  • Terminate the EC2 instance with the volume
  • Remove config from ~/.ssh/config
  • Optionally, remove volume snapshots

All cleaned up and ready for the next experiment!

Summary

You have a secure and efficient way of experimentation. The data stays in the cloud while you feel like working from a local machine.

You created an AMI for your future Data Science projects. Next time, your environment is only three clicks away.

There are multiple ways to customize your environment. Here are a few ideas:

  • Add more tools to your AMI using snaps, apt packages and conda env
  • Add MLOps tools like Charmed MLflow for experiment tracking
  • Use the CI tool to build AMI
  • Add scheduled power-on and power-off for the VM to optimise the cost
  • Set up MLOps platform with Charmed Kubeflow

For more MLOps Hands-on guides, tutorials and code examples, follow me on Medium or contact me directly.

--

--