Managing VMs like a Data Scientist
Managing virtual machines (VMs) as a data scientist can be tedious. If you are like me and work in a team that is not fortunate enough to have a data engineer cleaning, prepping and giving you your data on a plate with some garnish on the side, then you have to manage, extract and manipulate files sitting on various VMs. Logging into each of these VM to see if all the necessary files dumped, all the necessary packages installed and all the cron jobs executed on time can be a time consuming, inefficient and downright laborious task.
Luckily Python comes to the rescue with a package called paramiko. This posts explains how you can wrap your VMs in a DataFrame and execute the same command on all of them saving the returned output in a Python DataFrame.
The code for this post can be found on this git repo. Although this post relates to managing VMs — the underlying hack applied here is to use your current knowledge of DataFrames, with all their great functionalities that we all have come to know and love, and combine Python Classes to abstract and make inefficient tasks more efficient.
Below is some background for those stumbling onto this post with no clue on any of the topics, feel free to skip the sections you know a lot about as the snippets below give a high-level overview to the reader to ensure the post makes sense as a whole.
As I work at a corporate and to avoid disclosing the IPs, I’ve opted to spin up 4 VMs on Google Cloud Platform (GCP), but the methodology for any VM on any domain is the same. If you are using GCP and you have the gcloud sdk installed, spinning up GCP VMs via the command line can be achieved with a one-liner as shown below.
gcloud compute instances create vm1 --custom-cpu 1 --custom-memory 1
gcloud compute instances create vm2 --custom-cpu 2 --custom-memory 2
gcloud compute instances create vm3 --custom-cpu 1 --custom-memory 1
gcloud compute instances create vm4 --custom-cpu 2 --custom-memory 2
The above bash commands use the gcloud sdk to spin up 4 VMs named: vm1, vm2, vm3 and vm4 respectively with either 1 vCPU and 1Gb of RAM or 2 vCPUs and 2Gb of RAM. If you log into your GCP console, you’ll see the VMs created, each with their respective public IP address that we’ll be using to ssh into.
If you’ve never worked in a shell before, you should give it a bash… For those unfamiliar with what the shell even is, it’s the screen that looks like the matrix that all the techies use at work; I show an example below.
Similar to Windows 10 or MacOS, the shell is just another way to interact with hardware and comes preinstalled with a plethora of programs like
top and of course
ssh. SSH, short for Secure Socket Shell, practically comes installed with every Unix (Mac OS) or Linux (Ubuntu, Red Hat, Debian) system. The
ssh program runs in a shell and is used to start a SSH client program that enables secure connection to a SSH server on a remote machine.
The ssh command can be used to log into a remote machine, transfer files between two machines or to execute commands on a remote machine. To log into a VM, all you need is the IP address of the VM and a username and depending on how your user configuration, either a set of ssh-keys or a password. The syntax for the ssh command for the user
root to log into a VM with IP
ip then looks as follows:
In our example, for
vm1 with IP
184.108.40.206 sitting on GCP this would translate into:
Running this command in a shell logs us into
vm1 and the shell we are working in will now not be a local shell anymore, but rather, a remote shell logged into the remote VM identified by public IP
Python is an object oriented programming (OOP) language. Almost everything in Python is an object, with its properties and methods.
A Class is like an object constructor, or a “blueprint” for creating objects. Classes provide a means of bundling data and functionality together. Creating a new class creates a new type of object, allowing new instances of that type to be made. A toy example is shown below where we create a Person class with 2 attributes: name and age.
def __init__(self, name, age):
self.name = name
self.age = age
p1 = Person("John", 36)
What makes the object orientated paridigm amazing is that we only have to write a great blueprint (Class) once, then we can reuse the hard work we’ve done again and again. For our Person class above, we might want to construct a list containing many people. An example of this is shown below:
people = [Person("Jane", 29), Person("John", 36), Person("Blake", 10)]
As a side note, Classes in Python always start with capitals as per PEP8 convention, but I digress. We can now iterate over the
people list and access the class attributes for each
Person in the list. For example, if we wanted to print out the names of all the people in the list, we can do the following:
for person in people:
Isn’t that nice and modular? Using Classes to abstract away complicated logic from the end user is a critical pillar in OOP and a great mindset to adopt to write scalable, reproducible and maintainable code.
If you ever see a function in Python starting and ending with two underscores (
__), like the
__init__() function above, know that these functions are “special”. The
__init__() function, usually called a method instead of a function just because it is a function inside a class, but that is just some nomenclature. The
__init__() method is used to initialise the class and is sometimes also called the constructor method.
Similarly to the “special”
__init__() method in our Class, there is another “special” method
__str__() which prints the friendly name of the object. The
__str__() method is called by Python when you print a class using the
print() function. For example say we code up:
print(Person('Jane', 29)) what should print? Just the name, or the just the age, or both? The
__str__() method tells Python what it should print.
Ok, the theory is almost done. Just one more topic — Pandas.
Pandas is a Python library for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables called dataframes. Dataframes are a data scientist’s bread and butter and is most likely the most used data type (Class) in the Pandas package. For this post you need only know two things about the DataFrame Class, viz.
- what is a DataFrame (just a table) and,
- what does the
apply()function do to a column of a DataFrame.
I show an example of the Iris dataset as a DataFrame Class below. To my dismay, pandas has no built-in datasets, so I’ve consulted another imperative data science package — seaborn. Go check it out if you can.
import pandas as pd
import seaborn as sns
iris = sns.load_dataset('iris')
So what is a DataFrame? It’s just a table — that’s it. However, it’s got some pretty cool built-in methods to make your data manipulation, interrogation and cleaning a much, much more pleasurable experience.
If we run the code below, which calls the
apply() method on the
sepal_length column, we get the output shown in the table. I hope the functionality of the
apply() method is clear from the example… If not, stare at it a bit, then read on.
iris.sepal_length.apply(lambda row: 'tall' if row >= 5 else 'short')
There is a weird
lambda keyword thrown into the example, which in short is just a “phantom” function, formally called an anonymous function. Basically it is a function that doesn’t have a name but runs some code. In our example, this anonymous lambda function checks each row of our column, and if the sepal length is greater or equal to zero returns tall, otherwise it returns short.
Bring it all together
I hear you saying: “OK cool Louwrens, nice background, but so what?” Well, we’ve covered all the theory needed to understand what is about to happen, which is:
- Create a VM Class which gets initialised with an IP and username,
- the init method then checks if we can connect to the remote VM and uses the ✅ and ⛔️ emoticons to show a successful or unsuccessful connection.
- I then create a DataFrame containing all the IPs of our 4 remote VMs on GCP,
- then we can use the
apply()method to run bash commands on these VMs and return a DataFrame.
- I then display a summary DataFrame containing the specs for these 4 VMs sitting on GCP.
Below I create the VM Class.
from paramiko import SSHClient
from paramiko.auth_handler import AuthenticationException
def __init__(self, ip ,username, pkey='~/.ssh/id_rsa.pub'):
self.hostname = ip
self.username = username
self.pkey = pkey
self.logged_in_emoj = '✅'
self.logged_in = True
ssh = SSHClient()
except AuthenticationException as exception:
self.logged_in_emoj = '⛔️'
self.logged_in = False
I then create a DataFrame, VMs, which holds all the IPs for our 4 VMs on GCP.
VMs = pd.DataFrame(dict(IP=['220.127.116.11',
We can then call the
apply() method on the DataFrame, which iterates through each host and creates a VM Class object for each VM which gets stored in the
VM column of the
VMs['VM'] = VMs.apply(lambda row: VM(row.IP, USERNAME, PUB_KEY), axis=1)
Note that the
__str__() method of our VM Class is used to represent the VM Class in a DataFrame, as seen below. Each VM is represented as
username + ip + ✅, exactly how we defined it in the
Ok great, we’ve created a DataFrame with a bunch of connected VMs inside. What can we do with these?
For those of you who don’t know, there is a command called
lscpu in Unix which displays all the information about the CPUs on a machine, below is an example output for
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
On-line CPU(s) list: 0
Thread(s) per core: 1
Core(s) per socket: 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model name: Intel(R) Xeon(R) CPU @ 2.00GHz
CPU MHz: 2000.170
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 56320K
NUMA node0 CPU(s): 0
We are now looking to get the output of the
lscpu command for each of our 4 VMs on GCP; we can wrap the
lcspu function in the
exec_command() method (see the github repo) to return the output of each VM’s
lscpu = VMs.VM.apply(lambda vm: exec_command(‘lscpu’))
With which we can obtain a DataFrame like the one shown below.
Another useful command is the
cat /proc/meminfo command, shown below, which returns the current state of the RAM for a Unix machine.
louwjlabuschagne_gmail_com@my-vm1:~$ cat /proc/meminfo
MemTotal: 1020416 kB
MemFree: 871852 kB
MemAvailable: 835736 kB
Buffers: 10164 kB
Cached: 53504 kB
SwapCached: 0 kB
Active: 92012 kB
Inactive: 17816 kB
Active(anon): 46308 kB
Inactive(anon): 4060 kB
Active(file): 45704 kB
Inactive(file): 13756 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 28 kB
Writeback: 0 kB
AnonPages: 46176 kB
Mapped: 25736 kB
I’ve extracted the most relevant columns from the
cat /proc/meminfo commands and display an overview of our 4 VMs below. We can plot this information quickly with a library like
plotly that works great out of the box with DataFrame objects, or we can get summary statistics for all our VMs using the built-in methods pandas has.
This post has only scratched the surface on how using Classes and DataFrames in conjunction with each other can ease your life. Be sure to check out the jupyter notebook on the github repo to fill in some coding gaps I’ve eluded to in this post.
The next time you are doing data wrangling with pandas I encourage you to take a step back and consider wrapping some of the functionality you need in a Class and seeing how that could improve your workflow. Once written, you can always reuse the Class in your subsequent analysis or productionise it with your code. As the python mindset goes: “Don’t reinvent the wheel every time.”