Cloud Host Any Tool on AWS from Scratch
How I hosted a deep learning labelling tool on AWS by using only the most basic products.
Data labelling is THE largest pain the ass when it comes to anything relating to deep learning. As a result, multiple startups that build tools to aid the data labelling process have received awesome funding rounds just this year! Full-blown Tech companies such as Tesla with unique labelling requirements even build their entire labelling software from scratch — this seemingly innocent step in the deep learning pipeline is just that important.
As an ML practitioner, data labelling is nothing new to me. Due to the ability to customize and tweak the software in the backend, I have always used open-sourced labelling tools such as Label Studio and Diffgram. It is always easy and straightforward to simply install the tool on a local machine, that contains the data to be labelled! Local instances are great: access and security is simple to configure, latency is low, data management is easy. However, when working on a larger project that requires multiple collaborators in the labelling process, I faced the ultimate challenge of all: we need to scale our solution.
Thankfully, I am located in Singapore which is home to data centers from all the Big Boys. Cloud hosting seemed like the most obvious and quick solution to allow multiple collaborators on the data labelling platform. Here is my journey as a complete beginner to the Cloud in hosting a labelling platform from scratch on AWS!
Creating an AWS account is pretty easy. I already had one from playing around a few years back, so I won’t share much on this. AWS offers free tier pricing for nearly all of the basic products here, so feel free to try it yourself! There are insanely detailed ways you can manage your user accounts, privileges, and security access but that could take years to get through. (If you are a̶ ̶b̶o̶r̶i̶n̶g̶ ̶p̶e̶r̶s̶o̶n̶ bored, feel free to read about that here.)
Launch EC2 Instance
EC2 stands for Elastic Compute Cloud, no idea what that means. All I know is you can imagine that EC2 gives you access to a piece of a computer (which is called an instance) somewhere in one of these things:
To do this, I performed the following main steps:
- Navigate to the EC2 tab and just click ‘Launch Instance’, pretty straightforward!
- Selected the Ubuntu 20.04 as the AMI (this is something like the OS for your instance). Ubuntu because its the only one I recognize.
- Chose a t3.xlarge instance with 4 vCPU cores and 16 GB of RAM because I’m rich. For poor (PhD) students, feel free to go ahead with t2.micro which is Free, but results may vary.
- IMPORTANT: In the ‘security group’ tab, make sure you open port 22 for SSH access, to either the internet or your own IP address if you wanna be more secure. Also, open port 8080 to allow access to the webtool later on.
- Clicked on ‘Review and Launch’ without a care for configuration because I’m lazy.
- You’ll have to name and create something called a key-pair, which is required for security. Just download the .pem file and keep it somewhere safe on your computer, it’s required to authenticate yourself with connecting to your instance. I know it’s just going onto your Desktop, but remember that it is actually the most vulnerable location to hacks.
After that, your instance is officially up and running on the magical Cloud, great job! Next, we have to connect to our running instance. To do that, click on ‘View Instances’ and click on the instance ID that you just launched, see below:
Next, click on ‘Connect’ which is found near the top right. You should then see some instructions on how to do the connection to your personal, unique, amazing EC2 instance! I have to assume you know how to use the command line if you’re reading this, so run the command highlighted in the following image. Make sure you’re in the same directory as the key-pair file (or point to it somehow) you downloaded previously for this to work!
Setting up Label Studio
If all things went well, you should see a nice ubuntu@ip-xx-xx-xx:~$ in your terminal, showing that you now have access to the instance! Following that, I simply ran a quick bash script that encompasses the installation instructions for labelstud.io, taken from their official docs. The code block below should be sufficient to get label studio installed.
git clone https://github.com/yeeyangtee/setup-label-studio.git
chmod +x setup-label-studio/setup.sh
To check the versioning and then run the labelling tool in a webserver:
To view the web tool, simply extract the public IP address from the instance details page. Next, you can simply connect to it by typing the IP address followed by 8080. For our example, the complete address would be: http://22.214.171.124:8080/.
Once you get that into your browser, you should see the following login page. You’re golden at this point! Feel free to create an account and sign in, this account will be unique to your instance. You will be able to use all the features in the labelling tool, such as uploading of data and annotation of data, completely functioning on the Cloud!
If you followed all the default steps in this article, it should be fairly straightforward to launch a fully functioning labelling tool (or any other open sourced tool for that matter) on AWS. Of course, many things were skipped over due to this being simply a hype article.
Some things that I promise to cover in subsequent parts are:
- Database systems for backups. Label Studio actually uses SQLite3 by default, so we can back that up by simply dumping one file to separate cloud storage. Label Studio also supports PostgreSQL database which is great for more serious work!
- Security. Not much to say here, it’s really obvious that everything explained in this article is not secure at all.
- Integration with Amazon S3 or other Cloud storage tools. We need a better place to store all our large datasets, and the measly SSD storage on the EC2 instance just won’t cut it!