I’m a Data Scientist, why should I use the Cloud?
People often think that the Cloud is just a service to rent computers, provided by companies like AWS (Amazon Web Services) or GCP (Google Cloud Platform). Someone is letting me use their computer rather than my own? So what!?! My machine has all the power I need to build and develop my Machine Learning algorithms, why should I use the Cloud? Let me explain how the Cloud can make your life a whole lot easier.
What if you want to try out some new, amazing open-source technology (e.g. Apache Spark) but don’t know where to start when it comes to setting it up on your machine? Why not just ask a Cloud provider to instantly give you a machine with this already pre-installed (e.g. AWS EMR or GCP DataProc)? Easy.
Another scenario could be that you have a database running locally (i.e. not in the Cloud). You keep having issues regarding upgrades (which can never get done), queries taking FOREVER to return any data and the continual threat of running out of storage. One option is to keep fighting this losing battle, spending all of your (and your data/software engineer’s) time trying to stop the database from freezing up at any moment. Why not just ask a Cloud provider to instantly give you access to a database (running on machines managed entirely by them) which can handle as much data as you throw at it, without any increase in the time it takes for a query to return (e.g. AWS DynamoDB or GCP BigQuery)? Simple.
Perhaps you want to run a process regularly, like a daily report. Locally, you would need to install an application on a machine that would trigger each process to run at the correct time. Your chosen machine is now a single point of failure. It will crash, it will need updating, it will get old and you will by mistake try to run an application that will exhaust all of its computing resources. Why not just ask a Cloud provider to manage the schedule with a 100% guarantee? They will even retry individual processes if an error occurs (e.g. if your daily report process crashes before completion), or create a failure notification system once it has retried 3 times, for example (e.g. AWS Data Pipeline). Beautiful.
You’ve just built an amazing Data Science algorithm. You’ve wrapped the code in a Docker container and now want to deploy it. Hosting it on a single machine means that you will run into the single point of failure nightmare mentioned in the previous paragraph. You also won’t be able to easily scale when demand increases. Your other option is to create some cluster architecture to turn your machines into one computing entity. If one machine goes down, no problem, your model will automatically be moved to a different part of the cluster. Trying to set this up will be a very painful and slow exercise. Why not just ask a Cloud provider to instantly create a cluster to host your model (e.g. AWS EC2 Container Service or GCP Container Engine)? Amazing.
The ease of working in the Cloud is just unparalleled. It can even be truly awesome. Cloud providers now offer a service called Serverless computing (e.g. AWS Lambda or GCP Cloud Functions). It lets you run code without needing to provision or manage any servers at all. You just provide the code to be run, nothing more. It executes your code only when needed and scales automatically. Game changing.
So the Cloud is amazing, it must be so expensive. The truth is the exact opposite. Firstly, both AWS and GCP (for example) provide extensive free tiers. The competition between the increasing number of Cloud providers also keeps prices low. But most importantly, it’s pay-as-you-go. You only pay for the amount of data in your Cloud database or the computing power you use. There is no need to overbuy expensive hardware that you may only use for 8 hours a day. You can scale quickly and easily in the Cloud, with no upfront costs. What more could you ask for?
The Cloud is about making your life easier. Your goal as a Data Scientist is to run the best algorithms, with state-of-the-art technologies, most reliably, with the least required maintenance, using all the data available, with as little set-up time as possible, paying only for what you use, with the capability to scale quickly and easily if and when required. The Cloud gets you there, working locally doesn’t. Case closed.