Automate your Script Using GCP VM Instance and Cron Job

Warda Rahim
6 min readDec 2, 2022

--

Data scientists often have some scripts that need to be run on a daily, monthly, or quarterly basis. For example, you might be pulling data from an on-premises server where the data is entered on a weekly basis. Therefore, would it not be great if you could automate that data pull and its integration into your workflow pipeline without having to manually run all your scripts every single week.

When it comes to automating scripts, automating locally brings the issue of your laptop not being on all the time. That is where a cloud platform, such as Google Cloud, comes in handy. One way of automating scripts is using cloud functions and cloud scheduler.

Cloud functions (a Google Cloud service) allow us to run our scripts in the cloud without setting up a server. For them to work, you need to configure the cloud function, specify requirements, and upload your script to it (i.e., define the function that you want to be executed). The other nice thing about cloud functions is that it is possible to trigger them using different ways, for example, we can either use an HTTP point or a Pub/Sub topic, and this can be specified during the configuration step. Finally, we can automate our cloud function by creating a cron job in the Cloud Scheduler. The only drawback is that they can run for a maximum of 540 seconds, therefore for any process running longer than that you would have to resort to an alternative option.

In this article, I will walk you through one of the straightforward ways to automate your scripts on a GCP VM Instance. I am assuming here that you are familiar with the concept of GCP VM instance, which is just a Virtual Machine running on Google’s cloud infrastructure. Refer to these instructions on how to create a VM instance.

Once you have a VM created, you can connect to it using SSH. I find it convenient to do it using gcloud command line. You can install the Google Cloud CLI using the instructions here. And here are the guidelines on how to connect to your VM using Google Cloud CLI. Once you are connected to your VM, you can access it using your browser.

The automation of the script would involve 3 steps:

  1. Create a cron job in your VM instance
  2. Create an Instance Schedule
  3. Link your VM to the instance scheduler

For understanding, let’s imagine a scenario where we have a script that pulls data from a SQL server and writes it to SharePoint on a weekly basis. Writing data directly to SharePoint can be instrumental if it is a major piece of collaboration in your organisation. You can refer to this article which makes use of Office365-REST-Python-Client to write files or pandas dataframes directly to SharePoint. Note this would require SharePoint integration working on GCP (your GCP project should be configured to connect to your SharePoint site).

1. Create a Cron Job:

This brings us to the very first step in automating the script. We need to go to our GCP terminal and type crontab -e. This will open your crontab file which contains the instructions to run a cron job. You can have as many cron jobs running on your VM as you like and for all of them, you will have instructions present in this one file.

The instruction is defined in the unix cron format. For example,

35 13 * * * /home/warda/automate/data_refresh.sh > /home/warda/automate/logs/data_refresh_`/bin/date +\%Y-\%m-\%d`.log

In the first part, we are specifying at what minute, hour, day of the month, month, and day of the week, we want to run our script. In the above example, it will run at 1:35pm. You can use tools like https://crontab.guru/ to make sure your cron schedule format is correct. The second bit is specifying the path to the shell script that will be run at that time. And finally, we are writing the output of running the shell script to the logs folder, where a file with date appended will be created.

Instead of specifying time in our cron job, we can have @reboot which would start our VM first at the time we would specify later in our Instance Schedule (step 2 below). Then the script would run and finally the instance is stopped according to the Stop Time in our Instance Schedule.

@reboot /home/warda/automate/data_refresh.sh > /home/warda/automate/logs/data_refresh_`/bin/date +\%Y-\%m-\%d`.log

(If python is your preferred programming language, you can either run a python script directly or run any number of python scripts using the shell script (data_refresh.sh). For example, you can have a shell script running 3 python scripts, first script to get raw data from SQL server, second script to pre-process/transform the data, and third to write the final output to Big Query or SharePoint. It basically depends on how you structure your code or ETL pipeline.)

2. Create an Instance Schedule:

Next, we need to go to Google Cloud >> Compute Engine >> VM instances, and click on Instance Schedules.

Instance Schedules icon on GCP

Then click on Create Schedule in the top bar.

Create Schedule icon on GCP

Clicking on Create Schedule will open a page like below:

Details to be filled in for creating an instance schedule on GCP

The name field refers to the name you would like to give to this instance schedule. Note that this name is permanent so once the schedule is created, you cannot edit it. The region needs to match the VM instance region. The other important fields are Start and Stop time. This should be based on how long your script takes to run. Frequency as the name suggests would specify how often you would want the schedule to run, for example daily, weekly, or monthly. For example, you can have your script running at 06:00 AM every Monday morning so that before you wake up, you have the up-to-date data ready for you.

3. Link your VM to Instance Scheduler:

Once the instance schedule is created, all you need to do is link your VM instance with it. Click on Instance Schedules. Here you will be able to see the instance schedule you just created. Click on it.

Location of your instance schedule created in the above steps

Then click on Add Instance to Schedule and pick your VM instance that you want to link to this instance schedule.

Add Instance to Schedule icon

And that’s it. Now every Monday, according to the time specified whilst creating the Instance Schedule, your VM would start, the cron job will be run, and then finally the VM would be stopped without you even waking up.

Conclusion:

In this article, we discussed how we can automate a script using GCP VM instance schedule and a cron job. Whether it is automation of ETL processes or algorithms put into production, automating repetitive tasks should be one of the main focuses of a data scientist so that they can channel their energies into more fun data science stuff. I hope you find this article helpful and can use it to automate some of the repetitive tasks in your daily work.

References

  1. https://cloud.google.com/compute/docs/instances/schedule-instance-start-stop
  2. https://cloud.google.com/scheduler/docs/creating
  3. https://cloud.google.com/functions

Hey 👋 if you found this article useful, please support by buying me a coffee here. Thank you 😀

--

--