Source: Freepik

Optimizing Google’s Cloud Infrastructure for Data Engineering and Analytics

Paul Nwosu
12 min readJan 13, 2023

--

In this article, as part of a series titled “Building your First Google Cloud Analytics Project”, I aim to guide data analysts and engineers in establishing a foundational understanding of utilizing cloud technology through the configuration of a Google Cloud Compute Engine Instance for data-related projects. The objective of this installment is to provide a comprehensive introduction to the Google Cloud Compute Engine and prepare readers for the subsequent projects to be covered in the series.

Prerequisite

  • A Working Visa/Mastercard to set up your billing account
  • Vscode on your local machine
  • Access to the free 300 credits provided by Google Cloud
  • Google Cloud SDK on your Local Machine
  • Basic knowledge of Bash

Why Cloud Computing?

As technology advancements continue to evolve, so do the requirements for computing and storage resources. The traditional limitations of personal and on-premise servers may no longer suffice for the demands of certain projects. Cloud technology presents a solution for this by providing scalable and easily accessible computing and storage resources. In this series, we will explore utilizing Google Cloud services to set up a project. This installment will not focus on the benefits of using cloud technology or Google Cloud specifically, but rather on the steps to set up your first project utilizing the platform.

Creating a Project on Google Cloud

In other to access Google Cloud resources, you have to set up a project. A project allows you to group all your resources together; this is useful for billing purposes, and also allows you to control who has access to what. Resources are the services you use on the Google Cloud platform. These resources can be grouped into three services.

  • Infrastructure as a Service (IaaS): This is a type of cloud computing service that provides users with access to a virtualized computing infrastructure, including servers, storage, and networking resources. Users can deploy and run their own applications and operating systems on the infrastructure and are responsible for managing and maintaining the underlying infrastructure. An Example is Google Compute Engine.
  • Platform as a Service (PaaS): This is a type of cloud computing service that provides users with a platform for developing, testing, and deploying applications. The provider manages the underlying infrastructure, including servers, storage, and networking resources, and users are only responsible for developing and managing their applications. Examples include App Engine, Google Kubernetes Engine, and Cloud SQL.
  • Software as a Service (SaaS): This is a type of cloud computing service that provides users with access to a software application or suite of applications. The provider manages the underlying infrastructure, including servers, storage, and networking resources, and users access the software over the internet. An Example is Google Workspace.

There is a fourth one that should be mentioned:

  • Serverless: This is a type of cloud computing that allows users to run code without having to worry about the underlying infrastructure. Instead of running code on dedicated servers, users can run code in response to events or triggers, and the provider automatically allocates the necessary resources to execute the code. Examples include Bigquery, Cloud Run, Cloud Functions, and Cloud Scheduler.

The main difference between these types of cloud computing services is the level of control and responsibility that users have over the underlying infrastructure and applications. IaaS gives users the most control, while PaaS provide more abstraction and relieve users of the responsibility of managing the underlying infrastructure. Serverless computing takes this one step further by automatically allocating resources to execute code as needed.

A step-by-step guide to creating a Google Cloud Project

Go to https://console.cloud.google.com/ and log in with your Google account

You will need to set up billing in other to activate your free 300 credits. It is paramount that you activate these free credits because of the tasks we would be running.

On the upper left corner, click on the dropdown and create a new project, you can call this project anything you like, then take note of the project id, this is usually created automatically, and only be edited once upon creation.

Once you have set up the project, ensure that you select the project you just created.

Selecting the Project you just Created

Now that you have created a project, we will go ahead and create a Virtual Machine using Google Compute Engine:

Google Compute Engine

In the previous discussion, we highlighted that IaaS, PaaS, SaaS, and Serverless differ in the level of control they offer to users. Among the services offered by Google Cloud, Compute Engine falls under the category of IaaS (Infrastructure as a Service). This allows users to run both public images provided by Google for Linux and Windows Server, as well as private custom images that can be created or imported from existing systems. This feature enables the creation of virtual computers on Google’s network. Each virtual machine is configured using a virtual private cloud, ensuring that the resources on the virtual machine are not publicly accessible.

A step-by-step guide to creating a Compute Engine instance

Open the Navigation menu, click on Compute Engine, and then Click on Create an Instance.

Set the following configurations as seen in the image below.

Creating a VM using Google Compute Engine

For the Boot disk configuration, click on change and make the ff adjustments.

Setting the Boot Disk Configuration

Allow HTTP and HTTPS Traffic (Not really important for our projects though) and then click on CREATE.

Finalizing the Set up of the Virtual Machine

It will take a couple of minutes to set up, once this is done we need to SSH into the virtual machine (remote instance).

What is SSH, and How to Set up a Direct Connection from your Local Machine to the Remote Instance

SSH, also known as Secure Shell is a way to access a computer (in this case, the remote instance) over an unsecured network.

Before you proceed, please ensure that you have installed google cloud SDK on your local machine and that gcloud commands work on your terminal. If you have not, follow this link: https://cloud.google.com/sdk/docs/install-sdk to help you get started.

On the VM instances page, you will see a list of the virtual machines that you have created, click on the dropdown next to SSH on the Connect Column, and select the view gcloud command. Copy the command and go to the terminal of your local machine and paste it. Also, take note of the external Ip address of the virtual machine.

Logging into the Virtual Machine (Remote Instance)

When logging into a remote instance from your local machine for the first time, the above command creates two SSH keys, a private key, and a public key. It uses the local user to configure access and stores them in the .ssh directory of your home directory. It is unclear whether the keys are automatically stored in the .ssh directory for Windows users, but it is confirmed for Mac and Linux users. This public key will be used for any future SSH connections with that user.

From your home directory, move into the .ssh directory;

cd ~/.ssh

You should see two files named google_compute_engine and google_compute_engine.pub. The former is the private key, the latter is the public key.

Create and open a config file in the .ssh folder:

touch config

nano config

Copy and paste the ff commands on the config file:

Host “alias_name” (Pick anything that's easy to remember)
HostName “external_ip_address_of_instance” (This will if you restart the instance)
User “name of user” (This is usually the name of your pc)
IdentifyFile ~/.ssh/private_key_name (This should be the absolute path of the ssh key)

Here is what your config file should look like:

Host gcp_instance
HostName 23.354.76.134
User macbook
IdentifyFile ~/.ssh/google_compute_engine

The external IP address of a virtual machine is subject to change, such as upon manual upgrade or when the instance is shut down and restarted. This is similar to how a local machine’s external IP address changes when it is shut down or restarted. It is possible to set the IP address to be static, but that is outside the scope of this guide.

Now, log into your instance from any directory on the command line using:

ssh gcp_instance

We have successfully logged into our remote instance, now to set it up for our data-related projects

Installation of Anaconda, Docker, Google Chrome, and Chrome Driver on our Remote Instance

Anaconda Installation

Before installing any components, you are required to update the package index. On the terminal of your remote machine, type in the command:

sudo apt update
sudo apt upgrade

To install Anaconda go to the anaconda’s download page, look for the Linux version, and then copy the link address. On the terminal of the instance, you can then use wget ‘link address’to download anaconda just as seen in the command below. The installation of anaconda would automatically lead to the installation of python 3 as well as jupyter notebook.

Note: You can always install python 3 as a standalone application.

 wget https://repo.anaconda.com/archive/Anaconda3-2022.10-Linux-x86_64.sh

The above command installs anaconda on the virtual machine; the ff command is then used to install it:

 bash Anaconda3-2022.10-Linux-x86_64.sh

After the installation, you will be asked if you want the installer to initialize Anaconda3, type yes. You might need to restart the instance before. To test that anaconda was installed properly use conda --version

You can then try running jupyter notebook using the command jupyter notebookand python using the command python3

Google Chrome and Chrome Driver Installation

Most of the projects we are going to be doing will involve web scraping, and for that, we need certain tools such as Selenium. Selenium WebDriver is a tool used for automating web interactions and testing.

With Selenium, automated actions such as click hovers, and form fills can be performed on browsers by directly interacting with them. It supports multiple programming languages including Java, C#, PHP, Python, Perl, Go, and Ruby.

To use Selenium, one needs to choose from a variety of browser options such as Firefox, Chrome (Chromium), Edge, and Safari.

For our projects, we would be working with Chrome [with the headless option] (which does not display the user interface) because we are working on a remote instance, and it’s only the terminal we have access to.

I will not delve extensively into the technical details, but rather provide a concise list of the necessary commands to install Chrome and the driver on your remote instance. For more information, refer to the ff blog posts:

The following commands are used to set up Google Chrome

# Adding trusting keys to apt for repositories
wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | sudo apt-key add -

# Adding Google Chrome to the repositories
sudo sh -c 'echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list'

# Updating apt to see and install Google Chrome
sudo apt-get -y update

# Install Chrome
sudo apt-get install -y google-chrome-stable

# Check that the google chrome was installed correctly
google-chrome --version

We then need to install the chrome driver

# Installing Unzip
sudo apt-get install -yqq unzip

# Download the Chrome Driver
wget -O /tmp/chromedriver.zip http://chromedriver.storage.googleapis.com/`
curl -sS chromedriver.storage.googleapis.com/LATEST_RELEASE
`/chromedriver_linux64.zip

# Unzip the Chrome Driver into /usr/local/bin directory
sudo unzip /tmp/chromedriver.zip chromedriver -d /usr/local/bin/

# Set display port as an environment variable
export DISPLAY=:99

# Check that the chrome driver works
chromedriver --url-base=/wd/hub

Installation of Docker

Docker will help us package our application and all its dependencies into a container which we can then deploy. To install docker use

sudo apt-get install docker.io

Check the version you installed using docker --version

In other to ensure that docker runs without sudo; check the link

Now try; docker run hello-world In this case; hello-world is a docker image. In the case you don’t have the docker image locally, docker pulls the image from the docker registry and builds the container on your local machine before running it.

Given that we have completed the majority of our setup, we could choose to conclude here. However, in order to enhance the development of our application, we can take an additional step and connect our remote instance to Visual Studio Code (VSCode). This will facilitate a more efficient development process. Vscode has an extension “Remote - SSH”, a tool that allows users to log into their remote instance from vscode.

A step-by-step guide on how to connect Visual Studio Code to a remote instance

From the extensions tab on vscode, search for Remote SSH, download and allow it to install its dependencies on vscode

To get started with it, look for ><at the lower left portion of the VScode GUI and click on it, select Connect to Host. This will list out the alias name you set up for your remote instances in the config file.

Installing the Remote SSH extension on vscode

Select the one you want to connect to. As you can see from the image above, I have a couple of instances I can connect to. You can set up every remote instance connection from one config file located in the .ssh folder on your home directory.

With this, we are good to go, and now we can connect to our remote machine from vscode. This setup brings many benefits, one of which is what we call port forwarding.

If you have started using a remote instance actively, you might wonder how to use apps that need a Graphical User Interface like Jupyter Notebook. With port forwarding, we can connect these apps running on our remote machine to our local machine. We can use VSCode to do this. Let’s see an example to understand it better.

First of all, I will log into my remote instance from the terminal of my mac and run jupyter notebook by typing jupyter notebook

Launching Jupyter Notebook

As seen from the image, I am provided with a link: http://localhost:8888/?token=6e873595950e652f43f941940039854223c92b9cbf8e71df. Copy the link provided from the terminal of your remote machine.

Open your vscode and connect to the remote instance (Yes, you can have multiple sessions of your instance running at once). Go to the port session and click on Forward a Port. Type in 8888, as this is the port jupyter notebook runs on (In other projects, we will cover this in more detail). This creates a local address.

Port forwarding on vscode

Before pasting the link for the Jupyter notebook in a browser, ensure that the port number specified in the Jupyter notebook link is the same as the one generated on the local address link. This will ensure that the connection is established correctly and the notebook can be accessed on the browser.

In the above case, both ports are the same because there is no application running on port 8888 on your local machine. However, if your Jupyter notebook is already running on your local machine, the local address link will generate a different port number. In this case, ensure to change the 8888 in the Jupyter notebook link to the number generated on your local machine. This is to avoid any conflicts with the already running notebook on your local machine.

Jupyter notebook on the remote instance

SUMMARY

And that’s it, we have successfully been able to connect the jupyter notebook running on our remote instance to our local machine.

Wow, we have done a lot in just one article. Here is a summary:

  • Set up and Connect to a Google Compute Engine Instance
  • Set up a Configuration file for direct SSH access
  • Installation of Anaconda, Chrome, Chrome Driver, and Docker
  • Set up of Remote SSH connection from vscode
  • Accessing Remote Applications from our Local Machine

Before we conclude this article, it is important to note the ability to create an image of an instance in Google Cloud. This feature is significant as it allows for the preservation of the current environment, in the event that the instance needs to be shut down or deleted for cost-savings purposes. In a subsequent article, I will demonstrate the steps for creating an image of an instance. Overall, this feature is an essential aspect to keep in mind when working with Google Cloud as it enables efficiency and flexibility in managing instances and costs.

I trust that this article will serve as a valuable resource for individuals seeking to begin their journey with Google Cloud. There are a plethora of exciting capabilities and features to explore within the platform. However, it is important to exercise caution when experimenting with these features, as costs can quickly accumulate. Nonetheless, with the proper guidance and understanding, Google Cloud can be a powerful tool for any organization or individual. With that said, this article marks the end of our journey, I hope it was informative and you are now ready to embark on your own adventure with Google Cloud.

Next Post: Building an ETL pipeline leveraging Google Service Accounts

--

--