Opinion — In the Cloud: Cloud Operations

Published in

GDG Singapore

5 min readNov 6, 2020

Hey everyone, Harizuan here. Hope that you’re doing well and staying healthy during this COVID situation.

Google Developers Space recently hosted another webinar in their “In the Cloud” webinar series. This time they covered a pretty important (albeit less trendy) topic; Cloud Operations. The webinar covers aspects of tooling that allows developers to understand their applications running in production environments.

Before jumping into the opinion section of this article, let’s recap on the actual livestream itself:

TLDR; The webinar session essentially covers the importance of ensuring that your applications are “operationally ready” by utilising some of the various tooling that Google Cloud Platform namely offers:

Cloud Logging
Cloud Monitoring
Cloud Error Reporting
Cloud Trace
Cloud Profiler
Cloud Debugger

The main documentation for all these features can be found here:
https://cloud.google.com/stackdriver/docs

Some additional references to using these features:

System Reliability Engineering book. Many concepts mentioned in the video comes from this: https://landing.google.com/sre/books/
Stack docker video series by GCP Cloud youtube channel https://www.youtube.com/watch?v=lwBBAvPxO9c&list=PLIivdWyY5sqLuKKx4pcdEAkJY1HevjVVm
Performance aspect that was mentioned in the webinar. http://www.brendangregg.com/
Video recording of a talk on stackdriver done in one of the sessions in GDG Cloud Singapore meetups in 2019. https://www.youtube.com/watch?v=gzZaj-_BbIY

Before proceeding further, we must first understand and define what does it mean to ensure that our applications are “operationally ready”. To do so, we would need to understand the situations that developers and application operators would face when getting an application into production and keeping the application alive and working properly.

In some organisations, there is still a practise where developers are not too heavily involved with applications that go to production. When applications go down, there is a team (operations team) that would be on stand by duty to respond to it. The operation team would need to quickly understand the situation and try to debug the situation and to find ways to quickly get the applications back up and running. If all else fail will then the developers be contacted. One of the reasons for this split responsibility is to ensure that developers won’t get too bogged down with being responsible for operating the software they built (at the end of the day, developers would still be expected to churn out more features for the organisation).

Let’s assume that applications are deployed on virtual machines.

In the above practise mentioned, generally, developers wouldn’t have access to production environments. Access to production environments are usually just limited to operations teams (partly a security measure) and due to that, developers are not able to easily access the machines on production in order to retrieve the required data from their applications that they would need to analyse issues with their application. Examples of data that developers would need to debug their applications could be application logs (maybe errors would be printed within the logs?), system metrics (are there sufficient cpu/memory/disk space to run the application) or application metrics (did the application.

How can developers get the required data to debug their applications? It would be quite depressing for developers if each time they would want to get application logs/metrics or system metrics to go file a “ticket” to request help from operations to run a specific command on a specific machine to retrieve required data. That would definitely result in a situation where we have long developing cycles where the developer would do a even more rigorous testing of their codebase (to the point their it makes little economic sense) or even the reluctance to add new features to their applications in fear to introduce new bugs/issues in their applications.

One way this was solved was to have centralised logging and monitoring systems when there are agents installed in the machines to retrieve the logs and metrics and push them to a central location. Developers are given access to this central location store of logs and metrics and that would somewhat allow them to debug their applications more easily.

So with that in mind, we now realise the following:

Application need to log in a way that is acceptable to centralised logging system
Ensure that not too much logs spit out by applications (or negotiate with logging systems maintainer to allow such exceptions). Processing and storing too much logs can stress systems quite a bit (or cost a lot in the cloud platform that it is deployed on)
Provide metrics in a common format that the monitoring system accepts

So, if an application has logging and metrics capability in place, would that mean that it is “operationally ready”? Unfortunately, this where we would need to disagree.

The “operations” word is a pretty loaded word. The way Cloud Operations is being marketed seems to put an emphasis on observability tools such as logging and monitoring etc although there are many other things to consider as well. If you think from the perspective of a member from operations team — you would want to know the full lifecycle of the applications like how to commission, upgrading and decommission of applications, any hacks for known expected issues while application is running etc

If you happen to join some of the bigger organisations with more established engineering practises, you would realise that before any applications team can get their applications to production environment, they would need to go through a whole checklist of items and go through multiple rounds of reviews. Here are some of example items that can be expected:

Does application require database? How is the database migration handled? How quickly can the migration be done?
Does the service require High availability?
Expected resource requirement for the application (e.g. how much cpu, memory, disk, network is needed)
Performance tests; maybe running applications under a low load for extended period of time to ensure that application doesn’t show resource buildup utilisation issues (e.g. memory leak problem)
Should the application be available during upgrade phase? Is downtime acceptable during updates? Or if downtime is unacceptable, will the application allow rolling update to happen?
In the case where issues happen (e.g. unexpected downtime), are there any playbooks (series of actions that operations team can follow) in order to quickly debug and resolve it (hacky solution rather permanent solution) — this is just to ensure that application can stay up long enough to allow developers to issue a new updated app

Unfortunately, each organisation has their set of practises that they adhere to so it’ll be pretty hard to find a generic list of things to consider before saying that application is really operationally ready. But then again, if we only deploy applications when it is has fulfilled everything in the checklist, we would be deploy nothing. Everything in life is a trade off; if a feature is expected to bring a lot of value to an organisation, management might decide to bite the bullet and request that the operations spend more of their resources to maintain and run the feature. While operation teams spend their resources on maintaining the application in production environment, applications teams can look into adding the required code/configuration to meet the required operational needs

This article is an opinion piece so take the advice/opinion with a grain of salt. If you have opinions on this as well, feel free to comment below

Opinion — In the Cloud: Cloud Operations

Written by Hairizuan Noorazman