Cloud Service Management and Operations
This the first blog in a series of blogs where I will focus on monitoring “the stuff” that makes up today’s dynamic IT environments. In this first blog, we’ll set a baseline on monitoring is and expand from there. I have also tried to write this in a fashion to assist folks who may not have a background in service management or monitoring but want to know more. Throughout the series, there will also be “guest” authors to give us their thoughts.
Monitoring, according to Merriam-Webster has the following definition: To watch, keep track of, or check usually for a special purpose. Monitoring when used as a verb, a thesaurus tells us the synonyms are: audit, check, control, follow, observe, oversee, scan, supervise, survey, track, and my favorite — “keep an eye on”.
Whether you are running traditional enterprise, cloud native, or hybrid applications, monitoring touches many IT aspects which are required building blocks for a robust enterprise management solution. Data generated by a monitoring solution can be leveraged by numerous stakeholders within an organization and not just for alerting as demonstrated by the following monitoring aspects:
- On-premise infrastructure and environment: Servers (CPU, Disk Memory), Hardware (Fans, Boards, and Temperature), and the HVAC that keeps the Data Center at the proper temperature.
- Applications Performance Management: Are the applications performing as expected and meeting the customer needs. Are the runtimes healthy?
- Network: Much like infrastructure monitoring, the availability of the network and its associated hardware is monitored for availability. The ability to monitor activity on the network is vital to overall performance and additionally security.
- Logs: Servers, applications network equipment and security devices all produce logs, where problems, and more information is constantly logged and saved for analysis. Known log events/incidents should be alerted on and these logs are the go-to place for diagnosis of first-time anomalies.
- Cloud: With a public cloud though your business may not be directly responsible for the infrastructure, it is still important to monitor the health of whatever one may have deployed in the cloud. Private cloud presents an interesting scenario, where the supporting infrastructure (typically VMs).
- Security: Commonly referred to as Security Information and Event Management (SIEM) or Continuous Security Monitoring, solutions used to monitor for cyber threats using log (see above), analytics against the collected log data, and security devices with robust algorithms to detect threats.
- Dashboards and Topologies: Visualizations of monitoring and performance metrics logical groupings.
- Alerting/Help Desk Integration: Notification to the proper first responder of an actionable event(s) generated by the monitoring tool(s). Ticketing for Incident Management.
- Synthetic Transactions: The playback of recorded application transactions from various points of presence to ensure application availability and response times.
- End user experience monitoring: What is the perception from our end users? Quite the challenge with mobile apps given all the different devices, OS(s), and network providers.
- ServiceLevel Indicators: Monitoring of the Key Performance Indicators to ensure SLO/SLAs are first, obtainable and second, being met.
- Business processes: Tracking business performance as related to quality, cost, and time. Or put another way, applying analytics to the preceding monitoring aspects.
The ability to understand what is happening within each of the monitored aspects and correlation of the data from the various sources is essential for accurate alerting, incident management, trending, SLIs, and so on.
Building and maintaining a monitoring solution takes time, I mean a lot of time! Much like owing a home, there is always something to do in order to continually improve the solution. Today, the monitoring software market is flooded with tools one can use to build a robust solution. An enterprise strength solution is typically comprised of several tools; our CSMO customers are usually at a minimum of five different monitoring tools (not counting network and security) in production. The tools have generally been brought in by the traditional silos (apps, infrastructure, etc.) or acquired to specifically support the journey to cloud. Adding to the fun, depending on whom you talk to, their requirements for what is important from a monitoring solution will greatly vary. For example:
- Dev/Ops: Focused on runtime, transaction performance, and user experience.
- The Operator: Focused on availability and infrastructure
- The Site Reliability Engineer: Focused on enhancing solutions/tools, processes, and ensuring an available and reliable product for the application’s customers.
- The Line Of Business: Focused on User Experience, SLIs, Transaction rates, and orders
Businesses are adopting a DevOps journey, leveraging methods such as CI/CD, Blue-Green, and Canary deployments for the ever-evolving applications in order to meet business needs faster than ever before. Application changes are being deployed based on sprints: weekly, monthly, and even daily in some cases.
Just as the speed of development and deployments have found their way to standard practice for a business, monitoring needs to keep up with this velocity. Still today, monitoring an application and the necessary integrations is generally not considered until the application is close to deployment or actually deployed in production.
Operating in this fashion will
- Create unnecessary incidents.
- Prevent you from having the necessary solution in place for the application.
- Cause churn and moral issues across teams.
- Reduce confidence in the monitoring solution.
Every component of an application presents different characteristics, which should be monitored based on their uniqueness. It is uber important to understand what the “working as normal” state is of the application and monitor for symptoms not the cause.
For example, take a look at something such as simple CPU performance monitoring in an application where there is a lot of compute, such as credit card validation. In this environment, a toleration for periods of high utilization is normal and not alterable. Conversely high CPU busy on an https server would have much less tolerance and the cause of the CPU busy should be investigated.
Rather than taking a broad stroke at monitoring applications where every component is monitored with a pre-established set of thresholds, understanding the characteristics provides you with the ability to monitor the environment properly and tune the solution as required by the application. In doing so “noise” from too many alerts is eliminated. It is imperative for organizations to shift-left/shift-right and begin testing and monitoring applications as soon as there is an Minimum Viable Product (MVP). Doing so provides an understanding of the “working state” of the application and ideally any test(s) used in the development cycle should follow the application into production and be leveraged as synthetic transaction monitoring.
Certainly, there are components in the IT landscape perhaps less prone to change and do not require 7x24 support. These components can be profiled for a certain set of monitoring. These environments are not the focus for this blog series.
Which brings us to our next topic:
Where there’s smoke there’s fire.
> Learn more about how you can co-create with the IBM Garage.