Important Error in Microsoft System Center Operations Manager (SCOM) that you must know, and its solution.
This article is intended to IT professionals and specially those who work with Microsoft´s monitoring platform System Center Operations Manager, also known as SCOM.
The goal is to explain an error in the SCOM product that affects reports that show the availability of the systems it monitors and, therefore, prevents proper monitoring.
Let’s start with the following example:
We have the SQL Agent of an SQL instance and we are monitoring it with SCOM that shows the status of this service on each server by means of the following icon:
In which we can see that the status is in red, CRITICAL, and if we want to verify what has caused that state we have to access the Health Explorer of the service to check its monitors. The following image shows that the service status is not Running:
It can be seen that the Critical state has been triggered by the SQL Agent service and that it has not been running since 17/11/2016 19:44. You can also see that he had another fall on 17/11/2016 and recovered the next day.
With this way we can obtain the intervals in which the service has been dropped, but what happens if the dates of the state changes are not shown?. For instance, the following Health Explorer does not display dates:
This is because you can erase the status change dates by means of a script developed by Microsoft Field engineer Kevin Holman in order to delete old data to avoid collapses.
To obtain the history of the availability of any system monitored with SCOM, you have to access Reporting Service which includes several reports that provide us with monitoring information such as Performance, Alerts and Availability.
In the case of availability we can use “Availability Report”, which shows the hourly availability of the service.
In the next image each column represents one hour and the color percentage of each state can be observed That the state has changed to Critical by the red color, and in the case of Gray pain the state is called “Monitoring Unavailable” since at that time the agent SCOM was dropped, once the agent is restored the report shows that the state of the Service will be Critical since the SQLAgent is still down.
Let’s propose the following situation: what would happen if while the SQLAgent is not running SCOM agent resumes in a time interval that changes time? For example, from 23:12 PM to 00:09 AM, when the SCOM agent is restarted and we check the availability of the service it has to show that it is still down, right?
The following report shows the time intervals in which the previous situation occurred.
It can be observed that in the columns of the hours 23:00 PM and 00:00 AM the service status is Critical and Monitoring Unavailable, but from 01:00 AM columns the status is Healthy !!! but it is still CRITICAL because it has not been restored…
This means that if we want to obtain the availability of a monitor with the SCOM reports we are subject to an error when the agent stops reporting for a time and therefore, it is not possible to determine real data such as percentages of availability or SLA.
The error consists in the exporting of the monitoring data from Operational DB to DataWareHouse, SCOM does not calculate properly the State Changes when an agent stop reporting in certains intervals of time. It can be notice in the versions from SCOM 2007 R2 to SCOM 2012 R2 Rollup 11, last to be published.
But do not despair because this error has a solution; Which is to develop newreports with Visual Studio Bussiness Intelligence to correct this error.
The following image shows a report developed by me in which you can see the percentages of total availability of the SQLAgent, the availability by hours and the detail of the state changes.
As can be seen in the following image, which compares the availability of hours by means of the two reports, the calculation error of the SCOM reports is solved through the report developed with Visual Studio without manipulating any data from the DatawareHouse.
I hope you have found this article interesting and that Microsoft solve this problem in future SCOM versions.
If you have any doubt or question contact me on my Linkedin Profile