Awareness : Alerting for developers

Awareness module is very simple to implement alerting for developers. We at MyOperator have multiple alerting systems like Nagios, New Relic. Both had their limitations (in terms of functionality and pricing) so we also added custom alerts using elastalert. But the complexities involved in creating a simple alert for a developer is huge and time consuming. As our requirement was unique hence some unique thing was required. This project has not complete yet, this is just an Idea.

Purpose

Our current architecture is facing many issues and reporting and alerting is not been done properly because of following issues:

  1. Lack of alerting in applications there are lack of alerts mainly because configuring an alert is complex and time consuming process.
  2. Status of other services Applications are not aware of the current status of other services involved. Hence cannot take action if something is not working.

Awareness will keep the current status of services of all servers and alerts can be triggered if there are some issues. Awareness will provide following features:

  1. Application alerts: now alerting is very complex to implement. In awareness to trigger an alert for an error in the application developer only has to set an entry in config file and subscribe to the alert on the awareness panel.
  2. Service status checks: Centralised system for maintaining status checks of services like apache, mysql, supervisor etc. If any service is down it can be alerted.
  3. Central monitoring panel: Status of all services on all servers in the project can be checked from the panel, services health will be shown in many colours (green, orange and red). New alerts can also be implemented from the panel.
  4. Commands : send the command to server if any action is required for example : shut down apache, turn offline. Commands will be generated by alerting system and will be sent to the server through spiderbelly (central module described in next section).

Modules

Awareness has many parts each will handle its assigned responsibility.

Agent

Agent will be installed on all servers who are to be monitored, same agent will work for application and service monitoring. Agent will a python script running as a daemon using supervisor. Each agent will have different config file. Agent can check following things in version 1.

  • Service : service is running or not. If apache check is configured, it will check if apache is running or not.
  • File update : will monitor the file. If file is modified in N number of seconds then alert will be generated. It can also be configured if file is not modified in N seconds.
  • Network : it will check the network with provided URL. It will check packet loss and not reachable cases.
  • Database : it will check the database connection from the server.
  • API : will check the connection and time taken to connect the API.
  • More can be added easily, with few lines of code.

After checking all the status agent will send the data to the central server where every log is maintained. Data can be of three types:

  • Health : it means no problem, status is fine.
  • Error : There is problem which need to be checked.
  • Issue : Unable to check the status, some error occurred.

Agent will also run the commands sent by spiderbelly.

Spiderbelly

It will collect data from all servers and store it in the database(elasticsearch would be great). It will be a simple API which will expose a add method and data from all servers can be stored there. In return it will reply if certain actions are required to be implemented in the system like turn offline mode if there is some issue. Spiderbelly responsibilities include :

  • Collecting data and storing them into database.
  • Send alerts to the right server generated by alerting system.

Alerting and command triggering system

Alerting system will be fed with rules for triggering the alerts from the data collected by spiderbelly. For example, if we have to implement an alert on server A, for an event of apache down, this rule has to be set in the alerting system. Alerting system will generate the alert through different channels (slack, email, sms), and developer can receive the alerts on their device (if he has subscribed to that alert). Command can also be triggered if required based on the alert. for example if alert is generated for “apache down” then a command trigger can be generated to server A to start the service (which need to be highly secured).

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.