Experiences on building a status incident management page from private to open-source

Published in

genomics-healthcare-systems

16 min readFeb 15, 2022

Introduction

Imagine the following situation: your company infra-structure starts to grow, and you are suddenly dealing with all the scale up issues that happens with your softwares. It includes maintenances and downtimes with your customers, that you forgot to communicate or you thought that it would long only 5 minutes, and began into a huge long incident. You will deal with a constant flow of fractured communications, lost conversations, website bounces and even a negative feedback from visitors who are disappointed with your handling of incident management.

Typical scenario when you don't have an incident communication plan for your customers

That's why you need a clear way to communicate the issues in real-time to your customers. One of the solutions is to have a status page service.

What is a status page service ?

Status page enables you to communicate incidents, scheduled maintenances and downtimes with your clients. Status page could be publish or private. Your incident management team might use these services to share outages, maintenance schedules, and uptime statistics publicly and internally.

The main features for a status pages:

Share current status and incident history in real-time updates, which reduces the number of customer support queries and improves client loyalty and trust.
Track the state of critical components, which means that you can learn about your incident frequency and track the time spent to resolve the incidents.
Have a historical view of past incidents and the global uptime percentages.

Examples of statuspage product services.

The problem

At our laboratory along the years, our infra-structure started to grow, many new systems started to run and serve critical services such as the LMS, the physical infra-structure started to overflow as the traffic with more laboratory machines had implemented at our network, demanding more outbound and inbound network traffic. With more components to handle our infra-structure team had to handle and monitor all these IT structures, and check for any unplanned interruptions, or scheduled maintenances which affected the quality of normal services. Before, when an incident was declared, our team had to activate our Incident response plan, with several steps to respond to incident, however, everything was manual and the communication was based on fragmented e-mails, phone calls or whats-app messages. There wasn't any dedicated status pages, that would show the current status from all our systems, and furthermore, our customers and internal staff started overflow our IT team asking what happened and when the systems would be back to normal operation. It was a really stressful situation!

Our solution

We decided that we needed to improve our incident management plan, and we started to study some solutions to help our IT team with their performance over these problems. We came to look after some status page services, but, one of our trainees at time decided to adopt this problem and asked me to build a status page product for our lab. We could, of course, to acquire one from several available out there such as FreshStatus, StatusHub, Uptime, and many others. But he wanted to improve his skills with web development from zero, and I gave him a chance.

So he started with some basic features as a baseline, but with potential to bring more complex features in the future. Here are some of our key features that we kept in mind, and can be useful for anyone starting to develop or choose a status page product.

Create Incidents

When using a status page software, you might want the power of creating incidents that you can manually input into your status page’s control panel. In this case, the control panel is the “inside” of the product, where you can control everything for your status page. Our recommendations for the communication is the message to be as simple as possible so that you are not confusing the end-users.

Simple CRUD Control Page for creating the incidents

Scheduled Maintenance

Scheduling regular maintenance updates for your infra-structure helps maintains trust with all the users by supplying regular updates. It is important that the internal team and our clients must know before happening any planned outages. By having these events shown at your status page can bring your users to be part of the operations, and prepare themselves. By having a calendar note, date of time for the planned maintenance are some of the features that can be show up with some updates at your status page. This is specially useful when the status of service maintenance changes, and you need to let your end-users know the important changes before they miss it.

Clear End-user communication

Clear and uncomplicated messaging is the best way to go for any status page. You message must be accurate and reflects the status of the incident, and nothing else.

Usually , we don't put images, videos or vides that distracts our users from the real purpose. The status page must be fast and reliable, so the most of the brand colors, logos and custom images should be in the header at most.

keep it simple, and use a product that allows you to reflect this.

To wrap this up, a clear status page should:

Communicate the incident (and update in real-time) Use easily digestible web copy and microcopy (no sales or marketing jargon);
Be as simple as possible (no extra graphics, colors, etc.)

Omni-Channel communication

Having multiple notification channels as an option for end-users also helps clarify communication. There are several communication channels available with integration APIs support so you can have your status page to notify all these channels automatically. The idea is to spread as fast as possible the incident and any updates for our end-users and by communicating in the platforms that are already present, can make the message be accessible immediately.

Incident notification by E-mail channel. All the team from the organisation was notified in real-time about any incident report updates

Integration with your tools and systems

Status pages should be flexible enough to integrate with your systems and tools. The idea that using API plugins or webhooks , you status page product can be automated to monitor the services and can create incident automatically or even notify our incident management team. You can also aggregate other status pages and compile into a single status page that is shareable with you team. The flexibility might help the team automate many incident updates without placing it manually.

APIs for creating incidents and statuses reports from third-party applications

Off-site hosting to separate your status page from your infrastructure

Your status page must be served in a off-site hosting so in case of both of your infrastructure and status page hosted in the same place, they can be "down" at the same time. It isn't a good idea. So it is importante to use a status page product that hosts it separately. Allow your management to focus on the jobs they are meant to do instead of scrambling to continuously fix the status page to make it on-line again in case of collapsing.

To summarise, making sure you have a reliable status page product at your side when something goes wrong.

Why open-source it ?

We developed the statuspage service product along three months with our IT team, focusing on solving our incident communication issues. Our MVP solution went on-line only for our internal customers. During the testing step (for about six months) we collected the following metrics and feedback:

50 incidents reported from creation to closure;
12 systems and services from our laboratory monitored in real-time;
Overall feedback was quite positive, reducing the incident communication from hours to minutes;
Lots of feature requests from our laboratory quality sector;

Here is a simple demo recorded of our current system, working on-line:

Screenshot with one incident reported. There is also an incident history timeline containing the further details.

The main features that we developed until the date of writing of this post:

Creation of manual incidents using a special CRUD website with authentication;
Status updates of the incidents
Real-Time status for each service; a better visualisation for our status page;
Integration with e-mail channel so any updates or incidents reported, the system can send it to the end-users mailing list e-mail.
REST API endpoints so we can integrate with other channels such as Whatsapp, Telegram, Slack, etc.

We had many requests to new features and adjustments, but unfortunately, our IT team was disassembled since our company was acquired by a huge hospital, and. this system wasn't in the migration plans. It was an experimental project with any people to support it or keeping updated.

With the system development on-hold, I was thinking about open-sourcing it. I explored similar open source projects , to see any solution available, and I couldn't find it. With that in mind, I decided to open-source it and share all the base code with the community. The reasons ? First, we believe that it could be an inspiration for anyone looking for interesting projects to engage on. Second, it is the problem that we experienced, our solution could be also useful for any laboratory or IT company that needs a minimal working statuspage. Finally, we could practice some development practices in order to convert our private project into an open-source project.

Along the way, we learned some important principles of how to convert our private statuspage project into an open source project. I want to share these ideas and experiences in the next section.

Our experiences on changing to the open-source

Documentation

A good project documentation describes some of the main points about the project . It provides for developers a deep dive into implementation details and find by themselves how to use the tool. In our scenario, we focused on the README.md file, which is what a user sees first when visiting the repository. For our README we included the following sections:

A title and the mission of your project , explaining what the solution is aming to solve. We placed it right after the title of the project.
A short description of the project and Why should you care using this project. It doesn't need many technical details, just highlight the good parts.
A "Install" section describing how to get all the libs needed to get the project running locally. If you have a link to detailed documentation, here is a good section to insert it.
Finally, you can have a "Usage" section containing the main features or the main available commands. We recommend use a list for an easier read.

Here are some examples available at this reference repository, RichardLitt/standard-readme, with several specs to build a standardised README file.

Package management

Installing the libraries can be really painful, specially when you run old projects and get stuck. I really recommend using a package management tool such as Pipenv, Poetry, Conda or Pip to handle all the dependencies and corresponding versions required. In our scenario we use pipenv locally, which can install and choose the python version for our project, and a requirements file (lockfile) which you specify the dependencies tree. The idea is that you can avoid any unexpected behaviours or crashes during the installation.

Docker

Docker is a powerful tool for any project that needs to be working right away everywhere. Building a image with all dependencies of your project and hosting in a docker repository such as DockerHub, helps the developers to run it without buiding manually. Remember to have small images ( I generally use the -slim images) and provide the Dockerfile so people can contribute with any new changes.

Finallly, if your project requires a database, a caching service or integration with other services, consider making a docker-compose.yml available, so other developers can run all the stack required in an easier way. Don't forget to have .env (environment variables file with default values) so the developers can test your software in different context environments (test, production, development, etc). More information at docs.docker.com .

Remember that if you’re using Docker / Docker Compose, you’re aiming to save time and inherent complexities that come when working cross platforms.

Linting

Code quality is an important concert to keep the code consistency across different project and teams in the company. You should consider to use a coding style, so you can guarantee a standardised way so any developer contributing with your repo, they know what the rules about how the code should look like.

There are a lot of options available out there, such as, pycodestyle, pyflakes, mccabe, pylint, pylama and others. I am fond of a combination with black, flake8 , prospector and isort. It combines a bunch of tools together and it's easy to use and set up. Using black to automatically fix whatever the tool is able to, the flake8 verifies any files that hurts the PEP8 Style Guide for Python code, the prospector tool analyzes Python code with output information about the erros, potential problems, convention violation, and complexity of the program. Finally isort to keep the imports properly ordered.

Make sure to have all these tools properly set up so if you run any build process such as tox, git-pre-commit, they can run along with it.

Nice article to read more about it.

Security checks

The security checks can help your project that you are deploying as safe as possible. Besides careful planning while architecting and implementing your project, there are tools that can help you checking your code for possible security issues. Here we use the bandit tool for the Static Application Security Testing (SAST), which it help you find security issues in your code. The safety package checks your installed dependencies for known security vulnerabilities.

Depending on your project and company, it may be advisable that you have pen testing executed by a third party against your product to identify more possible threats. There are several vulnerability scanning tools that you can use in your project.

Git hooks

Setting up git hooks is a good way to have certain scripts being triggered by some git actions, like git commit or git push. A good tool to leverage is pre-commit. Easy to install, and having commands to be executed on every commit/push, such as :

code formatting

isort
linting
security checks
unit tests

Git hooks helps you catch and fix issues earlier in the development process.

Continous Integration (CI)

Continous Integration (CI) tools are really helpful for any project that requires automation on identifying issues, vulnerabilities, possible refactoring, and more. Specially, when your open-source project is collaborative, how to avoid any unexpected changes that breaks your product ? Is there anything we can do to reduce the amount of time this happen ?

The CI integrated with your repository have hooks that when you push or. open a PR to your main branch, it invokes several scripts that check the code quality, perform automatic tests or even build a release into a testing website. There are a plenty of tools that helps us with CI, some are Travis CI, Circle CI, Github Actions, Jenkins, and many more.

There are other tools that just helps you in specific tasks for testing code coverage or check any vulnerabilities such as SonarCloud, Coveralls, CodeClimate, etc.

For any start project, you can begin with a minimum set of tools agreed and expand as yo see fit the adherence of the tools. Don't forget to include the integration and tests in this CI as well, these features can help you catch issues before you start the release process.

Check our CI example here.

Documenting your API

If your project makes use of APIs , is is important that you have an API documentation besides the README file. The documentation explains concisely and to the point all the usage aspects. For instance: enumerate the available endpoints, HTTP verbs and expected response structures.

Create an additional page solely describing the API based on Consumer-Driver Contracts approach. There are two great tools that can assist us with documenting our endpoints, API Blueprint and OpenAPI (originally known as Swagger). OpenAPI is more broadly used, and API Blueprint seems easier to read and write. Both of them have a good set of tools available that can help you.

I’d recommend giving API Blueprint a chance, using dredd to validate your contracts against the actual endpoints (a good idea to have this in the CI process), snowboard or aglio to beautifully render your contracts and drakov to spin up a mock server and test out your contracts even before coding the endpoint.

Test Cases

Open source code is famous for not having the best quality. Developers create open-source code and neglect the testing component. It is difficult for anyone that wants to contribute to your project but gets the code difficult to understand, unstable and full of bugs. In this regards a good way to increase truste and demonstrate the quality of your open source project is testing it.

Consider using pytest and checking out the available plugins, like pytest-django and pytest-cov.

It’s great having a coverage report of your tests, with pytest-cov, coverage.py or some other tool, that gives you a glance of how things are being covered, but it’s way more important to be testing your code in intelligent manners, that ensures the functionality will be working as expected than to just see the coverage going up.

A few more tools that may help you in your testing:

Model Bakery (can help you with fixtures, testing data for your tests)
Model mummy (can help you with mocking the django models)

Badges

Adding some badges in your README file can help contributors to easily identify certain aspects of your project.

Badges can advertise various aspects of a project, e.g., its license , the code coverage of its test suite , the adopted code style , etc. Some badges act as an incentive to maintain excellence in the qualities they display, lest a bad signal be sent to the project’s users and potential contributors. For instance, a code coverage badge creates an incentive to maintain a high code coverage, as otherwise, potential contributors can easily see that the project is poorly tested and, therefore, prone to have hard to detect bugs. There are several types of badges available for open-source projects such as:

Build status
Code coverage
Latest released version
Project license
Code quality rating
Amount of vulnerabilities
Technical debt percentage
Social platforms
Dependencies/libraries status
+ much more

A great repository to find badges is https://shields.io/

Changelog

Make sure to maintain a CHANGELOG as well. It’s a simple Markdown file with a certain format where you write down changes made to the project, new features, fixes, things that have been deprecated, etc.

Also following semantic versioning tells anyone that you have stable releases or for new updates you can tell what are the changes. It helps business and developers.

A good example, imagine if you wish to upgrade a package’s version, you may want to check what happened between versions, verifying if there was any breaking change and there’s an upgrade guide you need to go through. Sometimes if you see a change in the PATCH (MAJOR.MINOR.PATCH), you may not even bother because the packages you’re using are most likely following Semantic Versioning and you understand that there was nothing major that would cause you headaches to update.

We use bump-version to automate the version releasing process and for the Changelog formats we follow the guidelines from https://keepachangelog.com/en/1.0.0/

It may be worth taking a look at semantic-release to help your project even further.

Settings

We recommend separate the settings file for each environment and changing the values of the variables using environment variables. There’s a great package called python-decouple that you can install and leverage a .env file to set your environment variables, by using the package you can force some environment variables to be required in the environment without default values, parse port numbers to integer or use the CSV function to cast environment variables split by comma into lists easily.

There is a nice article with best practices when building your settings config.

.gitignore/.dockerignore

Don't forget to add a gitignore and a dockerignore to your repository. It avoids unnecessary files to upload to your repository. There are several gitignore templates for several platforms, you can find good ones for your project: gitignore.io

Now, talking about the .dockerignore, I recommend always having it, and possibly ignoring the .env, .git, tests, build and cache files in the container.

Use it for free!

Since november 2021 we decided to publish our Geninfo statuspage service as an open-source project following the MIT License, so anyone can obtain a copy of the software, changes it, but mentioning the original contributors of the project (copyright notice).

GitHub - genomika/geninfo: Our status tracker for our infra-structure - it registers services drops…

Geninfo is a open-source project for any brazilian company interested on monitoring and tracking your infra-structure…

github.com

We also provides an integration with Heroku by placing the Heroku Button, which enables an easy way to get the app up and running quickly on the Heroku platform. Clicking a Heroku Button will initiate deployment of the app, provide an option to configure the app, and deliver the running app on the web.

See it in action at screenshot below.

Heroku Deploy Button for automating the deployment stack from our Geninfo WebSite

Next steps

There are several features that we want to aggregate to our statuspage such as integration with bots, historical reports, more visualisations and etc. Feel free also to come up with nice features to it. We have the Github Issues tracker where you can come with any suggestions, feature requests and bug fixes.

Issues · genomika/geninfo

New issue Have a question about this project? Sign up for a free GitHub account to open an issue and contact its…

github.com

Conclusions and acknowledgements

I want to thank you for reading this post and I hope that some of the experiences that you found here were helpful to you and your team. We share all these guidelines based on our perspective over the years working with Python, Django and laboratory infrastructure. So any suggestions or recommendations are quite welcome!

I want to send a special thanks to Genomika team, an amazing laboratory team specialised on genomics sequencing tests , our mentor João Bosco Oliveira, the ex-CEO who believed in these initiatives and finally for our infrastructure team, with a special mention to alumni IT analyst Lucas Eduardo Carvalho, who took my idea and made it happen !

Experiences on building a status incident management page from private to open-source

Introduction

What is a status page service ?

The problem

Our solution

Why open-source it ?

Our experiences on changing to the open-source

Use it for free!

GitHub - genomika/geninfo: Our status tracker for our infra-structure - it registers services drops…

Geninfo is a open-source project for any brazilian company interested on monitoring and tracking your infra-structure…

Next steps

Issues · genomika/geninfo

New issue Have a question about this project? Sign up for a free GitHub account to open an issue and contact its…

Conclusions and acknowledgements

Written by Marcel Caraciolo