Solution for Cloud & Application Maintenance, Quantyca I&O

QUANTYCA — INFRASTRUCTURE & OPERATIONS TEAM

Lorenzo Verardo

Published in

Quantyca

8 min readOct 13, 2020

Who we are

Hi,
my name is Lorenzo Verardo and I’m a member of the I&O Team in Quantyca.
With my colleague Andrea Macchi, we have written this post to share what we do and how we work.

Quantyca is an Information Technology consulting company that deals with data in every respect, starting from the integration and the definition of Big Data architectures, up to reporting and analysis.

In this post we will see the following topics:

Infrastructure & Operations (I&O)
Type of support
How we work?
Conclusion

1. Infrastructure & Operations (I&O)

1.1 OUR BUSINESS

Within our company, the I&O team is responsible of four main subjects:

managing our customers’ cloud infrastructure
providing services and performing the necessary activities for the correct monitoring of both systems and applications
troubleshooting of malfunctions in applications and systems
maintaining our customers’ systems and keeping both infrastructures and applications always up to date

In this article, we will dive into the meaning of the “O” in “I&O Team”, so that you can give a glance on our day by day activities (don’t worry, later we will discuss on what the “I” in our team’s name stands for too)

1.2 “O” FOR OPERATIONS

One of our pillars is for sure the operative part, the one in which we provide a support/assistance service for our customers through our second level help desk.
The goal of our support services, or operations, is to ensure the correct functioning of our clients’ architectural and infrastructural components (support perimeter).
Such perimeter includes the maintenance and administration of the applications, focused on ensuring their correct functioning and, as we go on, establishing a set of best practices that will help our future operations.
The support service includes also the use of monitoring tools, essential to ensure the health of services and applications, and also crucial to prevent problems before they occur, so that we can guarantee a high level of both reliability and performance.

1.3 SKILLS

The team is made up of professionals with skills in different areas.
We actively invest a lot in periodic training on different projects and technologies, attending courses and taking certifications,
and all this training is fundamental to guarantee a high level of preparation and a high quality service level.

1.4 COMMUNICATION

There is a series of soft skills highly involved in our work, such as the communication skill, both with customers and coworkers, the ability to understand the emotional state of our interlocutor and the ability to manage stressful situations, coming from both external (other people) and internal (e.g. heavy workloads) events.
Even if these aspects can seem obvious at times, we ensure you that they are not!

Under this point of view, we have equipped ourselves over time with suitable organizational tools to allow us to compare and distribute workloads that take into account the soft skills needed to manage particularly critical situations.
In addition to this, there is a constant personal search for growth in order to improve ourselves through continuous comparison and study.

1.5 ORGANIZATION

Thanks to our agile organization we can easily coordinate all the activities from all of our customers, carry out each single activity following the correct priorities while never missing a SLA, thank to a proactive monitoring.
Today our team includes the following figures:

Service Manager
Tech Lead
Infrastructure and Automation Tech Lead
I&O Specialists

There are also some meeting points that come in handy in our activity organization: “The Daily” and “The Weekly”.

The Daily is an every day schedule of about 30/60 minutes that can be requested by anyone on the team and at any time during the day.
Indicatively, we call at least one Daily every 2 days even if it is not requested by anyone, just to give the opportunity to have an encounter moment.
In this meeting we discuss about delicate issues that require specific attention or issues that have a priority change.

The Weekly, on the other hand, is called by the Service manager every two weeks, and involves the CTO and CPO of Quantyca.
We have this type of meeting in order to perform an internal review/analysis of our most-recent work with every single customer, and also we meet to share any improvement proposal aimed to help our work and foremost our clients.

2. Types of support

The activities that make up the service we provide to our customers is determined during service setup and are divided into:

2.1 ORDINARY MAINTENANCE

It is the set of agreed operational activities that can be defined as ordinary administration (administration of application components, manual batch execution, etc.). These activities are activated via ticketing system, opened by the customer or independently.

2.2 APPLICATION MAINTENANCE

All agreed operational activities relating to the monitoring and management of what has been carried out on design phase (subject to the transfer of know-how — KT) or developed by the customer / third parties and submitted to Quantyca support (correct verification of batch execution or ESB services, dispatch of reports and notifications etc), are part of application maintenance.

Also included are a series of ad hoc checks which may also include infrastructure checks, called “Checklist”, performed and scheduled periodically, usually, daily and before normal business hours.

This type of service is used to guarantee the correct functioning of all the application and business components necessary for users to carry out normal daily activities.

2.3 INFRASTRUCTURE MAINTENANCE

The infrastructure maintenance includes the controls developed and maintained to perform check and verification on the servers in order to guarantee the reachability of the applications and the correct state of the services.

2.4 AUTOMATION

The Checklist, as described above, can include application and infrastructure checks. In many cases these controls are critical for our customers because they can now predict various types and species problems. The remediation procedures are associated with this type of checks, which as we will see later in this article in the Warm Up phase of the service. Here, therefore, automation and monitoring are issues that see us personally involved.

The nature of repetitive and standardized activities and the presence of procedures for remediation in the event of failure are an excellent example of how to introduce monitoring and automation systems within the support activities.

3. How we work?

The support service follows a very simple and effective process that we try to describe. There is a first phase that we call Warm Up. This phase is not cyclical but it is necessary when you start with a new support contract or whenever the team’s scope of intervention is changed within the customer’s ecosystem.

3.1 WARM UP

Before start something, everything must be defined. Nothing is perfect, of course, but is important to take the time to define (almost) what is important. Warm Up phase is that moment.

So, before starting the Application Maintenance Services, the Warm Up phase is needed to setup the monitoring environment and define the procedures for managing incidents and service requests.

The Warm Up phase begins with the as-is analysis and knowledge transfer. Subsequently, the monitoring environment will be designed and the automatic repair rules will be implemented.

Finally the operating procedures for the management of the services (playbook) will be shared and written.

As described before, playbook defined during Warm Up phase will be object of review and adjustment during running services period.

Finally, the Warm Up phase principles consists on:

Support Perimeter definition
Documentation and integration in it (if already exist) of operative procedures
Handover e comprehension of architecture and infrastructure
Definition and configuration of ticketing system
Setup of manual and automatic checklist.
It consists on a collection of preventive checks in a specific time slot, to grantee the correct working of applications, infrastructures and business process. Those tasks are usually at high impact and, usually, should be done before normal starting of working day.
Preparation of knowledge base (Confluence)

3.2 SERVICE DELIVERY

Once the perimeter has been defined, the process of providing support services proceeds following a recursive cycle which consists of three phases: Input, Output and Review.

INPUT
Support services can be activated in two ways: passive or active method.

Passive method consists on activities made by customer. For example opening a ticket (incident or service request), sending an email or calling support phone number (in case of down of ticketing system).

Active method consists on activities made by I&O team. For example: checks based on periodic checklists, automatic checks in real time from monitoring environment and so on.

OUTPUT
When a ticket is solved there are a list of activities needed:

The knowledge base shared with customer will be enriched
If issue and resolution is not present on playbook, it should be added as a point of attention/discussion for playbook revision
If the incident has a “monitorable” technical cause, it is added as a point of attention to be discussed during the Review to evaluate possible introduction of checks in the checklist or for integration with the monitoring system
If the incident can be resolved definitively, the improvements deemed necessary are proposed

REVIEW
Review is a periodical meeting with Service Manager and customer, for review, analyze and evaluate the state of services. The goal of this review is, first of all, to analyze tickets on reports with shared templates and KPI information defined during Warm Up phase.
During the meeting, the two sides will verify SLA and baseline, analysis and check of completeness of the procedures defined in the Playbook, evaluate the degree of automation and validity of the controls.
It’s also possible to discuss about structural weaknesses analysis that could be cause of incidents and proposal to customer some flex activities, like training, change requests, check integration or new monitoring KPIs and dashboards, architectural improvements.

4. Conclusion

In this article we wanted to share our approach to managing operations activities.
The nature of the service we offer is not only made up of processes, organizational procedures and / or tools, but has a strong propensity for human relationships. Who does this job deals with talking to the customer with whom he must continuously and constantly communicate.

“The real problem is usually two or three questions deep. If you want to go after someone’s problem, be aware that most people aren’t going to reveal what the real problem is after the first question.”
— Jim Rohn

Furthermore, having a strong propensity for problem determination, being quick in reasoning and not taking anything for granted are important qualities.

In conclusion, what characterizes our team is not only the technical competence but also the ability to relate effectively and proactively.

Thank you for reading my post!
For further contents visit my company website, or follow our Linkedin page!