Customers and merchants rely on Groupon to deliver a great experience all the time. This is particularly important and challenging during holiday periods including Black Friday and Cyber Monday. Every year, driven by these holiday periods, service and engineering teams across Groupon work to ensure that they are operationally ready to scale and meet the demands of our customers and merchants.
However, in 2017 we realized that the process for ensuring operational readiness had become inefficient. Every year, for a number of years leading up to that point, efforts were made to collect and/or verify information about the services within the company. Teams had to put in extra time to answer redundant questions asked in previous years, and fill out a variety of ad-hoc docs and spreadsheets, and they had to sacrifice product development time to do so. In many cases, the system of record for the verified data was Service Portal but since a single team was tasked with managing the data for all of Groupon’s services, and it was doing so largely manually once a year, the data was either missing, incomplete, or out-of-date. As a result of noticing this yearly pattern and its costs, we set out to enhance Service Portal, and move it from a repository of centrally- and manually- procured managed service metadata, to one that increasingly decentralizes and automates the management and procurement of that data. This was to be done leveraging regular service and engineering health checks, so that operational readiness was a constant focus, not merely driven by holiday cycles. This approach has proven successful in 2018 and 2019, as the cost of preparation was vastly reduced without sacrificing stability.
We defined 3 key objectives for Service Portal:
- Increase development agility
- Track Operational Readiness
- Ensure regulatory compliance
Before we dive into these objectives, let’s first understand what the different life cycles of a service are, which in itself is an overarching goal for Service Portal.
- After its inception, a service goes through an architecture review process, to ensure that nuts and bolts are in place, and to identify opportunities to reuse existing platform capabilities.
- After an architecture review, the service is considered in a ‘Preparing” phase and is required to go through a change management process or Operational Readiness Review (ORR). Without this review, the service will not be allowed to be deployed to production. This may seem like a bottleneck in a bureaucratic process however it is very much needed. For example, a service deployed to production bypassing necessary review and controls has the potential to bring down systems resulting in massive losses to the business.
- Once approved and deployed to production, the service is considered “Live” and will warrant stricter controls.
- If a service is no longer in use, it will fall under a “Sunset” category where the service is expected to be decommissioned in 3 to 6 months and all the dependencies removed.
- This is followed by “Decommissioned”, when the service is completely turned off in production, followed by the service team returning hardware resources that were allocated. Service Portal provides the platform to decentralize and streamline this process, ensuring faster turnaround while being nimble.
Now let’s walk through how Service Portal plays an important role in each of the phases.
Service Metadata: In a service’s early stages, it’s important to capture the metadata and make general information about the service available to the organization. This is where capturing service metadata in Service Portal becomes useful.
- Service Portal captures the ownership information for a service. This is useful to know whether the owner is an active employee at the company (via a back-office directory service) or escalate inactive employees to their manager for reassignment.
- It ensures proper documentation is available & accessible.
- PagerDuty schedule is created with active employees, conforming to escalation standards.
Service Ecosystem: As the service is being developed, tested, and getting ready to go live, it becomes imperative to capture information about its environment.
- Service Portal has built-in capability to read host configurations, whether in one of our own data centers or cloud, and associate that with a service.
- It also captures the resources associated with a service (databases, load balancers, Jira tickets/projects).
- Lastly, it has entry points into a service (DNS), if other services can integrate, and the environments it is available in.
API Schema Publication: If a service has a REST API, once live in production it becomes important to make its schema available to other stakeholders.
- Service Portal acts as an API repository for internal integrators.
- It ensures that API schema documentation published to Service Portal is always structurally valid, conforming to OpenAPI specifications.
- It validates that published API schema documentation conforms to company-specific standards and requirements, e.g. for providing regulatory classifications to API endpoints.
- It facilitates the automatic generation of REST API clients for languages that benefit from it.
- It integrates with service Github repositories to auto-detect schema documentation changes, and auto-publish them to Service Portal.
Regulatory Compliance: Service Portal is the source of truth within Groupon for service-level regulatory classifications and other legal or regulatory information about the service.
- To ensure that services meet regulatory requirements, Groupon follows a Compliance review process also known as Privacy & Security by Design Review that determines if a particular service falls within GDPR/SOX/PCI scope. Service Portal provides a platform for not just recording this information but also captures approvals and sign-offs necessary for compliance.
- Service Portal has built-in automation to verify whether necessary controls (e.g. user access to a host) have been applied based on its regulatory classification.
Service Health: Service Portal also offers a “health” feature for a given service where Service Portal runs automated checks to ensure necessary controls are in place. Based on the result of the check, a score is assigned to the check that counts towards the aggregated score for the service. A service can have a score of either Green, Yellow, or Red depending on the status of its checks. This score affects the service team’s ability to make changes in production.
Service Portal not only plays a crucial role in change management but is the bedrock of a stable platform at Groupon by driving continuous operational readiness.
Building Block Support:
It may be clear by now, that Service Portal plays a much larger role in the engineering organization than just being a mere tool for tracking services. It drives engineering processes and improves collaboration and communication.
Groupon’s engineering culture encourages innovation and teams are generally geared to solve the next problem or build the next big thing using cutting edge technologies. Consequently, Service Portal isn’t just iterating on incremental changes to existing features, but also trying to identify and solve the next big challenge. One of the ideas on our whiteboard is to incorporate large building blocks of the engineering platform into Service Portal. An example of this, which we are currently ideating on is supporting the database provisioning process. Some potential use cases for this include:
- Support infrastructure changes, such as migrating a service’s databases to the cloud
- Validate service decommissioning, to ensure that a service’s databases are decommissioned first and that those databases aren’t also used by another service
- Enhance regulatory classification capabilities, capturing such classifications at the database level rather than just at the service level
- Enhance service discovery/analysis, allowing people to identify which services are using a particular database engine, for example
Aside from acting as a repository for database information, be it the type of database (postgres, mysql) or configuration information (vis a vis port, endpoints, replicas, etc.), this information can also be used to associate infrastructure components. For instance, let’s say a service is being decommissioned and the associated hosts and databases are being requested to be removed. A challenge that we face in a distributed system like SOA is keeping track of dependencies. In this decommissioning scenario, associating an infrastructure component like a database would ensure that services that are live and dependent on the database don’t get affected. Or vice versa, if someone does not request the removal of an associated database, necessary action can be initiated once the association information is made available. Keeping up with the spirit of efficiency, all of this tracking can be automated reducing any process overheads. This can be expanded to not just databases but other infrastructure components like K8’s namespaces, load balancers, etc.
Historically Groupon’s approach to addressing production issues has been more reactive than proactive. There are on-call and escalation procedures in place to minimize the impact of any outages in production which work very effectively, and Service Portal does play an active role in this approach today, by providing up-to-date contact information, integrating with Pager Duty to notify appropriate teams involved, and other things.
Service Portal integrates with tools like Wavefront, which Groupon uses to capture its standard HTTP metrics per-service. Those metrics allow Service Portal to automatically calculate a service’s availability, at least relative to its HTTP activity. Our goal is to work with service teams to adopt a common standard for defining availability for a service that allows its automated measurement to be acceptably accurate and comprehensive for use in identifying services that may be under-investing in addressing technical debt. This aligns with our vision of being proactive in identifying production outages.