Platform Engineering at Palo Alto Networks: Part-2

Ramesh Nampelly
9 min readJan 3, 2023

--

This is the second and last part of Platform Engineering at Palo Alto Networks blog post.In the previous post, we talked about Palo Alto Networks Internal Developer Platform(IDP) inception and overview. Now we are going to discuss each category of Palo Alto Networks IDP and their capabilities in the context of 2022 Gartner Published Report.

In the diagram below, Palo Alto Networks IDP capabilities are categorized into the following three development lifecycle phases:

  • Discover and Create
  • Integrate and Deploy
  • Operate and Improve
Palo alto networks IDP components overview

Image1: Palo Alto Networks IDP overview in the context of Gartner published report

Let’s now expand a bit on each of these phases and IDP capabilities

Discover and Create (Day-0)

Discover and Create phase cover the initial part of the development lifecycle, including onboarding, training, bootstrapping, local development etc. as shown in Image1: Palo Alto Networks IDP overview in the context of Gartner published report , all these capabilities are part of “Developer Portal” at Palo Alto Networks.

Developer Portal (Palo Alto Networks DevClues)

As I described in the previous post, DevClues is based on open source project backstage.io and it is a one-stop-shop for all the internal services that infra platform offers. Now we will take a closer look at each capability of DevClues(i.e. Below Image1) in the context of the Gartner published report.

Internal developer portals should make it easy for developers to perform Day 0, Day 1 and Day 2 activities throughout all phases of the software delivery life cycle (SDLC).

Image 2: Palo Alto Networks DevClues home page

Palo Alto Networks DevClues provide ready to use service templates for developers to create new software applications, services and infrastructure components with embedded best practices.

Service catalog

This capability enables developers to search and find production services(i.e. On Boarded into DevClues) quickly and easily; including

  • Available services and their metadata (i.e. Owner and On call engineer)
  • Their API spec
  • Additional details such as — documentation, CICD stats, code coverage and DORA metrics as shown below(i.e. Image2: service catalog service overview).

Image 3: service overview in service catalog

Service generation templates

This capability allows developers to create new services, infrastructure components or applications based on predefined templates. DevClues currently provides a set of service templates for “go”, “python” and “react” application development and infrastructure components provisioning templates for “GKE cluster bring up” and “gitOps onboarding”(with gitlab and argoCD).

Documentation and training material

This helps developers understand how to leverage the platform to the maximum, in a self-service manner and facilitates a community around the platform, where individuals from different service teams become SMEs on the platform and can help their team achieve their goals without having to wait for the platform team for information or help. This includes:

  • Platform usage guides
  • Training docs and videos
  • Community brown-bags and office hours

Automated Support & Search

Provide developers with a self-serve way to find information about the platform and its features, get responses to questions, and automated help for problems. This includes:

  • Global search all available docs, guides and examples
  • Slack or email links to contact platform team
  • Chatbots to assist with onboarding

Best Practices Guidance and Tools

Provides developers with best practices around architecture -

  • Templates and Bootstrapping solutions
  • Production readiness guidance
  • SRE best practices and standards

Custom Plugins

DevClues enables Palo Alto Networks developers to build portals for internal tools and processes. At Palo Alto Networks, we have plugins for -

  • Cloud Costs
  • Observability onboarding
  • Incident analytics
  • Infrastructure management
  • Certificate management
  • New cloud region buildout
  • Auto-remediation authoring
  • Production audit logs
  • Marketplace for internal projects suitable for “inner-sourcing”

Integrate and Deploy (Day 1)

Integration and deploy phase cover application deployments into non-prod/prod, integration of distributed systems, configuration of resources etc. Provide a single dashboard to manage distributed infrastructure across cloud and on-premises environments. The following are the current capabilities and they’re spread between infrastructure management and production management categories (i.e. as shown in Image1: Palo Alto Networks IDP overview in the context of Gartner published report)

Infrastructure Provisioning & Orchestration

At Palo Alto Networks, we built a DevClues plugin Uno (i.e. please refer image 4 below) that’d help developers to provision and configure cloud resources and other infrastructure components for the service/app leveraging GitOps. This includes:

  • On-demand resource provisioning in private or public clouds
  • Defining all required resources as code with best practices

Image 4: Uno — DevClues plugin for multi-cloud infrastructure management

Policy management

Enforcement of business, operational, and best practices policies on resources and running apps/services. We’ve implemented an OPA (Open policy agent) based “control plane” to help with -

  • Authorization(RBAC) for all internal portals and underlying APIs
  • Restricting resource configurations to allowed values(i.e. CICD)
  • Enforcement of specific annotations/labels

Environment management

The DevClues plugin we built to manage infrastructure resources has been extended to easily create, configure and manage service/app environments like -

  • Easily adding a new environment/region, or removing an old one
  • Creating and removing ephemeral environments for development and testing

Secrets management

At Palo Alto Networks, this capability provides a service that manages certs, secrets and config sync in production so that they can securely -

  • Automatic integration of the secrets management system with the deployment/delivery system
  • Store/rotate their configs, secrets and certificates in a centralized repo (i.e. vault or GSM or ASM)
  • Reconfigure/reload their applications seamlessly when corresponding secret/certificate/config changes, which are running on K8S, Docker and as native linux processes

Operate and Improve(Day 2)

This phase covers continuous operations by providing access to a toolbox via DevClues plugins for automation, monitoring, observability and incident management etc.

At Palo Alto Networks, operate and improve phase capabilities are spread among 3 categories — Production management, Infrastructure management and Resource management(i.e. as shown in Image1: Palo Alto Networks IDP overview in the context of Gartner published report)

Monitoring and Observability

Monitoring and observability is one of the critical production engineering services at Palo Alto Networks. We’ve built an internal observability platform called “Garuda” using proven open source technologies such as grafana, grafana mimir, grafana loki, grafana tempo and vector.dev. We will publish a separate blog post introducing Garuda and deep diving its capabilities soon.

Garuda offers the below capabilities as of today -

  • Logging and events
  • Tracing
  • Metrics
  • Alerting
  • Dashboards

We built a DevClues plugin for Garuda(i.e. Image 5 below) to help engineering teams to easily onboard their infrastructure and services/apps into the “observability platform”.

Image 5: Garuda onboarding plugin in DevClues

One of the major challenges with monitoring and observability is the complex onboarding process for different resource types — cloud, on-prem, k8s , VMs, serverless and bare metals. Through this plugin, we made the onboarding process as seamless and frictionless as possible.

Image 6: Garuda agent onboarding

Incident management

Effective incident management is a key aspect of our SRE best practices to efficiently alert and notify engineers when business-impacting incidents/outages occur, as well as provide tools for managing those incidents. This includes:

  • Incident management dashboards
  • Incident analytics
  • Tooling to alert and create slack channels

Our DevClues incident analytics plugin provides the following insights

  • Incidents by Day
  • Incidents by hour
  • Incidents by component
  • Incidents by team
  • Incident repair Vs Occurrence
  • Repeat incidents by component/service

Auto-remediation

Based on a recent industry study 30–50% production incidents are repetitive and contribute to the majority of SRE toil. We wanted to tackle this problem through automation hence built a system called “Nutrix” i.e. our internal auto-remediation platform based on open source project stackstorm

Image 7: Nutrix plugin in DevClues

Again, one of the major bottlenecks to increase the adoption of this automation was not having a robust authoring framework which motivated us building authoring framework(i.e. as shown below) in DevClues Nutrix plugin.

Image 8: Nutrix auto-remediation authoring in DevClues

Insights dashboards

Dashboards that use observability and monitoring data to diagnose issues and debug running systems in order to reduce MTTR (mean time to resolution). This includes:

  • Host 360
  • Certificate 360
  • Kubernetes 360 dashboards
  • Cost Insights

Image 9: Garuda canned Host 360 dashboard

Image 10: Garuda canned Certificate 360 dashboard

Image 11: Garuda canned Kubernetes 360 dashboard

Image 12: Garuda canned Kubernetes costs insights dashboard

IAM

Management of identify and access management of users and tools to the cloud resources and systems.

Examples include:

  • Managing K8s clusters access level based on roles, for example, different access for service operators and cluster operators
  • Defining RBAC permissions to perform deployments, updates to the configuration of the service/app and resources
  • Bastion as a service for production infrastructure access
  • Just-in-time access management

Security and compliance management

Palo Alto Networks infosec team is responsible for definition and enforcement of security policies and automatic validations and checks but the remediations is the responsibility of engineering teams. This includes:

  • Security reviews and approvals (infosec)
  • Scanning and checks for vulnerabilities(infosec)
  • Remediations for vulnerabilities (engineering)
  • Framework for implementing measures for compliance with business policies (Platform engineering)
  • Framework for adhering to governance rules and regulations (Platform engineering)

Costs management

In the multi and hybrid cloud world, infrastructure costs became a hot topic. There is an undisputed need to continuously monitor, report and optimize costs.

At Palo Alto Networks, we started attacking this problem in two dimensions i.e. top-down and bottom-up

  • Top-down — this is driven by FinOps and CloudOps to executive teams to drive cost optimizations at Org and business unit level.
  • Bottom-up — this is driven by infra platforms and engineering teams where we provide granular insights of costs at individual cloud resource or user level including anomaly detection and automations to optimize costs.

Image 13: DevClues costs insights at SKU level

Configuration management

In the cloud world, there are two important aspects of infrastructure management i.e. infrastructure provisioning and configuration. At Palo Alto Networks, we standardized infrastructure provisioning with terraform and configuration management with ansible.

Developers should manage application configuration in a scalable and reliable way, similarly to how we manage and version source code or infrastructure as code(IaC).

To enable better management of ansible code, we adopted awx , an open source version of ansible tower to help developers/SREs to standardize how configuration is deployed, initiated, delegated, and audited.

Resource management

Resources and infrastructure management go hand in hand at Palo Alto Networks. DevClues “Uno” plugin is meant to provide both simple and seamless infrastructure provisioning and end to end management. This include -

  • Management of the Kubernetes cluster fleet and components as code, with best practices and continuous deployment
  • Management of virtual machines across cloud providers as code, with best practices
  • Management of cloud provider resources ; for example — google bigquery, cloudSQL and cloud run functions etc.

Summary

As per Gartner’s report, 75% of organizations with platform teams will provide self-service developer portals by year 2025 to improve developer experience and accelerate product innovation.

The adoption of IDP is directly proportional to the maturity of organization’s DevOps, SRE and platform engineering practices. So, higher the maturity index, higher their likelihood of using developer portal.

Platform engineering team at Palo Alto Networks is focused and committed to continuously innovate IDP capabilities by managing its adoption, roadmap, gathering feedback from our engineering teams and market its capabilities.

References and Additional Resources

Innovation Insight for Internal Developer Portals — https://www.gartner.com/en/documents/4010078

--

--