Platform Engineering model for Product Engineering

Balakrishnan Sreenivasan
9 min readNov 28, 2023

--

Authors: Balakrishnan Sreenivasan and Siddharth Sood

CIOs and CTOs are increasingly prioritising the transformation of their IT organization towards an engineering-focused approach, integrating product engineering & full stack principles. We’re observing a pronounced adoption with cloud which helps with transforming towards engineering centric teams. Concurrently, DevOps and Site Reliability Engineering (SRE) practices have reached significant maturity within numerous organisations. This convergence of hyperscaler evolution, service maturity, established patterns, automation, and DevOps advancements underscores platform engineering’s vital role in shaping an engineering-focused IT organization. Despite the evolution, many enterprises are still operating infrastructure like traditional services with some degree of automation. Per Gartner, platform engineering is one of its top technology trends for 2024 wherein 80% of large software engineering organisations would have established platform engineering teams by 2026. It also adds that goal of these teams will be to optimize the developer experience and accelerate product teams’ delivery of customer value. Key take away is that platform engineering will focus on internally providing reusable services, components, automation, and tools which will be consumed in a self-service model. In essence platform engineering model is an evolution of DevOps model. This blog will explore how the platform engineering paradigm and its breadth can empower product and full-stack teams through a sophisticated and well-developed platform engineering model.

Foundational elements for full stack product teams to be effective

Product engineering squads are necessarily full-stack teams with deep product engineering skills and work in product engineering construct. Refer this article for more details about transforming enterprise to product engineering model. Focus of this article is to articulate the platform engineering model that helps product engineering squads to work effectively in an empowered way. To realize the full potential of Full stack product engineering squads, there are certain foundational elements that needs to be established. The foundational elements could be classified into core set of platform services, guard-rails per enterprise policies and standards, landing zone by workload disposition aligned with target operating model, as well as DevOps capabilities offered as a service. The entire foundational platform needs to consider the fact that during transformation, applications and services needs a hybrid platform (part of the components would be on-prem while rest of them are modernised to cloud). Diagram below articulates the model:

Platform Engineering Model — Application Squads versus Platform Engineering needs

In the above model application teams (full-stack product squads) will need a set of services from platform in a self-service & embedded automation driven model. This is platform engineering services offer to them. Platform engineering services abstract variety of services that includes core platform & cloud services (including on-prem), security & compliance, backup & resiliency, network, containers, platform, and application tooling (including DevOps tooling), integration services and a set of additional middleware services per the needs of the enterprise. Platform engineering services can be visualized as a catalog of product-led services offered by a set of product teams. Another key element of this model is a set of patterns, reference architectures and guidance (as-code) that helps application teams operate in a cloud native way. Typically, this is provided by cloud acceleration center or cloud center of expertise which acts as a change agent helping application teams & other enterprise teams better leverage of cloud capabilities.

Platform Engineering Model Overview

Enterprises have traditionally built several shared services to drive optimization in IT delivery and associated costs. Application teams have to deal with several of these shared services such as security, identity management, Quality Engineering, API CoE, middleware / database services, Infrastructure services, Service management, and DevOps tooling. These services, designed to accommodate a broad range of applications and data, often introduce rigidity, hindering IT teams’ agility and decision-making autonomy. Consequently, this impedes the teams from operating with the efficiency of a full-stack product team. Platform engineering emerges as a pivotal approach in this context, aiming to overcome the limitations of the conventional IT framework that restrict speed and agility. The accompanying illustration contrasts the traditional model with the platform engineering approach.

Evolution of Platform Engineering Model from Traditional to DevOps led Model

Platform engineering model is about how a suite of services needed for end-to-end application development, deployment and manage is abstracted and offered as-a-service (in the form of guidance, self-service & code) of course with necessary enterprise guard rails. This includes a set of reference architectures (e.g. how to build an Azure or AWS serverless application at high availability) with reference implementation (code). This helps product teams make right design decisions based on a catalogue of services made available to them and that helps remove significant level of dependencies on shared services and move to a “do-it-yourself” model. While we are talking about platform and shared services autonomy in this article, it is to be noted that product teams need to follow domain driven design principles to ensure true empowerment from functional and data dependencies perspective.

Elements of Platform engineering Model

Platform engineering involves various elements that collectively aim to streamline development, operations, and delivery processes in software engineering. Key elements of this model include:

Foundation Services: Platform engineering helps build the right Hybrid Cloud Landing/Execution zone responsible for helping enterprise teams adopt cloud faster in a consistent, secure and optimized manner aligned to Cloud provider best practices and recommendations. Landing zone usually helps platform teams take care of multiple concerns such as cloud access management, account management, workload segregation, Workload security, Cloud services adoption, Centralized logging & monitoring, SIEM (Security Incident and Event Management), control tower and specific compliance and regulatory best practices. Landing zone considerations could also include specialized next generation use cases such as Generative AI workloads that need a way to source external models or have external vendors or service providers innovate using anonymized enterprise data in a separate landing zone that is isolated from rest of the enterprise.

Compute & PaaS services & Infrastructure as code (IAC): Speed and agility of teams come from their ability to build and operate environments, configure services, deploy & manage applications in a low-touch/no-touch, automated manner and here IAC plays a significant role. In the context of IAC, everything can be provisioned and managed through DevOps pipelines and Infrastructure as code. There are several options for IAC such as Cloud Formation Template, ARM templates, Google CDM, Terraform, Ansible and so on. While platform teams leverage IAC for building catalogue of services necessary for application teams, Application / Full-stack teams will leverage IAC to automate most of the deployment and manage activities (e.g. provisioning compute / cloud services, configuring them, configure CI/CD pipelines, incorporate security elements, firewall updation and so on.). Enterprises often need middleware services that are not natively offered by cloud providers (e.g. TIBCO, App / Database Servers) which are typically offered by platform services teams that have automation mechanisms (e.g. Ansible based) to deploy-manage these middleware. This gets bundled into the platform offerings to address breadth of enterprise needs as cloud native services alone generally is not sufficient. This includes any 3rd party tooling needed for development teams too. Given the Generative AI push across enterprises, one need to enable a suite of compute capabilities needed for Generative AI that includes HPCs, TPUs and so on in a hybrid way where models can be built centrally and deployed where it needs to run based on the use case.

DevOps practices & Continuous Integration & Continuous Deployment (CI/CD): This integrates development and operations teams into a continuum where in teams could continuously develop, test, and deploy to production whereas enterprise lifecycle processes are integrated and automated into pipelines. Full-stack teams use IAC to provision and manage solutions that also includes pipelines which in turn integrates lifecycle automation elements (vulnerability scans, code quality, testing, deployments etc.). Platform engineering teams are also responsible for building & managing respective shared tooling responsible for providing common services such as application security, code quality, artifact lifecycle management, container image management, centralized logging and monitoring, service management etc. Generative AI brings the need of Model Ops, FM Ops that enables LLMs and other AI models to be built (or tuned) elsewhere and deployed onto right target environments per use case needs.

Zero Touch Operations Model, Monitoring and Observability: True cloud native development model is about ensuring application squads building all necessary monitoring, observability and automating any manual touch activities through runbooks. This involves tracking application performance, collecting logs and metrics to understand system behaviour, and using this data to proactively detect and address issues for maintaining system health and reliability. First, entire suite of tools needed for the enterprise (combination of native & 3rd party) is offered as-a-service in an automatable form as a set of Ops patterns with necessary enterprise ITSM integrations. Application squads can easily leverage these patterns to ensure key metrics / KPIs are monitored with necessary alerts, runbooks incorporated into operations — of course building application specific logic into the runbooks, monitoring & correlation rules etc. Model Ops capabilities are critical for Generative AI use cases given the concerns around impact of AI as well as for continuous monitoring and improvement.

Security & Compliance: As the maturity of security & compliance offerings in an enterprise evolve, it helps shift-left several of enterprises’ policies & requirements into guard-rails either through platform configured policies, purpose built automation & rules or through application patterns with embedded security (as-code). This includes security tools integrated via CI/CD and Ops patterns. This level of maturity needs a well-disciplined platform engineering model where all security & compliance requirements are embedded as guard-rails into the platform and where security compliance can be automated and self-serve. Generative AI brings in additional needs of security that extends data security & privacy while the need to ensure data traceability (for training & tuning models), AI Misuse challenges, Bias challenges etc.

Application Patterns: Platform engineering model is most effective when several of platform capabilities are abstracted into patterns-as-code (examples: Public / Private API Gateway, Serverless functions, service mesh, database patterns, storage patterns, integration patterns, co-existence patterns, performance and resiliency patterns) which in turn has a combination of reference architectures, IAC & reference implementations including compute services as well as embedded security & compliance rules (like encryption, IAM integrations etc.). Typically, an acceleration center (like Cloud Acceleration Center) incubates several of these patterns which further gets evolved through leverage by application teams towards an inner-source model.

Data Platform, Management & Tooling: To effectively harness Generative AI, enterprises require a sophisticated data management platform & tooling that is capable of agile and secure data handling, integrating, and transporting across various applications and environments. Central to this is a standardized suite of data services, essential for the consolidation and orchestration of data resources. This robust infrastructure must facilitate swift data movement and processing, critical for tuning and optimizing generative AI models, thereby intensifying the demand for advanced data management and integration tools. These capabilities are much needed for every enterprise as they not only modernize their applications & data to hybrid cloud but also infuse Generative AI capabilities. This is another aspect that platform engineering model should address.

Containerization and Orchestration: With evolution of Kubernetes to enterprise grade platforms, deploy-manage Container clusters across cloud providers and on-prem platforms becomes a necessity. Development teams need key capabilities such as creation of clusters, CI/CD and Ops services & tooling integrations, FinOps integrations, integrations for security information and event management purposes (SIEM) and so on in addition to base container orchestration and management capabilities. Building and deploying AI Models Generative AI use cases across hybrid cloud environments need container orchestration, cluster management and monitoring capabilities across hybrid cloud environments.

It is also important to understand that with Generative AI becoming more mainstream, many of the above Platform engineering elements will get enhanced or reimagined helping bring in additional level of productivity and optimisation. Example could include automation IaC generation for variety of use cases, security gaps identification, vulnerability remediation and natural language augmentation of platform capabilities.

Conclusion

CIOs and CTOs are moving their IT teams towards a platform engineering model that incorporates product engineering principles, DevOps, and SRE practices for a more engineering-centric approach. This shift leverages automation, mature services, and established patterns to transform traditional IT infrastructure into agile, product-focused platforms, empowering product and full-stack teams. Several teams including security & compliance, platform services, resiliency services and so on will have to transform themselves to offer their services via as-a-service model. Generative AI infusion across enterprises brings in a suite of requirements from the platform engineering services as we called out above. Developers are the ultimate customers in such a model and platform engineering helps them become autonomous thereby helping bring agility and speed to business.

--

--

Balakrishnan Sreenivasan

IBM Distinguished Engineer and Subject Matter Expert in Application Modernization to Product Centric Models and Domain driven design