Enterprise Cloud Security: Foundation
Part II: Security consideration for cloud foundation
This article is part of Enterprise Cloud Security Series with Part I: Introduction introducing the space and how it differs from on-premise security. Part II covers the security consideration for building the cloud foundation. Part III: Developing secure applications focuses on security considerations of devops environment.
Enterprise cloud infrastructure (Ops in the architecture above) can primarily be split into two pieces i.e. foundation and application landing zone. The foundation refers to infrastructure setup typically performed by cloud platform team within the enterprise. Application landing zone is the part of the infrastructure provided to developers with appropriate guardrails in place to deploy application and associated infrastructure. Please note that depending on the size and maturity of the organizations, multi-cloud strategy and hybrid design, there may be business unit, geolocation and/or privacy/sensitivity specific landing zones to enable organization meet it’s organizational or regulatory needs.
This article focuses on role of security in various components that form part of cloud foundation.
An important part of any cloud deployment within the enterprise is the need to organize the resources. Depending on the size of the organizations, there are different approaches that are recommended by cloud service providers.
There has been significant interest in use of a AWS Account, Azure Subscription and GCP Projects to create landing zones for applications or higher organizational component. This approach, if implemented correctly, can be an important part of strategy to control the “blast radius” by providing micro-segmentation across identity and network which is an important tenet of zero-trust architecture. Most of the standard cloud models use one or more of the following type of landing zones
- Management: zone is primarily used to concentrate all the components that is used for management of the environment
- Shared service: zone contains shared services like DNS, NTP that is typically managed by cloud platform or similar infrastructure team.
- Network: Most of the networking model (like hub-spoke model) use a network specific landing zone to terminate the on-premise and/or inter-cloud connectivity to simplify the management and connectivity.
- Log aggregation: Log aggregation is an important part of logging and monitoring to ensure that a comprehensive view across all the landing zone is available for analysis.
- Application: Application or organization unit specific landing zones are zones with specific guardrails to ensure application teams can deploy infrastructure within the predefined parameters like access to network, allowed services and associated configuration, allowed access model.
Most of Cloud Service Providers provide capability to build a hierarchical model that enable organization to structure a very large set of such landing zones into more manageable chunks. AWS Organizational unit (Control Tower), Azure Management group and GCP folders can be used to structure landing zones based on the organizational structure, environments, security and privacy considerations. In addition to that ability to apply specific set of organizational policies (AWS Config AWS service control policies, Azure Policy, GCP Organization policies) based on specific criteria (e.g. non-prod vs prod, accounts with PII data) can also play a role into designing this model. I have seen the organizational hierarchy, where supported, used to provide access to a large number of resources (e.g. Database Administrators may need access to all databases across the accounts) which may be one of the questionable uses of this functionality.
Given that a lot of these features are currently evolving, it will be interesting to see whether organizations go through multiple phases of re-organization of management structure to better reflect and align with their requirements or use new capabilities that Cloud Service Providers introduce in this space.
Tagging provides an orthogonal mechanism to hierarchical model described above. Tagging is an important part of overall resource organization strategy and, if used correctly, can play an important part in security classification. Tags can be used to apply a lot of security facets like privacy, criticality across the resources instead of making these aspects part of hierarchical model. But lack of consistent features like tag inheritance, disabling tag overriding, referencing tags in policies and access rules has made use of this mechanism a challenge while designing management models within the enterprise.
Identity forms one of the core foundations of cloud platforms. It is very important to ensure that the identity layer is built to handle different use-cases and associated access models.
An important initial design decision for the cloud foundation is which user account store should the cloud platform use for each of the type of account. If the cloud’s internal account store (e.g. Azure AD Users, AWS IAM Users) is used, then each user account lifecycle must be managed within the cloud. In addition to that the internal store should also be able to support MFA, password management capabilities. Alternatively, by leveraging external store and integrating through federated SSO, user can be mapped (for example through SAML assertion or JWT token) during authentication process to an access role or profile in internal store and thus removing need to manage account in cloud account store. Please note that there will still be some basic accounts like tenant owner, break-glass accounts and service accounts needed for technical reason that should be created in cloud’s internal account store and managed through applicable privileged account management processes.
The cloud identity system should typically support user authentication against internal store and through federated SSO mechanisms like SAML, OpenID Connect. It is important that the authentication process is flexible enough to support different type of authentication mechanism for different types of accounts across all the access interfaces i.e. Portal/Web console, Command Line Interface (CLI), REST API. Lack of support for these mechanism across all the access point has been one of the significant challenge to automation in past.
Types of account
The cloud should be designed to support access from the following types of users
- End-users: represent the application users who access the environment through the edge components (e.g. AWS Application Load balancer, Azure API Management). As part of zero-trust architecture their identity may need to flow to downstream PaaS and SaaS component to ensure the access can be managed and audited in the context of end-user instead of generic service account. In addition to that the context of the end-user (e.g. whether it employee, contractor, vendor, consultant, partner or part of acquisition) may also be needed by downstream systems.
- Developers (Privileged access): represent users with access to make changes to the infrastructure. Depending on the regulatory guidance and other internal policies, some or all of developer’s access may be deemed privileged access. Privileged access typically triggers additional control requirements like access approval, session monitoring and recording, just in time access and additional audit and monitoring across all the components to ensure comprehensive coverage to understand who tried to perform the action within a given context for what reason. It is important to evaluate the authentication mechanism and identity store configured to ensure that the policy requirements especially with regards to having user identity details in audit and monitoring details is available for traceability purpose.
- Service Account: represents accounts used by automation scripts and other software that connect to cloud to perform various operations. These account have specific lifecycle different from user account lifecycle and may use different authentication mechanism like API Key, certificate. The credential management may involved expiry and rotation of the credential at regular interval.
- Machine Identity: is a special case of service account that represents the identity of VM or service making the call. It allows establishment of identity within the cloud infrastructure (and possibly outside) without any need for identity and password.
- Break-glass and owner accounts: represent cloud-native accounts that are expected to be used in case of any significant failure or lockout across the identity infrastructure. These accounts typically have a very high criticality associated and are typically used as accounts of last resort for accessing the environment.
Most of cloud services support Role-based access model (RBAC) that associates the permission(s) with a role based on the access model. The access model should follow the principle of least privilege to ensure that roles are defined in alignment with specific use-cases and then assigned to the users through a group. It is important to go through the access model development for each and every service being used to ensure that permissions are classified in to at least core operational part associated with cloud foundation (e.g. creation of VPC or VNet) and application development specific set (e.g. creation of VMs within a specific subnet). Depending on specific use-cases and operational model, additional roles may be created for devops operations (e.g. continuous deployment).
In addition to that the following practices should be evaluated while designing access control model
- where possible lockdown security attributes (e.g. disk encryption should enabled and not disabled). If not supported by cloud platform, use preventative policy or monitoring/remediation to overcome this platform deficiency.
- Ensure that all the machine identities associated with application infrastructure for non-management components have very limited access (like adding log) to ensure that in case of any breach there is no additional access available through these machine identities.
- Service accounts should have very limited access and/or should request access to very limited scope needed for the work being performed by scripts to reduce chances of major outages
- Use cloud specific features (e.g. AWS IAM limits scope of an authenticated session to a specific role in an account) where available to limit the scope of the access during assignment and as part of operations to avoid potential major impact due to human or automation errors.
- Consider identity and access model within the service where applicable (e.g. VMs or databases have internal identity and access model) as part of the over all service access model. In addition to that either integrate the services’ internal identity and access model with cloud platform identity or have privileged access management solution in place to ensure adequate controls are in place for monitoring privileged access.
Two aspects have a very significant impact on how most of the enterprises design the cloud : network connectivity with on-premise through datacenter connections (AWS Direct Connect, Azure Expressroute, GCP Cloud interconnect) and choke point for ingress and egress traffic to reduce attack surface, cost and enforce various security controls like DLP, malware monitoring. Most of the enterprise use hub-spoke model with defense in depth while designing network architecture even though that may not be most appropriate model for zero-trust architecture.
Most of the previous hub-spoke model were built with hub located in partner colocation or datacenter to ensure that all the traffic could pass through on-premise security appliances. Over past few years, there has been significant growth in availability of network appliances for routing, next generation firewall, Intrusion Detection System (IDS), Intrusion Prevention System (IPS) malware-scanning, content monitoring, data loss prevention (DLP) in the cloud which can enable creation of more efficient connectivity between regions or across cloud without need to route the traffic through datacenter unless required due to organizational policy.
In addition to that, use intermittent networks between hub and application network with static routing to isolate sensitive workloads.
Some of the on-premise approaches like multi-homed VMs with dedicated nic for administration and backup services are typically not replicated in cloud. Identify such practices and plan for alternate designs that may scale-up in cloud.
Ensure that you plan for presence of shared services like DNS, NTP, Vulnerability scanning, EDR, etc in each cloud and region to reduce the need to communicate with on-premise infrastructure and collect logs in local storage for analysis to reduce egress charges.
Ensure adequate sizing of IP networks for workload of different sensitivity to enable simple firewall rules and route rules to avoid leak of traffic across sensitivity and criticality boundaries. Where available use named IP collections or tags to create rules to simplify rule updates across complex network architectures.
Where possible reduce calls to cloud control plane and data plane over private network (e.g. Azure Private Link, AWS VPC endpoints) to reduce the flow of data over public network.
An important learning to take away is the intricate mesh various foundation technologies like identity, network, management structures form with each other and the security implementation percolates through these technologies. Besides these foundational technologies, there are additional considerations that should be kept in mind while building the cloud foundation.
Resiliency is an important pillar of building a foundation to ensure that the application can build the failover and disaster recovery over a platform that provides resiliency across identity, network, shared services. This is typically achieved through a right combination of leveraging cloud native capabilities like using global service, paired region, geo-replication and designing the shared services to be resilient across all the active regions and geographies.
Most of the cloud foundation are developed with a few shared services like DNS, NTP, etc. These services may be either built-in cloud capability or deployed as infrastructure to achieve integration with on-premise infrastructure where such integration is not possible for built-in cloud capability. It is very important to ensure that all the controls expected to secure any other application should be applied to these services including but not limited to
- Privileged Access: access to these services should be limited to very limited team and all changes should preferably be made through change control workflow with adequate review and approval in place.
- Vulnerability and Drift: services and platforms should be scanned and penetration tested to detect vulnerabilities and patched at appropriate interval. In addition to that configuration should be monitored to identify any change from “desired state” based on hardening configuration.
- Restrict network access: reducing attack surface by restricting network access from trusted source networks to specific port needed. Where possible, access over secured channel (e.g. TLS) should be enforced.
- Data : stored by service should be secured at rest and in motion. All the data should be backed up at regular interval and recovery should be tested at appropriate time interval to ensure that backup process is appropriate. Use built-in or custom data integrity checks to ensure that data has not been tampered with.
- Logging: of the audit events and other operations performed should be stored for duration as prescribed by organization policies and industry practices (e.g. leading practice for incident management suggest storage of log data for at least 180 days which may not align with organization policy or may exceed compliance requirements).
- Resiliency: of services should be planned for at design time and the implementation should be verified at regular interval to ensure that service is highly available within the parameter needed by depending services.
One of the shared service that forms an important part of security operations is log aggregation infrastructure. This capability enable collection of logs across different landing zones, services, platforms, networks into one or more aggregation storage site like AWS S3 bucket, Azure Log Analytics Workspace or GCP Cloud storage for further analysis. The log aggregation platform designed should be able to handle the following requirements in addition to the regular security requirements identified above for shared services.
- Large volume of data: with growth of services being used and size of cloud footprint, the data injection can grow up to 500GB/day for very large implementations.
- Integrity: the data should typically be stored as write-once-read-many (WORM) to ensure that the integrity of sensitive audit data is maintained
- Expiry: the old data should automatically be removed from the platform to ensure that only the appropriate data is maintained over time.
This article tries to cover various security consideration while building the cloud foundation within an enterprise. This is an on-going exercise which I will try to continuously improve upon.
1. Updated hub-spoke model with security components