Stories by Juan Matías de la Cámara Beovide on Medium

Do you want light? Left the fire, embrace a dynamo

Juan Matías de la Cámara Beovide — Tue, 20 Jan 2026 14:07:25 GMT

TL;DR

Moving from Firebase to AWS DynamoDB often hits a roadblock: replicating Firebase’s granular, per-user security rules.

At binbash, we’ve developed a research project, research-dynamodb-access-rules, which demonstrates how to implement fine-grained access control in DynamoDB using IAM policy variables to match the ease of Firebase’s security model.

The Story: From “It Just Works” to “How Do We Scale?”

Meet Alex and Sam, two developers who started a small project on GCP using Firebase. For their MVP, Firebase was a dream; the declarative access rules allowed them to secure user data with just a few lines of code. It was the perfect environment for rapid prototyping.

However, as their project evolved into a full-scale application, their needs changed. They required the broader enterprise ecosystem, advanced networking, and cost-efficiency at scale that AWS provides. The decision was made: it was time to migrate to Amazon DynamoDB.

The transition seemed straightforward until they hit a wall. In Firebase, saying “only the owner can read this record” is simple. In AWS, managing individual permissions for thousands of users via IAM felt like a daunting architectural shift. They didn’t want to build a complex middleware proxy just to handle basic data security.

We, at binbash, partnered with them to find a solution. We didn’t want Alex and Sam — or any developer — to lose the simplicity of “rules-based” access just because they moved to a more robust cloud provider.

Our research focused on utilizing IAM Policy Variables and Condition Keys to mimic the Firebase experience directly within AWS. By mapping Cognito identities to DynamoDB partition keys, we proved that you can achieve the same “owner-only” or “group-based” access without the overhead.

Explore the Research

You can find the full implementation, Leverage (OpenTofu/Terraform) configurations, and access rule logic in our official repository:

Repository: binbashar/le-tf-infra-aws.

Step-by-Step: Reproducing the Access Rules

1. Provisioning the Infrastructure

Alex and Sam started by using binbash Leverage (OpenTofu/Terraform) to deploy the base environment. This included the DynamoDB table and the necessary Cognito Identity Pools.

2. Setting up Cognito Identities

To mimic Firebase Auth, they set up an Amazon Cognito Identity Pool. This allows users to sign in and receive temporary AWS credentials.

Ensure users are assigned a unique Cognito Identity ID. This ID is the “Source of Truth” for their data ownership.
Since they already have a user base in FireBase, and it does not provide an "export user" method, we propose Alex and Sam to use this user migration solution.

3. Crafting the “Firebase-style” IAM Policy

This is where the magic happened. Instead of writing a policy for every user, they wrote one dynamic policy using IAM Policy Variables.

They applied a condition to the DynamoDB GetItem and PutItem actions:

"dynamodb:LeadingKeys": ["${cognito-identity.amazonaws.com:sub}"]

This rule ensures the user can only access items where the Partition Key matches their own Cognito Sub ID.

4. Data Partitioning

Sam adjusted the application logic to ensure that every time a record was created, the Partition Key (PK) was automatically set to the user’s unique Identity ID. This mirrored the way they used to structure collections in Firebase.

5. Verification and Testing

Finally, they used the AWS CLI and a small test script (included in the repo) to verify:

Success: User A can read/write their own data.
Failure: User A receives an AccessDeniedException when trying to access User B’s Partition Key.

Conclusion

Migrating from Firebase to our DynamoDB-based solution is a strategic move for teams looking to grow.

Granular Security at Scale: You maintain the fine-grained control of Firebase while leveraging AWS’s global infrastructure.
Reduced Complexity: By using IAM policy variables, you eliminate the need for custom “security layers” in your application code.
Enterprise Integration: It places your data security within the AWS IAM framework, making it easier to pass security audits and integrate with other AWS services.
Cost Efficiency: DynamoDB’s pricing model, combined with direct client-to-database access patterns (secured by these rules), can significantly reduce operational costs for high-traffic apps.

By bridging the gap between Firebase’s simplicity and AWS’s power, we’ve ensured that scaling your infrastructure doesn’t mean compromising on your development velocity.

Do you want light? Left the fire, embrace a dynamo was originally published in binbash on Medium, where people are continuing the conversation by highlighting and responding to this story.

Staff Augmentation vs Managed Services

Juan Matías de la Cámara Beovide — Fri, 02 Jan 2026 13:46:04 GMT

Beyond “Hands on Deck”: Why Managed Services Outperforms Staff Augmentation

Why outcomes don’t come from adding people
You want to renovate your house.
Not because it’s falling apart.
Because it no longer fits the way you live.
Walls that once made sense now interrupt movement. Electrical decisions made years ago feel risky. The house still stands, but it’s clearly the product of assumptions that no longer hold.

At this point, you’re offered a choice.

You can rent excellent tools. Professional-grade. The same ones real builders use. They arrive on time, well maintained, ready to go. And the moment they’re dropped at your door, something subtle happens: the responsibility shifts. You’re now the architect, the foreman, the project manager, and — when things don’t go exactly as planned — the person explaining why.

If a wall ends up crooked, the tools won’t argue. They did exactly what they were asked to do.

Or you can hire a general contractor. Someone who brings tools, yes, but also plans, sequencing, experience, and accountability. Someone who understands that a renovation isn’t a collection of isolated tasks, but a system where early decisions quietly shape everything that follows. You explain what you want the space to become. They take responsibility for getting you there.

That difference is the cleanest way to understand Staff Augmentation versus Managed Services.

Staff Augmentation is renting the tools.
Managed Services is entering a partnership.

Staff Augmentation tends to be framed as speed. A fast way to fill a gap. A pragmatic response to pressure. “We just need another engineer.” The sentence sounds harmless. Almost obvious.
But what usually follows is less obvious.

That engineer doesn’t arrive into a vacuum. They arrive into a system that already has architecture, habits, blind spots, and unspoken assumptions. And suddenly, the client becomes responsible not only for what needs to be done, but for how, in what order, with which trade-offs, and toward what long-term direction.

The augmented resource executes. The burden of thinking, aligning, correcting, and absorbing risk stays on the client.
Managed Services starts from a different premise: that most technical problems aren’t caused by a lack of hands, but by a lack of shared responsibility.

In a Managed Services model, you don’t hire an individual. You form an alliance.
An alliance means the provider doesn’t show up asking, “What ticket should I work on?”
They show up asking, “What outcome are we accountable for?”

Execution is still there — good execution, disciplined execution — but it’s supported by technical leadership, delivery ownership, and a roadmap that exists to be challenged, refined, and defended. Architecture is part of the conversation from day one, not something you circle back to once things hurt.

This is where the difference between a partner and a provider becomes impossible to ignore.

A provider wears their own jersey. Their incentive is clear: maximize utilization, log hours, stay busy. If priorities are unclear, if decisions are short-sighted, if technical debt piles up, that’s unfortunate — but it’s not their problem.
A partner wears the project’s jersey.
In a real partnership, success and failure are shared. If something isn’t working, it gets surfaced early. If a decision feels wrong, it gets challenged. If the roadmap needs to change, it changes — not because hours are being sold, but because outcomes matter.
That shared accountability changes how risk behaves.
With Staff Augmentation, risk is quietly transferred to the client. If the engineer struggles, the client manages it. If knowledge concentrates in one person, the client absorbs it. If the system degrades over time, the client owns the consequences.
With Managed Services, risk is designed into the relationship and actively managed. Delivery managers and tech leads exist for one reason: to make sure progress doesn’t depend on heroics, guesswork, or institutional memory living in one head.
This is why Managed Services often feels less noisy.

There are fewer daily clarifications. Fewer tactical interruptions. Fewer moments where progress stalls because no one is quite sure who should decide. Instead of micromanaging tasks, clients participate in planning. Instead of tracking activity, they review direction.

And despite appearances, this model tends to be more flexible.
Because Managed Services is built around evolving roadmaps rather than fixed roles, it adapts to change without renegotiating reality every month.

Capacity shifts. Focus changes. New constraints appear and are incorporated instead of resisted.
Staff Augmentation, for all its apparent simplicity, often locks organizations into rigidity: fixed hours, fixed individuals, fixed expectations — until the business inevitably moves somewhere else.

In the end, the distinction is not about talent. Great engineers exist everywhere.

The distinction is about intent.

Staff Augmentation is transactional.
Managed Services is relational.

One gives you people.
The other gives you a team that commits to where you’re going.

If what you need is temporary capacity, renting tools might be enough.
If what you need is a system that holds as you grow, partnerships tend to age better.

Staff Augmentation vs Managed Services was originally published in binbash on Medium, where people are continuing the conversation by highlighting and responding to this story.

SOC2, ISO27001 and how to meet security

Juan Matías de la Cámara Beovide — Thu, 11 Sep 2025 14:57:42 GMT

At binbash, an Advanced AWS Partner, our mission is to empower organizations like yours to build secure, efficient, and compliant cloud infrastructures. Today, we’re going to dive into a critical topic for many of our clients: achieving and maintaining compliance with ISO 27001 and SOC 2 in your existing AWS environments, and how binbash Leverage can be your accelerator on this journey.

TL;DR

Fintechs and Healthtechs must adhere to standards such as SOC2 and ISO/IEC 27001. With the AWS Well-Architected and binbash Leverage frameworks, tech teams can deploy well-architected, secure-by-design infrastructure that follows best practices and recommendations up to 10 times faster.

A Brief Overview

Before we detail how to meet these standards, and just in case you are not aware, let’s quickly summarize what they entail:

ISO/IEC 27001: This is an international information security standard that specifies the requirements for establishing, implementing, maintaining, and continually improving an Information Security Management System (ISMS). Its core purpose is to help organizations systematically examine their information security risks — considering threats, vulnerabilities, and impacts — and then design and implement a coherent suite of information security controls or other risk treatments. It ensures that security efforts are not disjointed but are part of an overarching management process that continually meets the organization’s needs. Organizations with an ISMS that meets the standard’s requirements can choose to have it certified by an accredited certification body.
SOC 2 (Service Organization Control 2): Developed by the American Institute of CPAs (AICPA), SOC 2 is an auditing procedure that ensures your service providers securely manage your data to protect your organization’s interests and the privacy of its clients. For security-conscious businesses, SOC 2 compliance is a minimal requirement when considering a SaaS provider. It defines criteria for managing customer data based on five “Trust Service Principles”: Security, Availability, Processing Integrity, Confidentiality, and Privacy. Unlike more rigid standards, SOC 2 reports are unique to each organization, as it designs its own controls to comply with one or more of these principles.

Disclaimer

It is crucial to understand that merely implementing well-architected infrastructure and following good technical practices, while foundational, is not sufficient on its own to achieve full organizational compliance, such as with ISO/IEC 27001 or SOC 2.

Achieving comprehensive compliance is a broader organizational endeavor that mandates high involvement across the entire organization. This includes:

Creating wide procedures and policies that govern information security across all aspects of the business.
Adopting an overarching management process to ensure that information security controls continually meet the organization’s evolving needs.
Promoting a culture that consistently honors these standards and practices.

Key Compliance Items for ISO 27001 & SOC 2

To achieve compliance with these standards, organizations must address several specific areas. Based on our understanding of the standards and AWS best practices, here are some critical items:

Systematic Risk Management:

- ISO 27001 requires management to systematically examine information security risks by taking into account threats, vulnerabilities, and impacts, and then designing/implementing controls to address unacceptable risks.
- SOC 2’s Security principle inherently demands robust risk assessment to protect system resources against unauthorized access and potential abuse.

Robust Access Control & Identity Management:

- Both standards emphasize restricting access to information and systems to authorized personnel and processes.
- Specifics include: Implementing minimum privilege (least privilege) for all users and systems, enforcing Multi-Factor Authentication (MFA), especially for administrators and privileged accounts, defining roles based on departments or functions, and establishing conditional policies for access based on factors like location, time, or device.

Comprehensive Data Protection:

- ISO 27001’s Security pillar focuses on protecting the confidentiality and integrity of data.
- SOC 2’s Confidentiality and Privacy principles explicitly require restricting data access and disclosure to specified persons or organizations, as well as protecting Personal Identifiable Information (PII) from unauthorized access. Encryption is a crucial control for this.

Multi-Level Network and Application Security:

- Protecting against external and internal threats is paramount.
- Specifics include: Deploying Web Application Firewalls (WAFs) with rules like OWASP to protect against common web attacks, implementing DDoS protection for critical services, utilizing VPC with private subnets for isolating critical resources, and configuring restrictive Security Groups to control traffic flow at the instance level.

Proactive Operational Monitoring & Event Detection:

- ISO 27001 highlights running and monitoring systems and establishing controls to detect security events.
- SOC 2’s Security and Availability principles involve continuous monitoring of network performance and availability, as well as robust security incident handling.

System Reliability & Availability:

- ISO 27001’s Reliability pillar focuses on ensuring workloads perform their intended functions and can recover quickly from failures.
- SOC 2’s Availability principle directly addresses the accessibility of systems, products, or services as stipulated by contracts or Service Level Agreements (SLAs).

Commitment to Continual Improvement:

- ISO 27001 promotes a culture of continual improvement in information security practices, including regular monitoring, performance evaluation, and periodic reviews to adapt to evolving threats and enhance ISMS effectiveness. This is an ongoing management process.

How AWS + binbash Leverage Address These Compliance Items

We combine the power of AWS’s robust services with our Open Source binbash Leverage framework to help you meet these stringent compliance requirements efficiently and effectively. Our approach is strongly based on and extends the AWS Well-Architected Framework, which provides best practices across its six pillars, including Security and Operational Excellence.

Here’s how we tackle the aforementioned compliance items:

Foundation: AWS Well-Architected Framework Integration:

- Every solution we deploy using binbash Leverage is designed to be Well-Architected out-of-the-box, ensuring that your infrastructure is secure, reliable, efficient, and cost-effective from day one. This built-in adherence to best practices significantly reduces the effort required for compliance audits.

Systematic Risk Management & Controls:

- AWS Well-Architected Tool: AWS provides a tool to regularly evaluate workloads, identify high-risk issues, and record improvements, directly supporting ISO 27001’s risk assessment and continuous improvement needs.
- binbash Leverage Reference Architecture: Our architecture defines opinionated conventions for organizing files and managing configurations, incorporating optimal, secure configurations for modern applications. This proactive approach helps mitigate risks by embedding security into the design.

Robust Access Control & Identity Management:

- AWS IAM and SSO with Leverage: We leverage AWS IAM and SSO (Identity Center) to implement fine-grained access controls, enabling the principle of minimum privilege. binbash Leverage automates the configuration of IAM and SSO with minimum privilege, implements mandatory MFA, and defines roles based on departments and conditional policies, directly addressing SOC 2 and ISO 27001 access control requirements.
- Multi-Account Strategy: Our framework promotes a multi-account approach within AWS Organizations, which inherently enhances security isolation and resource separation, crucial for managing access and blast radius.

Comprehensive Data Protection:

- AWS Security Pillar: This pillar guides our approach to protecting the confidentiality and integrity of your data, ensuring robust data protection measures are in place.
- Leverage’s IaC Library & Secret Management: binbash Leverage provides an Infrastructure as Code (IaC) Library of reusable, tested, production-ready solutions that include secure configurations for data storage and transmission. We also include features for secure secret management and handling security keys, facilitating encryption and protecting sensitive information.

Multi-Level Network and Application Security:

- Perimeter & Network Security (Level 1): binbash Leverage deploys robust perimeter security measures, including AWS WAF with OWASP rules and CloudFront with DDoS protection, providing multi-layered defense against web attacks and ensuring secure content delivery. We utilize VPC with private subnets for effective resource isolation and configure restrictive Security Groups to precisely control network traffic, aligning with best practices for network segmentation.
- Our solutions align with the principles of the AWS Security Reference Architecture (SRA), integrating various AWS security services for comprehensive protection of your workload.

Proactive Operational Monitoring & Event Detection:

- Operational Excellence Pillar: This pillar, integral to our framework, focuses on running and monitoring systems effectively and establishing mechanisms to respond to events.
- Features: binbash Leverage includes built-in components for monitoring, metrics, logs, tracing, and APM (Application Performance Monitoring). These tools are essential for continuous visibility into your environment, enabling timely detection of security events and ensuring overall system health.

System Reliability & Availability:

- Reliability Pillar: Leverage adheres to the AWS Reliability pillar, focusing on distributed system design, recovery planning, and adapting to changing requirements.
- Features: Our framework directly supports the implementation of highly available and disaster recovery solutions, ensuring your systems remain accessible and resilient, meeting SOC 2’s Availability principle.

Commitment to Continual Improvement:

- Infrastructure as Code (IaC): A cornerstone of binbash Leverage is its IaC Library, built with OpenTofu/Terraform, Ansible, Helm charts, Dockerfiles, and Makefiles. This enables consistent, repeatable deployments, easy version control, and rapid updates, directly facilitating the “continual improvement” cycle mandated by ISO 27001.
- “Dev First” Approach: Our recommended “Dev First” approach during migration fosters knowledge transfer within your team, building confidence and a deeper understanding of AWS best practices. This continuous learning and empowerment are vital for ongoing maintenance and improvement.
- binbash Leverage helps you to secure your cloud assets and production workloads and achieve compliance in AWS. The best part? We can get your project up and running on AWS up to 10 times faster than traditional consulting methods.

From Complexity to Compliance

Achieving and maintaining compliance with stringent standards like SOC 2 and ISO/IEC 27001 can indeed feel like a monumental task. Embarking on this journey from scratch is often a long, complex path fraught with potential missteps and inefficiencies.

However, by building your project upon very well-known good practices and powerful frameworks, you can significantly accelerate and simplify this process. The AWS Well-Architected Framework provides a robust set of best practices and guiding principles for designing and running reliable, secure, efficient, cost-effective, and sustainable cloud systems, organized into six core pillars: Operational Excellence, Security, Reliability, Performance Efficiency, Cost Optimization, and Sustainability.

Extending these principles, binbash Leverage is our framework that condenses years of acquired knowledge and experience into an ecosystem of code, tools, and workflows. It’s specifically designed to help organizations build, provision, and manage their AWS infrastructure quickly and securely.

binbash Leverage puts you on track to secure your cloud assets, optimize costs by design, and achieve a compliant baseline in AWS up to 10x faster than traditional methods.

While binbash Leverage lays an incredibly strong and accelerated technical foundation for meeting these standards, it is essential to remember that full compliance with SOC 2 or ISO/IEC 27001 is a holistic organizational commitment. Our framework empowers you with the secure infrastructure and efficient practices, but the broader governance and continuous organizational adherence are mandatory for complete certification.

Ready to move faster? Talk to our team

SOC2, ISO27001 and how to meet security was originally published in binbash on Medium, where people are continuing the conversation by highlighting and responding to this story.

How a Fintech’s 30% Cost Optimization Redefined Its Future

Juan Matías de la Cámara Beovide — Wed, 08 Jan 2025 13:47:31 GMT

When business met growth

Scaling a business is often celebrated as a sign of success.

Yet, beneath the surface, growth brings its own set of challenges. Operational inefficiencies, rising costs, and the pressure to meet ever-changing customer expectations can quickly erode margins and stifle innovation.

For Flexibility, a key player in fintech and digital banking solutions based in Argentina, these growing pains had reached a tipping point.

It was clear that their growth would become unsustainable without a radical shift in approach.

Co-authored by Exequiel Barrirero and Juan Matias (KungFoo) De la Camara

As the co-founder and CEO of binbash, an AWS Advanced Partner specializing in startups, I’m proud to share this transformation story. Together with KungFoo, we crafted this article to highlight the power of cloud technology and strategic partnerships in driving sustainable growth. This case study showcases the profound impact of leveraging AWS solutions to address real-world business challenges.

Back to the Future

This is the story of how flexibility transformed its operations, cutting costs by 30% and positioning itself for long-term success — all through the power of cloud technology and strategic partnership.

This was Flexibility’s infrastructure at the start of the project.

Standing on the Shoulders of Giants

The company’s infrastructure was a patchwork of static systems designed for a simpler time. Fixed capacity meant that during peak demand, the system struggled, leading to service interruptions. Conversely, during quieter periods, resources sat idle, wasting money. The lack of flexibility not only hindered performance but also left the team constantly firefighting

A New Hope

binbash entered the scene with a clear plan: transform flexibility’s infrastructure into a dynamic, cloud-native environment. The strategy focused on leveraging AWS tools to address the core issues at hand.

Environment Consolidation & Scaling

The first step was implementing AWS Auto Scaling w/ K8s Cluster Autoscaler for AWS EKS, which enabled the infrastructure and containers to automatically adjust capacity based on real-time demand. This ensured optimal resource use, eliminating overprovisioning and costly downtime during peak periods. Complementing this, Amazon EC2 Spot Instances were introduced to further optimize costs by utilizing unused EC2 capacity.

Subsequently, we consolidated multiple lower-tier environment accounts, including Dev, QA, and Stg, into a single multi-tenant (multi-environment) Kubernetes EKS cluster. This approach significantly reduced resource duplication and operational costs while streamlining management and improving efficiency.

Dev/Stg & QA Consolidation: Streamlining development, staging, and quality assurance environments improved efficiency and resource utilization. Less clusters to administrate

Visibility

Flexibility’s previous lack of system visibility was another critical issue. binbash deployed Amazon CloudWatch + Prometheus and Grafana to provide real-time monitoring and alerting. This tool gave the operations team comprehensive insights into system health, allowing them to proactively address potential problems before they affected users.

Automation & EKS Upgrades

To reduce manual intervention, binbash introduced binbash Leverage™ from the very beginning of the project . Tasks such as EKS infra deployment and upgrades, which once consumed valuable time, were now handled autonomously. Additionally, Infrastructure as Code (IaC) using Terraform through our Leverage Ref Architecture for AWS was adopted, streamlining the deployment and management of AWS resources while ensuring consistency across environments.

Kubernetes Clusters Upgrade: Completing upgrades for all clusters to version >1.26 ensured the latest security patches and functionalities were in place.

Optimization 1 | Architecture w/ Spot EKS Nodes for Dev & QA

Cost optimization

Cost control was a central focus. By analyzing usage data through AWS Cost Explorer and implementing AWS Reserved Instances, Flexibility achieved significant savings. Also implemented AWS Budgets and Billing Alerts for cost monitoring. These tools provided detailed insights into spending patterns, allowing for smarter financial planning and resource allocation.

Cost Optimization: Implementing Reserved Instances resulted in a significant reduction of 30% in overall cloud expenditure.

Optimization 2 | This is Flexibility’s Architecture w/ DevStg & QA consolidated in a single account VPC + EKS + RDS engine

Security & Compliance

Security, an essential aspect of any fintech and digital banking platform, was strengthened with AWS Identity Center (formerly SSO) and Access Management (IAM). binbash implemented granular access controls and multi-factor authentication to safeguard sensitive data, aligning with industry regulations and boosting customer trust.

To the infinite and beyond

The transformation didn’t stop at technology. binbash worked closely with Flexibility’s teams to ensure a smooth transition, providing training and support to embed new processes. This partnership approach fostered a culture of continuous improvement and innovation.

The results were transformative. Flexibility not only reduced operational costs by 30% but also enhanced service reliability and speed.

Customers noticed the difference, and the company solidified its market position.

Internally, the shift from reactive to proactive operations empowered teams to focus on strategic growth initiatives.

Crucially, the project included a robust knowledge transfer component, empowering Flexibility’s team to maintain and operate the upgraded infrastructure autonomously.

A Walk in the Cloud

Flexibility’s journey is a testament to how embracing cloud technology and strategic partnerships can drive meaningful change. This transformation not only optimized costs and performance but also redefined the company’s trajectory toward sustainable growth.

For organizations navigating similar challenges, this case serves as a beacon: innovation and adaptability are not just options — they are necessities in today’s fast-paced digital landscape. With the right tools and a clear strategy, the future is not just scalable; it’s limitless.

How a Fintech’s 30% Cost Optimization Redefined Its Future was originally published in binbash on Medium, where people are continuing the conversation by highlighting and responding to this story.

How to update EKS node types?

Juan Matías de la Cámara Beovide — Mon, 11 Dec 2023 14:26:07 GMT

While using Leverage, creating an EKS cluster becomes a straightforward process. To guide you through this, please refer to this link. This scenario involves upgrading the node instance types.

Workloads considerations

Ideal scenario

In an optimal situation, all workloads should be scalable. This implies that all workloads managed through ArgoCD Application’s Rollouts, the default in binbash Leverage, should be capable of upscaling. This means creating multiple pods for each ArgoCD Application to handle increased demand or load.

Non-ideal solution

However, in instances where certain applications can only accommodate a single instance (i.e., one pod), scaling becomes challenging. In such cases, there may be downtime as the existing pod needs to be terminated, and a new one must be created on the new nodes. This limitation poses challenges in maintaining continuous availability during scaling activities for these specific applications.

Steps

Create a new node group

Go to the `cluster` sublayer under the EKS layer at `///k8s-eks/cluster` . (e.g. here)

Edit the file `eks-managed-nodes.tf`. (e.g. here)

Under the EKS module:

module "cluster" {
 source = "github.com/binbashar/terraform-aws-eks.git?ref=v18.24.1"

there is a node group definition, e.g.

eks_managed_node_groups = {
 on-demand-t3 = {
 min_size = 2
 max_size = 10
 desired_size = 6
 capacity_type = "ON_DEMAND"
 instance_types = ["t3.large"]
 }
 }

(note this is an ON_DEMAND type, it also can be SPOT)

Add a new node group, e.g.:

eks_managed_node_groups = {
 on-demand-t3 = {
 min_size = 2
 max_size = 10
 desired_size = 6
 capacity_type = "ON_DEMAND"
 instance_types = ["t3.large"]
 }
 on-demand-t3xlarge = {
 min_size = 1
 max_size = 6
 desired_size = 3
 capacity_type = "ON_DEMAND"
 instance_types = ["t3.xlarge"]
 }
 }

(adjust the values as per your needs, here I’m considering the xlarge instance has the double of RAM and CPU than large)

Apply the layer and wait for the group to be created.

Cordon the old group

Once the new group is created and the new nodes are in place, the old group has to be cordoned to avoid scheduling in those nodes.

First, under the `cluster` layer, log into the EKS cluster:

❯ leverage kubectl configure

Get the node groups:

❯ leverage aws - profile --devops - region  eks list-nodegroups - cluster-name 
{
 "nodegroups": [
 "on-demand-t3–20220803130748876100000008",
 "on-demand-t3xlarge-20231130160430016700000001",
 ]
}

Note here there are two groups, the old one and the new one.

Now cordon the nodes in the old node group:

❯ for n in $(leverage kubectl get nodes - selector eks.amazonaws.com/nodegroup=on-demand-t3–20220803130748876100000008 | awk 'FNR>5 {print $1}'); do leverage kubectl cordon $n; done

It can be checked the old nodes are cordoned with this command:

❯ leverage kubectl get nodes - selector eks.amazonaws.com/nodegroup=on-demand-t3–20220803130748876100000008
[16:58:55] INFO Attempting to get temporary credentials for apps-dev account.
[16:58:56] INFO Using already configured temporary credentials.
[16:58:56] INFO Attempting to get temporary credentials for shared account.
[16:58:57] INFO Using already configured temporary credentials.
NAME STATUS ROLES AGE VERSION
ip-10–0–13–118.ec2.internal Ready,SchedulingDisabled  13d v1.24.17-eks-43840fb
ip-10–0–40–140.ec2.internal Ready,SchedulingDisabled  20d v1.24.17-eks-43840fb
ip-10–0–52–96.ec2.internal Ready,SchedulingDisabled  20d v1.24.17-eks-43840fb
ip-10–0–6–43.ec2.internal Ready,SchedulingDisabled  13d v1.24.17-eks-43840fb

Names can vary, but note the SchedulingDisable status, this means nodes are cordoned, i.e. no pod will be scheduled in these nodes.

Avoid old group to scale up

To prevent the old group from scaling-up, do the following.

Check the amount of old group nodes. It can be seen in the previous step (checking cordoned nodes) there are 4 nodes.

So, edit the file again and set the max and desired size to 4.

eks_managed_node_groups = {
 on-demand-t3 = {
 min_size = 2
 max_size = 4
 desired_size = 4
 capacity_type = "ON_DEMAND"
 instance_types = ["t3.large"]
 }
 on-demand-t3xlarge = {
 min_size = 1
 max_size = 6
 desired_size = 3
 capacity_type = "ON_DEMAND"
 instance_types = ["t3.xlarge"]
 }
 }

Apply the layer.

Migrate Applications

Escalable Apps
If Applications can scale up (i.e. managing more than one pod), do the following for each app.

- Go to the ArgoCD web console
- Pick the Application
- Turn AutoSync off
- Scale up (set the double amount of pods, i.e. if there are 2 pods, set the new value to 4)
— if no HPA
— in the Rollout definition change the Replicas value
— if HPA
— in HPA set the min amount of replicas
- Wait until new pods are created

Pods being created in the new nodes can be checked with the following command:

for n in $(leverage kubectl get nodes - selector "eks.amazonaws.com/nodegroup!=on-demand-t3–20220803130748876100000008" | awk 'FNR>5 {print $1}'); do leverage kubectl get pods - all-namespaces -o wide - field-selector spec.nodeName=$n; done

- Once the new pods are created, old pods can be killed.
- Once all the pods for an Application are scheduled in the new nodes the Replicas values can be set back to the original value
- Turn on auto sync

No escalable Apps
If app can not be escalated:

- Set a Maintenance Window for the process
- During the MW do the following:
— Go to ArgoCD Web Console
— Pick the Application
— Kill the existing pod
— Wait until the new pod is created

Once all the non-escalabe-apps are done you can continue with the next step.

Drain nodes

It is recommended to do this node by node manually to keep control over the process.

For each node in the old node group run:

❯ leverage kubectl drain node-name --ignore-daemonsets

During this process, if this mesage is shown “cannot delete Pods with local storage”….
This probably is due to “argocd/argocd-image-updater” or “kube-system/coredns” in the node using local (emptyDir) storage.
In this case it is safe to add the flag:

--delete-emptydir-data

Delete the old node group

Edit the file one more time, delete the old node group so it looks like this:

on-demand-t3xlarge = {
 min_size = 1
 max_size = 6
 desired_size = 3
 capacity_type = "ON_DEMAND"
 instance_types = ["t3.xlarge"]
 }
 }

Apply the layer.

Conclusion

This process is aimed to update the EKS nodes type with no downtime (or minimal and controlled downtime).

Note it is recommended to have a node group per AZ so in a given moment there is at least one node per AZ, this is to enforce HA.
So this can be set in the file:

on-demand-t3xlarge-a = {
 min_size = 2
 max_size = 10
 desired_size = 2
 capacity_type = "ON_DEMAND"
 instance_types = ["t3.xlarge"]
 subnet_ids = [data.terraform_remote_state.eks-vpc.outputs.private_subnets[0]]
 placement = {
 availability_zone = data.terraform_remote_state.eks-vpc.outputs.availability_zones[0]
 }
 }
 on-demand-t3xlarge-b = {
 min_size = 2
 max_size = 10
 desired_size = 2
 capacity_type = "ON_DEMAND"
 instance_types = ["t3.xlarge"]
 subnet_ids = [data.terraform_remote_state.eks-vpc.outputs.private_subnets[1]]
 placement = {
 availability_zone = data.terraform_remote_state.eks-vpc.outputs.availability_zones[1]
 }
 }

Note here we are using Terraform remote states from the EKS `network` sublayer (e.g. here), so something like this is needed in the `config.tf` file:

data "terraform_remote_state" "eks-vpc" {
 backend = "s3"
 config = {
 region = var.region
 profile = var.profile
 bucket = var.bucket
 key = "/k8s-eks/network/terraform.tfstate"
 }
}

How to update EKS node types? was originally published in binbash on Medium, where people are continuing the conversation by highlighting and responding to this story.

AWS Lambda: the “two updating-points” problem

Juan Matías de la Cámara Beovide — Thu, 06 Jul 2023 14:36:18 GMT

Overview

In this article, we’ll discuss a solution for managing Lambda functions effectively using Terraform within the AWS Well-Architected Framework. It addresses the challenge of handling multiple sources for Lambda functions and ensures that updating the infrastructure doesn’t overwrite changes made by developer repositories. By leveraging lifecycle management and appropriate configuration, developers can update Lambda code independently while maintaining the integrity of the infrastructure.

In our architecture, we utilize an infrastructure repository powered by binbash Leverage, which operates Terraform in the background. The objective is to create a Lambda function and grant it access to specific resources such as SQS, SNS, and buckets, all provisioned through the Leverage Infrastructure repository. Additionally, developers should have the capability to update the Lambda code from their own repositories.

The problem

The challenge arises from having two different sources for the Lambda function. One source is the Leverage Infrastructure repository, responsible for creating the Lambda function and applying policies. The other source is the developers’ repository, where code updates are made. To prevent overwriting the Lambda function during subsequent application of the Leverage Infrastructure repository, a solution is needed.

Facts

So, summing up, these are the facts:

There is an Infrastructure repository (binbash Leverage)
There is a Lambda code repository
The Lambda should be created during the infrastructure creation so the permissions can be easily applied
The Lambda code should be updated when new code is pushed into the Lamba code repository
The Lambda code should not be updated if the Leverage Infrastructure repository is applied again

The solution

Here, life cycle management for Terraform becomes handy.

The approach involves creating the Lambda function in a way that fetches the code from an S3 bucket. Additionally, we instruct Terraform to ignore certain values that can be modified by developers, such as environment variables, memory size, and timeout. By omitting these values during evaluation, Terraform can determine whether an update is necessary.

Here’s an example of the lifecycle block in Terraform:

  lifecycle {
    ignore_changes = [
      environment,
      memory_size,
      timeout
    ]
  }

Since we pull the code from a ZIP file, there is no need to instruct Terraform to omit any code, since it won’t be taken into account when evaluating the update. (Note for forcing Terraform to update the Lambda when ZIPped code changes, we should use source_code_hash , which we'll use later)

update the lambda function from the dev repository
try to update from infra, it should show no changes to apply

Applying the solution

The complete solution is shown in this binbash Leverage library experimental branch.

Here there are just the needed parts for the demo.

Under the layer apps-devstg/us-east-1/aws-lambda-poc (inside the binbash Leverage context, we call the leafs of this directory tree layers) and apps-devstg/us-east-1/aws-lambda-poc-lambda-update are all the files.

Creating the Lambda

For this, an S3 object is created, then the lambda should pull code from there.

Note the lambda has the lifecycle directive:

resource "aws_s3_object" "initial-lambda" {

    bucket = aws_s3_bucket.lambda.bucket
    key    = "bb-lambda-test-test"
    source = "initial-lambda.zip"

}

resource "aws_lambda_function" "func" {

    depends_on = [

        aws_s3_object.initial-lambda,

    ]

    function_name = "bb-lambda-test-test"
    role          = aws_iam_role.lambda.arn
    handler       = "lambda_function.lambda_handler"
    runtime       = "python3.10"
    memory_size   = 256
    timeout       = 600

    s3_bucket = aws_s3_bucket.lambda.bucket
    s3_key    = "bb-lambda-test-test"


    environment {

      variables = local.lamba_environment_variables
    }

  lifecycle {
    ignore_changes = [
      environment,
      memory_size,
      timeout
    ]
  }

}

initial-lambda.zip contains dummy code.

Apply this layer, Lambda with dummy code is created.

Updating the lambda

This update can be done by a few means, in this case Terraform is being used (AWS CLI and manual changes can be used as well). Note that by using Terraform for doing this, first the lambda function has to be imported into the current state. It is as easy as:

leverage tf import aws_lambda_function.func "bb-lambda-test-test"

And this is the actual Terraform code, file updated-lambda.zip contains the code updated by the developer:

resource "aws_s3_object" "initial-lambda" {

    bucket = "bb-lambda-test"
    key    = "bb-lambda-test-test-update/updated-lambda.zip"
    source = "updated-lambda.zip"
    etag   = filebase64sha256("updated-lambda.zip")

}

resource "aws_lambda_function" "func" {

    depends_on = [ aws_s3_object.initial-lambda ]

    function_name = "bb-lambda-test-test"
    handler       = "lambda_function.lambda_handler"
    runtime       = "python3.10"

    role             = "arn:aws:iam::523857393444:role/bb-lambda-test-sts"
    source_code_hash = filebase64sha256("updated-lambda.zip")
    publish          = true
    timeout          = 60
    memory_size      = 128

    s3_bucket = "bb-lambda-test"
    s3_key    = "bb-lambda-test-test-update/updated-lambda.zip"


    environment {

      variables = local.lamba_environment_variables

    }

}

Note in these Terraform resources the following elements are being used:

etag for the S3 object
source_code_hash for the lambda function

In this code, the etag and source_code_hash attributes ensure that the resources will be updated when the code changes. After applying this update, the Lambda code will be successfully updated.

Note this step can be easily run in a pipeline in the Lambda code repository.

What can be changed?

As per this directive:

lifecycle {
    ignore_changes = [
      environment,
      memory_size,
      timeout
    ]
  }

…the elements that can be changed safely (i.e. preventing the Leverage Infrastructure repository from updating the object) are:

the lambda code itself (using the ZIP file method)
the environment variables
the memory size
the timeout

By incorporating the lifecycle block mentioned earlier, these elements remain unchanged during the application of the Leverage Infrastructure repository. Therefore, subsequent re-application of the infrastructure layer will not impact the Lambda function.

Now you can re-apply the Leverage Infrastructure layer without panicking about the code!

Conclusion

By leveraging Terraform’s lifecycle management and appropriately configuring Lambda functions, developers can update Lambda code independently while ensuring the integrity of the infrastructure. This solution allows for efficient management of Lambda functions within the AWS Well-Architected Framework.

AWS Lambda: the “two updating-points” problem was originally published in binbash on Medium, where people are continuing the conversation by highlighting and responding to this story.

Someone changed your terraform!

Juan Matías de la Cámara Beovide — Tue, 10 Nov 2020 19:17:58 GMT

What?

So, someone deployed to production using a terraform file set. Then someone else modified something there and after a looooooong time, it’s your turn to fix some stuff.

Specifically what?

Ok, for this example, at first moment you had a Google Service (here we are working with Cloud Functions), activated in the main file of this terraform structure:

. 
├── env 
│ └── main.tf
  └── function 
      ├── http_trigger.js 
      └── http_trigger.zip

This means, the service address was: google_project_service.cloud-functions

Then someone had the great idea: using modules! :)

So the Google Service activation moved to the main.tf under modules:

├── env 
│ └── main.tf 
├── function 
│   ├── http_trigger.js 
│   └── http_trigger.zip 
└── modules 
    └── main 
    └── main.tf

Here the service address is: module.main.google_project_service.cloud-functions

And for a while (while you were in development) was ok. But then, one day, you went to production… where the tfstate still had the service under the old address… so terraform said: “hey, dude, I will destroy your service resource (disable) and then I will create it again (enable)”. To do this all resources dependent on this one needs to be destroyed as well.

So, what to do?

The solution

You have a real activated service. And in the tfstate you have a resource (google_project_service.cloud-functions) that is not in your tf files (so it will be destroyed) and in your tf files a new resource (module.main.google_project_service.cloud-functions) that needs to be created… but wait, that you already have in the Google Cloud infrastructure!

So, delete the old resource from the tfstate, import it again under the new address… and voilà, you can have a beer and a success!

First delete the old resource from the tfstate:

terraform state rm google_project_service.cloud-functions

Then import the actual service as resource into your tfstate:

terraform import google_project_service.cloud-functions your-proyect-id/cloudfunctions.googleapis.com

(change the project ID)

Ok, now you can run your terraform again a enjoy watching it not trying to delete your service!

Conclusion

Don’t panic and carry a towel…

Originally published at http://juanmatiasdelacamara.wordpress.com on November 10, 2020.

Someone changed your terraform! was originally published in tarmac on Medium, where people are continuing the conversation by highlighting and responding to this story.

Gitflow with Github and Cloud Build

Juan Matías de la Cámara Beovide — Mon, 05 Oct 2020 11:18:00 GMT

What?

I want to implement Gitflow using Github and Cloud Build.

Why?

When developing software in a team, it’s a good idea to set some sort of standard process. This keeps things clear, ordered and helps the team to avoid procedural mistakes.

Also, if a new dev/devops/lead (or whatever) is added to the team, to have a well defined procedure improves the on-boarding process.

Gitflow is not the only flow out there, and your choice will depend on your project’s nature. But if you are building software that is explicitly versioned, or if you need to support multiple versions of your software in the wild, then git-flow may be a good choice. (more on GitFlow here)

I made this research when a client using Github+GCP (specifically GCP’s CloudBuild) needed to organize its repositories and processes. So this is the How to implement GitFlow using GitHub and GCP post.

How?

I will create a template repo with several pipeline files (cloudbuild.yaml files), for each part of the workflow. Then I’ll use triggers from Github to Cloud Build.

What is needed?

A Github account
A GCP account
The will to do this 😉

Note

# 0

The pipelines here have dummy steps… meaning that some of them just print a string saying “Deploy steps”… but they are just templates, so you can then fill with your own actual test, check and deploy steps.

# 1

This is a WIP post, since I’m researching. I’ll improve this doc with new findings, so keep track of this and feel free to drop a message with recommendations or comments.

Steps

First we’ll set the repo. Then we’ll define a procedure to work with.

Setup the repo

Summary

These are the main steps:

Start the git-flow
Set Github repository (Settings)
Use as base the repo template or add files to the current repo
Create the triggers
Grant IAM permissions to CloudBuild SA
Take note on how to name your branches
Enforce branch naming with git hooks

Start the Gitflow

Ensure you have downloaded (and added to you master branch) from here the following files:

cloudbuild*yaml
.githooks/

We’re starting with repos that have only a master branch. So we can start with the action stated in Extra Actions below to create the new develop branch:

git checkout master && git pull --rebase && git checkout -b develop && git push -u origin develop

Settings

Things to do in your repo before start working.

Set develop as the default branch.
Create an SSH key so pipelines can push, and add it to your secret manager. (as per this doc)
Protect develop and master branch so nobody except the pipeline can push there.

Repo template

We need to add the basic files to the current repo (done previously). Files can be found here.

In the template repo there are these Cloud Build basic pipelines:

cloudbuild-feature.yaml
cloudbuild-pr.yaml
cloudbuild-develop.yaml
cloudbuild-master.yaml
cloudbuild-master-deploy.yaml
cloudbuild-release.yaml

Each one performs a part of the whole Gitflow process.

https://medium.com/media/23f9500617a3d010eb35b00d413ab06c/href

Triggers

Also, in the repo, is a triggers.yaml. With it triggers can be created like this:

gcloud beta builds triggers import --source=./triggers.yaml --project

Specs about CloudBuild triggers.

https://medium.com/media/08ea41be874d1a7a267142e83a4b780b/href

* master tag must be defined, the proposed one fits this definition. Tagging must be a manual process since it triggers the deploy to prod. This kind of tagging allows us to more complex scenarios (e.g. v0.0.1 for prod, release/v0.0.1 for staging, etc)

IAM Permissions

Google Cloud Buid runs with a ServiceAccount named as: @cloudbuild.gserviceaccount.com

This SA must have the following Roles:

Base: Cloud Build Service Account

If you needto deploy Cloud Functions in the project: Cloud Functions AdminService Account User

Access Secret Manager: Secret Manager Secret Accessor

How to name your branches

Prefix it with type, then a slash, ticket id and a brief description,(it’s important to follow these conventions since triggers rely on them) e.g.:

feature/TICKET-212-Add_new_provider 
fix/TICKET-312-Handle_None_case_in_var 
hotfix/TICKET-444-Fix_broken_endpoint 
release/v0.1

Enforce branch naming with git hooks

Option 1 — pre-commit

If the project uses pre-commit.

From repo get file .pre-commit-config.yaml, if you already have one append these lines at the end:

https://medium.com/media/aa86e2c2c48baa2bcc79419355340bbb/href

This hook will prevent commits to branches that do not apply to the regex.

To activate pre-commit hooks:

pre-commit install

Option 2 — plain git hooks

We will use git hooks to enforce branch naming.

The problem is… we can’t set local git hooks in an automated way. So, this will be a manual step that each dev must perform when cloning the repo.

In the same repo from which we got the pipeline there is a directory called .githooks/, there we stored the branch-naming-enforcement-hook... so each dev must run this in the local repo after cloning it:

git config core.hooksPath .githooks

The hooks has this code of pre-commit and pre-push files:

https://medium.com/media/64acd3849d07b3beac596fb7fe9cd18c/href

The procedure

Here are descriptions of the most common use cases:

Clone a repo
Work in a new feature or fix
Release code to master
Work in a hotfix

Clone a repo

After cloning a repo, un this command on the repo’s root dir:

pre-commit install

# If repo has no .pre-commit-config.yaml file, use plain hooks

# git config core.hooksPath .githooks

If you still do not have pre-commit installed read this.

Work in a new feature or fix

Checkout latest develop version:

git checkout develop && git pull –rebase

Create the feature/fix branch:

git checkout -b feature_branch_name
git push -u origin feature_branch_name

Create a WIP PR into develop branch to work on (maybe you won’t be able to create a PR until a change is submitted)

Work as usual (modify/add/commit/push) (on every push feature pipeline must succeed)

When done ask for approval and review (DO NOT MERGE)

When ready to merge add the following comment to the PR

/gcbrun

The develop-pr pipeline must succeed (it merges and deletes the feature branch)

Release code to master

Checkout latest develop version:
git checkout develop && git pull –rebase

Create the release:

git checkout -b release/v0.1.0
git push -u origin release/v0.1.0

Create a WIP PR into master branch to work on

Test it, if needed work in a fix (modify/add/commit/push)

When done ask for approval and review (DO NOT MERGE)

When ready to merge add the following comment to the PR

/gcbrun

The master-pr pipeline must succeed (it merges and deletes the feature branch into master and develop)

MANUAL STEP: Go to master and create a tag, it can be done in Github website or in the command line as follow:

git checkout master && git pull –rebase && git tag v0.1.0 && git push origin v0.1.0

The master-deploy pipeline must succeed (it deploys to prod)

Work in a hotfix

Checkout latest master version:

git checkout master && git pull –rebase

Create the hotfix branch:

git checkout -b hotfix_branch_name
git push -u origin hotfix_branch_name

Create a WIP PR into master branch to work on (maybe you won’t be able to create a PR until a change is submitted)

Work as usual (modify/add/commit/push) (on every push release-hotfix pipeline must succeed)

When done ask for approval and review (DO NOT MERGE)

When ready to merge add the following comment to the PR

/gcbrun

The master-pr pipeline must succeed (it merges and deletes the feature branch into master and develop)

MANUAL STEP: Go to master and create a tag, it can be done in Github website or in the command line as follow:

git checkout master && git pull –rebase && git tag v0.1.1 && git push origin v0.1.1

The master-deploy pipeline must succeed (it deploys to prod)

Extra actions

Add git-flow to existent repo

git checkout master && git pull --rebase && git checkout -b develop && git push -u origin develop

To Do

My to do list:

hide “Merge” button on Github’s PRs
protect branches

References

GitFlow

https://www.atlassian.com/git/tutorials/comparing-workflows/gitflow-workflow

Git — git-flow mapping

https://gist.github.com/JamesMGreene/cdd0ac49f90c987e45ac

Cloud Build envvars

https://cloud.google.com/cloud-build/docs/configuring-builds/substitute-variable-values

Cloud Build -Github authentication

https://cloud.google.com/cloud-build/docs/access-private-github-repos

Git Hooks

https://itnext.io/using-git-hooks-to-enforce-branch-naming-policy-ffd81fa01e5e

Originally published at http://juanmatiasdelacamara.wordpress.com on October 5, 2020.

Gitflow with Github and Cloud Build was originally published in tarmac on Medium, where people are continuing the conversation by highlighting and responding to this story.

To mesh or not to mesh

Juan Matías de la Cámara Beovide — Wed, 09 Sep 2020 15:44:06 GMT

Service Mesh

I was told that a Service Mesh such as Linkerd, Consul or Istio, adds a lot of overload in my cluster. Keeping this in mind, a Service Mesh is not suitable to a small deployment. Instead, you should consider a Service Mesh when you client is big enough to deserve it.

But, how big a client must be to deserve a Service Mesh?

And more important, how much overload a Service Mesh adds to my cluster?

The answer is: I don’t know.

Because of this, I’m starting this POC, to answer this question.

Resources

Files here: https://gitlab.com/post-repos/to-mesh-or-not-to-mesh

Requirements

To run this test you will need:

a k8s cluster (we will use GCP)
kubectl
locust
docker (or any container engine)
git

We will use Linkerd, so you will need to download the CLI.

BFF

A simple Python APP that exposes a simple API and hits BACKEND. Once it get the BACKEND‘s response, it enrich this response and sends it to the client.

BACKEND

The BACKEND just answers the request with its version number.

Service Mesh

I will use linker for this test.

What we will measure?

We will measure two items:

WEB response times with Locust
K8s resources usage

Then compare these metrics on two scenarios:

using the Service Mesh
using the raw k8s cluster.

Set up the environment

Cluster

This terraform template can be used: https://gitlab.com/templates14/terraform-templates/-/tree/master/gke

Then login into your cluster, e.g. for this case:

gcloud container clusters get-credentials kungfoo-test --region us-central1 --project speedy-league-274700

The tests

We’ll have two tests: one with and one without the service mesh.

No Service Mesh

Set env

Create the namespace to deploy the app into:

kubectl create ns kungfootest

Set the app and run tests

Go to Set up the app and the to Run the tests. After this come back here.

Clean up

Delete deployments:

kubectl delete -n kungfootest -f deploy-backend.yaml -f deploy-bff.yaml -f ingress.yaml

Delete the namespace so we’re clean:

kubectl delete ns kungfootest

Service Mesh

Linkerd

First, cli must be installed in your system. (more here )

Download binary from here and add it to your PATH.

Since we will be using GKE, we need to run these extra steps: https://linkerd.io/2/reference/cluster-configuration/#private-clusters

Check cluster is ready for linkerd:

linkerd check --pre

I got:

pre-kubernetes-capability

-------------------------

!! has NET_ADMIN capability

found 1 PodSecurityPolicies, but none provide NET_ADMIN, proxy injection will fail if the PSP admission controller is running

see https://linkerd.io/checks/#pre-k8s-cluster-net-admin for hints

!! has NET_RAW capability

found 1 PodSecurityPolicies, but none provide NET_RAW, proxy injection will fail if the PSP admission controller is running

see https://linkerd.io/checks/#pre-k8s-cluster-net-raw for hints

The cluster lacks these capabilities. But probably when Linkerd is installed these will be installed as well. (https://github.com/linkerd/linkerd2/issues/3494)

Then install it:

linkerd install | kubectl apply -f -

…and wait until it’s installed:

linkerd check

Set env

Create the namespace to deploy the app into, this time we’ll need an annotation for linkerd:

kubectl create ns kungfootest

kubectl edit ns kungfootest

and then add the annotation:

  annotations:

    linkerd.io/inject: enabled

This will allow Linkerd to automagically inject the proxy in namespace’s pods.

Set the app and run tests

Now, go to Set up the app and the to Run the tests. After this come back here.

Note this time the pods will have two containers, since Linkerd is injecting the proxy.

Compare the test resuts

For my tests:

No Mesh

Total average requests: 33% CPU, 8% memory.

Total average usage: 12% CPU, 26% memory.

Avg response time: 204ms

Mesh

Total average requests: 35% CPU, 9% memory.

Total average usage: 25% CPU, 38% memory.

Avg response time: 206ms

Conclusion

The mesh configuration we used is very basic, but it adds interesting services to our deploy with no need to modify code. (e.g. secure internal connections, metrics…)

From the client’s point of view the time was only 1% more in the meshed version.

From the server side, we’re using 100% more of CPU and 46% of memory.

Does it worth?

As usual, it will depend. Can you afford the CPU and memory usage increase? Then you can have all the service mesh pros at almost no cost on the client side. Anyway, it deserves more tests if you are thinking on it.

But let me read your opinions on this, drop here a message.

Set up the app

Under source directory there are two subdirs. One for the BFF and one for the BACKEND (inside the later you will have two more dirs, versions 1 and 2… a.k.a. stable and canary, for now we will use just the stable version).

Build the app

Backend

On both cases you must proceed the same way, varying only the version number.

Into source/backend directory you will see the Dockerfile and the two version directories.

CD into your source/backend directory and run:

cd source/backend/1.0 && \

GOOS=linux GOARCH=amd64 go build -tags netgo -o app && \

docker build -t backendapp:1.0 . && \

cd ..

…and:

cd source/backend/2.0 && \

GOOS=linux GOARCH=amd64 go build -tags netgo -o app && \

docker build -t backendapp:2.0 . && \

cd ..

Bff

Cd into sources/bff directory and run:

docker build -t bffapp .

Push them all

Ok, now you have the images… push them all to a reposiroty of your choice and keep their names so we can set them into the k8s manifiests.

Or use these already built images:

docker.io/juanmatias/canary-app:1.0
docker.io/juanmatias/canary-app:2.0
docker.io/juanmatias/canary-app:bff-1.0

Deploy the app

We will deploy all the elements into kungfootest namespace.

CD into the root project directory and then:

cd manifests

Deploy the backend apps:

kubectl apply -f deploy-backend.yaml -n kungfootest

Deploy the bff:

kubectl apply -f deploy-bff.yaml -n kungfootest

Deploy the ingress:

kubectl apply -f ingress.yaml -n kungfootest

Test the app

Get the public IP:

kubectl get ing -n kungfootest

You can test your app with this command:

curl http://$PUBLICIP/kungfutest/mytest

You should have an output like this one:

{"id": "mytest", "response": "Congratulations! Version 1.0 of your application is running on Kubernetes."}

Run the tests

We’ll run two tests, locust to know response times, and resources to know the used resources.

Locust

From the project root dir:

cd locust

If the first time, create a virtual env and install locust:

pip install locust

Now, run the locust server:

locust -f kungfootest.py

This will open Locust server listening on localhost:8089… open it with your browser.

There, you must add the host (e.g. http://$PUBLICIP), the max number of users and the users spawn rate. Then begin your tests.

I’ll test it with 100 users and a rate of 10 and let the test run for 2 minutes.

Resources

While locust test is running run the script resources.sh. When finish, just type CTRL+C and it will show the AVG mem and cpu.

NOTE: It’s important to keep in mind that this script will get the resources requested for nodes, and the real use only under kungfootest namespace.

References

Monitoring Kubernetes cluster utilization and capacity (the poor man’s way) | Jeff Geerling

First version of this post was published in my blog here.

To mesh or not to mesh was originally published in tarmac on Medium, where people are continuing the conversation by highlighting and responding to this story.