Istio Service Mesh to Tetrate with Zack Butcher

Defense Unicorns

Published in

Defense Unicorns

16 min readAug 15, 2023

‎Defense Unicorns, A Podcast: Istio Service Mesh to Tetrate with Zack Butcher on Apple Podcasts

You may not have met Zack Butcher in person, but those who have used Google Cloud or Istio Service Mesh have shaken…

podcasts.apple.com

Listen on iTunes, Spotify, or your favorite podcast app.

In this episode of Defense Unicorns, Zack shares his way of innovating, starting new projects, and what it takes to reach success by finding what’s broken and fixing it in a way that provides value. He is currently a Founding Engineer at Tetrate and has helped in a variety of roles across the company — currently, he’s Head of Product. Connect with Zack on LinkedIn today.

Defense Unicorns, a Podcast, is hosted by Robert Slaughter, Founder and CEO of Defense Unicorns.

Guests this week include:

Zack Butcher, Founding Engineer at Tetrate

Highlights from Defense Unicorns, a Podcast:

This interview has been lightly edited for length and clarity.

(01:53) Service Mesh Technologies

(6:30) Creating projects that change the world

(13:27) What is Tetra

(19:20) Cybersecurity value proposition of Istio and Tetra

(28:41) What is TID

(31:33) Big Bang

(38:25) Advice for entrepreneurs

ROB: Can you explain a little bit about service mesh technologies in general?

ZACK: I think it’s helpful to go back and look at the set of problems that we had at the time when we started to build this architecture, and that helps motivate why we did what we did and what it does. There were two big moves that were happening. One was the adoption of microservices. Google has always been a pretty tech-forward company, and it adopted a service-oriented architecture a long time before others. However, we’re starting to proliferate that into a microservices-oriented architecture as well, which has brought about connectivity concerns and a range of security concerns. A lot of the listeners here that are looking at Kubernetes and similar technologies are probably grappling with a very similar set of challenges.

Then there’s the other set, which is where the technical implementation of the service mesh came from; we had this monolithic API gateway, and every single Google application programming interface (API) — whether you use maps, translate, or any of the cloud APIs — went through it. There were a slew of problems there that you might expect. If you have a shared API gateway in your organization, you probably have the same problems, right? They’re shared-fate outages; someone misconfigured it and brought the entire system down; now everybody’s out and not just the one service. We have resource attribution problems. You can imagine that when you’re serving, say, all of Google’s API traffic, that’s a huge amount, a large footprint, and it’s expensive. It becomes very important to be able to track who’s spent this where so that we can do capacity planning and chargeback correctly.

Then we had a bunch of locality problems. You have a lot more services communicating as we’re load balancing and doing API gateway things, and it turns out that those API gateway things, those functions like rate limiting and authentication and authorization, you actually want in-between all your surfaces. If you have to go out and back in, it winds up inefficient and slow. As a result, we wanted to avoid that set of circumstances. We had a communication proliferation problem as well as a monolithic API gateway problem. And so we said, “How do we start to solve this?”

So the idea came to use what’s called the sidecar pattern. It was not original to the service mesh, but it was used; it’s been used for years in a variety of different places, most commonly for logs. If you’ve had a damon that rotates log files and ships log data, it’s very similar to the sidecar pattern that we use now, except we’re using it to manage network behavior instead of the file system. That was the key idea. We can then push that proxy, that gateway, right next to the application itself and implement it. Originally, inside Google, it was about implementing that API gateway-style functionality. As we took the problem into open source with Istio, it became very clear that there was a related set of problems that needed to be solved around baseline communication. Inside Google, that was handled in remote procedure call (RPC) frameworks and similar things, but there wasn’t a good general solution in open source at the time. It was a phenomenal solution for encryption in transit for application identity. You can actually write “the front end can talk to the back end”, and not “this virtual private cloud (VPC) can talk to that VPC,” because who knows what’s in those VPCs and what that means semantically to start to facilitate fine-grained traffic control. So, not just connection level and load balancing, but allowing decisions per request. How do I load balance? Let me apply security policy so that I can ultimately achieve what we’re talking about with a zero-trust architecture. Turns out, that’s a perfect fit for that service mesh architectural pattern.

So that’s a little bit about how it kind of came to be and how we got there. And maybe a little bit about what it does as well. So hopefully, that’s a helpful overview.

ROB: To me, technologies like Istio and Kubernetes have fundamentally changed the entire planet. Walk us through what it takes to start a project that changes the world.

ZACK: The first point to make is that we did not expect it to change the world. We definitely hoped that it would do big things, but we had a very specific set of problems that we wanted to solve to start with. I think that’s a really important feature — if you’re going to go in and tackle some big problem, you need to have a specific pain point. You can’t go do a science project. As much as an engineer, I love it; I want to go play with cool technology. We have to keep it grounded in a real set of problems and a set of user pain that we can address.

So that was the first thing. As we looked around the service mesh space, we knew that networking and Kubernetes were doing a lot to solve compute. It was helping you realize that the idea of compute being fungible and really easy is awesome. As we were seeing that happen, we saw that networking was becoming the next big bottleneck for people. That was a sharp pain that we knew was present. We knew there were these inner connectivity problems as you adopt a service-oriented architecture. We knew that inside Google, we had teams of engineers investing huge amounts of money in solving this problem for the organization, which doesn’t scale out. So that’s the first bit: we had a pain point that we knew was large and that we knew was going to be hit by a large part of the community.

And so that’s what gives you the confidence to go spend the political capital to spin up a project like this. For us, inside Google, it was made easier because of the success of Kubernetes. We had this project that is continuing and, as you say, revolutionizing the industry. We have a lot of political capital to point to the set of problems around networking that Istio could solve and say, “Hey, I think we can go do this again”, “We’re pretty sure that this problem is big enough”, “We’re pretty sure that the industry is at a spot where we think we can start to do it”. So, in my opinion, the single most important trait is to be focused on the end-user and to have a specific set of pain points, or perhaps just one pain point — one big problem. If it’s only going to be one that you can solve, and you can stay laser-focused on that, you can then listen to that customer, that set of customers, and that set of users; they’re going to guide you to the right spot.

ROB: Talk to me a little bit about Tetra; what made you want to leave the comfort of Google and start your own startup? Tell us a little bit about the big idea or big problem that Tetra is out there trying to solve.

ZACK: Google is a very comfy job; it’s hard to justify leaving, and it took a good bit to convince me. What it came down to was identifying a key problem that was obvious to myself and JJ — who started it —and that had to be solved. That problem was; when search engine optimization (SEO) was introduced, Varun, the CEO of Tetrate and a product manager at Google for SEO, was assigned the task of finding the first users. For an infrastructure project, especially that chicken-and-egg problem of “Are you going to be the first person to bet the farm, being your infrastructure, on this new, untested technology?” It’s a hard pill to swallow. That’s a lot of risk. He was given the directive, “Hey, let’s go get a list of users; let’s go track them down and get them to use Istio”. It happened that it wasn’t really selling to most of the companies that we were talking to; it was “Hey, use our open-source project, and we’ll make your life better”. Google wasn’t trying to get money from them out of it, but just adoption.

I was the only engineer on the project who worked in San Francisco proper at the time; everyone else for Google was in the South Bay. Varun said, “Hey, come along; I need somebody to do some technical sales”. We went and talked to a whole bunch of different companies — 20 or 30 at least — and never in my life have I heard a more consistent articulation from many different people using the same words, which are: “Hey, look, there’s this cool thing, Istio; you should use it.” The first thing they said was, “What is it?” as well as, “Wow, I’ve never heard of this before.” We’d respond with “It just came out” — this is 2017 that we’re doing this, in the early days. “You can secure your communication, control your communication, and observe it if you’re in Kubernetes,” we said. They all responded, “I have problems,” “I need encryption in transit and fine-grained traffic control,” “I have outages,” “I want to be able to carry my services.” “I need to be able to get insight into what’s going on,” and “Everything you’re saying checks out, but you said I could only use it in Kubernetes, and I don’t have anything that runs in Kubernetes right now. Where I’m at is that all my stuff is on virtual machines (VMs), which is the monolith. I’m trying to figure out how I’m going to get to that spot. “Once I’m there, I’ll definitely use that, but my real pain is: how do I get there? How do I bridge the gap between my legacy and modern infrastructure?” That idea of “bridging the gap” is a phrase that we heard articulated a ton in our main product today. The reason is because it’s all about bridging that gap in infrastructure. Then the third thing we heard from every single person that we talked to in that set of calls was, “Who else is using it?” We said, “You can be the first,” and universally, they said, “Hell no”. Although they didn’t say it quite so nicely.

We eventually got past that and got the initial users bought in. But that core idea was that the real pain that people had was not, “Hey, I’m in the cloud; I’m in a modern cloud-native infrastructure; how do I manage it? How do I transition from a traditional legacy infrastructure focused on VMs in a data center with firewalls, VPCs, and demilitarized zones (DMZs) to a cloud-native architecture?” And that’s the key pain that I need to help bridge. So that was the insight we were looking to solve. And we spent nearly a year internally, with Google pushing management to reorient the project to focus on the idea of adoption or bridging legacy and modern infrastructure. It appeared to make a lot of sense for Google, but we never really made much headway in convincing management or executives, but we were sure that it was a problem because we kept going and talking to new customers and new prospects, and they kept saying the same thing.

So that’s what gave us the confidence to leave that very cushy, very safe spot that was Google to really go do this. We had heard from so many people that this was a problem, and we couldn’t get the organization to solve it. So we said, “Look, let’s go prove it; the proof is in the pudding. Maybe the executives are correct, and there isn’t a problem, and we’ll just fail, but we’re pretty sure it’s a real problem. So let’s go solve that.” That was in early 2018 when Varun left first, and I joined him about a month later. And here we are today, four years later, doing very well helping with service mesh adoption, in particular in that niche. The reason that folks come to us versus some of our competitors is because A) we have deep expertise and a lot of experience because of the stuff around Google that I’ve talked about. But B), because we fit where they are: attempting to transition from a legacy system to a modern architecture and system. This is one of the big reasons we helped write some of the security standards. We’re helping some of the larger organizations in the world navigate that very sticky, muddy ball.

ROB: You briefly brought up some of your newest work and some of the cyber security benefits of adopting some of these technologies, both Istio and Titrate Service Bridge. Do you want to recap for us some of the whys from a cybersecurity perspective? What unique value proposition does Istio and Tetrate Service Bridge bring to someone going through what I’ll refer to as their cloud-native transformation?

ZACK: I think that the key thing here is that the set of features and functionality that the service mesh brings into play isn’t unique. You must have it to have a successful distributed system. When I say that, I mean an identity for every workload that can be authenticated and used for authorization at one time. It means to control how traffic moves through your system so that you can do things like Canary and upgrade applications gracefully, as well as implement policy. Traffic control is fundamentally a security policy. Availability is fundamentally a set of security trade-offs versus an absolutely safe system that nothing could touch. We have to make realistic decisions on availability versus security. The service mesh gives us the tools to be able to do that. Then, finally, it gives us insight into everything that is happening — all of the traffic that is flowing in our system. That’s a really powerful set of capabilities to bring to bear on the applications and on security as a domain.

The main point that we emphasize in the National Institute of Standards and Technology (NIST) papers is that we need those capabilities if we are to begin to achieve a zero-trust architecture. We need encryption in transit — every single request must be authenticated and authorized both at the workload and the end-user level. So now, how do we start to achieve them? What we have seen, based on some of the scale of Google and the adoption of the service mesh in open source as well, is that the service mesh as an architecture is a really effective way to achieve those kinds of necessary features.

Why is that the case? We can usually drop that sidecar, that proxy, next to an application without having to change that application because we’re interacting at the network layer and redirecting all traffic into and out of that application through that sidecar proxy. What that does more than anything is give us a policy enforcement point. So if we’re going to drop into some of the NIST requirements, when we have an access control system, we have Password Authentication Protocols (PAPs), Policy Information Points (PIPs), and Policy Decision Points (PDPs); policy machine language. One of the most fundamentally important ones is the policy enforcement point. That takes a verdict from an access control system and ensures it’s enforced at one time. And so, because that sidecar proxy is intercepting all network communication into and out of the application, we have the opportunity to apply any of the policies that we want, whether that’s a security policy, a traffic shaping policy, or something similar.

This is why it’s such a compelling toolset: we basically get next to the application in the network and can apply layer seven application layer policy to the traffic that’s passing through. That isn’t revolutionary in and of itself; you could do it with a software development kit (SDK) or a serving framework. The other key thing that the service mesh brings is a central control point so that I can go to one place and change configuration, and that will be pushed to the runtime system basically immediately and have that take effect. So now, for example, to mitigate a security problem, change the cipher suite that I’m using, enable encryption in transit, or rotate certificates, rather than having to go and change my application and redeploy it, I can update the configuration, and that can take effect in a matter of seconds. What we end up with is a much more robust system that can provide a security baseline for entire sets of applications without changing the application itself. That gives us the substrate to start building our security model on top of. That’s a lot of what we’re trying to do in those necessities: outline what the security model for a microservice should be and how we can start to implement that. We do talk about the service mesh specifically because we believe that it is the most compelling way to implement this set of behaviors compared to some of the other ones, like doing it in an SDK or doing it at the node level, for example. There are a range of different implementations. We’re confident that the service mesh is a good starting point. It’s not the right choice for everybody, but it’s a good first choice to look at.

ROB: One of your other products that I’d love to talk about a little bit is called Tetrate Istio Distro (TID). What is the background, and what problem is TID helping folks solve?

ZACK: The simple answer is that there are a couple of different things that we like to achieve with that one. So one, we’re not landing a bunch of functionality above and beyond the open source project. It exists primarily for us to facilitate security patches. We have a bunch of customers that update at a variety of different cadences. For example, we have large financial institutions and places like the DOD, which are not well known for keeping things updated or updating rapidly. The open-source project only supports two releases at a time. It’s just the reality that a quarterly release cadence for many organizations doesn’t jive with what they need. So we have extended security patching for 14 months for the Istio version; it was one of the baseline pieces.

One of the second things that we do is a much more comprehensive set of testing than upstream does, basically emulating every single customer environment that we can to a reasonable degree. Maybe most importantly for some of the federal audience, one of the big benefits of the service mesh is that you can use it to do encryption in transit for your applications. As soon as we’re in that realm, the Federal Information Processing Standards (FIPS) become incredibly important. We actually have FIPS-validated builds of Istio and Envoy as well. Which do quite a bit to smooth over your Federal Risk and Authorization Management Program (fedRAMP) process because any issue is going to make it very easy to implement encryption and transit across your entire infrastructure. Having infrastructure components that are already validated for you makes the validation process for your overall infrastructure a lot faster and cheaper. Then, of course, there’s the support when you’re having a sad day, as well as some of the architectural planning and stuff like that. So you can make sure you’re actually using and deploying it effectively.

ROB: One of the other products we’re going to quickly talk about is called Big Bang, and the way I describe it is: you take those Iron Bank containers and assemble them in a declarative, or everything is code, fashion to build a DevSecOps system that secures infrastructure providers can leverage. The reason I bring up Iron Bank and Big Bang is that there’s a rumor on the street that TID is both through the Iron Bank, and I’m hearing people talk about it actually being fully integrated with Big Bang.

ZACK: We are pretty excited about that because it’s been quite a while in the running. TID is in Iron Bank. For folks who are already in the public sector and are starting to do some of the exploration in the space, if your authorizing official (AO) has already approved the use of Iron Bank, the TID images and the Istio containers are there for you to pull and play with. You can start doing that right away, including the FIPS fields. We’re super, super excited. Big Bang is basically an instantiation; it’s a set of blueprints as code for how you would stand up a Kubernetes stack with Istio in it with a whole bunch of other battery-included systems as a really effective and powerful baseline for a zero-trust run time. We’re very excited that we should actually start to use TID as the default Istio out of the box because FIPS is so important. That will become the default for Big Bang as well.

As you’re playing with the “batteries included” platform there, for folks in the public sector and private sector as well, I recommend checking out Big Bang as a really good set of blueprints with a well-reasoned, pretty well-articulated set of decision points there for how to assemble an architecture, even if you’re not going to use it directly. That’s one of the other areas that we help out with some of the TID things, like security posture, Big Bang, and similar.

ROB: I definitely don’t want to encourage everybody to quit their jobs and create a startup because I think there’s a lot of opportunities to innovate within larger organizations. Whether it’s large companies, the Department of Defense, Medicaid, or Medicare, there are a lot of very large organizations that struggle with innovation. I think within those organizations; there’s a lot of risk that you can sometimes take when starting new things. You can sometimes put your career on the line just to start a project, even internal to the company. With that context, what advice do you have for the people who are considering or looking at making that jump?

ZACK: I think it goes back to what I talked about before, which is that you need to have your eye on the pain that you’re going to solve. It doesn’t matter if it’s a startup or if it’s a new project inside your existing organization. You need to have the focus on the user and on the pain that you’re going to solve. If you’re solving pain within the organization, you’re going to be showing value. You’re going to be successful because, at the end of the day, that’s going to be the success criteria by which you’re going to be judged. Whether you’re a startup that goes off internally or you’re pushing new programs inside of some of the largest organizations on the planet, like the DOD, Medicare, Medicaid, CMS, and similar organizations, keep your eye on the value. You get to the value by focusing on where people are struggling and where there is pain today that’s stopping them from doing great things. As long as you stay focused on that, I believe you will be successful.

(5–8 paragraphs)

Subscribe to Email Option?

Keep up with the industry’s sharpest minds right from your inbox.

Want to hear more? Tune in to the podcast episode, where we cover lots more.

You can also subscribe to Defense Unicorns, a Podcast on iTunes, Spotify, or your favorite podcast app.

Thanks,

The Defense Unicorns Team

Istio Service Mesh to Tetrate with Zack Butcher

‎Defense Unicorns, A Podcast: Istio Service Mesh to Tetrate with Zack Butcher on Apple Podcasts

You may not have met Zack Butcher in person, but those who have used Google Cloud or Istio Service Mesh have shaken…

Written by Defense Unicorns