‘DevOps, SRE & Platform Engineering’ with Ricardo Castro, Principal Engineer, SRE at FanDuel

Published in

The Typo Diaries

27 min readMay 15, 2024

This blog was originally published in the Typo blog.

In the recent episode of ‘groCTO: Originals’ (Formerly: Beyond the Code: Originals), host Kovid Batra welcomes Ricardo Castro, Principal Engineer, SRE at FanDuel. He is also a co-organizer of the DevOps Porto Conference and DevOps Day Meetup, as well as an ambassador of the Continuous Delivery Foundation. The discussion centers on ‘DevOps, SRE & Platform Engineering’.

The episode kicks off with Ricardo shedding light on his personal interests and transformative life experiences. He imparts valuable wisdom on differentiating between DevOps, SRE, and Platform Engineers. . He also advises adopting best practices for implementing CI/CD pipelines in startups, medium-sized enterprises, and large corporations and presenting a robust framework for cultivating effective DevOps teams.

Lastly, Ricardo provokes thoughtful reflection on whether deployment frequency truly encapsulates DevOps efficiency, or if the focus should shift towards the value delivered.

Time stamps

(0:06): Ricardo’s background
(0:41): Ricardo’s hobbies
(5:06): Key differences between DevOps, SRE, and Platform Engineers
(18:53): CI/CD implementation practices for different company sizes
(23:40): Strategies for building strong DevOps teams
(26:44): Is deployment frequency vital for measuring efficiency?

Links and mentions

Ricardo on Linkedin and Website
Kovid on Linkedin
Listen on Spotify, Amazon Music, and Apple Podcasts

Episode transcript

Kovid Batra: Hi, everyone. This is Kovid, back with another episode of Beyond the Code by Typo. Today with us, we have a special guest who is an expert in Platform Engineering. He’s a co-organizer of the DevOps Porto Conference. He’s a co-organizer of the DevOps Day Meetup. He’s an ambassador of the Continuous Delivery Foundation. He’s currently working as a Principal Engineer/SRE at FanDuel. With a total of 15+ years of experience in DevOps engineering and leadership, we are happy to welcome you, Ricardo, to the show. Great to have you here.

Ricardo Castro: Thank you very much for having me, Kovid. I’m really excited to have this conversation with you.

Kovid Batra: Perfect. Perfect. So, before we get started and deep dive into Platform Engineering, DevOps, and Site Reliability, I would love to know more about you, Ricardo. Our audience would be interested to know more about you. So, tell us about something that you like outside of work.

Ricardo Castro: Yeah, sure. The people who know me know that I’m an avid music listener, so I just love music, mostly metal music. That’s the thing that I listen to the most and everything around guitar, I love as well. I played guitar for many years and it’s something that I’m trying to pick up recently, so it’s something that probably in the next few years I’ll be ramping up my skills. Also, I enjoy sports, doing sports. I practised Taekwondo for more than a decade. And now I want to try something different. So, I’m doing CrossFit and I’ll probably try some martial arts in the next year, probably Jiu Jitsu. That’s what I’m going for.

Kovid Batra: Oh, that’s so cool, man. Music, Taekwondo, from where are you getting this energy and what is the motivation to learn all these things?

Ricardo Castro: I just usually like to challenge myself. If I want to learn something because I enjoyed it, I just try to do it, even if I don’t have a lot of time. And if I really suck at something, I try to at least get to some level of proficiency, not like going to a guru or on something, but try to at least get some understanding of that, if at the very least I have some interest on that topic. And that’s how I usually try to approach it.

Kovid Batra: That’s cool. That’s cool, man. All right, moving on to my next question. Anything that inspires you or defines you in your life, any, any life-changing moment or a life-defining moment that you would love to share with us?

Ricardo Castro: Sure. So something that I really like doing is to help other people or other teams. So, before I got into tech, I worked as trying to help kids in school to overcome challenges at school in terms of, so they probably, they were bad at math or something like that. I worked at a company for a few years like that. And it’s something that I always enjoyed to get to help other people achieve their goals. So for many years in the beginning of my career, I was a traditionally a software engineer, just building features, building product, and then eventually I migrated into a more of an operational role. And I think that now we talk about a lot of Platform Engineering, but it’s something that I believe that once I went into more of that operational role was something that I always tried to do, like to build tooling and to build platforms that other teams could use. And I don’t have to be in the middle. So I’m there to help, of course, but it’s something that my goal was always to, if you don’t need me because I built you the tool or you have something that you can progress on your own, that’s cool. I don’t want to be an impediment for that. So, it’s something that usually inspires me. So, how can I get out of the way of what people are trying to do. That could mean building documentation. That could mean building a CLI. That could mean building an API or a platform or whatever it is. And my goal is always to empower you to do whatever you need to do with the least impediments possible.

Kovid Batra: More like automate everything theory, right?

Ricardo Castro: Yeah, yeah, yeah, yeah. We always find new things to work on. So, I don’t, at least for now, I don’t have that fear of getting myself out of a job. Because as soon as I finish something and I deliver something, there’s 10 other new things that pop up. So, it’s always that, like, automate, give the tooling that people actually need. So if they don’t need me, I shouldn’t be there in the middle of just to click a button or to run some command because people are responsible. People are grownups. So, here are the tools. Progress on your own. If you need me, I’m here to help.

Kovid Batra: No, absolutely. I think, it’s a little counterintuitive for a lot of people. They try to create those dependencies so that they have the security probably, but I think the right way to grow, even I believe so, that you should make up time for yourself. And to make out time, you just have to get out of those situations where you are doing just the redundant job for people. So..

Ricardo Castro: Yeah, exactly.

Kovid Batra: Yeah. Yeah. Cool, man. I think this was really interesting. It was great knowing a little bit about you. Let’s move on to our next section which is more around DevOps, Site Reliability, Platform Engineering, your area of expertise, basically. So let’s start with the very basics. DevOps, Site Reliability Engineer, Platform Engineer, I would love the audience to know from you, what’s the fundamental difference? And with each of these roles, there are some responsibilities coming into the picture. So share here’s some thoughts about those responsibilities that you see as a DevOps Engineer or a Site Reliability Engineer or a Platform Engineer. And then, what are the challenges associated with it? So, it’s a big question. You can take your time.

Ricardo Castro: Yeah, sure. I’ll try to compartmentalize that into, like DevOps, SRE. So let’s start with probably the concept that people know for them, a longer period, which is about DevOps. So, if we go back and we start seeing the origin of DevOps, the movement started to form around 2007–2008. And it had one goal, right? It was to bridge the gap between dev and operations. So, we had devs on one side. We had ops, traditionally SysAdmins, back in the day. And there was a conflict between this group of people. So, engineers want to be, or product engineers, software engineers want to deliver features, want to deliver, put stuff in production in front of customers. SysAdmins or operations people, let’s call them operations people; their responsibility is to make sure that the systems are working as they should. So, if you introduce change to a system, there’s a very high likelihood of that system having some kind of failure. So there’s not an incentive alignment here.

So on the one hand, we have, you have people that want to introduce changes to a system. On the other hand, you want people that just want to make sure that the system needs to be stable. Don’t mess with it. Just, just leave it be. And this becomes a problem because to continue growing, companies need to continue to deliver features, but then you have probably two of the most important teams in the company that are not aligned in the same objective. So that’s where, where, what DevOps try to bridge that gap. It’s around, okay? How can we get these two areas aligned into common goals? Of course, over time, we realize that it’s not only Dev and Ops. There’s a lot more Quality Engineering, Security and so on and so forth. You have to involve Product. So, DevOps nowadays, I think it’s a lot more than just Dev and Ops. There’s more disciplines that need to be brought to the table around this. But DevOps in the beginning and when the term was coined was in 2009, was never very prescriptive in terms of, “Oh, this is what exactly what you should do.” But the problem with that is that on the one hand, it gives you a lot of freedom. So, the main goal is to bridge the gap between all of these disciplines. How do I align those goals? On the other hand, some, most companies are like, okay, but I need some, you need to tell me something that I need to do to actually do that. So, what we’d started seeing around DevOps was a lot of work around how do I deliver features faster? So, it was stuff around CI/CD. How do we automate all the things? So, DevOps in the last decade started to migrate a lot into, I can build some automation around, for example, how do I, how do I build resources with tools like Ansible, Terraform, whatever tool you, you decided to use and a lot about CI/CD. How do I automate the build, deploy, test process to get stuff into production?

Site Reliability Engineering, probably a lot of people don’t know, actually was created inside Google many years before DevOps, which is kind of weird because we only, like in the last three, four or five years started hearing about Site Reliability Engineering. But, it was around 2003 when Google was facing the same problem, right? Devs, Ops, Engineering, this doesn’t work. Google was growing like crazy. So they were like, “Okay, so we need a better approach to do this.” So they tasked a team. What would happen if we take a bunch of software engineers and put them in charge of operations? And that’s the, like the gist of what happened with Site Reliability Engineering. So Google created this practice. It was made popular by the book that they released in 2016 where Google revealed what for them is Site Reliability Engineering.

Something that we need to keep in mind is that that book came after, like more than a decade of experience. So what Site Reliability Engineering was in the beginning at Google and what was in 2016, it’s not the same thing. There are continuous learnings and continuous approaches, but it’s a lot about going from production backwards, right? So, something that people know a lot is about SLO. So, Google created something that would allow them to define what reliability means. It’s not like I think reliability is this. You think that reliability is that. No, we need a common language where, where we say, “Okay, this is what reliability is.” SLOs. That’s a framework. Okay. Once we have that, we can start working backwards. So, are we achieving our targets? Are we not achieving our targets? If we’re not, what do we need to do? Of course, these two disciplines can.. DevOps and SRE can somewhat meet in the middle, because if we start from production backwards, we will probably arrive at a point where we say, “Okay, so probably one of the biggest problems that we have is that my deployment process or my build process is not reliable enough. I’m introducing a lot of problems. So, let’s automate it. Let’s improve that.”

And of course, recently we started hearing a lot about Platform Engineering. Although again, although people think it might be a new thing, it is not. If you learn, if you read, like the ‘Team Topologies’ book, which is like four or five years old, they already talk about Platform Engineering. And if you worked at a big company, like 15 or 20 years ago, you already had internal platforms. So, the concept of Platform Engineering, it’s absolutely not, it’s not new. The thing is that the advancements of the cloud and all of these new technologies that we have started to democratize how platforms could be built, making it easier, making it that companies, smaller companies, or even startups can build their own platforms easier. So, that’s why we’re starting to see a lot of involvement in Platform Engineering, which is awesome, which is good. It’s about building some kind of abstraction where, for example, software engineers have a standardized way to deploy stuff, to observe their services, how to get them running in production. There’s a lot about golden paths. So this is like the preferred way for the company to actually build and deploy services.

One common pitfall that I see a lot with Platform Engineering and I’ve been writing about that recently a lot because I see this problem happening a lot in many companies. It’s about platform teams building stuff in isolation, right? So we, you create a platform team. They go into their room. They do their thing. And then they say, “Oh, here’s the platform.” If you don’t treat your platform as, let’s put it like, as a product where you have customers, which are other engineers in your company, you’ll probably have a lot of trouble because you probably built the wrong platform or at the very least, something that is not aligned with what people are expecting. You’re probably not solving their problems. You’re probably creating some roadblocks. You don’t have the vertical or the business knowledge, the business domain knowledge that those teams have. So you’re probably building, “Oh, so here’s a nice way to deploy this kind of application.” They’re like, “Okay, yeah, but I have this requirement and that requirement and I can’t do that with your platform.” So, that’s probably the most common pitfall.

So, just to summarize Platform Engineering, you should look at it as some kind of a product where you have customers, which are other engineers, and then you have to build, like any other product, the features that they need. Of course, every once in a while, you need to risk it. You need to try new things and you need to put new stuff in front of your engineers. Sometimes we don’t even know what we need, right? It’s like product development and the customer only knows what he wants when he sees it. But if you have customers like your fellow engineers and they have problems, the platform should be aimed at solving those problems. Maybe deployments are too slow. Maybe every time that I need to create a new project, it’s a hassle. So if I have an easier way to do that, it would be awesome. Maybe I don’t know how to observe my application. Okay, so my platform could build some automation and I get that for free. All those kinds of things. So you should be building iteratively and should be solving problems and making your teams faster, not trying to build a platform and then force it into your teams. And then at the end of the day, you will not be happy because adoption will not be great. Your teams will not be happy because they will be forced into doing something that is not helping them.

But Platform Engineering, I think it’s probably one of the biggest advancements that we’ll have in the operation side of the things because it allows you to automate a lot of stuff and it allows product teams to actually build faster because okay, I know exactly how to deploy my application, how to observe it. It’s easier. I need to focus or spend more time fixated on the problems that the business want to solve and not how I’m going to operate this in production.

Kovid Batra: Right. Makes sense. One question, regarding Platform Engineering and Site Reliability. So, the people who are actually on this team, who are taking care of, I would say developer experience, because ultimately you are impacting the developer experience with it.

Ricardo Castro: Yup.

Kovid Batra: How does a day-to-day basis, the role of a Site Reliability Engineer differs from somebody who is involved in DevOps. Like, what I could understand from your explanation is that both of them have the similar goal of taking care of the teams. There should be things running, it should be in line with automating or improving the overall acceleration or velocity of the team, right?

Ricardo Castro: Yup.

Kovid Batra: So, both seem to be similar levels. How does their role on a day-to-day basis differ from each other? And yeah, I think first is this.

Ricardo Castro: Yeah, sure. So, it is, of course, will be different from organization to organization. There is no silver bullet and you have to adapt to your own reality. What I see most often is that when you’re talking more about DevOps or traditional, I always struggle with the term DevOps Engineer, because again, I think DevOps is culture. You don’t need DevOps engineers, but let’s go with what the industry has come up with. What I usually see with DevOps teams is that they usually are responsible for some part of the automation. So probably they build, I don’t know, they build your servers or they build your Kubernetes cluster, whatever it is, they help you automate your CI/CD pipeline, for example. But then it’s probably up to you and your own responsibility to just, okay, so you have all this tooling, just go do whatever you need to do. Site Reliability Engineering usually starts to focus on production, right? So they start looking at production. You’re like, okay, what does reliability mean? One of the first things that site reliability teams do when starting to build this culture inside a company is, okay, we all have a different opinion of what reliability means. Okay, so let’s build a common understanding. Is it SLOs or whatever framework you want to do to make this definition? Once we have some nice definition, do we have the observability that we need to actually understand if you’re being reliable or not. So, site reliability engineers usually are very focused on observability. We have the metrics that we need. We have the logs that we need. We have the traces. We have whatever we require to actually inform that reliability and say, “Okay, it’s good.” “It isn’t good.” “Are we achieving it?” “Are we not being reliable as our customers need us to?” Usually only, only then they start to tackle more reliability issues in terms of unless there’s something glaring in your system. So, let’s say that you build an API or build a site and it’s constantly down, right? So, probably the first thing that the Site Reliability Engineer or engineering team will try to do is around, okay, let me like try to mitigate this or make it less painful, let’s put it this way. But if you’re getting, if you already have a, if you’re decent, you probably want to get from decent to actually meeting customer expectations. So that’s why you need a reliable framework. That’s why you need an observability platform, whatever it looks like in your organization to actually understand, okay, I’m meeting these goals or not.

Usually where these two DevOps engineers and SRE teams meet, it’s more or less right in this, in this, at this point, right? So you’ll probably realize once you have the observability and SLOs, that you could improve, for example, the way that you deploy applications. The way that you are running in production, you probably need, I don’t know, maybe you need a circuit breaker. Maybe you need to implement rate limiting. So, SREs can help with that. And the other way around. So once, DevOps engineers have a lot of, a lot of this in place, they can start then operating with teams. But I would say that these two approaches usually start at different spectrums. And then, eventually meet in the middle. So again, just to summarize, SREs are usually a lot involved, at least in the beginning of about defining what reliability is and what it looks like, a lot about observability. So they need to understand what is happening in your systems and as DevOps engineers are usually in the beginning focused a lot about automation, building CI/CD pipelines, and then eventually converging. Of course, just to finish on this, what SRE looks like in Google and in other organizations will be completely different. So, people need to take that into consideration because it’s tempting to just look at the SRE book and say, “Oh, I need to do this exactly as the book says.”

Kovid Batra: Yeah.

Ricardo Castro: We don’t, so the book is quite big. So there’s a lot there to learn from. And of course, companies like Google, Amazon, Meta, they have probably challenges that, I don’t know, a handful of companies, in the world have. So they need to tailor those solutions to their problems. But we can do the same, right? So we can see, oh, this is how Google does things. So, do I have a direct relationship between what they do and what I do? If not, are there similarities, are there differences and make that adaptation?

Kovid Batra: I think that makes a lot of sense. And the point that you mentioned here, you have to treat, if you are an SRE or let’s say, a DevOps Engineer, you have to treat other developers as your users and the platform as your product. So while you are building that, you have to have that ideology in mind to actually improve the overall experience, impact the overall velocity of the team, the quality of the work they’re producing. And I think most important part comes into the CI/CD pipeline itself. Like that is one critical portion of the whole delivery pipeline, I assume. So, you are an expert at it. What I would want to learn from you is what are those best practices or what is the ideal way to implement CI/CD for a startup, for a medium-sized company, let’s say who has 100, 200 developers, and then we move on to a larger size company where you have, let’s say thousands of developers.

Ricardo Castro: Yup.

Kovid Batra: Of course, For each stage, there should be a different set of practices. There’s a set of considerations that have to be taken while implementing that CI/CD pipeline. So, can you just throw some light over there?

Ricardo Castro: Yeah. So I think that overall, the problems are the same between your either a startup or a big company. So if you think about CI/CD, it’s the concept that we want to integrate code as fast as possible or integrate regularly. Let’s put it this way, regularly. And about continuous delivery. It’s about you’re getting stuff in front of your users. At least that, that’s my definition. And what we see a lot in the literature, some people, and this is a caveat, think about CI/CD as just automation. And what I would say to that is that if you have an automated integration step and an automated deployment step, but you’re not integrating regularly, it’s not continuous integration, because if I have a feature branch that spawns, I don’t know, a week, a month, and then eventually I integrate, that’s not continuous integration. That’s automated integration, but it’s not continuous. And the same thing with deployment. Yeah. For me, continuous deployment is putting stuff in front of your users, not just getting to the point, “Oh, this is in the process of being released.” And then it stays there for a week, a month, six months. It’s like, okay, I’m not putting it. I don’t even know, if it works or not. I like a lot of the philosophy of Charity Majors. She usually says that if you don’t put stuff in front of your users, you don’t really know if it works because probably we’ll discover problems with that.

So, what I would say is that independent of the size of your company, what you want to do is to make it faster, right? So, I’m a developer. I do a pull request. It is reviewed. I want to get feedback as fast as possible. Of course, that for smaller companies, they don’t have the budget in terms of either money or engineers to implement a lot of that themselves, so they can rely on SAS services if they have the budget to automate a lot of that stuff. And now, with the CNCF, we have a lot of tooling around that that can make it really easy.

As you start getting bigger and bigger, you probably will start to have more requirements. So, you’ll probably start having the need of more custom things. You’ll probably have more systems. You’ll probably have legacy systems where just using a tool out of the box doesn’t work. So, you’ll probably start to have to build some stuff internally. What I would say is to continue to leverage open-source tooling as much as possible. That’s always a business advantage because you won’t get stuck or you won’t get pulled in into a specific vendor. And then, yeah, all of your things are on top of that of that platform. It would be a strategic advantage. But the goal is the same, right? I want as a developer, I want to get feedback as fast as possible. So, I want to submit something and I want to, I don’t know, that my tests run faster, run fast. I want the deployment to run fast. So as an engineer, if I’m put into a place where I do a PR is merged. And then it takes one hour for the tests to run, one hour to deploy something. So I’m just sitting there like for a couple of hours like, “Okay.” And then to have something and I am like, “Oh! Actually, there’s a problem.” An automated test run or a user complaint. And it’s something very simple. And then I have to spend like two or three more hours just waiting for that whole process to finish, of course, when you’re probably a smaller company, you won’t have a lot of this, a lot of these problems because usually your systems are small. But again, that’s one of the differences between.. Usually, what I see with startups and bigger companies is that your system starts getting bigger and bigger and it’s easy to start having these delays, right? So I submit a new change and it takes a one hour, two hours, three hours for my change to get into production. So again, the goals are still the same. You’re just facing different problems in terms of size, scale. Bigger companies probably have more custom things that they need to do and they have bigger systems. So, they have to invest more in terms of time and money to actually fix those issues.

Kovid Batra: Yeah, Ricardo. I think that that’s really interesting. Now, looking at certain practices that you need to, like adopt before going ahead and solving those problems for the team. So, is there a specific framework that you follow when you do it in your teams? Any kind of philosophy that you adopt before approaching, like one you already mentioned on a broader level that you should treat them as your users and platform as your product. But is there anything else that goes into building some really good DevOps teams there?

Ricardo Castro: Yeah. So something that probably people know and have seen in the industry is a lot of talk around DORA metrics, right? So, for people who don’t know, it’s a set of metrics that allow us to understand the level at my organization or my teams are at. It’s a kind of a standardized way. It has been done, if I’m not mistaken, since 2014. There’s also a great book, which is called ‘Accelerate’, which explains the origin of where the metrics came from and the, the whole scientific approach around it. And it’s probably the best way that we have right now. Maybe we can come up with something better in the future, I don’t know, but it’s probably the best way that we have right now to actually understand at what level, in terms of proficiency, is my organization. And those metrics are deployment frequency, lead time for changes, time to restore a service and change failure rate. And now, I think last year, a new metric was introduced, which is reliability.

So essentially, it’s some centralized ways that I have. If I measure these things, I can understand if I am an elite type of team or I’m at a low level. So something that we’re doing in our organization is measuring these things. For example, that we were talking about CI/CD, if my lead, my lead time for change is more than six months, I’m at a low level, right? So it means that I have some change. I want to introduce that change. And it takes me six months to get that into production. Also, it allows me to compare between organizations. Of course, this needs to be taken with a grain of salt, right? So we can compare ourselves, like blindly. But it allows me to understand, “Okay, if my company takes six months to take a change to production, I can’t be at the same level as a company that deploys software continuously every single day. And the same thing for time to restore a service. If every time I have a problem, it takes me I don’t know, hours, days to fix it or to mitigate it, I’m not at the level that I want to be at. So it’s something that we’re doing internally. It’s something that I’ve done in the past and I think it’s a good idea to at least analyze the DORA metrics. See if it makes sense in your organization. Of course, they’re not the only metrics that you should look at, but it’s a standardized way. And it’s a good starting point because we have a lot of data from the past. I think the respondents in the last few years have been in the order of tens of thousands. So we have a lot of data that we can rely on and to be sure that these metrics are actually relevant. And these metrics are something that can allow me to understand, “Okay, if I’m at an elite level of deployment frequency, I’m in a good place. If not, it’s something that I probably need to work on and improve.”

Kovid Batra: Totally makes sense. I have one question related to this. One of the metrics, that we usually measure, which is deployment frequency, and you have just mentioned about it, a lot of engineering leaders to whom I’m talking, sometimes they challenge this thought that why is it even important? Like, we are doing it once in a week or once in 15 days. And if we’re rolling out features at the pace that we want, the deployment frequency doesn’t actually tie to that because ultimately it’s the value you want to deliver. If you are delivering it in small chunks or you’re delivering it in one big packet within 15 days, it’s the same. So measuring deployment frequency and understanding the efficiency of a DevOps team or a dev team on the basis of deployment frequency is probably not the right way. What’s your thought on that part?

Ricardo Castro: Yeah. So, I’ll start with the caveat and then I’ll give my opinion. So there are some cases, for example, there are services that I don’t know, don’t have a lot of development, right? So it’s a service that is old or is a service that is very stable. So then, there are no new features. So it’s normal for those services or that, that group of services to not have a lot of deployments. So, that’s the first thing. So, it’s not like a one size fits all. Maybe I have a service that I don’t know, deals with authorization and I’m, I’m done with authorization. I don’t have a lot of things to do that. So it’s perfectly normal that I don’t do deployments on that service every day.

That said, now there’s what you were talking about. If I’m delivering, for example, let’s use the timelines that you said. What’s the difference between me developing or delivering something like every day or delivering like the entire value once every two weeks or once a month. What we’ve seen and that’s why it’s important to have all the metrics and not one of the metrics in isolation is the fact that, for example, if you do a deployment every two weeks or a deployment every month, you will probably have more time with the other metrics. So you will probably be introducing more changes at once. So you will probably introduce more failures because it’s one of the, if not the most common thing to introduce a failure in a, let’s say in a system. It’s to introduce a change. And a deployment is a change. So the bigger the deployment, the bigger the likelihood of that problem of you introducing a problem in, in the system. And of course, time to restore will be impacted because if you introduce a small change, you have it in your mind. You introduce it every day. You just work on that code. You can understand it better. It’s like, “Okay, so I introduced the change and it was a failure. Okay. Let me look at it.” You probably get around much faster saying, “Oh, the problem is here. It’s fine. I’ll fix it. Deploy it again.” If it’s a big change, you’ll probably take a lot longer to get there because there was a lot of code that you put into production. What is happening? You have to pull in a lot of people. So you work on this piece, you work on that piece. What the hell is going on? Most likely your time to restore will be impacted because you will take longer.

And of course, lead time for changes will also be impacted, right? So if you introduce a small change every single day, deployments will be a lot faster. If you introduce a big change that is in a lot of services, deployments will be more complex. That’s another thing actually interesting to measure. It’s about the complexity of the deployment. If I need a coordination between services, if I’m introducing a big change, it’s something that I need to take into account in this consideration.

Of course, something that people usually tell is like, oh, okay. But for me, for actually to introduce smaller changes, I need to write more code. Oh, because maybe I need like some feature flag or something like that, or maybe I need to introduce somewhat of incomplete code just so that I can guarantee that it’s working in production. That’s true. But it’s usually at a very small scale, right? So maybe you need to, just to ensure that even if the feature is not complete, that the code that you’ve written doesn’t break anything that is there. Right? So maybe you need to put some defensive measures, like, for example, a feature flag just to make sure that no request passes through that code. That’s true, but it’s usually very small in terms of scale, and you can revert it back really fast. Reverting a big change is much, much harder.

So, just to summarize around that, I understand the concern, but it’s what we usually see. And when you use these metrics as a whole, it’s that when you introduce a bigger change, usually other metrics are impacted because you introduced a big change into a system. You’ll have more problems. You’ll take a lot more time to restore. And if deployment frequency is usually higher, it means that you’re not introducing big changes every single day. You’re introducing small changes, and you can recover much faster and deliver it at a higher speed. And what we see, actually, it’s something that there’s always a discussion about if you’re going faster, you break more things. If you look at the data from the past six or seven years from the DORA metrics, that’s actually not true, right? So, the teams that actually deliver faster and they’re deploying more frequently, take less time to restore service and have less failures.

I know that this might sound counterintuitive, but it’s what the data shows. It’s not like us saying, “Oh, this is like a utopia.” No, it’s what the data shows us is that if I deliver smaller changes more frequently, I actually introduce less failures into my system. And when I do, I’m much faster to recover.

Kovid Batra: Totally. I think it makes sense also, even though it might sound counterintuitive, as you said, but it definitely makes sense. If you’re bringing in a small piece of change, the error percentage, keeping it the same, the amount of absolute error would be much less. So, of course, it definitely makes more sense to deploy frequently.

I think the challenge comes in with the part where people want to set up, need to set up that pipeline for automatically deploying and doing it fast. That’s where the sluggishness comes in. But I think it is very important to have such a system in place if you have the bandwidth and the right team to do it.

Ricardo Castro: I understand that that’s an upfront investment, but it’s something that we can look at the data because again, it’s not me and you just having a conversation and sharing our opinion. It’s like we have data. We have data on that that actually can tell us, yeah, this actually works. So, although there’s an upfront cost and an upfront investment, the data tells us that at the very least at the medium term, you will start getting a lot of benefits. And if you’ve been in the industry for a few years and you have the chance to actually work in this kind of way, you start to realize that yeah actually, it is, right? So it’s not something that I heard someone talking, like someone from Netflix saying that in their organization, it’s awesome. No, I actually experience that on a day-to-day basis.

Kovid Batra: Yeah, yeah, yeah. Makes sense. Great, Ricardo. I am totally infected by the energy you have and the way you explain things. There’s definitely a lot of depth in what you are saying here, and maybe these 30 minutes are not sufficient. We might need another session for deep diving into certain issues like that. And I would love to have you for another show sometime again.

Ricardo Castro: Yeah, that sounds great. We can arrange something like choose a topic and go deeper into a topic. I’ll be very happy. I’ll be very happy to have that discussion with you. It was a very nice conversation that we had today.

Kovid Batra: Great, Ricardo. Thanks a lot. I hope our audience likes it too. Great. So I’ll see you soon and keep you posted.

Ricardo Castro: Thank you very much, Kovid. Thank you very much for having me. It was a really nice conversation. I think it’s something that is on top of everyone’s mind. People hear about DevOps, SRE, Platform Engineering, CI/CD. And it’s good that we see like-minded people just having discussions. We agree on some things. We don’t agree on something. But it’s out of these discussions and out of this brainstorming that we can actually, can start to get solutions with our organizations. So, I think it’s nice that we have a broad spectrum of opinions because again, my company will be different from yours and probably, what works for me might not work for anyone else. So if we have a broad spectrum, we can say, “Oh, actually what Ricardo was saying applies to my company.” “Oh, actually what Kovid was saying actually applies to me.” And people can, can make their own minds.

Kovid Batra: Absolutely. Absolutely. Perfect. Great, Ricardo. Thank you. Thank you so much once again. Great to have you on the show.

Ricardo Castro: Thank you very much for having me. Thank you.

‘DevOps, SRE & Platform Engineering’ with Ricardo Castro, Principal Engineer, SRE at FanDuel

Time stamps

Links and mentions

Episode transcript

Written by typo