Rob Hirschfeld is the CEO and Founder of RackN, a leader in physical and hybrid DevOps software. He has been in the cloud and infrastructure space for 20 years and has executive experience at both start-ups and big companies. In this episode of the CIO Exchange, Rob dives into how platform engineering can impact an organization’s productivity and improve the developer experience.
As we’ve known for some time, top developer talent is scarce and continues to see a rise in demand. As a result, tech leaders are increasingly focused on developer productivity and the developer experience, looking for strategies to improve both. One potential solution is platform engineering.
Rob Hirschfeld is the CEO and Founder of RackN, a leader in physical and hybrid DevOps software. He has been in the cloud and infrastructure space for 20 years and has executive experience at both start-ups and big companies. He has a background in Scale Computing, Mechanical and Systems Engineering, and is an advocate for applying lean and agile processes to software delivery.
In this episode, Rob dives into platform engineering, explaining how it can support developers and operations teams, improve productivity and enhance the developer experience. He stresses the importance of collaboration between internal teams and thinking critically about how to define productivity. He also outlines the challenges that he is seeing in automation reliability and automation half-life.
Key Quotes:
“What’s happening is, it's not that hard to write code. It's getting even easier with some of the new things that we have, but it's incredibly hard to take that code on a journey all the way into production and to keep it running in production. And that understanding, learning all those pieces having to do with all the work that has to go into building that infrastructure and making that work, that ends up being a tremendous hurdle for developers.”
“If you're looking at productivity as, I delivered a whole bunch of code. It doesn't make any difference. What you actually need to be doing is delivering the right code at the right time.”
“The way I see platform engineering is, it's a service-oriented operations infrastructure that consolidates the operations experiences across multiple disparate development teams. There's people out there who talk about it being an internal product. If you get too focused on a product definition of platform engineering, you miss what we're really doing, which is creating a service oriented approach. Meaning, it's an operations team that is looking at the developers as customers and figuring out how to better serve those internal teams as a shared resource.”
“What we see everywhere in the industry is there is a reliability crisis in automation. Meaning most people don't believe their automation is very reliable….They’re basing it on, when you run your automation, how often does it work? What we find is that a lot of people have under 80 percent automation success.”
1:00 Development team’s biggest challenges
2:46 Journey from ideation to production
4:46 Impact of improving the developer experience
11:48 Definition of platform engineering
13:06 Platform engineering case studies
18:51 Why developers and operations teams need to collaborate
24:30 How platform engineering impacts security
27:17 Importance of transparency and stakeholder buy-in
31:59 Automation reliability problems
35:00 Will generative AI improve automation?
0:00:01.3 Rob Hirschfeld: The fundamental thing in any good platform engineering effort is to improve collaboration between your teams. Period. That is it.
0:00:11.0 Yadin Porter De León: Welcome to the CIO Exchange podcast where we talk about what's working, what's not, and what's next. I'm Yadin Porter de León. Top developer talent is scarce, it continues to see a rise in demand. As a result, tech leaders are increasingly focused on developer productivity and the developer experience, looking for strategies to improve both. One potential solution is platform engineering. Rob Hirschfeld is the CEO and Founder of RackN, a leader in physical and hybrid DevOps software. He's been in the cloud and infrastructure space for 20 years and has executive experience at both startups and big companies. In this episode, Rob dives into platform engineering, explaining how it can support developers and operations teams, improving productivity and enhancing developer experience. He stresses the importance of collaboration between internal teams and thinking critically about how to define productivity.
0:01:00.6 Yadin Porter De León: Rob, developer experiences become extremely important to those who are trying to build experiences, build applications to deliver on those experiences because the business outcomes that are required of the business are extremely linked to those efforts. You have an idea, you need to get to code, you need to get to production, you need to get out there in front of somebody who's gonna consume it and then create business value. But there's a lot of friction there. So what would you say is, from a developer's standpoint, what's the biggest challenge that those development teams are having right now?
0:01:30.5 Rob Hirschfeld: What I think of is, it's a bumpy journey, that what a developer does, has to do on a day-to-day basis is actually really, really different than what that code's experience is running in production. And that path from keyboard to production, especially transforming it, building security around it, conformance compliance, what's happening is, it's not that hard to write code. Getting even easier with some of the new things that we have. But it's incredibly hard to take that code on as journey, all the way into production, and to keep it running in production. And that, understanding and learning all those pieces having to do with all the work that has to go into building that infrastructure and making that work, that ends up being a tremendous hurdle for developers. That the proverbial full-stack developer, there is no... I know some people like to call themselves full-stack developer. The amount of skills that you have to have to walk that whole journey is immense.
0:02:21.0 Yadin Porter De León: Yeah. I like the way that you're talking, 'cause you're not just talking about creating code, 'cause I think we can get lost in the conversation of, like, how do we get the code out, how do we ship the code, when it's not just writing the code, like you said, writing the code, not necessarily... There's creativity, there's art, there's architecture, there's all sorts of things that go into crafting what that looks like. Maybe this is the code and the infrastructure, it's all part of the journey of the idea. The idea, the value that you're trying to create is on a journey and part of the journey is through code. Then, when it's in that code journey, the code then goes on its journey as well. It has to go through all these different stages that you talked about too. And there's friction in that journey. And really when developers talking about, or development team is talking about getting something from ideation to production and getting it out there in the world, they're really talking about, what is that journey like? And right now you said it's fragmented, it's broken. What's causing that? Is it something that's always existed? Is it getting better? Is it getting worse?
0:03:17.7 Rob Hirschfeld: It's always existed to one extent or another. I think we constantly evolve the ways in which we architect code to try to improve velocity. In a lot of ways, and this is a little counterintuitive, we often architect things to reduce the communication burden within the teams that are doing that work. But, like so a lot of the microservices, the purpose of building microservices is to allow teams to have a smaller blast radius, a smaller interaction point between what they build and everything else. The problem with monoliths is not monoliths. Everybody thinks, "These old traditional applications that are huge, monoliths are bad." There is nothing wrong with the monolith. It is a very functional way to build an application stack. The thing that makes monoliths hard is the interconnectedness of the teams and people that go into building that monolith. And so it's a team statement, it's not a technical architecture statement at all.
0:04:14.4 Yadin Porter De León: And does that go back to the journey of the code? If I'm going on a code journey, is that, maybe the monolith is, there are different points of friction, different places where you're having a suboptimal developer experience that's impacting what you just mentioned too, which is velocity, which goes back to productivity. And I'll pull... Let's take a step back really quick too, as we're starting this conversation, what is the impact of improving that developer experience to improve productivity? How are CIOs, or not, how should they really look at, "Okay, I wanna improve the velocity of which I take an idea to market, that's gonna... " Of course, a big part of that journey is that developer journey and improving developer productivity. So, give me a sense of why is that so important in the developer experience. I know it might seem like an obvious question to some, but I think it's worth diving into is, why do we really need to focus on that developer experience through that journey where they're taking the code through?
0:05:15.0 Rob Hirschfeld: Oh, boy.
0:05:17.4 Yadin Porter De León: Why is it so critical?
0:05:18.5 Rob Hirschfeld: Oh, that's a hugely important question and problem in that, when we talk about productivity, it's really easy for people to be like, "Oh, I want my developers to be writing more lines of code every day... "
0:05:27.8 Yadin Porter De León: Yes. How many K logs have you done today, Rob? That's what we need to know.
[laughter]
0:05:35.6 Rob Hirschfeld: That is a measure of productivity, but is really not what enterprises should care about. Because if you could be, amazingly, and I watched this happen, you can have people write a ton of code. If it's not the right code, you actually are creating technical debt, in some cases you have emotional ties to services and code and capabilities, the sunk cost fallacy of saying, "Oh, I just spent a month building this feature." And somebody show up and say "You know what? That actually isn't the feature I needed. I needed something different."
0:06:07.7 Yadin Porter De León: Yes, exactly.
0:06:08.0 Rob Hirschfeld: And then all of a sudden, now people are like "But I wrote this, we have to use this."
0:06:14.3 Yadin Porter De León: "I have to put it on my QBR slide and that slide has to go into the event, and then I need to be able to tie my bonus to that." And that's... So it's... There's... Yeah, I love that emotional connection, I think that's a great point.
0:06:23.1 Rob Hirschfeld: And so if you're looking at productivity as I delivered a whole bunch of code, it doesn't make any difference. What you actually need to be doing is delivering the right code at the right time. And then you want to not be too defensive about what you've built. And I'm a big fan of Eliyahu Goldratt's "The Goal" and the derivative works like what Gene Kim did with "The Phoenix Project" and things like that that are similar. But The Goal is really seminal in the idea that it is very easy to create an end-to-end process, where you optimize what you think is the most expensive part of that process, and then cause the whole system to fail. So, in The Goal, which, I highly recommend from a reading perspective, they had this one piece of equipment, very expensive piece of equipment, like a developer is in an organization, they're often some of the most highly-paid individuals on the team.
0:07:11.7 Rob Hirschfeld: And so what they did was because they knew it was an expensive piece of equipment, they ran as much inventory through that gear. Everybody scheduling in the factory said, "Our number one job is to keep the expensive equipment busy." But what that did was, that meant that they piled up a huge amount of work in front of it and they had a whole bunch of stuff behind it that were all starved for attention or managing the inventory, and they actually were using that and it was creating a bottleneck in the system. And so one of the challenges you get with developers, if you get very focused on productivity instead of how they collaborate, how they work with others, are they conforming to systems though they're building deliverable code, they can run in production or maintainable code. Are they accumulating technical debt in the name of productivity? All of these things actually create systemic problems around your teams.
0:08:02.5 Yadin Porter De León: Yeah, I like the point you made specifically with regards to creating the wrong thing, and writing the wrong code which creates new technical debt. And that's the exact opposite, of course, what the business wants. And I had this great conversation with a CIO where he's like, "If I don't have my teams connected properly, why would I develop something or adopt something like Agile? Because that just means my backlog and my technical debt will pile up faster than it did before, because I haven't done a lot of the things that I need to do. I'm an Agile shop, but what if I'm not pointing that Agile shop in the right direction? Well, then I'll just do the wrong thing, but I'll iterate more quickly doing the wrong thing. I think there's some upstream stuff that I have to fix. I think that, the approach makes a really big difference. I think, setting your north star, making sure that you're pointing the team in the right direction, I think that's something that everyone tries to do. But, sometimes you look at, the optimization lens ends up maybe corrupting the way in which people are seeking to do the right thing.
0:09:00.9 Rob Hirschfeld: Yeah, it's interesting that a lot of times people couple minimal viable product design and things like that. One of the things about Agile and minimal viable products is actually the idea that you're building a prototype, you're gonna destroy it and then build the right thing. And one of the things that you have to realize is that, in building out how all these systems work, there's times when you have to say, "We didn't build this right, we have to go fix it, repair it, improve how the system work. We learned something." That is where microservices become really powerful. This is where if you're doing design with good decoupling points in it, then you can actually come back and say, "You know what? I'm gonna refactor how this one thing works, and if I keep the APIs the same, I might completely rewrite the backend service, improve its efficiency, improve its performance, change out vendors, do things like that," and yet not have the impacts on other teams or other consumers of how that system works.
0:09:55.2 Rob Hirschfeld: That's where we're decoupling teams from having to make those decisions. And if you're looking at monolith and you choose to refactor something, you might have dramatic consequences across other teams and then you have to do a whole bunch of coordination. That is very hard to optimize. Some of the architectural pieces that we've put together have encouraged this type of very segmented developer teams from a productivity and coordination and architectural isolation. There's benefits, but now we're starting to reap some of the costs.
0:10:25.3 Yadin Porter De León: There's challenges there when you're measuring the wrong thing, or you're optimizing for the wrong thing. But I think, coming back to the idea of the developer experience, Forbes recently reported that top developer talent's scarce. That's not a surprise. It's kind of a well know, sort of fact is that, top developer talent, and you can define that in different ways and that different means different things to different companies as far as top talent and what their focus is on. But the demand's there, and it's growing rapidly. And, platform engineering is a way to support and nurture those developers and provide that good developer experience. And that's a path to creating that value that businesses are trying to create. Let's just a level set to start for the listeners too, can you define what your idea of platform engineering means?
0:11:12.8 Rob Hirschfeld: I'd be happy to. Platform engineering is something that we've seen happen for multiple years, not as a consistent name but it's an enterprise trend. And that's important because I think this is actually something, that platform engineering has surfaced out of enterprises. It's not something that was brought in to enterprises. It's a definition of a need, inside of companies, as a consequence in some cases because of the microservices architecture but not entirely. And the way I see platform engineering is it's a service-oriented operations infrastructure, that consolidates the operations experiences across multiple disparate development teams. It's a little bit like there's people out there who talk about it being a product, an internal product, or things like that. I actually think those are, if you get too focused on a product definition of platform engineering, you miss what we're really doing which is creating a... It's a service-oriented approach, meaning it's an operations team that is looking at the developers as customers, and figuring out how to better serve those internal teams as a shared resource.
0:12:23.7 Yadin Porter De León: No, absolutely. I think that's an approach, it's a way of operating. And can you share, just to really focusing on what that is in your own mind, can you share just a couple of use cases or examples so we can break it down?
0:12:37.5 Rob Hirschfeld: We, RackN and I, Rob Hirschfeld, are very infrastructure-focused. We tend to think of things in ops and infrastructure terms, so my examples tend to drive towards that. One of my favorites is a media production company that has a lot of different needs. They have animators, they have developers, they have CICD processes, but at the bottom of a lot of that what they have is a need for very specific short-term uses for infrastructure. As a platform team, what they needed to be able to do is have a CICD system that would run a rendering job, that would consume infrastructure during that time period and when it was done, it would give it back up. An API-driven infrastructure, very straightforward. What they did was, the infrastructure team, the platform engineering team, they didn't call it that, but their infrastructure team actually produced an API that would allow them to interface into their CICD workflows and then do that infrastructure work.
0:13:40.0 Rob Hirschfeld: The platform team kept ownership of that, not the CICD team. And then they were able to take that same infrastructure and that same APIs and same automation, and then build that into a developer portal. So if somebody needed to check out a cluster or a desktop, configured to the way the standards worked, they could then check out a machine, or a cluster of machines, for their use case and actually deliver working infrastructure. The way it became a platform is, when those processes were actually using the same APIs, the same backend service different use cases out of the same infrastructure pool.
0:14:18.1 Yadin Porter De León: Okay. That's the distinction too, 'cause what I was thinking of is, what's the difference between what maybe you'd call a non-platform engineering model versus that platform engineering model? And that's what you're just describing right here where it's multiple use cases.
0:14:32.9 Rob Hirschfeld: This is one of the reasons why I think that we get confused when we get very focused on it being a dev portal, because what you're really doing is you're not creating the operations contracts, the operations collaboration. And this idea that not all development workloads come in through a portal, some of them come in through CICD or GitOps or... The goal here is the operations team can consolidate their operations. Operations doesn't wanna have a lot of ways to do things, it's bad practice. But developers and development teams have a lot of different ways they need to accomplish their goals. And so they don't want the operations team constraining what that interaction looks like. It's no better to say, "Oh, I'm gonna put a dev portal and you have to go through the dev portal." That might not work for every team. What you wanna be able to do is say, "Development teams, you go after your best way to interact with the infrastructure," and operations team wants to be able to support all of those interactions with the minimal amount of friction on their side, that means without having a lot of different processes or control points for it.
0:15:35.3 Yadin Porter De León: No, and that's what fascinates me because originally, when I say originally, in many different iterations that the developer experience, that developer has to do a lot of other things before they even get to start writing code. And walk through what that, 'cause you were talking about that journey of the code, what are some of the things that, let's say if someone's not in an environment where you've got a dedicated platform engineering team creating that kind of experience where you get to be able to have the underlying infrastructure support all these different ways of working, what would that workflow look for a developer? What do they have to do before they can even start writing the code, that they wouldn't have to do under that platform engineering model?
0:16:08.3 Rob Hirschfeld: Oh my goodness. Well, platform engineering teams still do the work. That's why I focus on it being an emerging topic. It has to be done either way. You might have a platform engineering team per development team, but somebody's doing that operations work. The platform engineering is this emerging idea that you can have it be a shared role, but it's very concrete. One, nobody wants to write code that doesn't go into production. It is the purpose for writing codes. In production, that code has to have data storage, a database storage objects, some place to store the data and the state information, has to have a place to actually run the code. Today that oftentimes looks like Kubernetes or some type of container management system, but it doesn't have to be, it could just be installing the application on machine. Code has to run somewhere, has to provide access to it, you have to have some type of networking topology with security, it has to have some type of refresh cycle.
0:17:03.2 Rob Hirschfeld: Just 'cause you got it deployed once, that doesn't mean that it's actually done, you have to have a system and process by which the code can be updated and changed and patched and fixed. And the underlying infrastructure that supports all of that stuff has to have the same type of lifecycle and model. All of those pieces and parts have to be considered. And even if you're doing serverless, event-driven architecture, those basic components are still present. Access inbound, URLs, some way to get the system to run, execute, store data and state, all of those pieces are gonna be consistently available. What the team has to do is they have to, in a lot of cases work backwards from what that architecture is, that they're gonna maintain in production and then figure out how they're gonna fill the pipeline of code and changes to that process.
0:17:53.2 Yadin Porter De León: Okay. And I guess, the focus though is, in this platform engineering's, emerging platform engineering model that we're talking about, is there a significant, a material difference in what that developer experiences, versus, let's say if you didn't have a dedicated platform engineering team and the burden of some of that activity then falls on the person who's actually writing the code, what is the difference between those experiences?
0:18:18.0 Rob Hirschfeld: That is a great question because there's a contract piece in this thought process that's worth worrying about. The idea here is that we want to take some burden off of the developer, so they're not worried about learning infrastructures, they're not worried about setting up the security, the networking, which OS they use. In a lot of cases, part of the challenge here is that, enterprises, back to the keyword here, this is an enterprise evolved process, they have a standard operating system, they have security, they have audits they have to pass, they have compliance checks that need to get done. They've already made a lot of these decisions. In some cases, that causes a huge amount of friction because the developer shows up and says, "I don't want to use that," or "That's out of date. I don't like it." And the developers will go and do what they think is the right thing, and that can be split or separate from what the operations team has to maintain and support. And that's important because what we've done in a lot of cases we've allowed developers to say, the CIOs have said, "Oh, don't worry about that developers, go do your thing. Your productivity is very important to me. Make your own choices and go."
0:19:23.9 Rob Hirschfeld: And then, what ends up happening is they get to a point where now they have to get the code into production and the operations team shows up and says, "Well, you didn't conform to any of my steps." And so either it's usually disruptive to the developer, or to the operator, 'cause they then have a new thing to support. Or, what unfortunately happens even more is you end up with a break in the process where the developer does stuff, they hand things to the operator, the operator now owns it, and you lose collaboration. The fundamental thing in any good platform engineering effort is to improve collaboration between your teams. Period. That is it. But if you lose sight of that, and you can be in a case where the developer productivity is over-optimized and they're doing things that create work for operators, or the operators can be overindexed on setting the environments and not being collaborative and then forcing everybody into a standard practice, that it used to be like a long time ago. And both of those are anti-patterns. Ideally, it's a collaborative.
0:20:29.9 Yadin Porter De León: No, you're right. You can create a platform, in this platform engineering model, and you can have all the different modes of which developers can want to operate. But if that platform either isn't communicated or doesn't support the way in which the developers wanna operate, or evolving ways the developer wasn't operate... Doesn't matter if you built this dedicated platform engineering team because the developers are just gonna go do what they want anyways, and you're creating these silos and you're creating this dysfunction. So as a technology leader in a business too, how am I creating that type of collaboration? And where have you seen it? 'Cause you've worked with a bunch of companies. How are you seeing them creating that collaboration so that there is both understanding, but also the experience, the function is there so that the platform engineering team is making the developers the developer choice, making the right choice, the easy choice, which is a popular security maxim, how do we make the right thing be the easy thing?
0:21:27.4 Rob Hirschfeld: Fundamentally, we're talking about building a catalog, and sometimes when people talk about product definitions inside of platform engineering, the way I really like to think about it is the operations team, or the platform engineering team is building, documenting, and making accessible, ideally via an API, the production environment that they wanna run. This is where infrastructure... I haven't even said infrastructure is code yet, but this is...
[chuckle]
0:21:53.6 Yadin Porter De León: We have to insert some more buzzwords so that we can attach them as hashtags.
0:21:56.7 Rob Hirschfeld: This is the... [laughter] We'll get there, we'll throw in some of those GPT...
0:21:58.6 Yadin Porter De León: We'll get there. Yeah.
0:22:00.7 Rob Hirschfeld: Or something like that. No, here's the idea. And it ends up being very simple. The operations team already knows what it wants to be, its operational environment. They don't always do a good job of defining what that is or making that accessible via an API. They're like, "Oh, I can set you up a production application environment and I know how to do it and I have a runbook for it and it's a whole bunch of tickets and all this stuff." That falls apart because developers can't easily access that, they have to ask for one, and it's really expensive to build. In a platform engineering environment, what you do is you say, "I'm gonna automate that build process and make it infrastructure's code." That's good practice. But then I'm gonna make it so that I can build those environments inexpensively, cheaply and quickly. And then, instead of taking a developer and saying, "Hey, we're gonna have to arrange a big contract for getting your code into production," you can start to say, "Developer, I'm gonna give you a production environment, or a production-like environment, that the operations team actually owns the contract for. And they can ensure the security and the compliance and the operating systems and all the versions," and they can actually set that and then hand it to a developer as a playground.
0:23:14.5 Yadin Porter De León: And so then you're getting that consistency that you need across from an experience standpoint, but also from a policy standpoint, like you said, across whether it's security compliance, all those pieces too. Especially, I think we're on that too, what kind of measurable improvements would an organization likely see in terms of things like security and compliance if they're adopting this platform engineering team?
0:23:35.3 Rob Hirschfeld: Security can be a hard thing to measure. [chuckle] But...
0:23:38.8 Yadin Porter De León: We're gonna solve it on this show. We're gonna solve... Yeah. We're solving...
0:23:40.2 Rob Hirschfeld: We're gonna solve it. We're gonna fix it. A lot of times the issue is not, "Can I make an environment more secure?" It's how much it costs me to secure that environment. "Do I have the security personnel? Do I have the time? Do I have to fix the conformance? Do I have exceptions that I have to deal with? How many people are trying to fight how many fires?" And so, what you actually get out of this is that those pristine correct environments that you have taken the time to secure and enforce, those are the ones that get replicated. And that makes a tremendous difference in security because it means that your team can focus on securing it, automating it, can get compliance, can get checks, and you can actually take that burden.
0:24:23.2 Rob Hirschfeld: If your development teams are following these patterns, then they do not then have to own security in that environment. So it takes security off of your developers, it means your security teams and all the compliance and all the other work that goes in with that, those teams can now focus on the higher level improvements and you can do more with less. It's hard to know if it's gonna make it more secure, but it can make... And this is where we're talking, it's not just developer productivity, it's my security team's productivity, it's my operations productivity, it's my compliance and reporting productivity, it's all of those things actually get enhanced by the collaboration of having those teams working together.
0:25:04.0 Yadin Porter De León: And so I like one thing that you said too, that's the environment that they're creating, that they're automating, they're creating, or which is consistent, they're giving to that developers a playground, if that's a good experience, then like you said, that gets replicated. And going back to making the right thing, the easy thing, is if you have the proliferation of good habits is always a profoundly positive effect on the business if you say, "This is great," and everyone says "That's the way I wanna go" because that's the easiest, best experience, but also underneath that platform engineering team has created. So it's consistent from a policy perspective, security perspective, and then that all of a sudden gets replicated throughout all the activities from the developer teams, then now you've got something that is at the very least, consistent. If you need to change policies, you can change them, that's a different thing, whether or not you're getting the policies right. But at least you're having the right behaviors being adopted by the ones who are creating the experiences, the value for the business that actually ends up being in the hands of the customers or the hands of the employees.
0:26:06.6 Rob Hirschfeld: And I would add something that's really important here that I've seen a lot of teams miss. The transparency of that build environment is also really important, because what happens is it's very hard to get, it's actually impossible to get that environment to fit for everybody. So you're gonna end up with a couple. And a lot of companies find that they end up with ever increasing number of those environments. And one of the reasons that happens is because the developers who are using those environments, have trouble seeing or influencing how they get built.
0:26:39.8 Yadin Porter De León: When you say transparency, you're talking about that's back to collaboration and, maybe stakeholder buy-in, is that what you're focusing on?
0:26:47.3 Rob Hirschfeld: That's exactly right. What you wanna be able to do is if you have a team that has advanced knowledge or wants to be able to influence the environment, you wanna build those environments in a way that have extension points into operations for the developers to collaborate with. That could be, "Oh, I can inject an Ansible playbook into here," as long as I see what it is, you can add in post configuration into my environment. Or you can add specific configuration. We call this infrastructure pipelining. And the idea here is that you can actually take a standard process and then extend it in transparent ways. To make that work, the teams consuming the resource, the infrastructure, have to be able to see the whole process and have to be able to watch it happen when they want to. Some of them don't care and they're just like, "Give me the system, and I'm done." Great. That's easy.
0:27:34.3 Yadin Porter De León: Yes. "Here's my code running. I don't care how."
0:27:36.4 Rob Hirschfeld: But this falls apart on the hard cases. The easy cases are easy, but it falls apart on the places where some of the team is like, "Yeah, that's true, but I need a specialized bias configuration," or "I need to inject my own... I need a specialized kernel for this." That's where this stuff falls down, and then you end up, they don't conform and now you've got the hard cases not conforming, and doing their own thing and now owning their whole stack. That ends up being the default. We need to make it so that those teams can collaborate with operations in the standard practices in ways that don't disrupt, don't tip the whole tower over because they just need a couple of changes.
0:28:11.0 Yadin Porter De León: Yeah, 'cause then that stifles innovation. If I'm incented to do things that conform too rigidly, then I'm not gonna go ahead and try and be creative, try and be interesting, try things new, and then that then leads to the talent gap where how am I gonna attract the top talent if I'm just creating a shop where you're needing to do things that fit in a specific box, whether that's the infrastructure driving code rather than the other way around? Yeah. I guess, is that collaboration as code? Is that the collaborative code journey? What is that... Has your marketing team come up with something catchy, Rob?
0:28:44.9 Rob Hirschfeld: We call that infrastructure as code actually. This is the challenge for us. We see infrastructure as code as treating infrastructure with the same practices you want out of development, reuse, portability, modularity, composition. Which, we do all those things as developers because we want collaborative environments. And so infrastructure as code to us means it's about process and taking an architectural approach. It's not, "Did I use Terraform and codify things in YAML?" That is absolutely not how we see infrastructure as code. That is what a lot of people end up describing it.
0:29:21.9 Yadin Porter De León: Yeah.
0:29:22.4 Rob Hirschfeld: That's the challenge here is that you have to look at it as this transparent collaborative process, and then make sure that your tools are helping you with that. If they're not helping you with that, then you're asking the wrong questions of your tools.
0:29:34.0 Yadin Porter De León: No, that makes perfect sense. Let me move to what we call our ticket to the board section.
[music]
0:29:41.2 S?: In short, ladies and gentlemen of the board, costs are down, revenues are up, and our stock has never been higher.
0:29:47.5 Yadin Porter De León: And that I think really is, it comes down to whether or not your effort is being successful or measuring how successful that effort is. And I wanted to frame that in. If I'm having a conversation with the CEO or the COO, or if I'm trying to connect different lines of business like developers with operations, with security leaders, I need to be able to then communicate the efficacy of what I'm doing. I'm in an E-staff meeting, and they're like, "Okay, this platform engineering thing, I saw that on your PowerPoint, it looks really, really great," what are some of the tangible and intangible things that, let's say technology leaders should be looking at when they're measuring the success of their platform engineering effort? Is it how many pizzas the teams are sharing together? Is it the lines of, is it how many engineers are not going rogue? What are the, like I said, tangible and intangible ways that, that you can start to measure and then communicate how effective that platform engineering effort is?
0:30:45.1 Rob Hirschfeld: I've been working on a KPI document for this, and I started with the DORA Metrics, but they're very dev-focused, and they can be hard to measure. And if you're the CIO asking the platform engineering team what they're doing, if you wanna measure improvement, you in some ways wanna start with, you might be starting with things that are hard to measure, which is exactly what you're talking about. I could count pieces, but that's not that helpful. I'm actually working on a KPI document, it's not entirely ready. People can follow-up with me if they want to get access when that post. But I do have a favorite from the list. I do have a favorite. And I call this "automation half-life." What we see is that a lot... Underpinning in a lot of what we're talking about is good automation. And what we see everywhere in the industry is, there is a reliability crisis in automation. Meaning, most people don't believe their automation is very reliable.
0:31:35.4 Yadin Porter De León: And what are they basing it on?
0:31:37.4 Rob Hirschfeld: They're basing it on, when you run your automation, how often does it work? How often do you...
[overlapping conversation]
0:31:43.1 Yadin Porter De León: That's a good thing to base it off of. Does it actually do what you told it to do?
0:31:47.2 Rob Hirschfeld: It's very simple.
[laughter]
0:31:47.3 Rob Hirschfeld: I click the button, does it work? And what we find is that a lot of people have under 80% automation success rates.
0:31:54.7 Yadin Porter De León: That seems really low.
0:31:56.5 Rob Hirschfeld: And one of the KPIs I have is, how often does your automation succeed when you push the button? Are you running around fixing problems because of your automation? You can be like, "Oh, Rob, I'm doing great, I'm at 95%." I'm like, alright, that means that five times that you run something, you're going to figure out what's going wrong, and troubleshooting a problem, or people are clicking buttons and 1 out 20 times it's not working. That's not very good. We have an automation reliability problem, but it's exacerbated by this automation half-life. And what I mean by that is how long, if a script or automation, or a process hasn't been run? How long before you will not be confident that we'll keep running? That's the half-life.
0:32:40.1 Yadin Porter De León: No, I think that's really... 'Cause there has to be a lot of that looming around. I imagine, audits around that or review around that. It seems like a heavy lift and a tough thing to become a habit and programmatic within an organization and a team.
0:32:53.5 Rob Hirschfeld: It's a key metric. And the nice thing is you can actually go in as a CIO and you can ask. You can say, "Alright, if I ask you to run a six-month-old Terraform plan or Ansible playbook or a Bash script or a PowerShell script that you wrote, if it's six months old, are you gonna review the code and be nervous? If it's two months old, three months?" You can actually ask people how confident they are that the scripts that they wrote will keep working over a period of time and figure that out. And what's happening is, if your half-life is short, and a lot of people have a one-month half-life here.
0:33:27.5 Yadin Porter De León: Ooh, wow. Confidence is dropping rapidly.
0:33:30.7 Rob Hirschfeld: Well, it has to do with how fast all the things around the scripts change. And this is where... People ask me about this. It's not that the script itself is breaking; it's in an automation environment, especially with infrastructure, the environment around that script often changes and breaks. We have the same thing in a microservices architecture. Somebody changes a service, and they break you because their service changed and you have a cascading failure. Very hard to defend against that.
0:33:54.2 Yadin Porter De León: Yeah. And then, I think the reason why we're going down this road is because when I ask, "What's the elephant in the room right now?" which is the new generative AI... What do you wanna call? Revolution, boom, hype cycle. Is this now something that teams are looking at to increase that confidence in automation, or should technology leaders not even be looking at it that way?
0:34:22.2 Rob Hirschfeld: I'm actually doing some research, and we have some projects at RackN, on generative DevOps and the idea that I can use generative AI to improve the outcomes and durability of my automation. There's a idea of, "Can I write it more easily?" But there's also a idea of, "Can I scan things to see if they're still working or figure out best practices or improve the quality of my automation from that perspective?"
0:34:46.0 Yadin Porter De León: That half-life.
0:34:47.3 Rob Hirschfeld: Correct. And the number one issue, and this is what I say about complexity in general, is the way to combat complexity is through exercise, not through simplicity. You can't un-complex something, but you can exercise it. And so to reduce your half-life...
0:35:02.2 Yadin Porter De León: I like that. No, I like that too, is exercise. You can exercise it. Don't just simplify it. Exercise it. I like that.
0:35:07.7 Rob Hirschfeld: The issue with these scripts is actually if your half-life... A lot of times your half-life problem is the fact that if a script hasn't been exercised in six months, you're gonna have very little confidence it's gonna work. And this is why having a whole bunch of teams doing a whole bunch of different stuff is bad. Operations wants to consolidate. Ideally, you've got two or three platforms, and they're being constantly exercised and used. And your confidence in your automation is gonna be super high because if you hit issues, you're gonna fix 'em really quickly and keep going.
0:35:36.2 Yadin Porter De León: It's like one of my cars. If I don't start the engine every once in a while, the last thing I'm gonna do is pack it full of everything, check it and make sure the engine starts...
0:35:42.8 Rob Hirschfeld: That's a great analogy.
0:35:43.4 Yadin Porter De León: And maybe drive around the block a couple of times.
0:35:46.0 Rob Hirschfeld: It's the same thing with us as individuals. You're not gonna go run a marathon without training for that marathon. And you don't train for a marathon in one day. You work your way up to it. You keep it going. You maintain your fitness level.
0:35:58.8 Yadin Porter De León: To look at generative AI in this context too, it's providing that kind of exercise. It's not, let's say, writing the automation for somebody, or they might have a co-pilot. That's a different topic. But it's exercising the various operations that exist, or automation that hasn't maybe been used in a while so that you can attain the high level of confidence. And that's de-risking that whole deployment model. Now you're de-risking all of the different automation that you got, and it's been around for a long time. That's really interesting is, AI for exercise. Well, Rob, this has been a fascinating conversation. I really appreciate the insight. Tell us where can we find you online, in social or other places, find out what you're doing, more about what you're talking about, or where you're gonna be speaking, etcetera.
0:36:48.8 Rob Hirschfeld: The top of mind one is generative DevOps. I'm actually giving a talk at GlueCon in May, May 23rd, 24th. So I'll be in the Denver area for GlueCon, on that topic specifically. In general, you can find me, I am Zehicle, Z-E-H-I-C-L-E, a handle from my electric car days, on most platforms, and of course, through RackN, if you wanna get a hold of me or read some of the KPI docs and some of our platform engineering research, you can find that at RackN. As a matter of fact, we are very passionate about the operations side of platform engineering, and what we call the operations experience. And we've assembled a list of resources for people, if you wanna be like, "Alright, I'm interested in the ops side here," we've been pulling together interesting articles and thoughts and things like that, other podcasts, items like this, this'll make it on that list, where we can talk about operations in a platform engineering context, not just developers.
0:37:43.6 Yadin Porter De León: Oh, that's fantastic. Well, Rob, thank you for joining the CIO Exchange podcast.
0:37:47.3 Rob Hirschfeld: Thank you for having me. It's been a pleasure.
[music]
0:37:50.5 Yadin Porter De León: Thank you for listening to this latest episode. Please consider subscribing to the show on Apple Podcasts, Spotify, or wherever you get your podcasts. And for more insights from technology leaders as well as global research on key topics, visit vmware.com/cio.
[music]