20250311 TechOps Test Scripts

7:53PM Mar 14, 2025

Speakers:

Rob Hirschfeld

Keywords:

test scripts

data center

hardware prep

network discovery

LLDP

CDP

AI integration

firmware updates

performance tests

stress tests

system validation

burn-in tests

diagnostic tools

automation

production cycles

Hello. I'm Rob Hirschfeld, CEO and co founder of RackN and your host for this tech ops series of the cloud 2030 podcast. In this episode, we deep dive into something seemingly very small, but has a lot of repercussions for how you manage and run a data center, and that is test scripts for servers. So the idea is, as you're going through a production cycle or a provisioning cycle, how do you test? What do you test? This was from a Reddit thread that we answered and then had a whole hour conversation about just how important and impactful this type of scripts is. I know you are going To find it enlightening. Enjoy. You.

Uh, we always

start with a little bit of tech news. If people have it, anything new that's worth sharing.

There was a big bug announcement, like trying to fear monger about some ESP 30 to Bluetooth integrated device having some vulnerability that gave everyone remote code access. But it turns out the person publishing the CVE didn't understand that every microprocessing Bluetooth device has some way of updating the firmware, and they misconstrued it as some big vulnerability when really it's just a standard for those devices.

And so that was like going around the press picked it up, like, as a panic or

Yes. So everyone's like, undocumented commands found in the Bluetooth chip used by a billion devices, and then the company that originally made the CVE shows that there's no actual back door. They had to repeal their original post.

Much Ado About Nothing.

I you know, but I can see the current atmosphere, and how little people actually think about it, and how fast people want to jump on that as a news item.

Yeah, there's so much. The weird thing is, there's so many front doors, it's almost hard to worry about the back door, sometimes pieces like that, that's a good one. Thank you. Isaac, no problem. Any any additional thoughts on the framework,

the new desktop? Yeah, I think my thoughts about it are, they went really hard into this AI, like image. And I think a lot of the the makers that were very pro framework were following kind of their original ideology that, oh, we don't need to put AI and everything. And even the like framework, like co founder was tweeting like, Oh, you'll find that our website doesn't have a single mention of AI in it. And now they just got all these laptops where you can get the Microsoft copilot key in your framework laptop now you can your all the new CPUs are all AMD AI chips. So it's kind of an interesting flip flop where maybe you need the investors, or maybe that's just a side effect of AMD partnership, requiring AI be plastered all over everything. I

think I would suspect the latter. Definitely. If you're partnering with a CPU provider, they're going to insist that you, and also even Windows probably insists that you plaster their their co pilot messaging,

I sent a link from a little under a year ago from framework on their Twitter, saying, Are we the only laptop maker that doesn't have an AI landing page? Kind of ironic now they do.

It's a hard thing to brag about, not not being AI enabled, not a it's funny because it's not a good look either, to be like aI washing everything. And so that's it's sort it's a weird balance, not perspective. The

No, I won't talk about that.

Interesting. I. Other, other new it's, I haven't seen a lot of tech news. There was something I posted to random that I thought wasn't but I don't think it was that big of a new a news item. Okay, I'll switch us to topic of the day to see where we go with it. This was related to that Reddit post of somebody asking what I thought was a legitimate question, and they didn't get a much of an answer. So I jumped in to answer it a little bit more. I figured it would be fun to expand on that here and talk about it, and we'll you know some of this is, there's some digital let me introduce the topic. The question was hardware prep scripts. So what do you do to condition servers when you bring them on board? And we could extend this in a couple of ways that might be, might make it a more general conversation. You know, what conditioning Do you want to do for hardware? Really any OS? Because I would include potentially VMs into this case, there's some places, especially in cloud VMs, where there's testing that you could do to make sure that you have a well resourced VM doesn't have noisy neighbors and concerns like that, or network topology, or just to test and validate things. So the idea is, I've got a I've got a machine of some sort, and I want to put it into production, where I'm recovering it to put it back into production. What operations should I consider doing in order to to, you know, be well set up for that machine being in service. And I think there's, and I'll add the caveat of it's, you know, there's things that you can do that are cheap time wise, and there's things that you can do that are expensive time wise. I'd actually like to hear both, and then we can, we can talk through what what those are, and Klaus and who else, oh, and Josh, yeah, both. I'll put a link to the post I'm talking about in the the chat

once I find it, thank you, welcome.

Where'd it go? So what

do people think? While I get the post

that was a good post, one point that I might add is that you emphasize LLDP for network discovery, and I know that there are people out there still using CDP, even though it's not as good a protocol. Might be something to consider. What

I'm not even familiar with, CDP at all. What are you

Cisco Discovery Protocol? It's it's used mostly for neighborhood discovery with Cisco Networking. And the Cisco engineers love it, of course, but nobody still

does it requires it, like LDP, where it requires you to have a broadcast message, or what's the, what's the pre reqs?

Yes, that's right, it's, it's, it's a network broadcast,

okay, and only, I'm assuming, only for Cisco or

no, it's got, it's got fairly wide industry support, but it's also kind of a deprecated protocol. I mean, it's like, I think it, it came out in the early 2000s and it was real popular up from till, like, I don't know, maybe 2015 2016 and then LLDP has sort of taken over since then. But still,

the thing about LDP, which we build in as a standard Scan option, is that it's, it's sort of slow, and so it's, you know, if you're gonna always do an LDP scan as part of a normal workflow, it can, it can take a bit of time, especially, and even worse, and if the switches you're attached to aren't using LDP and you're waiting, you're Doing the LDP sweep, and you're getting a timeout. It's trying to, trying to remember if we have it on by default. I thought we did, but can add in that extra like waiting for the time and the timeout to go. But if it's if. It's, if you're trying to troubleshoot what switches you're connected to, it is a really, really powerful diagnostic to know, you know, hey, I'm attached to this switch in this port, and be able to confirm that quickly, from a diagnostic perspective, probably one of the most simple diagnostics.

Yeah. It's also really good to be able to use for classification.

Also, what do you mean? Well,

let's say, for instance, that you have a switch and it's broadcasting LLDP. You're going to know based on the information that you can gather on the host, you're going to know what the switch is, what switch port you're using. And there's a name field in the LLDP broadcast that will take the actual switch names, that will let you know exactly what piece of equipment that you're connected to, and if you're laying out your data centers in a fairly consistent way. So like this row has this layout, and the road next to it has the exact same layout, or next to it has the exact same layout, then all you have to do is go, oh, well, this switch name is row four. So if I've discovered four rows of of machines, I can use that to go, all right? I can, I can set my name, I can set a couple of different things based on the LDP response that I get back. That's one of the things that we're doing at Microsoft was using that, that LLDP response to coordinate the host naming,

like that makes it Oh, so you could actually build the host name from that perspective, that would make a ton of sense as part of the sweep, okay? So you could say, Okay, I'm when I get this information back that tells me what the name of this host should be. And that's programmatic. That's awesome from a from a hardware setup. And then you could actually come back and use part of your check is, is the name of my machine Correct? Does it match the naming convention? Then you know immediately if somebody had botched your switch config, or you're rewired something,

yeah, exactly and automation or freeze, yeah?

Would you also then fix the IP addresses, like use that as a mechanism to do static IPs, or that too much? Is that a bridge too far?

No, actually, I can talk about this now, since I don't work there anymore, one of the things that

we use, so please don't give out any proprietary information. This

is I came up with this design. So I can talk about it. It's mine cool when the way that we have the systems laid out in AVS was that each rack was identical to the rack that was next to it, so we knew exactly what the Ru was like, you know, we'd start at, are you one would be the first system. Are you two would be the second system. So we could use the switch port, because we're consistent in the like, the system in ru, one was Ethernet, one, one, the system would argue to was Ethernet, two, and so, so so we could, we could use that value to determine whether or not our cabling was correct. So if, if, if the switch said that it was supposed to be the system that was in, are you for? And then we compared that to our host inventory that we pre loaded, and it said no, that system is ru three, then we either had a cabling problem or a labeling problem, and so right that that helped us hone in very quickly when bringing up new data centers to determine whether or not our cabling was correct.

That makes a ton of sense to me. That was one of the things that we had worked to do on an early version of Digital Rebar. I don't, I don't think we ever recreated this. This is back in the crowbar days. We actually built a rack view, assuming that LDP would give us the type of information you're describing. And so we would, we would, we had a graphic that would show you switches and the, you know, the servers that were mapped into each port on the switch as a as a visualization. In practice, we just don't see people having that type of discipline. Tom, yeah. To make the visualization that helpful

in smaller sites, I can, I can probably see that, but when you're talking about a large scale system installation, that consistency is key. Otherwise you're just shooting yourself in the foot. Yeah,

no, I and one of the things that we do see back into system validation is we'll see customers pre populate machine information and inventory data, including expected subnets, right? A whole bunch all sorts of network topology, LDP information, and then part of the bring up process in the validation step is to actually compare those values versus the expected values and stop if they don't match. And that's it. That has saved huge amounts of time and money, because if you know somebody's not connected your network right, or you can't reach the internet on a port or a subnet mask is wrong, which that one has burned people quite a bit. Then the then you, you get immediate feedback, and the system will just stop and say, Wait a second. Something's misconfigured here, that that's been it's and it's cheap, comparatively, right? If you know what things are supposed to be to your point about, you know, you're so regimented, you could pre populate the whole, whole schema for the rack and as things come online, if it doesn't match schema, you know, you've, you know, something shifted in transit, or somebody fat fingered or mis installed A network drop for a design like that. Tom, could you just do a visual inspection then and see that? I guess you couldn't easily tell that wire was in the wrong place, unless it was like a bespoke wiring harness.

I mean, it was, it was straightforward wiring. Microsoft was having all their racks built at WWT, where the initial burn in was done before they got shipped to the data center, so it was dependent upon them to do their own internal verification process before it came to us, which is not to say that we didn't occasionally get racks that were totally missed built first thing on a Monday morning or something.

I mean, it definitely having that type of burn in this is one of the reasons why we have all these, these burn in steps. So one of the things I talked about in the post was the idea of using a mem test or a disc disk test any of these, these burn in scripts, because we have them. We've had them forever. But people you know, typically don't want to add the it takes hours, sometimes days, to run these burn in tests. And unfortunately, you get very little feedback while they're running, like they don't have a lot of incremental rebar updates. And so you're, you're really in the in you know, systems are doing something, and you have no idea if they're actually work if they failed. Here's the irony, right? You're, you're doing a burning test. You don't know if the system failed until it comes out the other end of the tunnel. Um, that's been, that's it's been a frustration. We haven't seen a lot of customers at the end of the day want to turn it on. They count on that. You know, their integrator like WT to do the burn in, in, in, before the rack is shipped.

Yeah. But there's an argument to be made. If you're going to be doing regular reprovision of a system that you should probably run some kind of burn in every time you pass it through the provisioner.

Like, what would you think would be a normal because, right? You don't want to spend a day, you know, two days. You don't want the system's offline for a couple days. But,

oh no, no, no, but. But in in my mind, what I see is that when when you pull a system back from regular use, and you want to run it back through the RE provisioner, you're going to run identification and classification on it, and you're going to compare that to the existent record, and if you find a discrepancy, then that's going to trigger your burn in test.

Now go ahead

today, we can't do that because as soon as we boot a system into sledge hammer, we overlay the existing machine record with what we find on the system. So there'd have to be, I don't know if we want to do it as a database or prior to committing the machine record from sledge hammer, to examine the existing record and determine if there was, in fact, discrepancies there or. It's a thought, but I haven't really thought through the logistics of how that's going to work.

It's one of the dilemmas that we find is, actually, I don't know I'd be interested. I don't know if we have all the right the right people is, you know how much time you're willing to take a system offline for during during the cycle? From a test perspective, what? What would you expect to find, from a burn in perspective, on a regular on a regular test? Why? Why do the burn in

Well, that's a good question.

You know it's, I know it's, I mean, it's, it is, it's, it's, right, what do you what do you want to what are you going to catch?

Yeah, based on hardware statistics, right? Um, we're going to, at some point, power cycle the server, and the statistics show that about 85% of your hardware failures are going to occur during a power cycle or hour on because of thermal stress, because of power surges in the circuits, right? So? So that's why we need to identify during that rediscovery phase whether or not there's a difference in the equipment, because it's happened, and if we just overlay the machine record with what we find in sledge hammer, we don't really know that there's a problem, and we can't do then any further Testing to uncover that.

So the test should happen after RackN the machine.

Yeah, I get really, I get that on a new discovery, that perfect. Yeah, everything's fine, but I'm talking about during a normal, normal operation or routine. We're taking a system out of production and re running the discovery phase because we're having to reprovision it, because we want to install firmware updates, because we want to put the newer version of the OS on, whatever the reason right, we want to rerun that discovery phase and then go, oh, look, it's different from the way it was before. There's less memory showing up. We don't have as much disk. Something's wrong. We need to now go back and do some more diagnostic testing to figure out what's wrong. And

I think that's critical, right? What you're talking about is you've taken a system out of service intentionally for routine, scan, check, patch, update, right? If we've seen that's a very good practice for customers, and during that that time you what you're saying is that's a good time to beat on it. Spend a little bit more time, you know, reviewing it, testing it and scrubbing it, because you'd rather know that there's something wrong with the system before you put it back into service and put a production workload on it. That's fundamentally what we're saying, right?

Yeah, that's exactly right. I'd much rather be ahead of that problem than have a customer come to me with a production system and go, Hey, this is not right. You sold me this allocation, and you're only giving me this less amount of allocation, and I'm mad now, right,

right? Well, that's I mean, so for that reason, right? Be, you know, running a mem test, which you could probably do pretty quickly, might actually reveal that there's an issue. And I think when we look, look at some of the AI systems where they're having they're very sensitive the bus outage or other types of systems, if you can nudge it and find out that it's got a failing component before you put it back in service, especially in a larger cluster where it's you know, you could break it, break the whole cluster by missing, by having this failure in between checkpoints. That's a that's a pretty valuable thing to include in your cycle. Your recycle. Even more interesting, if you could do it, if you had a fleet of machines and you were testing them at the same time. It's a parallel operation.

Are the are the systems, internal sensors, or like self diagnostic capabilities? Are the a ubiquitous enough and be reliable enough to be complimentary or to in part, albeit the need to do a more extensive burn in test, let's say if you had a system that has been online for a year so far, everything is good and you are you. You're collecting the downboard diagnostic data so you have your baseline. Would you do your your maintenance? Let's say firmware upgrade. Which basement that you restart? Do you power cycle? Can you then compare the new onboard diagnostic data to the previous baseline to identify drift there.

That would make sense to me. That would that would even something down to, I'll give you a concrete example that we've seen in the field with SSDs is we, we had a customer who's there's a firmware bug in the SSDs, and they were not correctly recovering space. So they were they were losing, they were losing the where the wear cycle was was being incorrectly calculated. So the total available SSD space was decreasing much faster than it was supposed to. And so a case like that, where you were comparing known, expected versus baseline, would have, would have lagged out problems faster than, you know, the systems in production running out of disk space, which was what was, what actually happened was

that the issue that was it Samsung had on one of their E vote, it wasn't,

I don't think it was Samsung. It was, it was, might have been the end and drive, but it was, it was, it was one of the dim fan the manufacturers. We had to go through and do a patch. Then the Pat is a mess, automatable, but, but a very slow sequence of events. I and that this, this to me, we, you know, it's, I think it's in the realm of a hardware test script, but it so it include, I would include Braden BIOS checks and patches in that, yeah, you know, it making, making sure that that systems are at the patch levels you expect. You know, it's not the same as a stress as a stress script, but it's definitely part of, you know, an operation that you want to do while the system's offline, and then you actually have to test the just, sorry, shouldn't laugh. This isn't funny. It's sad. But a lot of times we find that when you apply raid and bios, you still have to do another you have to do a reboot and a cycle to test that the that the BIOS applied correctly. It is, it is incredibly common for a BIOS update or patch to not take because of existing settings or prerequisites that the BIOS has checked some warnings for. And so you might think you're happily applying a BIOS patch and it doesn't take because, you know, it was too big of a jump from where you were, or some setting that you had in the system conflicted with that bios and you have to walk, walk through the changes in a more, more linear place that is very common

to address your point, Klaus, one thing you have to remember is that a lot of the internal diagnostics are really good, but it's also highly dependent on the manufacturer, like Dell, does really good job of doing their internal components, but super micro, not so much. So it you know, you have to consider that going in. Yeah,

that's why I my first question was, like, are the ubiquitous and unreliable enough? Because, again, from my experience, which is several years dated now, it wasn't back then. So yeah, I was hoping that perhaps it might be better now, but from what you tell me, perhaps not. Did I mean that? And that's not even going to the situation where the self diagnostic data might lie like what see it is now experiencing with their hard drives, where some unscrupulous vendors are erasing the diagnostic data or resetting it in order to resell used drives as new

Oh, I don't even know how you would detect that. They probably burn new serial numbers too.

Yeah. So, so it's been going back and forth for a couple of weeks now, with this, the smart data is definitely. Gate, unreliable. Seagate, at one point claimed that their their other diagnostic like the deep diagnostic tools, were able to identify or provide accurate data and could tell if the drive had gone through more cycles than what smart was claiming. But looking at more recent articles, it seems like like some of the data is also getting spoofed. So who knows? I

huh? So that's a that's a third party, basically reconditioning, trying to put reconditioned systems back into the into the mix. Yep, I could see one of the things I was talking about from an AI automation perspective, and doing a type of scan like this on on AI systems, is that you could be in a situation where somebody lifts cards out of a cluster and and and assumes that you just won't notice. If you know you're you have to nine cards instead of 10 cards in your in your AI training servers, right? Zoom one. So the being able to inventory and check would be a big deal, including down to the model number. Right? We're talking we're I tend to think about this question from strictly like a stress and a performance check, but you can also check to make sure that the model numbers of the the serial numbers of the components that You have in the system are not being changed. I

I mean, unless you, unless you get put in a fuse and in their firmware, the burn, if the ball number was changed, I don't know. I don't know what the solution is right now for them. Well,

so this is, there's actually an interesting question to me about, you know, would this stress test, or could this stress test include a performance test where you would you could actually run a read write RAM, read write or disk read write tests against the different drives. You know, I I introduced this as stress, but you could do a performance check and make sure that system is within performance metrics too, and make sure that the vendor hasn't substituted. So something when you bought it. Or, you know, piece of equipment is not is in a configuration where it's not performing. Sometimes bios, patches and updates can impact overall system performance. And you might, you know, being able to test that could make a big difference in in what, you're actually doing, or if the system's degrading. It's really fascinating. Usually, I think of systems going into the data center, and they're going to not be molested, not be changed, not degrade, pretty much. But I think the actual history is that when systems are in, you know, there's a lot of potentially places where they would change, if you're talking about Edge locations, I think everything we're talking about is magnified by 10, where somebody might, you know, could rip off components of your system, could take out a dim or those become much more accessible for people to walk in and modify anchor steal

hardware components. Yeah, I would say an edge location is more chaotic or unpredictable. But even maybe not so much now with ass DS, but again, I remember the time where you needed to pay special attention to the dampening on your racks. Otherwise, like that, the hardest vibration from your mask could, like riddle everything apart. Oh yeah.

So, I mean, you have to think about the fact that even in the most secure location, bit rot is a thing. And you know, there's a number of reasons why that happens, but you know, maybe. Be a regular performance test might be something that's indicated, and if you're using really high performance equipment, and depending on that.

So is it worth, like, you know, is it worth taking the system offline? We're talking, we're saying, you know, systems coming offline on a routine basis is good ops. So part of that routine scan, if you're doing a high performance system, and you're actually validating that the system is still performing, probably a very good use of that of some downtime, because you're ensuring that the system's performance matches expectations. One of these parallel calculations, the slowest, slowest performer is the determines the speed of all for the share calculation, for shared calculations

true, but then it becomes a question of, you know, how fast do you need that system back in production? What's your what's your turnaround time? If it's, if it's something that you really require to be a short then, you know, maybe this is something you only do every so often. But if you have the if you have the availability, then maybe you make it a part of your regular process.

Well, the you could, and you could do it in parallel, across a cluster. So if you either you're, you're taking one machine out at a time, or a small block of machines out at a time, as a as a site, as a cycle, make should make sense, or you're doing the whole cluster in parallel, because a lot of these tests that we're talking about are machine local, so it's not as if there's a real bottleneck constraint on doing these tests in batch or bulk. But I mean, I know, I know Hadoop is not what it you know what it was. I think some of the AI systems are similar, where we had, like, a faulty Nick, yeah, this is a real story. We had a, we were building a Hadoop system, and in the field, one of the Nicks was faulty, like, the cable got cramped, or there was some something that caused one network link to be bad in, like, in like, a 200 300 node cluster, and the whole clusters performance degraded by 50% because of that one Nick, because they they were all tied up waiting, you know, they couldn't finish their batch until all, all components had reported in. And so that one system degraded the whole clusters performance and the diagnostics of it took weeks.

Yeah, the chain is only as strong as its weakest link,

and it can. I mean, even if you're just talking about a virtualization system, or virtualization hosts, you know one, one component that has has a fall, can will degrade the whole system, especially if it's a transient, you know, transit issue that really speaks more after this, I'm getting more and more in favor of more aggressive system, tests as part of a bring up and a check process, because it really does save, I mean, it's, it's not a especially when it's automated, not a particularly time intensive action, but It can save a lot. You're, I mean, we're talking even potentially putting doing network performance tests on each Nick, right? That seems like a low hanging fruit item to me, if I was, if I, if I was performance testing, each Nick should be able to do that in a pretty straightforward looking to see if we had that in the list of tests that we had out of the box.

Here it is,

we have CPU burn in. I didn't, I haven't, I don't know if we have an out of the box network performance test, interesting

wouldn't be hard to do, but at least to tell you if there was a any one of your links was misconfigured, I so Greg or Victor might know we've got something somewhere that I'm just missing.

Cool other thoughts on on stress testing or our performance testing systems as. Part of the bring up process. Feel like we're covering a lot of it

as a question is this, my assumption is that you're taking the system offline. You're performing the performance tests as a dedicated pass. Could we do these as part of a diagnostic cycle on on the production systems, like just you run, run this stuff inside of the production OS, are

you talking about like a blueprint out of the runner?

Yeah, yeah, no. I mean, well, so it'd be, it'd be instead of having to reboot. So my assumption, in DRP parlance, would be that you would take the systems, you'd cycle them back to sledge hammer, which is usually a reboot. Doesn't always have to be but assuming a reboot into sledge hammer to run the diagnostic and so you have a dedicated thing. Sledgehammer doesn't have to modify the disks at all. It can do all that work without breaking the installed system. So you would run the diagnostic test, check the written bios, do all the health checks, and then just reboot back into the production workload. But could you practically do that from inside the production? OS just took off those blueprints, yeah,

depending on the test, because some of them can be quite intrusive and you don't affect a running production machine to that level. But like this test, you could probably get away with fairly easily.

Yeah, some type of like do high inventory scan and just check the space, you know, brain bios and things like that.

Go ahead, yeah, yeah, no, um, if you if you had the runner occasionally do a system scan during operation and compare that against the machine record to see if there were any differences. That makes notification or alert based on that.

That's what I'm thinking. Yeah, it'd be a drift. You functionally end up feeling more like a drift detection. But yeah, you could run a go high scan, you know, make sure that that you haven't that you're within the specs, test the written bios, get, get a setting on the firmware which is useful, just to see if there's an intrusion. Do that out of it, we can do that the written BIOS piece, or the BIOS and firmware pieces you can do out of band, which is even which is less intrusive. But I there's an element, you know, I reboots are not bad for for systems, not bad to take a system and, you know, assuming you have the automation and it's reliable, to take systems through a regular reboot cycle,

you're gonna, you're gonna, you should be patching, you know, os patch cycles should be, I would think, quarterly, if not faster, that checking raid and bios. And the nice thing about being in Sledgehammer is that if you have to apply raidn bios, you are in the you're in situ to make that change right there, to patch whatever firmware gives you the tooling for it. That's what I mean. We were sort of moving towards a hey, while I have the system, I might as well do a more intrusive scan, because it might shake something out that I want to know about while I'm in, while I'm in that operational mode, and then put it back into service. The idea of running that while the system's in production is potentially risky. Oh, no, that's an interesting

yeah. Potentially risk. I mean, if you, for instance, trigger a bad memory while you're doing a memory scan with the system up and running, you're going to force a crash. Nobody wants that, yeah,

yeah, there's that. And maybe that's part of what we're talking through. Is that the diff, the delta between a scan, scan check and putting stress on the system, is pretty small. So it's it would be better to be into it. You know, this is a diagnostic moment. I. That was, we had a, we have a customer who's trying to eliminate reboots, because reboots include a memory test. And they're, they can be very slow, especially for big memory systems. And so they're, they've, they've been asking us to apply a out of band do all the patching using out of band tools while the system is operational, thus not needing a reboot. And we keep pushing back, saying, Look, if we have a fully automated process, it's going to add a reboot into this but we're going to have better control during that cycle to check, fix, assert, right? There's a whole bunch of diagnostic work you can do that you can't do when you're just making a attempt out of band. And so we've been going back and forth where the the this team wants, you know, it's like, faster, faster. You know, we're, you know, skip that reboot, and we're, like, if you have a really automated process, you're, you're optimizing the wrong thing. If you're, you're trying to make these, these cycle cycles, as short as possible, you know, spend the time, do a little bit more work. Is where, where, you know, we've been coming back to, it's not a lot of time. That's the thing that makes me scratch my head. I mean, a reboot is can be an hour just in in between the the, you know, the firmware diagnostic sweeps and time. So it can be a pretty expensive cycle. But if you get into a case where you have to apply, you have to do two reboots to do a firmware change, be trying to go straight, you know, the you have to, we have a lot of you know this is use, especially useful for all the RackN People. One of the things that makes going through sledge hammer really powerful is that if you're applying a BIOS or firmware patch, and it takes multiple reboots, you keep coming back to a system where you can check the to see that it made, was it was it took. And if it takes three reboots or three patches to do in a linear cycle. We can do that because we stay in Sledgehammer until all the changes are done, the workflow is complete, and then you go back into the other system. The alternative of trying to do this strictly out of band means that you have to catch the fact that your first applied BIOS patch didn't finish. And so you have to, know, just to not boot the system, or to boot into sledge hammer, you know, in, you know, while you're doing those works, not go back to the system, or go back to the system and then reboot it, go back to the system and then reboot it. And so you've, you've got a lot more things to keep track of during this cycle where you're applying a patch, and a lot of patches do require more than one reboot to fully apply, and so the saving of I only apply it While before I reboot is pretty minimal. All All Things Considered. You're really that worried about it, you should turn off memory scan during reboots and take the risk. But that makes sense. Yeah, go ahead. DDR,

five. I thought the longer reboots were actually more or less required because they run a such a faster clock speed, if you don't have the training, it drastically reduces the reliability of it. So this might be something that we're going to be seeing more often, or either people are going to start choosing to use like not DDR five if you don't need to, just so you can avoid those long reboot time. I think

the point I'm coming back to is, if you've got this, well, automated, and you do this, you have a routine to do it, then you don't, you know, just build in the capacity to handle the reboot time. It should. It doesn't matter. I You see, you know what I like, like saying, Oh, I'm, I'm trying to shave, you know, 10 minutes off of my rebuild cycle, because my reboots are really slow. You're, you're, you know, a lot of times you're going to need two or three weeks anyway, and so you're not, you're not helping, you're not helping yourself that much. You might be adding more complexities, better to just be like, I have an automated process. I don't care, as long as it finishes. Every time I start it, I'm not, I don't need to optimize that. That would, that's my point. Does that make sense? I. Isaac,

yeah, I guess in the grand scheme of things, if you're running a system with hundreds of gigs of RAM, like my my system with DDR five reboots in like two minutes, which is crazy slow, and it's only 64 gigs of RAM. I can't imagine systems with like 10, maybe five to 10 times that amount of RAM that it's it might 10 to 15 minutes, maybe 20 minutes, depending on how many times you're rebooting. But it's like, I guess if you have enough compute, then it's okay, but it's still more time it's

or, I mean, the thing that I keep coming back to in this conversation is adding a couple more things into that cycle time, if you're catching if you're diagnosing and fixing an issue, is probably worth making the cycle time a little bit longer, because proportionally it's pretty small, and trouble fixing an issue before it hits you in production is really valuable, because production outages are expensive, but you have to think about it as, like, I have an automated system, you know, as long as it's reliably putting stuff through, I don't really care the human scale problem. Like people are like, Oh, I hate waiting for the system to reboot that extra time. And if you're doing this on a regular basis and you're just like, Okay, it's just part of the

cycle time. So

the balance. The balance, to me, is you want to default to very reliable automation. So if the if you, if you've got a process that's a little bit faster but is unreliable to any degree, then you are adding human humans in the loop, or you're ultimately right. You're ultimately going to have to have people troubleshooting and fixing it and recovering whatever the problem is. It's it's a better system experience if, even if this is processing but slower, if it's more reliable, trading off the reliability is where the discussion Thank you for listening to the cloud 2030 podcast. It is sponsored by RackN, where we are really working to build a community of people who are using and thinking about infrastructure different, because that's what RackN does, rewrite software that helps that operators back in control of distributed infrastructure, thinking about how things should be run and building to discover software that makes that possible. If this is interesting to you, please try out the software. We would love to get your opinion and hear how you think this could transform infrastructure more broadly, or just keep enjoying the podcast and coming to the discussions, and I can lay out your thoughts and how you see the future called in field, all parties building a better infrastructure, operations, community story, thank you. This comes back from the crowbar days. So one of the problems with building automation, like we build and the crowbar days taught us this lesson, going back almost 15 years now, is that any time crowbar broke, it was our fault, regardless of what caused the break. And nine times out of 10, it was DNS. Is it was some type of network configuration problem. Sometimes it was burn in issues or systems being misconfigured or something. But the thing that finds the problem is always the thing at fault, just as a rule of thumb, always the first culprit you have the murder weapon in your hand. I'm like, I just found I just walked in and found it. I didn't commit the murder, the but that, that's how that's that's sort of how things are interpreted, right? If you're, if you're finding the error being blocked by the error, you cause the error, until proven otherwise. And so one of the things that we've evolved over time is additional diagnostics that help you pinpoint what the error is and identify the cause of the error have very high return on investment. And so this whole conversation, in a way, is reinforcing this, this, this idea that taking some time to run additional diagnostics, do additional tests, stress tests, the system, anything you can do to shake out problems in a in a narrowly defined way, so you can say that's a problem and report it will translate into much better system performance and much better human experience, ultimately. So. Uh, and in the field, a lot of times, humans who are used to doing it a different way will resist the idea that those extra checks are are creating long term value. And then the reason why I want to tell the story is the extent to which we are able to do those diagnostics and checks as part of normal system operations, the less we end up with people jotting back like, system broke you did something that broke it. So a lot of the design we do, a lot of the pre testing that we want to do, comes back to you. Know, if we pre test discover things that are happening, stop and tell people that that's what happened, you're going to spend a lot less time defending the system. Oh, this turned into a little lesson. So everybody we're out of time, but little Digital Rebar history for y'all

Cool. All right, I figured this topic would be an hour topic. So thank you all that's it. We're done. Thanks, Rob. All right, hey everybody. Thanks Rob. Thoughts that was helpful. See everybody later. Bye, see ya. Thanks. Bye.