20250225 HA Troubleshooting

7:46PM Mar 29, 2025

Speakers:

Rob Hirschfeld

Claus Strommer

Keywords:

High availability

troubleshooting

AI framework

AMD partnership

USB 4

PCI bus

laptop upgrades

AI coding

Copilot

telemetry

load balancing

DR strategy

cloud resilience

chaos engineering

infrastructure.

Rob, Hello, my name is Rob Hirschfeld, CEO and co founder of RackN and your host for the cloud 2030 podcast. In this episode of The Tech op series, we go into high availability troubleshooting, not just high availability, not just troubleshooting, but actually talking through what it takes to manage and maintain and fix ha systems. This is part of a longer discussion we've been having, and so there's some really interesting ideas in the middle of these discussions that I hope will shape your thinking as you build high availability systems and you build diagnostics and troubleshooting for people who are in high availability very complex environments, know, you'll enjoy the conversation.

The new framework two second or framework second generation event is live

framework, like

the framework laptops, yeah, they have a launch event. Yeah, it's right now. It's very AI focused, yeah, and they're also very heavy in partnership with AMD.

Wonder, Oh, are they moving to AMD laptops?

They already had some AMD Main Boards, but they some of their the second gen, like 13 inch and 16 inch ones are like, I think AMD only, I didn't see anything about Intel, and they're like, these AMD AI chips or something. And they added, or they started a new lineup of desktops that are racknable, so you can use USB for to connect them all together and have a cluster capable of running the full whatever six, I don't know how many significant figures, but like, 600 billion parameters or something.

How would you connect? Like, do they have a PCI bus extension for that? Or,

I think it's just over us before, okay, I'm not entirely sure us before that fast. I guess it probably is. But I think USB four is directly into the PCI bus, uh, rates of 20 gigabit, 40 gigabit and 80 gigabit. Holy cow, yeah,

yeah. At that point it's, um,

it's literally just a ram from another machine.

Yeah, I'm looking at a video from, yeah, it's, it's, they're basically you hooking USB four to a DPU. It looks like, yeah, so you can chain, basically it's a PCI bus extender.

Yeah, you'd have to rewind, I think, a few minutes to see the desktop, but they have a little rack and their standard like mini ITX boards, so with two use space to account for the fan.

So is framework making laptops too, not just the desktops

they were originally only. Making laptops desktop is a new thing.

That's what I was thinking. Okay, yeah,

a mini ITX board with non upgradeable RAM is what they just unveiled. I don't know. The crowd does not seem thrilled about the AI stuff. Either they come

with a car battery too.

That's the challenge. Yeah, they have not the their website for the alternate I know, I know y'all are been buying the framework laptops, and pretty happily from that perspective, but

cool, well, they are supposedly upgradeable for the same chassis. You just be able to swap only the main board and keep your hard drive and RAM, unless the RAM is not compatible, but the battery would be the same, or you could upgrade to a better battery. Or if any single part fails, you can replace it.

The great story, okay, do. Have they set up much of an aftermarket or are they still doing the main they're still the main manufacturer for themselves.

I think all of their stuff is like open source, so there could be aftermarket attachments. I think I don't. I haven't heard too much about that kind of thing. Usually it's just like makers that assemble their own stuff. Yeah, but all the the connectors inside of the laptop for like, ports and stuff are just 3d printable USB C little bricks. So you could just, I don't know, wire up an Arduino to a USB C connector and then be good,

like I like the

pretty cool

having, having taken apart Dell, old Dell and HP laptops to do upgrades or replacements. It's it's a huge deal to just have a drop in replacement battery. Should be so nice. That's cool. Thank you. Asa, appreciate sharing that. Of

course, it looks like they're, uh, their final unveiling is an Intel based 12 inch laptop that folds and has stylus support,

folds the other way. So, yeah, okay,

so like a laptop tablet

that would be nice catching something of the Mac air type of pieces I tried to do a Google tablet, and I did not find that to be a sufficient replacement for my desktop systems or for my travel laptop. So that was really disappointing, actually,

yeah, I've kind of forsaken the whole concept of a travel laptop and just have the Debian Chromebook that could just be thrown in the trash at the end of the day.

I like, I mean, I use my Mac air a lot. I like it, and it's sufficient when I travel to do pretty much, you know, everything I need, not nearly as productive as my desktop. I not having the good keyboard and the sort of the alt tab and multi, multi screen capabilities, really, I found much more hindering. I keep it, you know, I I use the keyboard a lot more than I expect. Even on the road,

I never really got into, like, actually using the touch screen stuff.

I mean, if you're used to typing, it's it's really hard not to be able to type, really, really hard not to be able to type. I always, I don't realize how much I depend on typing, because I don't think about it. I

wonder if that will be kind of the next generation of things, because so many people are mobile only, and the tech literacy is really just what's accessible from mobile devices, rather than, I guess, having to actually go through manuals and figure out how, I guess hot keys and stuff works.

That's, you're right. I don't that's a we could be moving into a into a phase where, yeah, people don't know how to do even the basics. I was, I was listening to the hard fork podcast, and they were talking about vive coding, vibe type. Vibe coding. Sorry, no vibe coding, which is basically letting an AI do all the coding for you. There's a whole, there's all IDEs for it, where you can basically prompt to get, I guess you're still typing. It's going to move to voice soon enough, but you, you're you, you develop based on prompting, strictly.

I so supposedly VS code in the insiders version has something like that with the agentic version of copilot, where you just say what you want, and it'll supposedly go read all your files and do everything that it needs. And the current version has something where you can request, like refactors and edits on a current file, and it'll propose the changes and you can approve it. But I found that, like when I've played with it, usually it doesn't really understand what. I'm asking for, unless it's super, like, dead simple, like, if I'm splitting a function into multiple functions, it'll kind of do it. But if it doesn't have enough context for the APIs, it'll just invent things, instead of, like, actually solve what I'm trying to do, right?

Copilot can't infer anything. You have to be really specific about what you wanted to do, or else it just does work.

I would disagree. Depends on the model. Some of the models, like the base, like the clod that's in there for like co pilot, the base one, that's like three, five, is terrible, despite the fact that everyone's saying that it's great. I've never gotten to get anything, but you start to use some of the other ones with it, and, like, the spooky stuff that it'll do, if you are, you know, say, like, working inside of, like a routine, and, you know, you're filling it out, and it's thinking about your other code, and you just hit Tab, and it's like, next line, and you're like, Okay, that looks good. Keep going, you know, because once you've started to build on that code, it's actually able to, like, figure it all out to the end, mostly, like, once you start on a thought process. I think it's a lot of it's just the model you're using, the ones that are, like, basic in there that people try are pretty terrible. But yeah, like the latest, like, oh three or a one, and then the they finally get a Google one in there, the Gemini one, which is pretty good, too. I

heard the Gemini was pretty good with coding. I heard Claude was excellent. Yeah, the but, you know, the whole, the idea of, you know, the agentic models, building the whole, the whole app, for you, the whole site, there's, there's one is, how do you know it? Did it, right? If you're not a programmer, right? The amount of review is necessary. Here strikes me as as as challenging, right, right? And we're, we're already seeing this even in the writing we're doing. It's, you know, we're, we can produce right writing so quickly using AI pieces that it can be very hard to know that, you know, to review AI writing in part, because my brain tunes it out because it reads like marketing speak a lot of times, but multi polished, and this is one of I'm Trying to figure out for us, you know, if somebody you know, again, I produce a ton of AI assisted outlines and writing and synopsis and things like that. And so part of what, you know, I'm thinking through it's like, okay, great. When I'm producing it, I'm, I'm like, Oh yeah, that looks great. And then go, but you know, as I start reading more of it, you know, it, it the intent, or what people are trying to do, or how to like, Okay, wait a second. I don't need all this. I just need a, you know, something, you know, how do I give good feedback on on AI generated content?

I think the big thing for me with AI generated writing, especially is that it tends to weight certain words and phrases and then uses them repetitively, and you'll, you'll see the same word like eight times in a document, and you just it grates on You.

Yeah, I think that's the big difference between coding and writing, of course, is like one, even if you're doing, like a technical document so slightly creative, you don't use, like, all the high fluten words that we come up with necessarily, like the coding part, though. I mean,

a lot of

the coding, especially some of the languages that are, you know, like going and such, are a lot more structured. I would say, Yeah, I

found that it is very good with kind of natural language tasks, if you are, like, dumping resources from another thing. So, like, if I want to make an enum for an existing API or something, I can just copy and paste the actual HTML web page contents into a comment and just start the enum, and then it'll build out the rest and put documentation comments in it, which is one of my primary use cases, is just to reduce unnecessary keystrokes, like time wasting, chore coding. Yeah, I was

working on something where we had someone wanted JSON schema for an endpoint. It was just like. Yeah, it was huge. And I basically pointed the LM at it, and said, Give me the schema. And like, we're talking like 1500 lines of JSON schema that no one ever wanted to create. And I was like, well, here we go. We have a starting point at least, that nobody wants to type all this.

How did you get it reviewed? Because that's, that's when the other side of it is, right? You you get something big like that. And there's an element of design and thought that, you know you have to, there's a, it's a different mode as your art, as you're designing it,

all right? I mean, then you can ask it to write a test for you, so you could actually test against it, and then start using and submitting that. What we did so it shortcut the work right of anyone going through and typing. It's kind of like, it's kind of like the new spreadsheet. I don't know, people dump things in spreadsheets, you know, to format them, and it's kind of like pull them back out sometimes in the old days.

Yeah, totally, totally, totally, did that. Yeah, right, so does that? I mean, one of the things, the other things I've heard is that people are are starting to be like, Okay, I my coding skills are atrophying. You know, I'm so used to using these tools to do this, this work, that I'm not building. I'm not building any of the, you know, oh, you know, remembering how to write code. And I'm, I'm wondering how new people coming in are going to actually, like, it's great if you've done it and you're like, oh, this saves me a ton of time, yay. But if you're never building the skills up, what happens? I mean, just does at some point, is it like you're the you know, you know, reading, reading code becomes a lost skill.

I think there's a degree of problem solving that will always exist in programming, and there are some tasks that are less problem solving and more, I guess, maintenance or boiler plate. So there's a degree of like you can read code but not write code. And I think copilot is, I think putting there's a the more agentic aspects can put people in that situation, like early college students. In my experience, they can read code somewhat understand what's happening, unless there's too many for loops and then, and then they aren't able to recreate that code for memory or without memory. So right? You can understand logically what's happening, but you can't understand why. And then I guess AI would fill in the gaps. It would try to lie to you about what the code is doing.

Yeah, that's a more. There's an element of wait and see. There's an, you know, I think even with writing, we're seeing that you're you can read, you know, reading good writing is is a lot easier than, even than correcting it. Then no, you know, like, if you're reviewing AI code, maybe it doesn't matter. And and you can tell F it works. It's good. It's not optimized, or, you know, particularly well designed, but it's good enough. Maybe, maybe the AI is going to get better at design, you know, doing better design over time.

I think my concern with, I guess, the current llms and AI in general, and laptops having co pilot keys, is that we're replacing like actual sources with generative output. So people are we already are in an information crisis, in my opinion, about like reliability of things online, with so much stuff being AI generated content, and now your primary sources, for a lot of people, when they're researching, is now going to be like aI driven at what point is the rest of the stuff that we have that was trustworthy going to start getting polluted with? I guess what people are thinking our first hand sources, but are actually people using AI to account things incorrectly?

The garbage dump spiral? Yeah, yeah, that was that true. I did see something fascinating. I saw two eyes to this. This was a prototype project. Perspective, two AI agents talking to each other. And they identified, in the in this prototype, I saw one of them says, I'm an AI agent doing, you know, looking to do this. And the responding agent said, Oh, I'm also an AI agent, and then it suggested switching to this ai langn, ai vocal language called blip, I think it was called, and they swapped and started talking in basically tones almost sounded like modems, to improve their communication velocity. And the funny thing was, is that they didn't get any more terse. They actually were still, like, in the in the prototype I saw, they weren't like, rude to each other. They were still having conversations. They were just not vocalizing them the same way,

comprehend that when they get up to 56k that's the question they

they it was, it was def, it definitely appeared faster. But it's like it was, it was like trills. It was all bunch of trails off to go back and see if I can find it, and I'll put it in the shoot some blue sky. Maybe it's in my in my feed, if I go back a little bit, because just yesterday, I should have dropped it in the random channel for everybody to see. I

It's one of the things that that's fascinating, and this actually might be a bridge into the conversation. So today's topic was a bridge from last week where we talked at the end about that Kubernetes outage, where the control plane got overloaded, and then basic functions like DNS failed, and then that created a cascading failure. So today we were going to talk through how to how to build ha control planes and troubleshoot them and do diagnostics the prompt I had for this, oops, if I can find my tab again, was moving, you know, moving in setup of HA control planes, virtual IPs and load balancing. These are things that we see from Digital Rebar, validation of high availability, recovering time, dealing with loss of resiliency, backup and recovery of HA systems. I think that's it's a good conversation for us to walk through and take some 10 career lessons learned, things that we know from Digital Rebar too. But one of the things about watching a eyes switch into a non human language is you can't troubleshoot it, right? It's a weird and this is, this is something to think, think about, right? If I have two AIS talking to each other, and they switch to, you know, using a completely, you know, non, non human language, is the efficiency, time efficiency for that really that Much of a savings versus the lack of diagnostics for how, for, for being able to figure out what's what's going on.

I mean, that's a good question, but if you have sufficient telemetry that's monitoring the agents and logging the data, isn't that good enough. You don't have to actually see what they're saying. You just have to be able to figure out what they're doing.

So you're saying, you can separate, separate the vocalization from the the actual content. So as long as the content is still the, you know, human content, maybe then if they're conversing with each other wouldn't matter if they're talking Spanish or or or gibber right to as long as you can see this is, this is what they're what they think they're saying,

yeah, exactly as long as I've got a way to look into what's actually being done, the method of communication really is, I'm not going to say unimportant, but it's certainly less important if I've got that data available to me,

how is it any different than just giving them let them talk by a bite stream? I mean, you don't understand that rapidly, but. At the end, I mean, the end models are still language models based on whatever language they've been trained on. That's

a good point. Yeah, they can't, they can't escape the the LM, from that perspective, it's, they're, they're talking vectors all of a sudden. But it's still language Yeah, it's an interesting this is one of the things that I go to troubleshoot, right? You could make the argument to have very terse, and I've seen this happen, very terse logging like, oh, I don't need to have a log that has information in it. I'm just going to put a code and no text, right? Code, and you know, and instead, you know, every time I've I've seen people get two turfs and logs. And you know, for for optimization sake, it actually usually results in much, much worse logs, much, much worse logs. But, you know, does that do, you know? I mean, has anybody been in one of those cases where it's like, oh, we don't need to write all this stuff. We're just going to write a code or something like that, put it in the log and be done

time and again,

seeing it all. And did it? Does it? Did it worked? Is it helpful? No,

it's only helpful the person that wrote it while they were debugging it to write it. Usually, that's when it happens. Shorthand, yeah, never happens down the line for anyone maintaining

what I find is that people like then can't find, this is part of the other side of it, right? They can't find what codes they use for something, so they go and add a new code. Yeah, and

where's the documentation? You know, if it's not clear and concise in the output, having a code there to reduce, you know, 57 paragraphs to a number really doesn't help your client. Yeah,

it's But the flip side of it, right, you end up with for a lot of much more verbose logging, and then you're, you're stuck waiting through a whole bunch of of of text to do that, although nowadays, without, you know, with an AI assistant, you could wind through that much faster, get actually help pick out what the error is pretty quickly. This is what's right. Condensing stuff down doesn't doesn't necessarily result in the better output when it comes to being able to troubleshoot or figure things out.

Yeah, this is, I kind of had a thought about this after we did the last conversation about the whole open AI outage, right? I guess I'll try and draw a parallel here. But I was thinking about the whole telemetry thing because I've seen, you know, I've seen, okay, not the same, that it was the telemetry that took it down. But seeing telemetry take down other, you know, other services in well, just in the past year,

it does, yeah, no, it's expensive, you

know it well, I guess it is expensive. And the interesting thing is that it's a whole other form of, you know, understanding your code right? Not just what you put in there to log it, but you're actually trying to do, you know, tracing on your code. And there's kind of a, I would almost call it a fight between and this happens like a lot in the SRE realm. It's like, do you put in a lot of statements and logging that you understand, or do you just slap on carte blanche all of this, you know, telemetry with, you know, open telemetry, and honey comb and all those other kind of packages to just pour through your code while it's running and expect to, you know, find some little nuggets of information. I wouldn't be surprised if the open 20 did just completely take them down because, you know, it ran out of resources or whatever. Then it happens to that's

a quality versus quantity problem. I call it the Schrodinger problem, because in the act of observing, you actually change the outcome.

All of those code paths become problems themselves.

Yeah, there's a, there's a, there could be a high tax and burden. I mean, one thing's, if you haven't ever tried this with Digital Rebar, you do it with caution, but you can, you can turn up the logging on, you know, individual sub components of the product to very high degrees, and it very quickly will overwhelm the system. If you've turned up, you know, some of the networking topology. Logging, or some of the trade, you know, if you move to trace on certain things, you will get traces on every packet that's flowing through the system. And you know, the utility of that for, you know, any prolonged period is is near zero. It's helpful if you were diagnosing a very specific problem, and we're like, oh, we need a packet capture, because we need to see what this one device is sending from a DHCP packet when it registers. That is awesome to have, but you turn that on for more than a minute or two, and your, your your logs are toast in there, your, the performance is toast. But I can tell you, not having it meant that we had to tell customers how to go get a packet capture, and that was unpendable. So it's the the right level of logging, and being able to turn on off is absolutely essential, from that, that perspective, I've heard of a lot of crit set situations where part of what was actually causing the situation was that people would increase logging in order to solve a, you Know, diagnose a problem, and then exacerbated the issues,

or forgot to turn it off, or

forgot to turn it off, and it filled up the whatever the desk, and systems crashed as a consequence. We've had, we've had plenty of tickets where people, you know, systems go down because they they filled up a disc. We've gotten better about warning people about it. We've gotten better about, you know, handling that, that situation, but it's, you know, it's hard to handle that self imposed thing. I thought about the open AI piece of load shedding, and how and thinking about load shedding from a resiliency perspective. Do y'all? Do y'all feel like planning load shedding? You know that people should factor this into systems design as they as they go. Does that? You know what I mean?

I would argue that you can plan beforehand, before you have to load chat. I think that's that, that's the the mindset is that you can allow your system to balance you all the time, I don't know.

I think that's an interesting architectural question, because if you're doing something like bots for slack or discord or chat applications, you might have instances that respond to specific servers, but those servers may grow unexpectedly. So there's a question of whether or not you have to build in a way that your boss need to be able to migrate, or you do a bot per instance, the instance might need to get upgraded, things like that. I'm not entirely sure it's, I guess, part of the context of your original question. Well,

no part of, part of what I'm thinking through is, as you, as you scale up a system, right? How? You know, having a and, I mean, I've seen this personally, it's like, I'm always happier when I remember to have a kill switch or an override or some additional step that I can, you know, shut things down or or change, change the the the responsiveness of something if I need to, if I need to stop, if I need, if, if I need to shut it off. Right? I mean, this comes to me of all the times I built a loop and and it was too fast, or the time, you know, I remember programming Digital Rebar with job stuff, and I made an error in the loop, and all of a sudden I have 10,000 jobs in a queue, and no easy way to remove jobs from the queue. And my systems like pegged out because the you know, the it's trying to process all of that work, and so I'm stuck. Meant that that in a programming situation, even on a small system, and ends up, you know, I've hit that enough times that usually when I look at like, Ooh, wait, I need, I need to have a safety for this. I need to have a relief valve. Or my, I. Just unique in having shot myself in the foot. That about that.

Are you talking about putting a circuit breaker on unchecked growth, or are you talking about like auto scaling, where once your load hits a certain amount, you you auto scale that to another VM, another process, another.

It's, it's an interesting question, right? I mean, I think that the circuit breaker is critical and and adding circuit breakers even when you don't think you need them is probably very good design.

That's funny. I had a, this is not, you know, it's home related. It's not tech. But I had a circuit I found in the basement that I was trying to I wanted to patch into and I could not find the fuse for it for the life of me. And not all my fuses are labeled. And so I'm like shutting off everything in the house, searching for that one to try. I never found it, actually, which is a little bit scary,

tone generator that will fix that problem. Code A generator. It's called a tone generator. Basically, yeah, I

have a I do. Yeah, you have to de energize the wires to tone generate, don't you? We usually do.

You can do it without it. It's just

barrier. Great, actually, actually, one of the things that I was hearing on the California fires was that they had circuits that, when people have generators on their circuits, that they don't always isolate back, isolate the circuit, and so they can have lines that they think are de energized, but are energized because people are running generators on the lines, which is fast, which is fascinating, right? Oh, the power's out. I'm going to turn on my generator, and then all of a sudden you're running power on it, which isn't right. We're talking about circuit breakers. One of the challenges with the building a circuit breaker is you still have to, you know, be able to find the circuit breaker when you need it. It's not, not, you know, especially in tech, it's not always a neatly labeled, you know, single location to find all your circuit breakers. Good naming conventions probably would be useful. We're talking about scaling ha systems and lessons learned.

Lots of fun stuff on there.

It's, it's, it's tricky when I'm when I'm just thinking about, like, if I'm running a big cluster, I need to be able to say, you know, all right, I want a controlled shutdown, or an uncontrolled shutdown, on some of the loads so I can get things to recover, and then I need to bring it back in a controlled way so that I can bring, you know, I don't overwhelm the system from a rebuild perspective.

Yeah. Then there's also the also the other matter of you may have gotten to your system to an equilibrium, but you you scale up the capacity of a downstream component, which, in the case of a service degradation, will end up causing back pressure on your entire pipeline. And what you thought was stable, final, or finally stable is now unstable again. So is that

something that you would find by logging? Like, how do you Yeah, I

would typically say that's something you find by monitoring, unless the system that happens to have that issue is your monitoring environment. But, yeah, I mean it, it's usually you, as long as you have metrics on throughput and on on on saturation, in particular, it's easy to tell which component is suffering from the from the back pressure. The bigger question is, like, how do you scale it? Because sometimes the part that is telling you that it's saturated is not exactly the one that's that has the problem. It might be downstream from it, yeah, for example, like if you write to the. Database, I mean, and you have, let's say, a load balancer in front of it. Well, if the load balancer memory spikes or the or its number of connections is saturated, it doesn't necessarily mean that you need more low bounce instances. It may just be mean that it's buffers, buffering the connections in memory and cannot deliver its messages fast enough.

So then you would need, where do you go? From a diagnostic perspective, I um, you still have a load balance. You still have a load balancing delay or discrepancy,

yeah. I mean, your load balancer problem is, in this case, the symptom, not, not the root cause. Okay, so in most cases, you learn this over time. If you have sufficient experience, you'll be able to tell, like, okay, like, if my love answer is in congestion, it means that it's a downstream problem. If you're if you're not experienced, you learn that soon enough, anyway, but, but, yeah, it's, it ends up being also very dependent on what components you have. Sometimes it's something that's not necessarily obvious. For example, if you if your low balancer is doing a fan out to multiple back ends, but let's say, because you want to do DR on the data, it might mean that one of your back ends is degraded, not necessarily, not necessarily down, And then the other back end suffers it because you you essentially you double the risk of back pressure on your low bouncer by trying to ensure that your messages are delivered to both back ends at the same Time.

And that's typically when, when you start thinking of things like that, letter queues and all of the fun things that, like method message queues, have already implemented through, through essentially hard learned lessons, and you start understanding well that's how complex your pipeline is.

So what do you do? I mean, yeah, and when you're figuring out this complexity, does that help you identify where your ha challenges are going to end up being? What is right? And I think this is, yeah, go ahead

it. It's the challenges are going to be very dependent on the nature of your data, on on the nature of your pipeline. So, for example, a system that allows for for the data to be not necessarily super lossy, but but like, if you drop off your packages, it's fine, like metrics, for example, there you will end up having a different kind of failure profile, as, let's say, a payment system where you not only need to process 100% of the messages, but you need to process them in the order in which you receive them. So So in either case, you're gonna have different scenarios for failure, and also different scenarios for failover, like with metrics. For example, you can have hot hot you can have automatic failover and just be fine in a scenario where you cannot lose the messages, or where all the messages need to be processed by the same system, and you can only do like a hard fail over then you need to be sure that you absolutely cut off the first person environment, like if you do blue green before surgery to Do the other one coordinates is is kind of in the middle there, like that. The notes are the notes themselves, at least the worker notes are very forgiving, but your control plane less, so

makes sense. So how do you when you're do looking in an HA scenario like that is. That is that something you would drill like, what this? I mean, we've, we've seen this come up with, you know, and there's an interesting question about, what do you need for the difference between Dr and Ha, when, when you look at building that, that ha system, right? Yeah. How are you? How are you validating? What do you? What are you thinking?

You have to start with by calculating your SLOs on your SLA like are you? Are you aiming 400% uptime? Are you aiming for three nines, and that will affect what kind of solution you end up going with. For example, if you only need, like, two or three nines, then it may be sufficient to just have your DR environment to be a warm standby. And then if you, if your main one feels falls down, is a short word to do the other one manually. If you need more nines, then you need to start looking into duplicating your data, and there you need to calculate also the not only the resilience from having the data duplicated, but also the cost, both in terms of expenses as well as risk of having a system doing the like the complexity of duplicating new data itself. There's also, there was actually an article this morning. I'll have to find it. But this was about a bank that had an interesting DR environment where they so dr wasn't a fully functional system, but only their core components, so enough to be able to say, hey, we can still process the critical requests, but the things that can wait Are not they don't have a DR environment for that. So that was rather interesting. Let me see if I can find it.

This is, this is, to me, part of the different the thinking behind ha and Dr for what, what people want to accomplish. Because a lot of times what somebody, somebody you know, says they need a ha system, and they their, their actual requirements are not ha or they don't want to pay the overhead to maintain, you know, the redundant copies of something to do with full HIV system. My

silence. I'm still trying to find the worst article.

And this is one of the things I wanted to bring up. I was writing something up about ha strategies. And you know, H in an HA scenario, you are maintaining a basically an automatic resiliency, where if the system fails, it's going to jump over to a secondary system, sometimes, to your point, with a loss of data, or a small loss of data, it could be a loss of logging. It could be, you know, there's, there's usually something, you know, something. It's very hard to have a perfectly ha system.

Sometimes you just need to have a documented procedure for recovering the data or validating it, like, for example, in voting environments. Oh, that's

an interesting analogy, yeah. So you don't want to lose any votes, but you don't, you don't have to make sure that you have an immediate copy, right? And just, you're gonna, you're gonna reconcile it at some point.

And I just found the article about the bank posted here in the chat.

Oh, cool. Thank you. You.

I'm there, they also talked a bit about their various testing strategies as well. So nothing outstanding, mostly the the the innovative, or quite a cognitive part, is the that didn't not fully replicate it into your environment,

right? So they're not, they're not trying to recreate 100% of what they have. They're saying I can quickly recreate what they're doing. This is actually an interesting thing from from a cloud perspective, right? What, what you can do is, is you can come back and say, You know what, I only need to be able to, you know, get, have my data and, you know, my control plane resilient. I can, I can create, create resources somewhere else, or I can get new resources if I need to. You're counting on the cloud from a resilience perspective,

but you're counting on a cloud from

a cloud well, you're hoping, because sometimes you know, changing the you know, depending on access and networking, that can be quite, quite a challenge to redirect traffic to a new location. So I would not expect most people doing that strategy to multi cloud it. They would maybe multi AZ it, but not multi, you know, within the same cloud provider, the money, my, my thought. But

the thing that, the other thing I was thinking to go to was this idea that you could say, I don't actually have an active, an active ha environment. I don't have an active backup, I, you know, and you're, you're hitting on this, I'm recreating my environment dynamically. And that's enough, long as it's within your SLA,

yeah, yeah. And again, like the depending on what kind of SLA you're aiming for, on what kind of failure tolerance your environment has, you might be able to just have, like, for example, Like a Rolling replacement of instances or dinners, which is the typical that you have in Coronavirus, or you might do an AB strategy or Canary, if something is more data stream oriented, but again, which strategy is valid will largely depend on what you're doing with your system,

right? And I, in general, I think people are often more aggressive on thinking they need ha than they they probably do from a true from a true resilience perspective. But,

yeah, it

depends on the budget too, of course, like Ha has its own costs, but, but if the cost of the risk is greater than the cost of having a duplicate environment. And yes, definitely, yeah. Well,

and sometimes I think that people think that because they're Ha, they need less Dr and that maybe that this is the point I'm driving towards the they're often confused. You need a DR strategy. Period, whether you're ha or not, you need a DR strategy and so that you know if, because right, your your your ha system, typically, you know, Ha is not distributed. It's not redundant. You might have a, you know, a 10, a nine node control plane as to the odd numbers to manage it. But if that control plane gets overwhelmed, or something like that, and you lose the control plane, or the host of that environment goes away, you know, you still need to be able to recreate that system in a material way.

No, no. So, of course, the difference between a service Dr AND or OR service continuity and business continuity, like business continuity, in some sense, depends on service continuity. But I. There are ways in which it can be at least partially decoupled.

Yeah, I think that that's that's part of where I was trying to go, and on, on figuring those things out, or at least getting reliable backups procedures to return things to normal. Trust me, I feel like I'm a little all over the map, on, on, on this one today.

Was, was Was there any particular technical aspect that you were hoping to address, or just

this was, we were talking last week about, the open AI outage, the Kubernetes control plane. Let's see. Hold on a second. Oh, and then, and so that was, that was sort of a gave news to this. Here it is, I'm putting it in the channel. And then we had a couple of talk, a couple of of points from here, from there. I didn't, I didn't spend as much time thinking about the topic as as I would have. And one of the things that I needed to do that I didn't do is drag Victor, who does RHA work into the conversation? Because I think that there's components for that that I will be you know, he spent a lot of time building our cluster ha strategy and then dealing with customers who break, break the HA or, you know, trying to think through the scenarios, like, because we break so Digital Rebar needed to be a completely standalone system. And, you know, we ended up having to build a core. It's a raft, you know, we're using standard libraries, but it's, there's a raft database behind the scenes where, if you have multiple nodes, it's going to keep the raft system synchronized for an HA build up. So great. And the number of ways that goes wrong. Get, you know, our customers think that they've they're doing ha when they needed backups, or think that because they're doing Ha, they don't need backups, or they're taking, you know, the backups interfere with the HA scenario, right there. We've had years now of like, all of these cases where it's, like, I was doing Ha, and I built this, and I moved my, one of my nodes to a lower latency thing, and then it got out of sync, and then, you know, I backed up the wrong from the wrong node. I mean, it's, it's incredible, the number of scenarios that we've had to code through, or help customers through, as they've, you know, tried to get, you know, an HA system going, or in some cases, realize that what they really need is a streaming backup out of a system, and then they can, you know, a script to do a Recovery of that backup. And then we have customers who do that and then never test the backups and don't have a process, and that ends up being ms too, yeah,

yeah. Or customers that are trying to do two node leader elections, yes, customers that have an HA compute cluster on then those single sand backing all of the data for that.

There are so many, so many places where these systems, you know, often out of somebody's purview. Have additional fault, fault issues that we need to cover. Need to think through. I have a couple of topics on deck for but I need to get my team to be able I need to give give people warnings on what we're going to cover. I think what we have for next week was talking about chaos, chaos engineering and infrastructure. So

fun. Times should

be good. All right. Thank you all talk to y'all later. You. Company, wow, we are getting down to some really advanced, thoughtful topics. Here we've been bringing some new people into the group, bringing new experiences. It is always helpful to have your experience your questions, as you think about what problems you're facing. That's what makes these conversations really powerful, that hands on concrete information. So please consider joining us or sending us some questions. You can get in touch with me, rob@rackn.com or just come in, join the call, be part of asking the Q and A and helping everybody make infrastructure better. Thanks. Thank you for listening to the cloud 2030 podcast. It is sponsored by RackN, where we are really working to build a community of people who are using and thinking about infrastructure differently, because that's what RackN does. We write software that helps put operators back in control of distributed infrastructure, really thinking about how things should be run, and building software that makes that possible. If this is interesting to you, please try out the software. We would love to get your opinion and hear how you think this could transform infrastructure more broadly, or just keep enjoying the podcast and coming to the discussions and you know, laying out your thoughts and how you see the future unfolding. It's all part of building a better infrastructure operations community. Thank you.