20240611 TechOps HA

    9:18PM Jul 5, 2024

    Speakers:

    Rob Hirschfeld

    Claus Strommer

    Keywords:

    system

    cluster

    ha

    logs

    metrics

    high availability

    service

    sla

    failover

    data

    nines

    nodes

    pacemaker

    monitoring

    infrastructure

    raft

    sense

    disaster recovery

    downtime

    outage

    Rob Hirschfeld, hello. I'm Rob Hirschfeld, CEO and co founder of RackN and your host for the cloud 2030 podcast. In this episode of our tech op series, we dive into high availability, and we don't just treat this as high availability is a good thing. We take an operations perspective. We look at places where you were over or under committing high availability, where you were confusing disaster recovery for high availability, and perhaps even securing the wrong service or looking at it the wrong way. We cover all of these scenarios with practical hands on examples that I know you will get a lot out of. I'll switch this, if that's right, to the topic of the day, because I'm actually really curious to talk about it in a lot of ways, this is good prep for talking about ha clusters, because the idea of coordinating and monitoring systems, I think, is core to HA and HA cluster so that what I was hoping we would talk about today is building good ha systems, and what people need to know about it. Along those lines, I think one of the things in our journey with RackN is that a lot of customers who thought they needed very aggressive ha systems, once they are actually confronted with the overhead of setting up an HA system and maintaining an HA system, which we should define them. I'm going to spend a minute framing this out, and then I'm I'd love to turn it over to hear opinions and thoughts, but once you're confronted with the overhead of maintaining an HA system, it becomes you have to really ask if you needed it. And along those lines, we started with basically an active, passive ha implementation using a third party monitoring I'm blanking on the name, I just had a tip of my tongue that would would monitor for when the system failed, and then spin up the second system, so you basically had a live streaming back up to the failover system, and then you Would when failure was Detective, you would you just spin up the new one that required Victor. What was the name of the the monitoring tool.

    I have honestly forgotten it.

    I'm going to kick myself when it comes back in my brain and

    actually read old documentation to find it okay.

    Was it pacemaker,

    pacemaker and then that associated suite of cluster management utilities,

    and so pacemaker would would do the monitoring, and then failover for us, which is awesome, and then the intended customers said, Wait a second. We don't want any external dependency. So we rewrote the HA system to implement a consensus cluster for this behind the scenes, which meant then that you needed three minimum odd number nodes to distribute the data. And then, you know this, the software behind the scenes, Alexa leader and handles, moving the VIP, and all those things became internal functions. But the overhead on it with having three running systems is much higher, and the complexity too go

    ahead. The thing is, it's it's not really higher complexity, because, you know, pacemaker has all the inherent complexity that our current setup uses. But people get really hung up on needing three nodes for voting and quorum establishment purposes, instead of just having two nodes and figuring it out from there,

    there's a good reason for it, because with two nodes, you can run into a split brain scenario, unless you have a third partial node, which could be like a Sound connect

    as a witness, but at that point you're still, then you're at three nodes,

    yes, but, but the point is, like your your your three notes don't all need to be fully functional, as long as you can use any other system for fencing.

    That's true. However, every single one of our customers that we've wanted that wants, ha, they want two nodes and two nodes only to do it with and trying to explain, you know, how you know? You know, trying to explain. The details of or not the details, but, you know, just the basic assumptions behind the CAP theorem and how, you know, we can't really use anything other than network connectivity to determine when something's offline, and we can't even tell the difference between, you know, I know that's offline, or one that's just taking the especially long time to respond and how to handle failover. It's, it's an uphill challenge to those who don't have the background to understand why we've made the design decisions that we've made

    sure so, and I think so there's two places where I go. One is the idea that, you know, we're finding customers who were just saying, hey, it's not, you know, we don't, we don't need this. We just, if we can, it's almost back to pacemaker, if we can stream a backup, or a backup within a certain, you know, failure window to another system. Now we're back. You know, feels like we're back to, getting back to a pacemaker solution where somebody's like, Okay, if there's a failure, I can hit a button and the system comes back online from this stream backup.

    That sounds like a scenario that would be better served with, like a primary secondary with manual fill over Yep.

    Which it is, you can automate which we have here

    and which case, you can definitely do them between notes, because you you don't need the automatic decision making that requires three notes,

    right? Yeah, no. And this was, this was the thing to me. I think we get into scenarios where somebody, you know, an architect, looks at this on paper and says, I want a zero downtime ha system, and I don't want any external systems making, you know. I want it to be completely self contained. And then, and then, from that perspective, it's, it's a matter of, you know, sorry, you go down that path. You're like, well, I'm going to design the best thing I can get. And then you get down into the weeds, and people are like, oh, wait a second. I don't need that level of resilience or fault tolerance for a lot of systems. Or, I mean, in some cases, the other alternative, from an HA perspective, is, if you put a load balancer in front of a system right the load balancer has a degree of HA built into it. If it's scaled horizontally enough, then you you don't need the main front end to be, you know, ha, as long as there's something making a load, a load decision there, that makes sense, because a load balancer has, you know, should have a degree of health check, you know, at least it's a system up before I route traffic to it in, in its design,

    In a service based architecture, yes, but more often than not, design some falling short, because you may up having like instead of like, service oriented systems be just consumers. Let's say like two systems that would be able to consume from a queue, but you only want one system to consume at any time. Okay? And then, yeah, you have coordination

    when you're designing that. Does that mean that there's a part, like a queuing system that needs to have some ha component or an HA design, but the workers servicing the queue don't have to be I mean, that's part of, I guess what I'm what I'm getting to is where, you know, there's places that you probably need redundancy in the design, or ha and the design, but there's a lot of places we you know, either by having Multiple bigger volume, or because there's delays built into the system, you could probably tolerate not having it heck, and I'm going to air quote Ha, in that model, what does ha really mean in this case?

    Yeah, I mean, that is actually a good question, because, like it, there's various degrees of HA, there's just like, reducing downtime, there's disaster recovery, there's everything in between those Two. For the most part, most of the demand for ha Bos down to downtime reduction and risk mitigation. So right for most scenarios like partial high availability is. Is sufficient, as long as it keeps you within your SLA. No, there certainly are some systems for which you want effective 100% uptime in which case, yeah, you also need disaster recovery you likely wanted to be distributed across multiple sites, in which case you also need to have networking stack redundancy, whether it's stack switches or or even like, if it's global redundancy, then you may have, like, multicast addresses or whatnot. So it's hurdles all the way down.

    Do you? I mean, if you were using a system like Kubernetes, right where it would monitor for service being up and move, migrate the service as part of a does that? Does that help is Ha,

    or some definition

    resilient. It's a resiliency where your service stays up, the routing traffic to that service stays up.

    Yes, I mean, it will not save you from regional outages, because, by and large, kernels are Sonal or Best Regional. So if you want Global Edge, then you also need to have a more fully distributed system, which may also include using, let's say, Istio or some, some other kind of service mesh for East West gateways, so that they can both replicate their data once interested. Or you may end up having again, like a global load balancer in front of it. You may even decide, well, I don't want to put all of my eggs into a one system cloud provider and do ha across, let's say Amazon and Azure or Google. Do again, it, it ends, ends up being a business decision, largely, like, how much Ha is sufficient Ha, but yes, like, even if you like, just distill it down to, like, just having a Kubernetes cluster on deployment with at least two replicas, yes, that is technically ha for the most simple stuff use cases. Well, the

    issue is there, you're pushing the actual burden of HA off from, you know, your your whatever set replicas to Kubernetes.

    Yes. Yes, yes, absolutely. But which is not going to

    be just a, you know, a two replica thing. It's going to have, you know, at least in voting systems to, you know, manage your etcd cluster and all that other fun stuff,

    right and about but again, if you use a cloud provider that is abstracted for you, and then again, for a, let's say, for a developer who is responsible for producing a service that maintains business continuity, in the sense that you do an update on it, on all of the requests, keep getting managed a deployment, withdrawing upgrades will give you that. And I start the minimum necessary scenario, in my opinion, to call something ha. Now, is it sufficiently? Ha? Again, that's a business decision, because, as you said, Victor like well, you're delegating some of the some of those capabilities to the cluster itself. But Well, what happens if the cluster is out?

    Hmm. Um, this is if we're delegating the so when you say cluster, here are you talking you're thinking about the cluster manager, the cluster itself,

    the cluster itself, uh, a piece of infrastructure.

    Oh, so you're what we're talking about is how, what's the SLA for an individual service or an individual

    essentially, there's different layers. There's the there's a service layer which runs entirely in the cluster. There's infrastructure layer, which is your cluster itself. There's the network layer, which is access. To the cluster on, potentially the data passing in and out of the cluster. So again, like it. It depends on, on, on what your target for a j is like if you have an SLA that says all customer requests must or all customer requests received by the cluster must have no more than, let's say, 0.5% non 200 response codes. Well, then your ha scope is up to the cluster, so you might have a redundant cluster set up with a load balance in front of it, but if the load balancer is down, then, well, you don't care, because that's not part of your SLA. On the other hand, if you're if your ha SLA is essentially trying to meet the customer's perception of uptime, that's a different story. And then you would also want to have redundant networking and other capabilities, you would likely end up having a CDN or something to delegate that to, unless it's your private infrastructure has that capabilities?

    Do you how much of a factor do you have? Because what you're describing can have a long chain of dependencies, and the more components you have in that dependency chain, you know the likelihood of a failure interrupting that chain. How do you when you look at an HA system, how much do you worry about all of the associated components in that, in that chain, you know what I mean, the weakest link, sort of the the whole weakest link thing,

    um, it's definitely a risk. And I've seen uh, situations where an attempt at an HA architecture actually weakened the perceived availability because, like, if your redundant system goes down, then your response time goes goes up significantly, for example, because, let's say you have a system buffering messages, because it's trying to do a fan out to both sides and suddenly, well, it's writing those to disk, which takes has a lot more latency. So, yes, there's definitely a risk to it. I mean, you could probably do a whole academic course

    on discussing this whole system. No, it's a very real problem. That's why I'm and one of the things that we find right all the time is that you've got. Yeah, there's a lot of times there's unexpected components in an HA failure mode. Yeah, in

    my experience, the people demanding ha or demanding extremely high not number of nines for availability, they don't know how

    much that costs Exactly. It goes up by more than an order of magnitude for each nine you tack on the end,

    yes and they overestimate the risk of downtime. Do? Time Like yes, there's definitely a cost to losing business continuity if your system is down, but as you said, Victor, the cost of maintaining ha can easily escalate past what the potential losses of not having Ha, or at least a certain degree of ha? Yeah,

    I got, I've got, I'm getting a call, so I'm going to pause for a second, but please, please keep talking, just letting you know, muted,

    just going back a little bit so now you mentioned Peacemaker before we talk briefly about like the other election. Have you guys done anything with chorusing or similar systems?

    Yes, not in a very long time. Last time I serious, the last time I messed with CO resync, was when I had to know how it worked and do basic configuration stuff to get my RHC. But that was many, many moons ago now these days. So our product, Dr provision, has, you know, we basically baked raft right in at the lowest level of into our database to handle leader election and high availability. So, okay, yeah, basically any update has to stream through raft and get acknowledged by a quorum of the cluster members. First and all, API traffic winds up being routed through raft so to maintain consistent databases in a shared nothing storage environment across your odd number of nodes. So,

    yeah, I've I have experienced with both of those coursing back when I was managing fleets of proxmarks on clusters, to my knowledge, Proxmox still uses coursing for the election, probably, yeah, unraft When, when setting up console clusters. Because, of course,

    yeah, we actually use HashiCorp raft library as our back end for that. So yeah, it's worked pretty well so far.

    Yeah, definitely,

    you know. The The nice thing is, you know, once you actually get it up and running, it tends to be pretty stable, modulo flaky networks, which occur more often than some of our customers would like, you know, but a lot of a lot of them just kind of can't get their heads wrapped around there's no such thing as a dedicated leader that we can force into the system. You know that, due to circumstances outside of basically anyone's control, it can and will fail over to any cluster node to handle all the active traffic.

    Yeah, I mean, I understand how people have a hard time on like wrapping their head around that, but yes, it is also frustrating how many to explain it over and over again.

    I'm assuming most of your ha system comes again with regards to, like cluster and non hardware infrastructure management, unless so with say, like daily ingestion pipelines and such.

    Um, I mean, everything goes through raft, so we don't have too much of a data ingestion pipeline, per se, depending on how you're defining that. You know, every API interaction has to go through raft, and that includes things like uploading new plugins or content packs or whatever. So

    if you mean something besides that, then I'm all ears.

    My the scenarios where I'm asked to implement a j is more on again, on the data ingestion, particularly with observability, so like setting up log servers or metric servers or tracing ingestion servers and ensuring that these systems have optimized uptime because SRE wants to doesn't want to fly blind.

    Did those but those systems, in my mind, wouldn't meet the idea of a ha required system. They're monitoring or logging, although I guess if you're having an outage and they're out, then that's a significant problem. Or if you are Miss, missed an observability mark, because those systems out, I don't know. How would you do you normally think of those things as, you know, high nines or the same nines as the systems that they're monitoring.

    I um, I mean, there's, there's, what I think under is what my users think, okay, so, so I personally think the. That the observability systems need at most equal number of nines as the system being monitored. But in most cases, three nines is sufficient, like 99.9 however, again, the users of the system, which are, most cases, SRE or other teams, being on call, they get very antsy about the idea of missing an SLI or SLO, so their demand is as many nines as possible.

    I'm in some ways, this is part of the topic to cover, right? Because nines, you know, adding nines is really expensive, yeah,

    especially once you get north of four nines,

    expensive, yes, yeah.

    And again, in my opinion, it like, in most cases, it doesn't make sense, because, for example, like with if you use the Prometheus metrics, those are lossy anyway, yeah, yeah. So, I mean, if you miss a couple of entries, that's not so bad. Even if your system is down for an hour, your back end is still running. Now again, like if you expect to be down for an hour on your monitoring system, you should have a DR capability just to fill over to another environment to continue monitoring. But yeah, there's that. Now, on the other hand, there are some scenarios we do need to meet the full nines, like, let's say, let's say your audit log storage like you want, you don't want to have lossy audit logs.

    Yeah, right. No, that would be bad, because that would, that would imply that somebody's being able to come in, but does then that translate into an HA system for your logging? Oh, gosh. I mean, if it's an audit, then

    yeah. And those are the situations where we end up saying, I end up saying, like, you know, it's better to outsource this like this is where paying an exorbitant amount of money for Splunk to ingest these logs makes sense, because it's not our problem then anymore,

    okay. Oh, interesting. So from a SaaS perspective, not your problem, and actually being ha or not, the same thing. Yeah,

    that's funny. It's someone else's alone,

    right? Which they're hopefully going to, hopefully going to accomplish. You had mentioned disaster recovery versus ha. And I don't usually think of disaster recovery as an HA substance, but maybe that's the better place to go, because this is where we started right going from, I needed an HA system that was always up and, you know, I had multiple tiers of redundancy in it, to, yeah, if you're down, and we can push a button or send, you know, right? You know, spin something up, then, you know, within a five minute or a 10 minute window, then that's a, you know, we'll manage that. Disaster. Does that count as disaster recovery?

    I like, I wouldn't say this. It's disaster recovery versus Ha, I would say more than that, disaster or Dr strategies tend to tend to overlap a lot with ha. Like, again, like you said, like streaming replication, just so that you can switch your your metrics ingestion to another stack, if necessary, or even like But if, if you have the budget for it, then you end up like just having full rename systems so that if one of them goes down, you still Have the other one is still ingesting the data, and then you're you can more calmly work around the issues that cost the one to go down. And then you can you. You can replicate the missing data back to your to the other replica whenever both are up again, right? Similar like when it comes to like meeting an SLO Mark, let's say your system goes down like, if it goes down hard, like at least to do to with regards to meeting your numbers, it may make sense to spin up another system as a drop in replacement, while you still investigate why the first one is going down, because again, then you you can restore business continuity To whatever service you're supporting without having to destroy I mean, evidence is a bit of a strong word. It may it may hint at, like malicious causes, but essentially, like, if there was an accident or the or, like something completely non malicious that caused an outage, you still want to preserve the evidence to to review it.

    I mean, that would along those lines, then putting in some type of splitter, and being able to send data to two different places, like so we've been talking about ha from the service side, there's a degree, if you had audit data that was really sensitive, you could be not taking the risk on the service that you're sending it to, like Splunk. You could actually say, You know what, I'm going to copy the data to two sources, and then I'm going to have two different storage mechanisms. We haven't, we haven't really explored the belt and suspenders story on the other side of this as an alternative

    man, if it's within the budget, and if and if your your SLA requires that that many number of nines, then yes, you definitely will want to do that. It's essentially like not putting all of your eggs in one basket, right? Similarly with using cloud infrastructure so up to a certain size of business like it, it makes sense to to use a single SaaS provider, but particularly once you have certain number of contracts that require high availability, you may end up thinking, well, maybe I should split my infrastructure across multiple providers. There's costs associated with that, and if your architecture doesn't support it that, then it's a very difficult decision to make, but if your architecture does support it, then yeah, it absolutely makes sense, right?

    No supply the supply chain piece, and none of it's useful if you're not testing and exercising. I guess I was thinking just back like on the writing your audit logs to multiple places. One of the things you know you could conceivably be writing, you know, the last 48 hours of logs, so you're not overwhelming a disk. In addition to sending it, there's some, you know, it doesn't have to fully be a two legged stool. It could be a hey, I'm, I'm, you know, this data is so important, I'm caching it, so if I have an outage, I can go back and look at it. And then you might be able to get away with a lower SLA right, dropped down to a 99 97% on something. If you're like, Well, you know, we're saving data locally. So if something happened, I can go, I can find the last week's logs. I've got them, and we can, we can store them if, if our if our services are down, it seems like like a defense in depth strategy. I mean, I don't hear them a lot,

    just like, go ahead. No. I would also say that most people involved in any of the components of observability, like observability, like logs, metrics or traces, have a tendency to gravitate towards hoarding their data just in case. But the truth is you do not need to retain all of your audit logs, just like you do not need to retain all of your metrics or traces all of the time, like other blocks that other logs should be consumed and evaluated as close as possible to the source. And then what you want to retain. Mean, is your summary.

    So, right? That makes sense, or at least, or you could parse the logs pretty significantly so that they only contain security events or something like that, yeah.

    So for example, I wouldn't care that that like each single instance that Victor logged in, into the Digital Rebar systems over a week. But if, if, let's say, there was an outlier, then you would want to retain the logs around that that outlier event do,

    yes, so, so, yeah,

    the challenge I get into is, as soon as you start pruning and pruning those logs, then you, you're effectively tampering the logs. So you, you have to, you know, have some, some mechanism. It's, hey, this is this, we're crossing out ha land into audit land. But yes,

    there are off the shelf systems that will do that for you, that essentially that like these. These are the new generation of log parsing systems that ingest them as soon as possible. They create the summaries that are compliant with whatever rotary system that you're working in. And then they produce a summaries, which is what you then store. And then you have, like, a likely 100x 2,000x reduction in volume that you're sending out of your system. That there's a huge lift off of your network, on your storage costs.

    So you have to, you have to have a trusted system to do that,

    and you also have to have the volume of logs that you're producing to essentially make it worthwhile to purchase a license for these kind of systems. So again, it's a matter like, there's like, it's not a straight line, there's like, one number goes up, the other number goes down.

    But But yeah, like open Telemetry is it's actually probably the easiest field to get comfortable with, with working on these kind of filters, because the process in, in getting only the interesting traces into your back end. It's very, very similar to the process of filtering your logs.

    How much is a is it a factor if you have a redundant system and it fails over? Is there, you know? Is there a concern that the logging and tracing systems then might not keep up with that failover? I hadn't thought about what it what it takes to actually be like, okay, all these, you know, you have a system that fails over. We have all these downstream components now that we're talking about, they have to keep up with the system having moved also, I'm assuming, I

    mean, that's

    just not, not that much of a concern. Because I right. I think if we're logging our Promethean metrics out of, say, Digital Rebar, and we have the active, it's an active with standbys, and the active moves the permits we'd have to have, I guess permits to me, see if metrics out of all three of them, which you'd want anyway, because you need to know that all three of them are up, and you're gonna also the metrics are gonna switch over to a different, different node.

    Yes, the system, I mean, some of it, you can, you can automate like, again, for example, if you if, instead of using like, like an IP address, you use a DNS entry with a low TTL, then you can update that DNS entry and when you switch over, when you do a failover, and Then the querying gets switched over, pretty much transparently assuming that that the that the primary system is actually fully down, then the client would likely have a connection error and retry anyway. So,

    right, and then the primary for us, the primary IP, the. VIP would would migrate, but ideally, you would monitor, and maybe this is yeah question for Victor, right? Would, would we expect people to monitor all of the redundant, passive, redundant nodes in their system, or would they just attach to the VIP?

    Um, I mean, it depends on, depends on what metrics you're looking for. So in our system, we don't move metrics from one system to the other each even in an HA cluster, you know, metrics are kind of gathered and displayed on a per node basis. What we do is we expose the metrics are exposed over the VIP and on the per node IP addresses. So if you want to look at the cluster as a whole, then you look at the VIP with the understanding that if a failover event happens, some of the metrics are going to jump around. Um, mostly because I don't think the cost of shipping metrics from one node to the other is really worth it. No, it

    wouldn't make sense to me. What you're describing is really fascinating, which is monitoring the understanding your the HA infrastructure and monitoring that is different than monitoring the service. Yeah,

    yeah, there. Is an argument for shipping your metrics off system, and that is that once your system is down, you likely want to go to the metrics in the first place to understand what led to the system being down. Yeah. I

    mean, all of our endpoints are just exposed over, you know, a bog standard, you know, Prometheus service endpoint,

    yeah,

    so you get, once you get there and, you know, you just scrape, however, how often you scrape to get up to date metrics, yeah,

    this is why I advocate for having, like Kubernetes, Prometheus metrics sent out of the cluster, as opposed to having a per cluster Prometheus cluster, again, that does because, like, if your environment is down, you don't want to be scrambling to regain access To those metrics.

    Well, I think this is part of the disaster planning that's worth, worth considering, right? Because we're this conversation is from an ops perspective. And part of the ops perspective is you're, you know, you if you have something going wrong, you your your systems, if your systems are failing, then that's, you know, it's owned. And we have stories about this going back years, right? Of of something like your, your single, you know, the Facebook one is famous at this point, right? The Single Sign On system failed. They couldn't even get access to the gate in the data center because it had dependencies on the Single Sign On system. And we were what we were talking yesterday, even with SSH, where, if you were dependent on that system, you, you know, without having backup keys, you would be locked out of all, you know, potentially out of the boxes you needed to maintain that system and repair it, get it back online. So it's, it's not just ha systems, but it's understanding the failures of the systems that you're using to maintain and monitor the systems to help restore and check them, which is an important dimension from an operations perspective. There's

    there's also the difference of like doing this at scale versus, let's say, doing this on a startup, like, if you're a startup and you have, like, three services that you're running with maybe 100 users a month, by all means, just keep your metrics in the cluster, because it's not cost effective to ship them out, unless perhaps you shipped into like a, like a SaaS provider that has a free tier because your volume is so low, right?

    And but this, and this is the classic challenge of designing, of picking your SLAs for anything, right? It, there's, there's, it's smart to have a strategy to get to a high, you know, SLA, but you know, until a service actually has a real need for a high SLA, it doesn't make any I have a story, actually. Let me tell this story is a good startup story, since you brought up startups, although it's a little bit older, and then, and then we need to wrap up because we're out of time. But in my first startup, we were building a cloud service provider back in the 22,000 and part of the idea for the service was that, you know, we would be running, you know, your people's infrastructure for them in a data center. And. Therefore we needed to do a better job running people's email and SQL Server and stuff like that, and so a significant amount of our first capital actually went to buying highly redundant gear. At the time, we bought a SQL cluster with a shared drive in it. This thing was a beast. It was half a rack, large and incredibly expensive. It's like over $100,000 purchase. And you know, my co founder rightly and wrongly insisted that we needed this highly redundant SQL Server, whatever, even though we had no customers who had actually bought highly redundant SQL Server capabilities at all. And it's one of to me, it's a really nice illustration of how expensive it can be to make the assumption that you need high availability when you don't have a demonstrated need for high availability, and that there's you need to know how you're going to get there in the future, we could have bought that device later, after we'd sold a high availability SQL cluster, instead of committing a whole bunch of capital or time to buying a device that we hoped to need, rather than actually needed and maybe, and that, to me, is, is, is a morality tale for high availability. Cost us a lot of money to do that, and we replicated the same mistake several times, and in the in the startup, and burned a lot of cash that we would, I wish we hadn't spent

    so on the other end of the spectrum, we also have those who don't think about high ability at all until it comes some bites them in the rear. Typically, there's more about backups and not having like a three to one strategy. But it also applies to a chain in some cases.

    And I would, I think that our conversation about Dr was a really good example. And in, you know, some, in a lot of cases, having a good Dr strategy and testing your DR strategy is going to be much more valuable than an HA strategy, or is is valuable first from a ha strategy,

    given, given a sufficiently permissible SLA, your DR strategy can be your he strategy. You can just say, like, okay, system is done. I will create another one within an hour, and I

    will restore it from backups. You were taking backups, right? Yep, that's

    that I would I would much rather our customers be taking good backups, right, and have those and test those then, you know, ha, because the recovery right? A downtime is bad, but loss is, is really, really, is really much more expensive that, I think is a good wrap. There's a lot to cover with this. So I appreciate everybody's time. I'm trying to think, if we have we're going to be we're out next I think we're we have a long winter of cloud. 2030 DevOps lunch coming up. So next Tuesday is out. Let me say, DevOps. So it's in the tech ops. We will come back, but there's no meetings from June 18 to July, 23 so basically, back in August, after that, Thursdays are still on. By the way, Thursdays are still on. Talk to you soon. Thanks.

    Wow, in my career, high availability is one of those topics that has really tripped people up on both sides of the coin. There were stories here about RackN building some very sophisticated high availability, and then having people realize they really just need disaster recovery. My earlier startup career, I really love that story about just how complex and sophisticated, how much money we spent building an HA system that we didn't need. So I hope you take this advice to heart. The whole tech op series, I think is really powerful, and hopefully has been helping you become a better operator. Let us know what you want to do next. We want to hear from you. You have some time to influence the series, and I'm looking forward to hearing what you think at the 2030 cloud. Thanks. Thank you for listening to the cloud 2030 podcast. It is sponsored by RackN, where we are really working to build a community. Of people who are using and thinking about infrastructure differently, because that's what RackN does. We write software that helps put operators back in control of distributed infrastructure, really thinking about how things should be run, and building software that makes that possible. If this is interesting to you, please try out the software. We would love to get your opinion and hear how you think this could transform infrastructure more broadly, or just keep enjoying the podcast and coming to the discussions and laying out your thoughts and how you see the future unfolding. It's all part of building a better infrastructure operations community, thank you. Applause.