20250304 HA in DRP

3:53PM Mar 21, 2025

Speakers:

Rob Hirschfeld

Keywords:

High availability

Digital Rebar

Raft consensus

failover

leader election

VIP management

synchronous replication

content packs

job log entries

DHCP leases

load balancing

network latency

backup system

restore capabilities

distributed transactions.

Rob, Hello, I'm Rob Hirschfeld, CEO and co founder of RackN, and your host for the cloud 2030 podcast. In this episode, we dive in to RackN, high availability technology. What we did to build consensus based raft ha capabilities directly into Digital Rebar. And this is one of those episodes where we are talking specifically and only about Digital Rebar. So it is a vendored conversation from that perspective. But if you are building ha systems, or interested in how ha systems work, this is one of those great sessions where you are going to learn firsthand from our experience, and I know you will enjoy that conversation.

This is going to be a RackN discussion about RackN topics. I'm not trying to keep this neutral or pretend it's neutral. So just, just so a lot of times these, these lunches, I try to make them a little bit general industry. For this one, I am very happy for us to talk about our ha capabilities, product troubleshooting, it what we think about as part of the framework, and that'll be, I think it's, there's a lot of general interest topics that we're going to cover. And of course, there's specific interest topics for RackN. So please don't be shy or hold back about RackN specific topics, because this is basically a RackN discussion.

Any questions about that, and then I'll tee it up a little bit cool, yeah, but so feel free to ask, you know, ask questions and dive in. What I was hoping to do today actually extends on yesterday. Last week's topic a little bit, and it was sparked in part by conversations I saw about Dr P's ha capabilities in the channel, and so I was hoping Victor that you would help lead a conversation about how we've built. Ha, I'm happy to go through a little bit of the history, because that's probably useful for the team. Talk about the challenges. Answer questions about, why do we do it this way? Why don't we do it this way? What if we learned? One thing I would ask is, Let's not mention customers directly, just because that that would be the one thing that would be a pain to go back and fix for the for the discussion, but otherwise, that that was my goal, to to talk about what we've built from an HA infrastructure and why.

Okay, I guess I can take over from here. Um, so very early on in the company that in the company's uh, days, um, we sort of did and sort of didn't have something that might be considered Ha, by virtue of having sharded the entire product into a bunch of tiny, micro services. It doesn't really have anything to do with how we have designed and implemented anything today, you know, but in theory, because at that point in time, we were a traditional, you know, three layer corporate app with, you know, your HTTP front end, some middle layer stuff in the middle, and a database back end. You know, you could do scaling and ha using traditional scaling and ha technology, or traditional scaling and ha technologies for whatever layers were involved. So like for the back end, you could have done, you know, multi master Postgres, or, you know, targeting clustered, my sequel, or whatever, for the middle layer, it depend on the individual services. But you know, that would have been handled via Docker or Dockers, Docker stuff, basically because it was a bunch of coordinating Docker clusters, and for the front end, it was just a Rails app. So you would scale it and make it ha the same way you would any other Rails app. All that went out the window when we decided that that code base wasn't worth having anymore, and we rewar the whole thing as a single, go binary. So for about a couple of years after that, we didn't have an HA story at all, kind of what we were targeting and what we were promoting in the marketplace. Our service was a we had requests for ha. But, you know, we kept things incredibly simple. The database as such was literally just a directory full of JSON files. It

was the open source era. Yeah, yeah, back when

we were trying to run things as an open source company. And so for a couple of years, that was the status quo. After that, we reorged the company, took all the critical bits closed source, inverted our licensing model, and started to actually think about those customer requests for ha. So the first incarnation of what we would really consider Ha is what we currently whereas what I currently call basically synchronous replication, the leader in a cluster, which can be of any size two nodes, is fine in this scenario, basically broadcasts all of the rights to its database and The rights of the artifact to the followers, and if the leader dies or goes offline, then you're supposed to have an external process come in and initiate failover to one of the followers. Kind of the thing that we had initially been targeting was that, you know, people would redo whatever they had done with tools like coro sync to actually do semi automated failover. Or if they didn't need it, you know, they could go in and run the appropriate commands to tell one of the followers that it's the leader now and take over. Needless to say, that only lasted for a couple of releases, the code that implemented it is still in use as the manager code, and the code for taking a backup that still relies on the same protocol and the same and uses the same sorts of log shipping that the initial synchronous replication with failover protocol used. But these days, starting with 4.6 we re implemented all of that stuff to go through raft instead. So these days, our consensus model is that every Dr provision is actually running on top of a raft cluster for the overwhelming majority of Dr provision nodes, that is a cluster of size one that doesn't talk to the network and has very generous fail over times and very short election cycles. So every Dr provision instance that is running out there is effectively a raft cluster of size one, and what that allows us to do is automatically create and scale up and scale down clusters as needed. Our method of creating a cluster is to tell the node that it is, in fact, a member of a real cluster now and give it a network address to listen on for consensus traffic, and have it restart itself internally. That actually doesn't actually, it doesn't actually interrupt API traffic at all. It just reconfigures raft behind the scenes, and then, bam, you're a cluster of node size, one that other clusters can other, that other cluster members can introduce themselves to. And growing and shrinking is just a matter of taking a running Dr provision instance and saying, Hey, you're a member of that cluster over there now, which will cause the node being introduced to replicated state from to join the cluster replicated state and then start participating in the voting process.

The other that kind of catches us up to where we are now,

the one, one thing foreign ha cluster, when the cluster leader moves. There's a there's a couple of there's a couple of disruption points, right? One is that there's a main IP address that the VIP needs to move also, right? So, so what coordinates moving the IP for the cluster to the new leader.

Okay, that can actually happen in one of two ways. The first method that is actually baked into the product, and that is the easiest to use, is we use gratuitous ARPs to move a virtual IP address around, and we rely on internal scripts that essentially know which Ethernet address we're running on top of what the VIP should be. And whenever a leader gets elected, it adds itself as the VIP. And then. Sends out a gratuitous ARP to say, hey, I have this IP address now. So, you know, switch out there direct all traffic to me. And when a when the leader loses its leadership for whatever reason, the first thing it does is it releases the IP address on the interface that it releases the VIP on the interface that used to have it, and then ARP will eventually expire. And as part of the election cycle, all of the nodes will also try to release the VIP if they have it for whatever reason, because misconfigurations have happened before in the past. And then the system will elect a leader, decide who has the most updates. And once all that is done, it will then add the VIP to the interface that you configured it to be added to, and send out a gratuitous Arp. And then the systems back up again. Usually, an election cycle like that takes less than a seconds to go through, unless something is very wrong. So the other method involves doing exactly the same thing, except using customer provided scripts that should go out and poke whatever thing is acting as the IP address redirector to make sure that all traffic dusted into the VIP winds up going to the current leader. This usually winds up being some sort of load balancer, although stupid BGP tricks with any cast addresses would also work as well. But then we're getting into stuff that the teams that we usually work with would never be allowed to do

if they were going to use a load balancer to do that, they could use go back to the wall Sync tool and not worry about raft. Couldn't they?

They could, and we tried to make that work for several of our customers, and it never really cohered into a working example for anyone. Okay, which is why we decided that we're just going to do automated fill over as our base case instead. And everyone will have to live with the fact that their cluster sizes need to have their clusters need to have at least three nodes, and they should really be odd numbers of nodes instead of, you know, two nodes with something else acting as the tiebreaker. Gotcha, that last part tends to be a discipline statement on a lot of our users. Yes. The other thing with using modern failover techniques using things like Paxos and raft, is that, by default, there isn't such a thing as a designated leader. It's whichever node got elected is the leader, and it will stay that way until some other node gets elected.

And right now, we don't really have any knobs to influence which node would be the leader in the event of a tie. Adding those kind of knobs can do bad things to the system stability otherwise, and I don't want to allow our customers to get themselves into a state where a leader can't be elected because of their preferences versus because of just the natural leadership election rules that raft uses

would do. We have customers who are, like, trying to get, like, their primary, like a primary DRP that has more resources and like the backup ones are like, scaled down or under resourced, just to save resources, or,

Yes, and it's a constant battle to tell them that's a bad idea, okay?

Because that, yeah, if you can't tell define which one's the election, then it's strictly order of bring up. Then to force, force a force an election to a single node cluster. Okay? What's the doubt? What's the downside of of trying to force it like that?

The downside of trying to force it like that is that it adds extra steps in some pretty low level bits of the code that adds that can potentially add instability to the leader election process, and it also forces the system to make an assumption that there is a preferred node that it should run on top of, and

just bad. What also means that that you've, you're, you're potentially under resourcing the bat the other so. So part of this design means that all, even if the DRP is not handling load, the non the follower, Dr peas, is that the right term? Yeah, follow. So the followers still need the in memory. They have the same in memory footprint from a data. They say the same storage footprint as everything else they might, you know, probably don't even use that much less CPU if, I guess that's the only thing you'd be saving if you tried to scale down one of the node, the endpoints,

they as long as they're just a follower, they actually should be using significantly less CPU. One of the things that I've noticed causes the most CPU traffic or the most CPU utilization on the system, is either when the CPU gets really overwhelmed, just due to the sheer amount of API traffic coming in, or with plug ins misbehaving, we have several customers that just have blindly loaded all of our plugins, and it's not uncommon to see things like having several instances of the callback plugin, each consuming a core to a CPU, basically spinning their wheels, trying to talk to an upstream provider that they have overwhelmed or trying to execute poorly written scripts that are wasting a significant amount of time, or just trying to handle event traffic from Dr provision that is has not been filtered out as well as it could have been. I

was going to make a note about that for the new people. Plugins can subscribe to the event stream, so one of the most effective, efficient ways to process the event stream. But to Victor's point, if you are subscribing to too many in a busy system, it's you're going to have a very active plug in doing especially looking at events that it doesn't have to look at or act on.

Yes, one

of the big improvements that I made in 414 that has not been back ported to earlier stuff, because it actually rewrites a lot of the event and trigger handling code, is to allow for plugins to have a lot more specificity in the events that they look for by being able to more tightly filter out events on the server side for to basically throw away the ones that they're not interested in. Instead of having to make those decisions on the client side and have sort of a coarse grained event subscription, you can now subscribe to events that are pretty fine grained using extra filters. That's that's kind of aside from, ha, it's well, when this because followers don't even have any running plugins. Ah,

that's an important distinction. Okay, yes, one, why not?

Because there's nothing for them to do part of the election cycle is that whenever a system gets elected as leader, it spins up all the it, it reads all of the plug in configurations and spins up all of the plugins that should be running. And whenever a leadership is lost, it kills all the plug in provider that are running on the system. If

so, if, if you had a follower that was trying, you know, it's raft, you can conceivably make rights from multiple locations. No,

not only the leader. Can only the leader can accept rights. All rights have to go through the leader.

Okay, so the followers are really just keeping their database synchronized.

Yes, an architectural decision that we sort of made earlier on to kind of go along with using a VIP was that the followers will still have their API running, but they will only handle API traffic that is destined for themselves, things like getting the info, getting some metrics, getting a couple of read only items pertaining to cluster, you know, cluster state and that sort of thing. And for everything else, they just redirect it back to the cluster leader to do their thing. So we could have architected it to sort of work more in a multi IP environment where each of the cluster members was sort of CO equal, and add an additional layer, sort of in the network traffic to more tight, to more separate the front end and mid layer code with the back end and database code. So there'd be a network protocol for, you know, followers to send API updates to the leader, but that would have introduced a whole bunch of other locking overhead, and then I have to care about making transactions work across network boundaries, and that was just too much hassle to pull into the project at the

time. Makes sense,

because, as people who read CS research papers for fun will probably have noticed, distribute. Transactions are hard.

Yeah, that's this is getting, getting this to work effectively has, has been, you know, it is, in itself, with the constraints you've defined, a very tricky piece of software to build and maintain. Um,

yeah, and one of the bigger constraints on our architecture is the fact that we are handling traffic other than just straight up HTTP, that requires being able to do things like, well, TFTP and DHCP, they have their own stories for how they could be integrated into an H a cluster. DHCP sort of has protocol level things for handling multiple servers and using DHCP relays to sort of bail over to the leader for handling traffic. But TFTP is way too stupid for any of that sort of stuff. And so we kind of decided that flipping of the around is the best way to handle that sort of thing, because there's no way to tell a system that's pixie booting if talking to such and such server fails, try talking to this other guy over here instead. And we cannot rely on relays or forwarding for that sort of thing, because if the leader is offline, it can't transparently forward, if what used to be the leader is offline, it can't transparently forward traffic to wherever it should be going instead, Right?

And DNS, well, it has redundancy build, as we've talked about on previous sessions, it has a lot of redundancy. The to me, the idea of having to troubleshoot, you know, oh, wait my you know, whatever is is on one of n number servers or locations, just strikes me as asking a lot of the operators for, for not that much gain, right? We scaled endpoints to hundreds of 1000s, not hundreds, 10s of 1000s of servers simultaneously. It's there's no, not a capacity. There's no capacity requirement to distribute the load. From that perspective, I Yeah,

the question is whether or not the capacity requirements for doing that are more or less onerous to handle than the consistency requirements of trying to distribute the load would be. Because running, you know, well, but, but from a running 20,000 nodes, or wanting running 20,000 nodes against a machine, you know, just the overhead of monitoring all the web sockets becomes pretty significant after a while, and that's an area in which I am actively looking at how am I go? How? How am I going to optimize this in the future? And what does that mean? But you know, spreading the load out horizontally means that there has to be communication in between the nodes that the load is being spread across to coordinate updates, and it just becomes the distributed transaction problem again, right?

That makes, that makes sense. My, my point was, was sort of of the idea that we can, you know, there's times when it's, you know, this is a failover story more than it is a load distribution story,

correct? This is not a load balancing story at all. This is a failover. This is a this is an automated failover. In the case of one of the members being one of the cluster members being fenced off or crashing or needing to restart, or whatever. Gotcha,

Greg, Greg's putting something up on screen. So

yeah, so this is what our UX shows when you're in Ha mode. Normally, your cluster section high availability will be off or not visible. And then when you're running, you actually tell the system what your high availability ID is. So like, this thing gets shared across all of the systems. And then, like, each of these are my three systems that are currently up in Ha mode. And the VIP in this case is the 10 address right where the rest of these are 1112, 13, in the lab setup, and the UX will show the perspective from the leader, who's up and who's in sync and stuff like that. And then, so that's the HA section, the replication system, like Victor. Was saying is already, is still present. And in this case, we use it for manager to managed system kind of set up. In this case, I have an upstream system being a manager for this, and it shows its state here. And if you do a backup, you'll see like the backup system will attach, show that a backups are, showing that it's saving data between those and then it will disappear when the backup program detaches.

Tom, you have a question,

yeah, what considerations have to be made for the amount of synchronization traffic between the nodes. Like, do you require a specific link speed, or do you have to worry about things like broadcast storms or anything like that?

At the low level, we used HashiCorp raft library as the basis of our consensus code. So link speeds isn't so much an issue as as long as it's sufficient to handle the replication traffic. Latency is more critical than that, because we are the raft protocol requires that an update to the kind of the the shared state has to be replicated to a majority of the nodes before the leader can proceed with the operation. And we've baked that assumption deeply into how we handle transactions, so that if you add latency to the system by trying to spread the load across multiple data centers or multiple continents, that will directly impact how fast transactions can finish and negatively impact the overall speed of the system.

Oh, interesting, okay,

yeah, because sort of the minimum latency for committing a transaction. It boils down to the worst of the local storage latencies and the worst of the network latency between the leader and a majority of the followers times two. So if it's all on the same data center and are all on the same subnet, you know network latencies are, you know, only in on the order of a millisecond or two, usually. So they're we don't generally have to worry about them, because the inter system links are usually at least Gigabit Ethernet, more often 10 gigabit. So, you know, even with the largest of kind of transactions, which will commit anywhere between 500k to a mega data to the cluster, you know, the overhead of writing that, to sending that across the network, to have the followers write it to their discs, and have a majority, you know, acknowledge that they've written it. That isn't generally too bad, and it's usually actually bound by the storage subsystem these days.

But one of the things I remember doing early in this was we had to separate, or even in the first versions, we had to separate out some of the the artifacts from the database, from a from around like we couldn't we're not stuffing ISOs into the raft.

Oh God no, yeah. So one of the design decisions for the consensus based off of raft is that no one in their right mind would want to shove a raft update that consists of a nine or 10 gig ISO over the wire, because that would halt all other Cluster Operations while the followers are receiving that ISO and updating that it had been received, and all that other fun stuff. So one of the things we allow, and that has, and that is a potential source of problems, is that we allow artifact replication, where you know when you upload an ISO, it uploads to the current cluster leader. And then once that ISO is finished uploading, that leader will send out a update through consensus, saying, hey, this artifact has been updated. Its new SHA sum is going to be blah. And so go out and fetch it if you need to. Those updates are pretty small generally, but we we rely on the followers to go out and fetch those artifacts and indicate that they have received them via a system of basically incrementing numbers. And so what that. Means is that it's permitted for the state of the follower, or it's permitted for the followers to lag behind in terms of artifact replication compared to the leader. In fact, it's even expected. One of the things that I think has recently bitten a customer is that we do not force replication of job log entries to followers until the job is marked as finished on the leader. And that's partially to cut down on network traffic or jobs that are still in progress, and also because whenever a failover happens, from the perspective of everything attached to the system. It looks like a temporary network connectivity loss. And so the behavior on the agent side is to notice that all of a sudden, their HTTP connections have been reset and fail whatever job they were working on, and restart that job whenever they reestablish connectivity to the bit,

okay, can we dig down on that a little bit? Because this is a place where what we're building is not just a raft database, it is there's a the agents. The agents are part of a distributed story where, where the sorry, Greg's having internet issues. But so, right? We have a we have a RAF database that's maintaining state, and then all of the agents are attaching in and doing work, and the job is the unit of work for us. And so yeah, what can you describe what happens during a failover for the runner, and then jobs, any jobs that are in process, because that's that's a important part of this whole equation is not just a UX needs to move over, but the job running is actually much more critical.

It is much more critical, but it's also a problem that we already had to solve, going all the way back to the very beginning of of jobs, yeah, RackN, of RackN, the company even, even on, like the previous stuff that was based off of Ruby on Rails, a job that, you know, the system is going to, You know, we're going to need to restart the server, and that's going to result in the agents needing to reconnect to the system to do things we're going to need to, you know, the agents are going to lose connectivity via no fault of the server, you know, because, you know, switch went down, or whatever, and So baked into the whole protocol that the agents used to pull jobs from the system is the notion that occasionally connectivity is going to be lost, and we have clearly defined boundaries upon which we can try to redo work, and that boundary is a job. And so if the agent notices that, hey, all of my connections suddenly went away. It's just going to spin and try to reconnect to the Digital Rebar server and tried to pull the job that it was running to re execute it and the system. We leveraged the fact that the system already did that and has done it since the very beginning of the current code base to handle what happens when the VIP gets moved or what happens when our connections go away as part of a consensus failover,

just sort of part of part of the reason why you don't need an incremental job log. Because you're not picking up a job in a failover scenario, you're restarting that job. So whatever, whatever was interrupted is functionally, you know, there's a partial log somewhere, but not really that much of a concern to replicate the partial log, because it's, it's that that whole job is toast, but, yeah, it's gonna

be, it's gonna be retried anyways.

That makes sense beginning so you have the history. Because right, there's, I guess, when you when we're replicating most objects in Deer peer, pretty small machines and jobs are probably the biggest exceptions to from a job size perspective or a object size perspective,

from an object size perspective, it's going to be content and machines that are the biggest things. And machines are already very large at our customers, and we compress them to transmit them over the wire, which means that they wind up on for most of our customers, they wind up anywhere between 104 100k so machine updates actually account for the vast majority of clusters. Cluster synchronization, traffic,

Oh, yeah. So, so even though we might take a patch for an object change on a machine, so we're not rewriting a whole object to from the API's perspective, we're still resynchronizing the whole object. Yes, it seems Yes, because the way raft works makes sense, and then there's no risk of an of a partial update, yes, interesting.

Going to get your your I'm going back. I'm going through the questions. I was jotting down the Sure, no worries, other people's have other questions. I know I'm sort of running the gamut, because I know a lot of the deep, the depths, but so I think one of the things, why I'm asking these questions, I think it's important, is that that you know, for the most part, most objects, one are very sta, Oh, no. How? My question is, most objects in the system are very static, especially if they're in content packs, versus, you know, there's a handful that are much more dynamic in the system, and those are those are probably the ones that are most likely going to catch be problematic. So it's useful to name them for people to understand.

I don't quite know what you mean by problematic there. I

mean if, if most DRP objects are delivered as part of a content pack, or part of basically getting the system up and running. They're they're set and forget data from that, that perspective, right? Like so if you have a content pack, does a content packet decomposed into objects from a from a ha perspective, or does it ride as a whole content pack?

It rides as a whole content pack? Wow. Okay, cool. Yeah, objects like that. So one of the other features that our database system is, is it consists of a series of layers of objects at the lowest level is sort of the writable objects, and that's things like machines and leases and stuff that you just randomly create through the UX. Everything outside of that generally resides in a content pack, and those are from a database perspective. Those live in entirely separate layers to eliminate the possibility of a bug in the code, allowing for something to write them when it shouldn't be able to from a consensus standpoint, those objects don't exist. They're only distributed across database or across consensus members as an entire content pack update and their own and content pack updates only get distributed to other nodes and consensus once they've passed all the validation checks that they need to pass on the leader.

Oh, that's cool. Okay, so Okay, and there's, and there are atomic actions from that perspective anyway. So it's, it's a single right into the database, and then it's gonna take some and then it's gonna take some time to synchronize from that perspective. That's why content is, it's, is the larger is one of the larger components Well,

that and just the entire content pack is distributed as a single object.

Does it make sense to not like I'm asking this the wrong way? Should DRP accept update, you know, content changes or changes like that, in a degraded state, would that be a problem?

If the if there are still enough nodes in the cluster to have quorum, which means it can elect a leader and accept the API request in the first place, then no, it's not a problem. Okay, so the system is set up so that no provisioning traffic will be accepted, or even can happen until the until raft itself has reached quorum and elected a leader, and we've done our little internal election to make sure that the leader that the system will wind up Running on is also the one with most updated artifacts from the point of view of the artifact replication process, which I went over a little bit ago. So if you're able to make the API request for our content update and have it come back successfully, then it's safe. Yeah. Then the content update will have been replicated to a majority of the nodes, and if all the nodes go down and come back up, they will elect someone that has that or that content update in the database.

Makes sense. If does that also, then make all the runners who are attached, they're gonna basically see that as a as a change also, like, are the runners aware that they're attaching to a specific node in the cluster? Or are they just, they're just like, hey, I'm talking to the VIP. It's, I'm

talking to the VIP. And from a failover perspective, you know, the VIP stopped accepting traffic for a little bit, and then all of my connections got reset, and then I got back and trying it. Okay, yeah, so a failover looks just like a temporary connectivity outage or someone restarting Dr provision, except a lot faster.

That makes sense. Is there, is there any we talked about scaling in Ha systems. Is there any way to stop all those runners from reconnecting so I fail over? Is there some like, how fast does the load come back onto this new system that's spinning up?

So it kind of depends on how long it was down the the client or the client that all of our runners use has baked in sort of Fibonacci back off. So if connectivity is lost, it'll try again immediately, then it'll try again in a second, then it'll try in two seconds and three seconds, than five and so on and so forth, until you get in about 30 seconds, and then it'll keep retrying at 30 seconds forever.

Okay, so, so in a down, in a down recovery state, you're not going to overwhelm the ERP system with all the agents like, you know, immediately checking back in and then starting starting the put load on the system Correct. Makes sense is there, would there be backlog of consensus processing to do in that case, like, how heavy of a load is it? If the consensus gets behind, consensus

can't get behind. Okay? That's one of the that's one of the nice things about using raft with an elected leader is that consensus can't get behind because we don't accept or we don't commit transactions at a rate faster than they can be distributed through consensus. Okay, so in practice, that's not an issue, as for systems that have been set up with all the consensus nodes in the same data center with minimal latency,

if we tried to go to the cockroach DB route, then that would turn into a thing we'd have to care about, and We'd have to add the tuneables to tune stuff like that. And that's just not something I really want to get into, because then that involves a lot of black magic knob tuning,

right? So how, how? Tom, you wanted to

say something too? Oh, yes,

sir. I'm not watching the hands go ahead. Tom, oh, I was just going to comment

on, what about things like real time broadcasting in terms of DHCP leases that have expired and now have to be refreshed and caught back up.

So the nice thing is that DHCP already has lease renewal. It already has pretty same lease renewal times baked straight into the protocol, and from the point of view of a failover event, those lease times are usually too short to or the the failover events are short enough compared to lease renewals that It doesn't really matter. Most of our systems these days use 10 minute lease times by default, which means that DHCP is going to try to renew initially at five and then seven and a half, and then 10. And so that's three windows that you have to renew at least before you have to go through and do the whole rediscover and refetch it process. So the entire cluster would have to be offline for all of those events, for at least to actually expire due to failover. If we miss one, it's not a big deal,

right? That was what I was thinking. Was in an extended outage where it's longer than that, what kind of headache? Are we talking about,

potentially a big headache, but it's the same headache that everyone else running a DHCP server has to be aware of.

It's one of the reasons why these having a DHA strategy or Dr strategy is important. Yeah. I do want to talk about the DR piece and wall tool in a minute, but I before, before that we've this is, this has been, you know, a piece of software that we've had for a long as, you know, it's been part of DRP for a long time, but it's also been something that has grown, you know, as has been tweaked, has been tuned and optimized. Do you have have any like, you know, war stories for us about things that, things that you know, took a long time to get right, or, you know, places we should watch out for in the field,

things that took a long time to get right. Well, I'm always discovering new and obscure bugs, sometimes very annoying ones,

or troubleshoot, yeah, yeah.

Well, that and troubleshooting consensus systems like this can be challenging, because I have to get logs and databases from three nodes instead of one, try to cross correlate events and changes in the databases, and if a system is misconfigured, even a little bit like a lot of our customers, just turn DHCP logs to debug and leave them there forever and on a high traffic system, that can quickly mean that critical logs get drowned out and replaced and rotated away before we have a chance to actually look at a at a consensus issue that's happened multiple times. That's not really a consensus war story, though, that's more of a logging war story with unless people have gone into specifically tweak their logging setups. Most modern systems go through and they use system D, that's journal cuddle stuff for all of its basic level logging, and a an active Dr provision system with any of the logs that's debug can blow right through the default log size settings of that journal cuddle has in less than a day,

right? So I mean logs. Logs are not written as part of the consensus system. They're going to be on the local system, on the local machines, right, correct. Okay. And so if you're dealing with a consensus system that's migrating, you're gonna, you're gonna have to find the logs for each, each of the each of the members of the system, yep, and then correlate them.

And that's part of why the log format includes transaction and sequence numbers to make some of that stuff easier.

Is time sync important and our

system not for consensus, because of the way that draft does its leader election by funneling all rights in the system through a known leader that,

Oh, I lost audio. I

think it was Victor.

I think go good. You're confirming. I agree, yeah,

I hear you. Okay,

yeah, the time sync pieces, I know I'm dominating questions. Do other people have questions they want to ask on this I know it's a lot. I

Okay, my goal, my goal. Here while we Victor might not even know we can't hear him, hold on. I'll let him know.

I'd like to see pictures. We lost him, pictures of what, everything that we've talked about, it would just kind of help conceptualize me.

I think we might have them. Let me see Greg. Greg would know if we have ha pictures.

What do you mean? Pictures of what

like, like, how the like an architecture picture of how, how the sorry

about that. Zoom decided that I needed to be signed out now and killed the meeting. Dear,

no, we're well, we're we are still here. The meeting did not bounce happily, but we lost you.

Yes, um, yes,

I don't think there's, we haven't done a lot of HA graphics that I'm aware of. Yeah,

there's some spots in the. In the documentation for it, but I haven't gotten to it again.

It's in part because it's, it is, you know, you set up, you know, multiple nodes, you elect the leader then and then the leader replicates. There's not a lot, there's not a lot of flow, like as this far as what Victor saying, Go ahead, Greg, all right.

Um, yeah, there isn't. But usually what we have to overcome with the customers is certain mindsets of things and their definition of high availability and availability zones and what that means to them often doesn't match what it means to generate A ha raft highly available system. And so we get into terminology, discussions with them, and have to work through mindsets. And so that's part of where we need to spend more time, kind of describing, not necessarily, because the flow is fairly simple and straightforward. I mean, not necessarily from an implementation perspective, but from a setup and usage. Problem is, how does that overlay the customer's mindset of HA? And what I mean by that is like, okay, so we say, Okay, we'll provide an AJ, DRP for your data center, right? So you'll put three nodes in the data center or whatever subsection of the data so that you feel like that's appropriate. But from an availability perspective, we're saying those three machines that are representing that ha cluster DRP means that it's covering an availability of some number of racks, a data center section, whatever, and that's highly available as long as that unit is highly available. But from customers perspectives, sometimes that doesn't align. So for example, take a take a bank, for example, they have to have a availability of make my data center again. So in that time, sometimes they say, well, ha, needs to be two of those node need to be in the data center and another one needs to be outside the data center, so that if the data center gets blowed up, then we can rebuild it or whatever. Well, okay, that's not necessarily high availability. That's disaster recovery, but they often get inflated, and so they're saying, like, well, that way when this goes down, I can just make that happen. And you're like, but that's not necessarily the high availability that matches what we're defining. That another use case turns into like, well, I've got an HA DRP endpoint, but it's managing 15 remote sites, right? But if the data center goes down, I want the remote sites to still function. So my DRP endpoint needs to be spread across three data centers, so that if I ever lose a data center, I can still manage my remote size. Well, okay, that's an availability zone definition of like, Okay. Well, we often associate the machines we're managing with DRP to be within the same availability zone of the DRP endpoints themselves, so that if you lose an availability zone, then the DRP endpoint died too, and you're not worrying about the things they manage. But in the case of like a retail use case that I just described, right from their mindset, the HA is relative to the retail unit that's not running a DRP endpoint, but is relying on it as a separate service. And so in that mindset, right then you have to get around like, Okay, what does that mean? Do you have the latency requirements? Do you have the availability requirements? Can you right? So the point is, it's aligning the HA with what our customers define as their needs of AJ, if that makes sense, that's an

important thing to be able to understand you're bringing up, actually a question I wanted to make sure we covered on the Wall Tool, by the way, which is setting up the wall tool, not as a point in time. Backup. That is a stream backup. Is that the, you know, we can, we can basically take a continuous backup using the Wall tool, or is it just a snap? Yeah, actually,

can't using waltz right now. Okay,

you have to. You can't just leave it running. You have to restart it. But it does. It does incremental now, still, right? Yeah, it takes incremental backups, just fine, but continuous backup,

glad. I asked. Well,

continuous backup runs through the problem of, when do you stop backing up?

Yes, and then, once you're in sync, what do you do? Do you force everybody to wait for you to get your backup before you unsync them or let them, you know? Oh, so from that perspective, that's why, for the backup purposes, you run the Wall tool. It syncs it to a point and gets itself in sync and then stops, because then it doesn't impose the overhead on the rest of the systems to wait for it to be caught up before it frees transactions. Okay, that's

an important, important note. So if you if you want to have live backups, that's a ha cluster, you better you're going to get a consensus system. And if you just want to back up, take a take a backup on a cadence. What we're saying, okay,

and right. That's as we talk about feature enhancements or future enhancements. In this area, there's things like more metrics coming around the HA system so that you can track the state of at least how often it's operating, and things like that from the Prometheus metrics. And then on top of that, there's feature work being done to enhance the backup system and restore capabilities, because some of the challenges of restoring an AJ system are actually much harder than you think.

Well, so to be a lot more straightforward, right now, we don't restore ha systems. Restoring an HA system right now just means you build a you build an entirely new cluster using the last backup that you took from the leader of the old ha system. That's right. It may or may not live on the same hardware, but you know, there's no taking a snapshot of an entire ha cluster and then restoring that snapshot. There's only taking a snapshot of the leader. Yeah, there's you take a snapshot of the leader, which, by which, by kind of by construction, is going to be the most update, the most updated version of everything else in the database. And when you restore, you're building an entirely new cluster using that most recent snapshot. It contains all the same data that was there, but it's going to be an entirely new cluster from the point of view of our tooling,

right? So you need to make sure you turn off the old followers and do it in the sequence at that point, because other otherwise you're the followers could end up fighting with the the restored backup, right?

Am I overthinking it?

If you're are restoring, okay,

let's just be more detail the question than we need to. We need to go into in the moment.

So yeah, at that point I'm not clear as to what you're getting into, because if you just take the leader down and you have an otherwise functioning cluster, then what? Then one of the other nodes is going to feel like it's the cluster leader, and it's going to continue working until you take a majority of the nodes down, in which case, at that point, the whole cluster is going to stop. Too

much of an edge case, yeah, one, one note, by the way, for for people who are who are not familiar with our history, if you're thinking, why didn't you just use the database and and do a, you know, do ha clusters on on a database, or use, you know, let the database handle all this stuff. I did want to make sure we reminded people that part of the core use cases of Digital Rebar is to be able to seed a new data center. So one of the reasons the core sync pre req didn't work was because it had a prerequisite, of course, Inc and so part of the strategy here is even though we're talking about Ha, which means bringing in multiple. Double nodes, we still have a core no prerequisites to bootstrap your data center, which means DRP has no prerequisites in operation. And part of doing this is constrained by that core design requirement. It's always useful to remind people of why, why we didn't opt for, you know, just slap this into a database and let the database handle the backups. Cool. All right, Victor, thank you. This was a great, great brain dump. There's so much here. Hope it was helpful for everybody with RackN. Give you all some some background on on how the HA system works, and then, of course, this will be transcribed and available. So thank you. Thank you all Victor. Thank you for getting this done on relatively late notice.

No problem. Thanks, guys.

Wow. There's really nothing better than going from what ifs theoretical generalizations about technologies, really getting down in the weeds and talking about exactly how you had to implement something, why you did those choices, what trade offs you made, how it was impacted, and it's great to have the Digital Rebar engineering team as such a fantastic resource here to really share that experience. I hope you enjoyed this podcast. We need to know what type of topics you are interested in. How do we make your operations journey better. Shoot us a line, let us know, and thank you for listening to the podcast. Thank you for listening to the cloud 2030 podcast. It is sponsored by RackN, where we are really working to build a community of people who are using and thinking about infrastructure differently, because that's what RackN does. We write software that helps put operators back in control of distributed infrastructure, really thinking about how things should be run, and building software that makes that possible. If this is interesting to you, please try out the software. We would love to get your opinion and hear how you think this could transform infrastructure more broadly, or just keep enjoying the podcast and coming to the discussions and laying out your thoughts and how you see the future unfolding. It's all part of building a better infrastructure operations community. Thank you.