Download Otter for your meeting notes

20250626 Using AI for Complex IT Problems | Otter.ai

20250626 Using AI for Complex IT Problems

Rob Hirschfeld54min

Rob Hirschfeld

00:05

Rob, hello. I'm Rob Hirschfeld, CEO and co founder of RackN and your host for the cloud 2030 podcast. In this episode, we talk about how to use AI and agentic AI specifically to rethink log collection and analysis, how to use AI to find and solve really difficult problems, not specifically, but generally. And I'm going to ask your patients a little bit about this, because we get into it by talking about RackN, raidn bios, configuration history, log history. So there's, there's a bit of a build up in this as we get to something that's really fascinating. So wait for it, it's coming. And understanding some of the how we got there will help you appreciate where we got in the conversation, because I know you're going to love

Rob Hirschfeld

01:02

it. You are piling everything in as much as they can on the premise that July and August are going to be dead months. When? When has that happened? When has that happened in recent memory? Definitely, no. I it'll probably slow down a little bit. It'll we're getting a release out the door, and I think there'll be a lot of vacation for us. So how often are you how often are you putting out major releases, we end up in about a six month cadence is what I'd like to be, a little faster. But six, six months seems like about the Yeah, about the case. It's the right. I was going to say customers get a little cranky if you start coming out a little too often with release. Yeah, they usually would be happy if we spend more time testing it. And the last one we did out of was very bumpy out of the gate. So, I mean, we, we did a major security refactor in it, and so that, you know it, it had rip ripples that kept, kept surfacing for a couple of weeks. So that was, that was a big deal. You know, it was the security thing's really cool. We, oh my goodness, um, we have a token. We generate tokens, obviously, for permissions to operate the system. When we built the system, originally, we gave the machines a token that had very high access into the system to perform, to perform tasks. On the right, it's the agent. It's doing the thing, the challenge becomes, if you can grab that token, then you can impersonate the machine, and you have high access. And so we've been, you know, there wasn't a lot of alternatives for a long time, but about three years ago we maybe more. We added the ability to have a every task as a unique token. So so when you run our system, it generates hundreds of tasks to accomplish, and each task can have bespoke security permissions. So that way, each task has the minimum access is supposed to have in the system, right? We hadn't gone as gone back and said, Oh, now we don't need the machine token to have as much permission. We can tighten things down quite a bit more. And so we went back and made it so that we had least privilege tokens in all operational context, which is, it's, it's a hugely powerful statement, and it's, it can have profound implications if you have a task, yeah, made an assumption about having bigger permissions than it didn't. So we had a lot of work getting all those pieces. There was, and there were some other changes we made to backups or something. Anyway, more details than probably, right, right about, yeah, I mean, anytime. You're messing with a something as as deep and as kind of in the mix, as as as you know, security is that that gift that keeps on giving, right? You know, you never you never done. You're never done. It's so hard that, and like the HA capability is, we have a requirement for ha to be completely internal. So we can't just move ha into a database and, say, set up an HA database. So how? I mean, how difficult must be. And this is a, this is kind of a one of those questions, but how difficult it must. I often think about how much, how difficult it must be for you guys to do testing, given that, you know, it's not like you have this, you know, enormous warehouse full of, you know, every piece of hardware that you're you're going to be dealing with, and in all of the configurations that you must deal with. I mean, it's just, it's got to be it's got to be crazy. We're actually building a bigger hardware library to automate testing against a lot of those pieces, to catch, try and try and catch more and more of them. Even if we had it, our customers have their own configurations. So, you know, we don't, we don't the hardware, since we're just using the vendors tools, right? Ultimately, it's the customers configuration through the vendor tools. We're just orchestrating that. So there are places where we can help them, or we can catch stuff, and we do there's also times when it's just like those bio setting Yes, the vendor has to tell them that the bio settings will work or not. Does that? Does that make sense? So yeah, I want us to do a way better job of what I'm what I'm starting to call pre cognitive bugs, right, borrowing the term from Minority Report, right? The right. I mean, this is, this is, if we can catch something before you know that we should have been able to catch before it went into the field. I call that a precog, a precog bug, so I want us to, I'm asking us to track things as a company of, you know, oh, we should have caught that. That shouldn't have surfaced in the field. And it's not like traditional bugs where it's like, oh, we coded something wrong. Fixing that, I think, is just fixing a bug. It's, it's Dell changed the, you know, this happens all the time, super micro, top offender, you know, their new version of the management tool, bios, management tool has a behavior change from the last version. Yeah, and we that that we should be able to catch, detect and fix. So it's not a matter of, hey, you know, you've got six models of super micro gear, and one of them has a whatever it's we've got four versions or 10 versions of the super micro BIOS management tools. And there's, you know, when they rev it, we should be pulling that in, testing it through a battery of tests, and then identifying that something changed in their tooling. And that's what I mean. So a pre cognitive bug is we know, we know our customers are going to pull in this new whatever, you know, Linux distro, as soon as it, soon as it's out, you know it's coming, right? You know it's coming. They're going to hit this. If we can fix it before they hit it, then we've that's a precognitive win. It's not a, it's not a it's not a defect, it's a it's an issue. You'll accept the shade of gray distinction between the two pieces. So I'm calling those precogs. Yeah, go ahead

Jon Aykroyd

09:17

class. So are you aiming to release a new build with support for the new firmware before you,

Speaker 1

09:28

before the clients are begin, starting using it, or are you just aiming to patch as quickly as possible? Or the release like, make the announcement say, like, hey, we we've identified an issue with this new BIOS version. You might want to hold back until we have a new release.

Rob Hirschfeld

09:50

That's, that's ideally we would, we would have a Rata if you would go out on a regular basis. You know. Yeah, and I, what I'm, what I'm hoping is we'll have enough intelligence about customers environments that we would narrow cast it to say, hey, we know you're using super micro gear, super micro, you know, released this, you know, updated this tooling. We've, we've already accommodated the change here, but you need to take a patch if you're gonna when you before you do an upgrade. And so that means there's a degree of, ideally, data collection that we're doing with our customers to know what their inventory is, so we can help them identify that. I mean, it happens on a weekly basis, right? Linux, the repos for the Linux is get deprecated and on a regular all the time, the vendor tooling changes,

Unknown Speaker

10:54

Jack shaving.

Rob Hirschfeld

10:57

Oh, my, I was gonna drive this even faster. So even faster I then the G right, but we get GPUs patches come out. BIOS patches are coming out all the time. Os integrations. I was just watching a comment on our open shift work Red Hat. This if our CTO posted this, he was Red Hat. Gave him a notice there was an upgrade for his cluster available, but not to take it, because the APIs had changed, and he needed to make sure he was operationally ready before he did. Let me, let me actually see the see if I can find where did that go? It was, it was hilarious. It's like, you know, don't, yeah, don't take this batch. Here it is. It said this cluster should not be updated to the next minor version Kubernetes, 132, and therefore open shift or one nine, remove several APIs which which require admin consideration. Please see the knowledge base knowledge article for details and, and that's a minor version.

Speaker 1

12:17

Yes, but, but Curtis, they do make backwards incompatible changes in minor versions, although these are usually announced very far in advance. So if you only got this at the last minute, then someone dropped the ball.

Rob Hirschfeld

12:38

Yeah. I mean, it's good for them to do it. What red this was the source of the Reddit outage, if I recall last year, they had similar years, right? Yeah, they taken a patch that broke a filing, file naming chain. It was the removal of the word master from the config files. And that had been, you know, boss is right, implemented slowly over time and gone from, you know, we're going to do this to warnings to errors to, you know, no longer works over four years, but they had a config file that they didn't realize was getting ingested into Kubernetes that cascaded, cascadingly broke.

Speaker 1

13:30

Yeah, they are metrics that will warn you or the release the cluster events that will warn you if you're using a deprecated API. But if you don't know other existence, and that's hard. It's hard.

Rob Hirschfeld

13:44

No, it is. It's, it's a very, I mean, this is, we started the call with me talking about, you know, changing some security pieces where there was an assumption of access that we we undid,

Speaker 1

13:59

or the scripts in CI that assume that results in are in a certain path, and then the path changes, and

Rob Hirschfeld

14:10

we we get, we get. And this happens in red fish a bit is somebody implements the red fish API incorrectly. So the behavior response for a API call is incorrect, and then you have the dilemma of, and then they usually fix it in a future version, but now you've got two different behaviors on the same API, and so you have to be aware of of what you're going to do. Why this is why, when people ask us if red fish is a universally like it can't be right, one vendor fixes a, you know, uses the API one way. And. Or another one does it a different way, then you've got to account for that. Really tough. It's just, I mean, this is it? This is what we deal with. It's hard.

Speaker 2

15:12

Well, yeah, would a software Bill of Materials help or hurt in that context, or let me ask a better question. Great question. Ai driven software of materials help.

Rob Hirschfeld

15:30

I mean, if the AI was creating the software bill of materials,

Speaker 2

15:33

if the AI was constantly checking for those types of errors and omissions, and, you know, overs an oversight. It's a great question,

Rob Hirschfeld

15:47

kind of, even if it didn't do, do the checking, that kind of checking, if all it did was go back and run through all of the dependencies? Yes, if you, you know, if you could sit there and say, Here's my bill and my materials now here, you know, you know, I've, I've looked it out all, all of the, the relevant code and so forth. Here are all of the dependencies. And they actually kind of went through and said, All right, since the last time, this is what's changed in one of the dependencies. Here's what, here's what we know about new diversions. If, if you had the notion of an active bill of materials, it would be a very, very interesting world. It would be, it would change the whole nation. Notion of what a a software bomb would be.

Speaker 2

16:48

I was thinking of it in a inter Well, potentially interactive, but as a dynamic object, constantly looking constantly, you know, it's kind of like the notion of CDC, but in real time, I do

Rob Hirschfeld

17:09

think it would be useful. The challenge is your you need good data on both sides of this equation. Yeah, so you need us. You need a source of truth for the AI to talk through the challenges, right? It's not going to be in the model. And you need a source of truth from the machine. So one of the things that we're we, you know, I was talking about actually collecting a deep scan and knowing our customers inventory better, so that we can then identify, hey, you know, we have this change. You're likely going to be hit by that. I think there's, I think there's definitely an application for AI to analyze those things and provide some human context for it. My, my, my first concern is actually less on the AI and more on the reducing the data on,

Unknown Speaker

18:04

I'm sorry, on what the data

Rob Hirschfeld

18:07

on the knowing what, accurately what people have in their fleet, and then knowing what, what universe, what, what the potential pieces are in the outside world, but I'm not sure that AI can do, do, I mean, it could tell you if there's drift, it might be able to diagnose if, yeah, it would have to read the release notes. Or, I don't know, I guess I'm thinking about like, do I take a patch? Do I upgrade? What are you? Where are you? What are you thinking?

Speaker 2

18:48

I'm thinking more of it wouldn't be broad set, generative, it would be agentic. And you could containerize it. You could pro give it programmatics for reasoning, or rules for reasoning and logic and whatever. And if you did the Deep Scan first, because the Deep Scan side, it's not unlike a piece of equipment on the shop floor, right? You're going to know everything that's in that machine in terms of what its functionality is, what its metrics are, what its benchmarks are. You would do the same for the piece of hardware that would give you your baseline. And then from that, the agents could go in terms of

Speaker 2

19:46

what's changed. No, what's the baseline, what's performant, what's not less than performant or less than optimal. What changes are there? What are the. Dependencies of those changes. What are the implications of those changes to everything else? You know one thing changes everything else, kind of dominoes or doesn't Domino? Recommended courses of action based on historical knowledge and or real time knowledge together, and have it constantly aware so every time, because with open source software, even if something is not deprecated, but changes are made by the populace that contributes to it. That that would be on a constant basis.

Rob Hirschfeld

20:47

It's, it'd be interesting to think of an agentic system that could compare current results and start looking towards you know, which you know, system, component by component, what? What, where patches are available, evaluating possible interactions with the patch, right? You need, you need a fair bit of history, and then ideally, you'd have a test, right? I mean, what we encourage our customers to do is they'll, they'll test changes before they put them into production. So they have a test system. And conceivably, if the, you know, the agents could, you could have an agentic test system that would, you know, run through us, run through the scenarios pretty quickly. We, you know, it's funny this, this goes back 10 years. I've had people, you know, looking for more AI driven server management pieces with the idea of, you know, you just brute force AI could, you know, change, you know, change test, change test, change test, look for performance improvements.

Speaker 2

22:01

Well, if it's, if it's meant to run, if it's originally designed to run autonomous, but released as human in the loop, so that you have human AI collaboration kind of thing, until it's trusted and trustworthy, that would work it. I

Rob Hirschfeld

22:21

The funny, the funny thing that we get with the BIOS pieces is that there's a lot of cul de sacs. And maybe, maybe the new generation of systems is smart enough to avoid some of the cul de sacs. Really you want to collect data back and learn the it's pretty it's actually pretty easy to break a system, okay? And so you want it, you want to make intent, you know, pretty intentional tries, and then you need to be able to reset the systems, to do the experimentation, and the apply and reset process ends up being hard, right? That's, what we do. Did you?

Speaker 1

23:04

Yeah? I mean, I just wanted to put on a skeptic hat here and try to please pieces apart, and also a little bit more pragmatic hat. So from from where I see like in terms of implementation challenges? Yes, getting the data is one, but as you said, also Rob, doing the tests consistently and reliably is another. If I were to approach it, I wouldn't. I wouldn't take an AI first approach. I would take a regression test first approach, because this is not like the input or the behavior of the system is variable, but the inputs that it takes on, the outputs that gift Those are, those are deterministic. So, um, if you know which APIs work today, like on the on the list, on the current, latest version of the BIOS right then you can run a standard regression test against the next version and see right if it fails, if it passes, great then, and if it fails, then you need to figure out where it fails and that. And there may be a case therefore, for an AI system, to give you a hint towards what needs to be done to to make it work, but in many cases, it will, for some for someone familiar with the system, it will pretty obvious, like, Oh, like this flag now. False to True instead of false, or something like that. Again. This is this me putting on the skeptic hat on. This not necessarily a universal statement.

Speaker 2

25:15

I think in the I'm sorry, go ahead. Sorry, no, I jumped in front of you. Go ahead.

Rob Hirschfeld

25:22

This is kind of why I was thinking that, you know, the use of AI in this particular case would be really more limited to kind of agent, agentic base search, into the into the the, all of the the precedence, all of the all of the dependencies, and not try to analyze them so much as it's a it's more a smart notification system when Something has changed. That strikes me as probably more that strikes me as, at least on the immediate basis, a lot more important. And to Klaus his point, more often than not, you see a behavior that you you know when you're in the midst of it, you see a behavior a human being sees a behavior that says that's a lot like something else I encountered way back when, or or, and there's, I've been dealing with this very thing, building a much less complex system than what you're you're describing here now, but the, you know, I've been using Cloud code, which very, you know, really excellent. But as much as you try to kind of nudge it into the idea of, Hey, have you seen a behavior that's kind of like this before, and, you know, hard to get them to go back to kind of doing their at what point you they, they kind of click in and say, you know, spending a lot of time on this. Let's, let's take a step back. Have I seen something like this before? Can I go through my histories and so forth, and I'm, I've, I find more of my human in the loop. People, you know, role here with them is to kind of say, you know, we had a problem like this before with, you know,

Speaker 3

27:43

some, some package manager is, was, was in the midst of of, you know, doing something, and it wasn't using the virtual environment that had originally been set up, or there was a lack of compatibility with what we were doing locally and in in a darker container. And they were, they were at cross purposes. Those have been more my, my inputs, my valuable inputs into the process. A lot of it has gone into kind of the debugging, the great debuggers, they they just run through and, but they hit certain kinds of problems where I'm I'm better at it than they are.

Speaker 2

28:30

No question and, but that goes to autonomy versus human in the loop, and how much level of supervision and capturing tribal knowledge. And I use the term loosely, but, yeah, but I have a question, aside from the grimacing that you were doing when I was, you know, elaborating on this, has anybody ever tried to take the logs of all of this stuff and throw them into something akin to a historian that's used for manufacturing, because that would be the ideal way to do this. And you could take all of the logs and all of the data that exists and using AI, do the ingestion or the and then start doing some embeddings and use the powers of a historian, type database as your foundation, and then run against

Unknown Speaker

29:32

it. Really expensive, though

Rob Hirschfeld

29:34

it would be very expensive, but it, it's, there's, there's some things that you would have to solve first to make it even close to economically viable. Part of it would be basically data compression, in the sense of, you know, I don't want to see anything. I don't I don't want to. Keep all the log. You know, I am still here, and the temperature is 72 degrees. I want only the changes. So there's kind of a change issue there. There are lots of things that you would have to solve, but, yeah, I've actually often thought about that exactly, and one of the things you want to do is do something that is basically taking the logs and rather than just throwing them in, figure out how to build patterns that you can then save, yeah, and then do the comparisons, kind of looking at them. So yeah, and it's a good it's a it's a great idea, and it's one that I think is absolutely kind of coming sooner or later, but there's some real expense that has to be managed before you can get there.

Speaker 2

30:58

Well, here's the other question, if the agent, if the agent, the action of the agent, or the execution of the agent, is to find the anom. I mean, this is classic anomaly detection in in a lot of respects, take the anomaly and test it from the trend, to see what comes out of it as its execute act before it does anything else that's part of its analysis, and throw it in, you know, put it into the sandbox, and test it for what it is, and see where the the real bugs actually Come out. That would be 95% of the way there

Speaker 1

31:45

an approach that could perhaps be taken to towards making the log retention, or the knowledge retention, more economical, is yeah, something similar to what can be done with open telemetry traces, with the precisely like, a difference between like head sampling and tail sampling, like a, a, if you, if you, if you hit something only, yeah, you just reduce the volume of data, but your noisy single ratio is the same with tail sampling, you can improve the noisy single ratios so that only you only retain the events that you care about, and then you combine the two so that you only retain a portion of the interesting ones, because they're still statistically significant. Like Paige Cruz, she did a wonderful presentation on this thing, like two or three years ago. Anyway, bottom line is, what it's missing here is identifying which logs or which, which history needs to be kept on, which is either redundant or not significant enough?

Rob Hirschfeld

33:11

Yeah, actually, it's probably a combination of things, because some logs you want to keep detailed histories, even though it's a kind of expensive. And the reason you do is that the nature of the anomaly may not be parametric. It may be non parametric. And what you're looking for a are, for example, time shifts, something happens here, and you see a pattern. It doesn't trigger, you know, a, some sort of an alarm. You know, it's gotten too hot, or it's gotten too cold, or it's gotten too fast. It's still in there, but you've seen, you start seeing a pattern in one, you know, one stream of of log data, which just a little bit time delayed, starts to get mirrored somewhere else. And then you kind of go, those shouldn't be talking to one another, but they're, they're both operating on what looks like this, some sort of a being influenced by the same thing. The other is the time spectrum. You know, both you know, expanding it and contracting it. I think probably what happens in something like this is you go through some learning period, where you kind of make selections about, you know, what's the best way to keep the data? What's the what's the important data? What do I keep about that particular log file? It's a it's it's investigatory, to the point where, if you're going to do it, it's got to be a pretty important. System, and it's got to be one in which you're willing to pay that freight, either because it's a mission critical system or, you know, just minor, minor changes in it are going to, you know, result in either very bad things, or, you know, it's a, it's a criticality decision.

Speaker 2

35:23

It is I would, I would say also that to your point about the overhead of it, the log files, if you, if at frequent times you would do, and I know this is going to sound weird to you, multivariate linear regression. You would be able to start the beginning of the trend line that you were just discussing and get ahead of it so that you could keep that specific encounter and that forms the beginning of a design pattern for what you're what else you're going to look at. Now, you may not be able to have the same design pattern usable in every single case, but that starts your library of them. And then why wouldn't you just do the storing of the

Rob Hirschfeld

36:24

Yeah, kind of are you saying, capturing a capturing that, doing multi linear regression analysis on to some degree, on the fly, and then over some period of time, deciding what you're going to throw away, or what you're going to compress in a certain way, or what you're going to be keeping with high fidelity,

Unknown Speaker

36:46

absolutely. That's

Speaker 1

36:50

again, taking this skeptical hat here, or maybe not skeptical the pragmatic hat here. It might be overkill to do that, because, again, this is coming out the SRE field. When you have a problem, you have an incident, just keep the data that's related to an incident. But

Speaker 2

37:18

don't you want to get ahead of having the incident happen.

Rob Hirschfeld

37:22

Well, it's kind of your pre cogs, you know, it's you want to go from from from analysis to kind of prediction to prescription, but, yes, obviously, but it's a it's what I'm

Speaker 1

37:36

trying to get at here, is that the data that you get from the incident tells you what to look for in the pre cog incident, or in the pre cog approach, and that, as opposed to trying to use linear regression or machine learning tools

Unknown Speaker

37:57

to are you saying, trying to guess which data

Unknown Speaker

37:59

Rob Hirschfeld

38:00

previous incidents. Are you looking for historical, historical evidence from a prior incident?

Unknown Speaker

38:08

Yes, yeah, that's your training data, absolutely,

Speaker 2

38:15

but, but, so here's my question, if you were to do that, though, and keep the incident, which is a very good idea at lowering the, you know, amount of data and whatever else. How often is there a variation the incident? You get the incident, but the incident specifics are not exactly the same every single time, very often.

Rob Hirschfeld

38:43

But that's the point. So to here's, here's something that is fascinating to me. I'm drifting away because from logs, a little bit into shared information, right? The interesting thing about the logs is that, right, you're, you're captive in your own piece. One of the things that's that becomes interesting to me is, if we're capturing information, that is where the the actual issue is not this my individual systems performance, but am I taking an upgrade? Am I taking a patch? Or can I compare that to other other systems? So, like, we're working to build a database of good configs, what I'm interested, what I hadn't thought to do was to capture a database of failed configs, right? Yes, and so it would be really fascinating if a customer is trying a configuration, and it doesn't work if they report that to us as a failed config with whatever date data they get. Because what would be fascinating? It's incredibly important. Yeah, right, but we don't even think to capture that. We're only capturing the good ones. It's it. It's like, it's like, it's like, big science, you know, you throw away too many times that you have a hypothesis, you throw you throw away the results. You throw out the hypothesis because the results don't, don't support it. But you know, failed experiments are failed experiments are important, and they're probably more valuable. I need to think about how to how to encourage the team to think about that. But the place where agentic AI would be interesting is if you had the failure this we tried this patch. It failed, you could then look up bios, the notes, or even scan the code for what the deltas were between the versions, yes. And then you could potentially learn from Code Analysis likely or unlikely bugs or where, what failure modes would be like the individual. This is what's fascinating to be about the log. Now question is, I could give you a, potentially a log of all the things that were tried and the failures, not failures and things like that, but the agentic capability to then analyze the underlying code or the release notes for that issue, to look at trends related back to success or failure. That could be just limited. Yeah, well, yes, I can tell you right now that one of the things that I've started to do with my use of of coding agents is to actually, because I break things into tasks and don't let them just run, you know, willy nilly, through the, you know, through the weeds, I break things down into tasks. And every time a task ends either with task completion, and this is what happened. This is how I did it. In a fairly compressed and i i literally keep a a set of memory. They're basically release notes on each each portion that it's been working on, and they're all time stamped. And, you know, it can go back pretty easily, and just kind of say, did I, you know what? What happened? Where am I now? What do I most of the reason I do that is because I have to be concerned with context. You know, how much I how much I bring into the AI to context that. And this is, this is quite useful for keeping track of where you are in a large project, but it is absolutely valuable for exactly the reasons you were just mentioning. You go through and look at, okay, I tried this, I tried that. I tried this. This one worked. Boom, boom, boom. This. Keeping those kinds of that kind of data, even if you don't keep all of the results of the of the trial. Very important is that, does that end up being agentic AI, or is that something else?

Speaker 2

43:12

It's machine learning. It's agentic AI. It's a lot of different

Rob Hirschfeld

43:17

things, all of the above, my agents, my agents, you know, have a protocol. When they, when I start a new session, when I start a new task, they, you know, there's a, there's a thing they do to kind of come up to speed, you know, where am I? Where's the project? What, what just happened? What did the last What did my predecessor? What notes did my predecessor leave for me? Tell me, all right, this is what's, what's high on the you know, that is the end, and that becomes very agentic, because you can have multiples working in parallel, and they're just picking. It's like picking. It's, it's like agile programming. It's like picking the, picking the issue, off the, off the list. You know, what's the next one? Yeah, I got it right. What you're describing to me is a degree of data retention for learning behavior that we don't often think about this, that this is a really interesting insight into how AI needs to change our learning behavior, because, because you're, you know, a lot of times people don't want to keep track of all the failed attempts. But if we're dealing with an AI system, those failed attempts inform its next action as much as the prompt. And precisely does it does it's fascinating. And every and every time, every time you avoid going over aspect that you've already, you know, has already been tried and and shown. To be a failure every time you tried it, you know, you, you've you start to save quite a bit in in, you know, token consumption and production, you there, there are lots of reasons for doing it.

Speaker 2

45:16

And, you know, the the other thing that also comes out is in the process, whether you use a knowledge base to start putting things together, or you are using vector databases, or the combination of the two, like we are, it makes a huge difference. But how you but the other thing that I've also found is to knowledge bases and knowledge graphs. The idea that you can have multiple knowledge graphs all running at the same time means that you can to Rich's point about tasking, you can sub task, and you can also start saving a lot of tokens and optimizing by the way in which you designate certain tasks to run over and over again. Because remember, agents can collaborate with each other, right? A routing agent can call a knowledge agent can call a programmatic agent can, as long as they have the right hooks to each other, this becomes like Lego. And so you can configure these based on to a certain extent, pragmatically to a certain extent, but you can start configuring them in a way that you can get to your ultimate goal much faster, because you come to the point with each knowledge graph of saying was the failed configuration caused by the same type of error, and make that closer to the original error by weightings and by distance, so to speak, their, their their their GP, their GPS, so to speak, are they farther or closer? If it's very, very close, like within a millimeter or whatever,

Rob Hirschfeld

47:14

whatever, whatever the similar, whatever the similarity metric is, yeah,

Speaker 2

47:18

right, yes, the similar, that's the word I was looking for. Thank you. Yeah, because the similarities and the synonymous kinds of errors all can be relegated to a certain pile kind of thing, like, if it's if it's red, is it Red? Orange? Is it more orange? More coral color, right? There's different shades. So that allows you to start segmenting even further and even further. That reduces your token cost, and that reduces your complexity, because it doesn't matter how many agents you have as long as you know what they're doing and what they're supposed to be doing, is how they're performing, you're good. There's no limitation, or a limit of how many agents you can have running that we haven't have you.

Rob Hirschfeld

48:21

I'm sorry, what's the question? Are there limits on the number of agents you can have running

Unknown Speaker

48:26

all at once? Yeah,

Rob Hirschfeld

48:28

well, not really. There are some practicality issues. But no, it again. It's a matter of what they're working on, and so long as they don't, you know, start like anything else, if they're checking you almost have to check out a problem the way you do in in GitHub. It's like, you know, I'm checking this out so and I'm working on it. Yeah, you could work on it as well, on a separate branch, but recognize that a I need to know that somebody else has got this checked out and is working on it, and maybe having, you know, an exchange of information about what we are as agents doing with with the same, same part of it. But it's a, it's more a matter of, you know, not tripping over one another. It's, it tends to end up being choreography. Choreography. When you try to, you know, kind of, if back into the main tree, right? Yeah, you've got to have agents that, that know they're waiting on answers upstream, right? So that's, that's part of the choreography,

Speaker 2

49:37

yeah. But we tried, I'll give you, for instance, we tried to look at a particular problem or particular task, but apply different perspectives,

Rob Hirschfeld

49:50

and which is what you do with the Knowledge Graph,

Speaker 2

49:53

exactly which, which is what you're doing with the Knowledge Graph. And we had, we started. With two than we did for we actually got it up to eight different agents all looking at the same thing with a slight difference in their perspective. Okay, yeah, and that's the point that I'm trying to make to you. Rob,

Rob Hirschfeld

50:14

what do they do that? What's and then, what happens when they come? When they come, how do they come to a conclusion? How do they, how do they, you know, do they all get together for a beer after work? And, you know, hey, you know, what did you find? What's, what's the, what's the mechanism? The

Speaker 2

50:35

last step in each of their choreography is create an overall context for the human to look at,

Rob Hirschfeld

50:44

ah, so it's an eight human in the loop, but you still, I mean, you still have to take all the results and then coalesce the results into a something that's yeah, because digestible for a human being,

Speaker 2

50:57

right? It also made, I mean, it's a good test tool, because it makes it very simple to see which agents are interpreting even from what they've been given, that they're drifting a little or they're skewing the perspective a little. Waiting is there

Rob Hirschfeld

51:19

are ways in which you can, some, some cases, some of the issues you can, you can actually have the agents score, the, you know, self score, the the solution or the the answer, they've come up with some problems just aren't amenable to that. But there are, there are enough of them that are kind of fairly easily recognizable, especially if you have design patterns or that that are used often. So what happens is, I've got a design pattern, I have a way of scoring them, my result. You know, if I'm dealing with one of these problems, come back together and say, All right, now, if our scoring problem, our scoring mechanism is is not scoring the right things or not that, then that's on us. But again, there are ways of doing that that doesn't necessarily call for a human attempt to be the final adjudicator. But it's, that's still a, that's still a black art

Unknown Speaker

52:32

human adjudication.

Rob Hirschfeld

52:35

Yeah, well, adjudication altogether. Hey, folks, I, I'm, sorry I got it. I gotta jump. Sorry. It was a great conversation. I learned something interesting. Thank you. Thank you. Have a good weekend. No, no meeting next week, right?

Speaker 2

52:53

Oh, yes. Happy Fourth of July in advance, by the way, if you're not recording, and I'm hoping that you weren't I can see, please,

Rob Hirschfeld

53:12

Wow, there's so much going on with agentic AI and analysis and The insight here that we are going to be able to learn so much more from the mistakes and errors if we keep them around than we've ever been able to do before. It's really powerful insight. I'm interested to know what you think of that insight, if it's going to change your behavior of learning from negative consequences by keeping them around and feeding them into your AI systems. Thank you for listening to the cloud 2030 podcast. It is sponsored by RackN, where we are really working to build a community of people who are using and thinking about infrastructure differently, because that's what RackN does. We write software that helps put operators back in control of distributed infrastructure, really thinking about how things should be run, and building software that makes that possible? If this is interesting to you, please try out the software. We would love to get your opinion and hear how you think this could transform infrastructure more broadly, or just keep enjoying the podcast and coming to the discussions and laying out your thoughts and how you see the future unfolding. It's all part of building a better infrastructure operations community. Thank you. You.

00:0000:00