20240528 TechOps Logging

2:06PM Jun 29, 2024

Speakers:

Rob Hirschfeld

Claus Strommer

Keywords:

logs

traces

metrics

tracing

events

logging

automation

telemetry

span

open

interesting

elasticsearch

information

talking

running

observability

performance

system

job

infrastructure

Rob Hirschfeld, hello. I'm Rob Hirschfeld, CEO and co founder of RackN and your host for the cloud 2030 podcast. In this episode of the tech ops series, we dive deep into logging, tracing, metrics, observability, with a specific filter for automation and systems and infrastructure, which I really find fascinating. There's a real challenge in here of, how do you capture information from a running system in a way that provides the right information at the right time? And that fundamentally is the question that we are working to answer throughout this really fascinating discussion about logging.

Topic of the day was to talk about improved logging with with this idea, we started down this road a couple of sessions ago, and I wanted to go back to this idea of, how do we improve how you know, our outcomes from logging? I think there's an observability piece for this, but I wasn't going all the way to observability the prompts I had to start the conversation were these ideas of structured versus unstructured. Logging, how much formatting, how much how consistent your logging needs to be to really make it most effective is what to capture from an automate from an automation log. So putting on strictly an automation hat. I know this is something that I struggle with. This as we write automation is, what should the automation output? How much detail do you need? How much debug information? How do you make it so you know you can turn off debug information so you're not flooding the logs or leaking sensitive information. How to deal with failures and what what you know, finding a failure inside of a log, some other notes, but those are good, I think, good, ways to open the conversation on how do y'all think about logging like building good logs?

If you asked me, 10 years ago, I all over that back when Prometheus was not a big thing, yet, the gold standard for me at that time was essentially structured logs and then parse those into Elasticsearch or gray log, or whatever interval you would use at the time. These days, I relied lesson on logs for um, essentially for structured data, but um with with the GS and being so ubiquitous, it's, I mean, it's hardly, hardly controversial to say that like to do or to prefer structured logs versus, let's say it's just log style, formatting or, or even worse, like Java, kind of multi line logs, which I absolutely despise, multi

line,

yes, like, For example, like, if, like, Java has the habit of not only dumping logs, but also stack traces in the logs, and again, each event on those traces is another line in the log. So unless you preserve the exact same order in which that was emitted and don't mix it with other logs that has just other garbage.

So again, structured logs like JSON, make that like work around that by simply encoding the multi line data into just a field, right?

Yeah, I wrote a test for just really the only best practice. I think this just widely agreed. Pond coming from the 12 factor apps community is treating logs as event streams and then trying to find something like white papers or something like that that would try to say something more than that. I'm interested in, but I haven't really found much.

Go ahead, Greg,

yeah, as I say, for us at RackN, we've kind of got multiple layers to think through a little bit when we talk about our logs, because we have, like, four different areas of logging. And so part of the discussion becomes, who's the audience? Why are they seeing it? What do they need to know and do about it, and what are they trying to solve with it. And so we start with audience, is this, the developer, is this, the operator, is this, the auditor, those kind of things. And then how are we going to provide that data? Useful? Lee, then there's the, is this for um, failure, performance tuning, those kind of things. How do we drive it?

I would, how would, how would? I'm sorry, finisher. My question go ahead. I was gonna ask how is, how is like if, if you're looking at logging from a failure versus performance, like, how would, how would they? How would you treat those differently?

Well, so like this, where Claus was talking about, you know, now, with Prometheus and stuff like that, a lot of times your performance and other things that you might have done through kicking out logs at a certain interval. It's more efficient and effective to actually go through your code and not generate any logging at all, but provide counters and metrics and histograms and stuff like that to keep track of what you're actually doing, because at a certain point you have to transition from operational to tuning or performance, which some people would classify. Those can be problems. Well, of course, they can be problems that you might want to log about, but from a getting the appropriate information, getting the over time, data, when causality, those kind of things, the Prometheus kind of tool set example is a much better way to visualize alert and trigger on that right? So it's

the language is getting a little so it seems like in the like cloud native community and all that, there's there's metrics, there's tracing, and then there's logging, and it all goes under telemetry. And then, yeah, so when we're when we're talking about logging, sometimes it's like everything,

we just lost you, yeah? Oh,

can you hear me now? Sorry, yeah, I switched my phone. Sorry, yeah. So it's like talking about logging, like everything that's not metrics or or tracing, you know, like metrics kind of well defined, kind of good standards there on what it is, how to even implement it, push, First, pull those things, tracing. You know, there's some arguments around about how much value it has, but we kind of know what that is. And then logging comes in. Well, in my mind, it's like everything that isn't those two, but I'd be interested in seeing what other people are thinking what it means.

So interestingly enough, for rack in tracing has is a knob for logging, if you will. Mostly because, if you're talking about tracing, you either need to integrate it into your logging system so it can go somewhere, or you have to create a separate application space for tracing, right? I'm not sure there's been a kind of canonical tracing and application methodology, right? I

know, yeah. I think they're in like the open tracing standards, yeah,

yeah. So actually, open telemetry, right? Now is the de facto standard. It has been adopted by practically all of the vendors, including Splunk analystic Search

for reception of telemetry data, right,

or for the protocol, which then also includes uh, yeah, reception, but like and then there's a significant amount or nowadays or standardized open telemetry libraries. The nice thing, the nice thing about those libraries is that they not only produce traces, but they can also produce metrics and logs, and because they're all standardized, means that you get uniform log formats and naming schemes for your metrics and traces across the board.

So so the the challenge that I have with that is that this is back to your audience. Does the audience that's consuming this have access to a system that's going to implement that API for reception? And so what we found in most of our customers as we're dealing with this, they don't. So they can go look at logs easily enough, but they can't go look at the telemetry, or have a system to do that. So we could say, oh, go set that up. But that becomes another service, another thing that they have to manage. So it becomes a The other thing is, for tracing, a lot of times, that's a on demand to fix something, though, depending on how efficient the system is, though, most people don't recommend running tracing on all the time, right? Becomes a different kind of scoping issue, at least for us. So we haven't actively looked at dealing with telemetry leads to another aspect of things to consider, right? When you're talking about a 12 factor application thing in general that's very cloud native, single focus thing as we talk about automation and integration, one of the challenges that like we have is we're talking about certain automation tasks or automation steps, those are additively generated into the system, and those logs are meaningful in different ways at different points, and aren't necessarily driven into the overall telemetry system. So think of it as monitoring and doing telemetry on your automation system versus getting logging and information about the thing you're actually automating. That makes sense.

I think automation is an interesting special case here. Sorry, go ahead. Watson,

yeah, it's a special case that you see that this problem happens over and over. I can telecommunication size, right? They have workloads, and they have infrastructure, and the automation of the infrastructure is its own beast, and what have you. And it ends up being, you know, are all bets off when we're talking about that. But not it seems like, yeah, if it's too much pain, you know, don't do it. Don't you go ahead and, yeah, go ahead.

I know. I think, I think the challenge that I see with automation is, historically, it's been really, really hard to log and, you know, troubleshoot, figure out what's going on. You know, there's, you know, performance metrics out of it has been really tricky to get, even though our experience has been that they're important. And, you know, for a couple of reasons. And I think the challenge with automation is it's often there's you're transforming something, and so the state that you start in, and, you know, the background information you have about what you're doing changes how you read the logs considerably. Or, you know what was done. Or, you know, it can be really tricky from an automation perspective compared to what I used, you know. And I'm used to building a log out of an application, you know, or in telemetry of an application. It's much more those, those actions are more like simple, transformative work, whereas automation, to me, has seems like it's a different problem, which is sort of what I'm trying to capture in my Yeah. So

if we were to just some of the terms. So when we saying logging, we're saying metrics. So performance kind of goes under metrics. Then we have the tracing, talking about stacks and all this. If we say, let's just do performance, let's just do the metrics side and and so at that point, what do we run into? The automation where we're saying, Okay, it's too much of a problem to have some type of exporter to Prometheus back end that's too much of a hassle, or the audience doesn't want to install that, or something like that. Is that what we're saying here, or is it something else, or are we saying that metrics aren't when we say, you know, metrics are whatever you want to export. So

so we actually, we've had some interesting cases, and Klaus, I see your hand is raised, so let me answer, and then I'll yield. But where customers have asked for Prometheus metrics, which we're happy to provide out of the out of the system, but they want the Prometheus, the metrics, to indicate job performance. And then so you get into this interesting thing, as I'm running, you know, a piece of automation over and over again, but on different different equipment, and the variances in job performance for a certain job type are are have relevance, because that, you know, a slow job, if it's the same job running in a whole bunch of places, that same job might the variance might be very interesting or indicate a broader problem, but it could be different depending on the type of infrastructure or the operating system, and so the it's really hard to get a metric from this job took this long the way you would think of a normal metric to then do An analysis that would help you find something useful that ends up being for some of our customers, they'll toss that into elastic and then get a little bit more more metadata from it to be able to do some interesting analysis. But it's a hybrid,

yeah, actually, you're right, like a time and time metrics are kind of point point in time, and not so much, you know, you ask a question and it's like, I guess you could get how long a job has been running so far, but yeah, that's

what that's that's what gets this sort of interesting. And to interpret it, you have to go back and look at what the you know context for the what the metric means, but then potentially scan the log of that job to see if, you know, there's a retry loop or something like that, because that would actually indicate a health problem that you might be able to address.

It seems like it would like if you did point in time, and you did how long this job is is running, and you have everything else in there. You've got CPU, all these other things that seems to be very useful to me as someone who cares about performance. Whereas when you get a message that says, Yes, Job's been running this long, you don't really have a snapshot of all the rest of it at the same time,

you know, there's, there's the dilemma,

yeah, so there's a couple of topics I want to touch, but let me, let me try to put that in order. So first, there's like on the matter of like, of like, metrics from from jobs, I would argue that metrics are for jobs are actually not useful, and that is because your jobs are not continuously running, and metrics, like their primary demographic, are continuously running processes or systems, things where you can have a time series with consistent data points at regular intervals. Jobs. On the other hand, there are great use case for traces, because you have this essentially event based. Like, you know when a job starts, you know how long it takes. It may have, like, if it's part of a pipeline, it may have some downstream, downstream jobs that you then can then put together into a whole tree span and say, Okay, this pipeline took more than usual amount of time in this particular job. So let's dig into that, tying this back then to logs as well. So again, the nice thing about the new, the newer or the new generation of libraries like the open telemetry ones is that, because they they're capable of emitting traces, metrics and logs, they can also tag them accordingly, so that you can cross reference them. So for example, what you would end up having is, let's. Say you have you you have your application emit traces. And one thing that I should mention on this is going back to something Greg that you, that you said about, like, trace is not on all the time. Traces should be on all the time, but you shouldn't be collect or you shouldn't be saving all of the traces. In fact, in most cases, your sampling rate should be something like 0.01% and that should be sufficient. Most of the traces that your application is going to emit are just going to be noise. But the way you should really do this is that you filter traces down to what's interesting to you. For example, if you're interested in in events where a job took longer than necessary, then you discard any traces, or really any spans, which means that traces and parent or child traces that are related for that don't have a job that took longer than your threshold. Similarly, if that volume starts getting high for whatever reason, then you can do sampling so again, and typically you would do this as tail sampling, as a you process the entire span first, and then you look at it and say, like, okay, is the span interesting? Yes or No. Or once there, once you, you, you, you're certain that they're interesting. Then you do your sampling and say, like, Okay, let me just keep 10% of those, or whatever, and that can significantly reduce the volume of traces that you store to the point where you can have a single system store traces for like, a global fleet of servers, because, again, it's not a fire hose is it's just a trickle in the end, but it gives you enough information to then say, well, I know that this, this particular job, had a failure. This is a span ID. Let me find the metrics and logs related to the span ID, and then you have your entire context there, so that you can effectively drill down into the root cause. And again, this is not something that is that is not like we haven't done this before. It's just that before we've had to do this manually, by having to sift through things and doing like manual guessing and say, Okay, I had a spike in metrics at this particular time. Let me find all of the logs during that time and then do like a grab for things that are interesting to me nowadays, like, again, this new generation of tooling just automates that whole process for you.

But would you say that the cross referencing of tracing metrics and logging is a best practice,

absolutely. How do you like?

How do you know when to do the tracing? Are you suggesting that you would because and tracing is expensive? Would you turn it on or turn it off? Or

the recommended practice, at least as far as the people that I've talked with, is that you emit traces at any at all times, and then you put an open Terminator collector in front of it to filter the traces for you and decide which ones get actually saved. And because you have this collector essentially running either as a sidecar with your workload, or very close like in the same local network, you do the filtering really early and mitigate the network or compute costs of handling a large Volume of traces, because, again, it's really just local

that's one of the things that we allow people to do, is you can up the logging level of different services, right? Because for Digital Rebar, there's a whole bunch of different services that are combined together. And I think one of the themes did that was really effective, was allow you to dynamically set the log level of each of services independently so you could capture you know more or less detail depending on where where things were or what you need to do. It's still a problem of knowing when to one. You turn on that tracing for something, those those logs fill up really fast. And if you're, if you don't know how to duplicate the problem or cause the event that you're trying to trace, you're, you know, you're going to be really unhappy, or worse, you forget to turn it off,

you're going to be going to be really unhappy,

yeah. Again, like, like, open telemetry based tracing doesn't mean that you have to increase your your your log volume. In fact, your log on it can stay the same. The only difference is going to be that your logs aren't going to have a an ID that your you tracing UX can then use to cross reference to the telemetry and the metrics that it's been receiving, so that you can have a single pane of glass where you can Say, Okay, I'm interested in this particular span because if something happened that was unusual, and these are all of the logs and traces that related to that span. Now if, if at that point, you decide you want to crank up your crank up your log labels, let's say from from warning to debug sure you can do that, and then future spans will have all of those debug logs related to it, and then you can have a more detailed window of it. But in many cases, you might not need that anymore. And I know it sounds like, like selling snake hole. Here it it is. Like, skepticism is healthy, you know, like in this field, particularly, like observability, there's been a lot of promises, but I like I have to say, like that. I've seen first person how the availability of traces has reduced the mtdr on incidents significantly because, like, it cuts down all the time that we're spending manually correlating the data. Like, the correlation is already there. So the only thing left to do, and I'm not going to minimize the work necessary for this, but really the only thing left to do is making sure that your your trace filters are letting the correct traces through, which means reducing noise and making sure that you have the, again, the correct types of traces for the events that you care about after that. Again, you can, you can turn your sampling rate down really, really low, and it's fine, because you don't need to have all of the traces. You only need to have representative traces. Yeah, so, I mean, if, if you, if you create 1000 of events per minute that each are causing that, that each are related to issues, but they're all caused by the same system, then you only need really one trace to tell you this system is having a

problem. No, you're right, yeah, and Well, Greg, I think the the aspect of that is the acknowledging of of tracing gets used in multiple terms. And so in that regard, you're looking for, you're using tracing to to find issues with a group of set of a group of actions, right? And you're sampling is saying of 1000 actions, they're all behaving the same. So if I catch one trace of that of that group, I'll have the information to drive whatever I information and correction I need to deal with. And that's awesome and quite useful. I think a lot of times, especially among our customers, they're using and needing tracing at a a more of a functional level, not a operation. Operational. Performance level. And so in that regard, the the failure is a one off event or a set of one off events that needs better tracing. And so it's really hard to make those filters handle that right. So it's just a different aspect of what level and what's your expectation around the tracing. Because I think all the stuff you've mentioned class is very powerful and quite useful, especially in tuning and driving, you know, high performance systems where the Event Processing is the action, where, like for our system, we'll like be provisioning a system and that will generate 100,000 events that are all different. And so tracing Those have all sorts of different paths, and the failure is in one of those is the actual thing they want to look at. So it's no longer, you know, 1000 common events. It's 1000 independent events that are linked together. And so for most of our customers, tracing from their perspective, represents a more complex domain specific in some regards, use of an event stream to represent their event models,

right that? So this, this collection of related events, like in the open telemetry terminology that's called a span. And the aspect like a span is a collection of traces from potentially decoupled systems. But essentially, like you have a you have your root trace. Sorry, you have your rootspan, and maybe I have, I have the terminal, but you have your root your rootspan that says, Okay, this is your starting event, which you would know, because, again, it likely would come from Bracken saying, like, Okay, I'm starting to provision this. And then each additional step, or each other job in your pipeline would would emit its own spans. And then, when you put them all together, you have a tree or like, essentially like, the equivalent of like a like, if you open your browser and you look at your the debug panel at the Networking tab, like the that's essentially the same kind of composition of traces that you would see with open telemetry. And yes, it can be 1000s of events, but they're all they're all linked to this one root span. So what you can then do, on what opentelemetry does with its filtering and sampling capabilities, is that it's able to correlate all of that information and then saying, Okay, I let's say you have a leaf node that that is that had some kind of exceptional event. But because it is correlated to all of your other spans, your collector can say, Okay, this, this leaf event is, is something that did not get filtered out. Therefore, all, all of the other spans in the trace, all the way going up to your root, down the tree, to everything else, because they have the same span ID, these are all going to be saved, and then that is what goes into your back end, and then that's how you review those traces. It took me a very long time to do, to get out like, to wrap this around my head so, and I completely understand what why your customers will will will likely be tempted to say like, Okay, let's open the firehose and save everything. But the good thing is that it's not really needed.

Oh, I see this is where I think I would disagree, but they have a different path for that today, in the sense that like we have customers who have used the event stream, which is effectively a superset of the telemetry stuff, and have used Elastic Search to turn that in to do things like use the. Visioning data of certain tasks as they fail through certain times, to see that as a trend on, like 12 systems to go, oh, that indicates that the top of rack switch doesn't have jumbo frames turned on right. Or to go, this machine is failing, and it's image deployment time went from or not failing at all. But the tasks the event for, like image deploy, went from two minutes to 26 and it's just this one machine, and they look and it was drive, we're beginning to fail. So the the interesting thing is, the data is actually very useful in total, especially if you can do like correlative analysis on it. And so that's why it's like, I almost say, Yeah, you do want to do the fire hose and catch it, right,

actually. So, oh, open open channel, the open survey collector has a very clever solution to this, particularly when you use tail sampling, and that is, you can, you can have the collector produce metrics from your spans. Because, like, really, what the scenario that you're describing is? Okay, there's now a spike in exceptional spans from this particular system, because the drive is failing. So again, you don't need to know at each you don't need to look at each individual trace or span to say, to get the big picture. You just need the aggregation and be able to say, hey, there is an extraordinary amount of the these traces coming through, or the sum of the time that it takes to write has suddenly spiked. And again, this is possible with with with the open telemetry architecture.

Yeah, we'll need to look into that, because it's an interesting path, or additional path, right? Right now we offer that as a bit processing on an elastic search kind of methodology, right? Yeah,

but what? Which? Again, like I, I have Elasticsearch in my background. I used it heavily for these kind of things, and I completely understand why it's very tempting to stick with Elasticsearch. But like concern, considered Elasticsearch to be a very or nowadays, I consider Elasticsearch to be a very amphistat solution to this, because, again, like it. It definitely works, but, but it is. It is a much more brute force approach than it needs to be.

That's interesting, because I would, I would classify it as a more AI friendly approach, because you don't know what you don't know. And so by gathering the events and storing them and correlating them and generating them now you have a a set of data that you can potentially drive, especially some of the machine learning techniques to catch variances that your expert knowledge might not cover yet. But it's a trade off, right? And so are you building the telemetry rules based upon your expert expertise that will trigger faster. Or are you going to do post processing, to do discovery and stuff? And given the AI buzzword of the world right now, there's some right you could see both, both kind of paths being quite useful.

Um, I mean, I I've used the machine learning capabilities on RC search, and I don't know how much they've changed over the past, I would say five years or so since I've last used it. But one of the things that was always an issue with elastic searches, machine learning, and I got just a machine learning, not the AI based discovery that is perhaps available these days. I've been I haven't used Elasticsearch again in five years, but for machine learning and part. Particular you still need to know which particular metrics or events, or, I guess they call them indexes, which indexes you wanted to use, and you need to know what aggregation you wanted to use, like mean, average, some whatever, and then you need to to run that aggregation switch. Is not terrible, but it's also not not great. It It certainly works when you don't know what your data has, but, and it may well be that Elasticsearch is a good starting point, but once you know what your data has, then the Yeah, the amount of options that you end up choosing, let's say like aggregation sums, or kind of trend prediction that collapses into a very, very small set, like per year principle applies like 80% of the of the metrics that you're collecting you're hardly ever going to be looking at the Most of the time.

No, no, you agree completely. I think, I think you're what you commented. There is very true, using something to then, you know, once you have a clue, it makes a much smaller footprint to use in your open telemetry footprint to get more specific right triggers and faster responses and subsets, yeah.

And to be honestly, like, going back to traces on logs, like, the same approach should be taken. Like, we haven't done it because we're by or when I see say we like the community as a journal typically doesn't do it because we are lazy. Like, once we started collecting things, then we start hoarding things. But again, like metrics and logs as well. Like it, I would certainly recommend doing a yearly review of what you're storing and then deciding, like, do we really need this? You can cut, cut your observability costs, which really should be around 25% of your entire cost. Significantly,

there's an interesting balance that y'all are talking around here, which is adding in like, deliberately, like, these are things I need to log. These are things I need to trace, like, intentionally adding that into the system. Because, you know, these are places where you might need to do that diagnostic, versus I have to dump a whole bunch of, you know, you don't know. So you know, more information, feeding it through a system and then doing it analysis is the is the alternative and and so there's, I think there's always this yin yang of of of going back through and adding in the performances and the metrics and the traces and things like that that you need so that you can quickly diagnose stuff. I don't think that gets you out of having to have the skills or the capability to do an ad hoc analysis too

well. And I think this is one of the topic sections I didn't get to. And some of our log discussions is, what's the purpose of the log? Right? The Challenge real quick, that we've had to deal with is a log. Sometimes is a comfort blanket.

There's the quote for the day.

And so right? In general, on a thing that's successful, I don't want to log, right? I don't want, I might want a metric that says, yep, that's good. But in general, I don't care, right? I may care about how long it took, but I don't care that it was successful, but if I don't print anything out, we have users who will go, it didn't print anything out. And we're like, yeah, it was successful. And they'll be like, Yeah, but what did it do, and why was it successful? And we're like, Wait, why do you care? But there's a comfort blanket for an operator in some regards, audiences to go, oh yeah, yeah, it's good. That's great, right? And this is where I think a lot of times we as a community, especially in the automation area, deal with the problem of developing versus operating, versus debugging or dealing with failure, and aligning those in a log environment is kind of a challenge, because you're like, Well, I don't want to log anything success. Right? But the operator is going to occasionally spot check these and go, Oh, wait, it didn't print anything. What did it do? Did it do the right thing? Did right? And it's a, maybe an observation of a aspect of a difference between a true not true, but a developer developing a 12 factor off, kind of like app, right, or 12 factor app for cloud ready, versus I'm building something, I'm automating something, I'm driving this and having different expectations about what they're seeing and what they're they're logging and auditing, right? So, in

a sense,

where I was, I was thinking too, Greg, thank you for being specific on it.

Yeah, in a sense, I don't disagree with what Watson said about logs essentially being things that you can't put into metrics or traces. I would, however, also enhance this by saying that the purpose of the logs is to provide contextual information, and this applies to both logs for error handling, logs for observability, as well as raw logs for auditing purposes, like the this is your your Essentially the context that you're writing down on paper. In some cases, like, for example, when you're auditing, your context is more of like a like a journal of what your workload did. And when you Greg, when you were talking about the like, the the customers who are saying, like, well, I didn't print anything. They're using it as a journal. It. It doesn't mean that it's more the most efficient way to do it, because those cases, like, like to know what the application did. You could have done it by a trace, but I, again, I understand the the logs are the most like the least common or most common denominator, and that like lacking anything else, the logs are still gonna be there with so it's very tempting to do to fall back to logs when you don't need more than logs, because your volume is really small, it might just be Fine. So and on that aspect, I am less prescriptive. Than, than, than perhaps others were like if, if you're only dealing with, let's say, five transactions an hour, maybe you can, maybe your logs are really just fine, like you. Don't need additional infrastructure. If you can just read the logs with your own eyes and know what's happened without any extra effort. Once you once you start scaling out, that's when things like opentelemetry are really shining.

Yeah, I was going to add in some differences on the, say, on metrics, or, in a sense, almost free, because it's, it's pull, it's pull. Well, the best practice is the pull from the metric service, so Prometheus from whatever service, right? And not so much push. And so it's less on the network performance, so it's better. So pulling all of that out of your log is just going to be better on performance, right? The other ones are more push. They're driven by events and stuff, so like with traces and things like they're saying that are not free, or as free, or what have you. So that leaves it like to the question, and we're talking about automation, and we're giving, you're giving examples of top of rack switches or something like that. And you're talking about, you know, 1000 switches going, you know, having this communication with a server or whatever, or as we're just talking, a few goes into it was said before, you know, talk to it's really about how many, how much events you're talking about, how much scale out, because you're not scaling out then, yeah, put it all in one big box, one big log or something.

I wouldn't really characterize metrics as free, but then again, I also dealt with metrics at a scale where things that were assumed to be free or essentially not no cost end up compounding. I said that like. The biggest pitfall the one can make with metrics is to try to put events into into metrics you don't really want IDs or or or really like ephemeral information in your metrics.

Oh boy, I'm gonna, I'm gonna put, we talked about events, but I'm gonna put event thing back on the calendar, but we need to wrap up, because we're way over. You all this was fascinating, because I think that you're we're talking directly into the tension with this. I'm capturing audit, creating auditable systems, because I think that's one of the things that clearly is a next conversation out of this, which is the tension between logging and audit. I thought that was really fascinating. And then I think there is a question about events, like handling events and eventing that we will come back to for that, just from a calendar perspective next, next week, I have SSH, which sounds like a really small topic, but we keep, I think there's a lot, lot there. So we'll see. And then, because of travel and vacation schedules, I have us off June 18, the 23rd so or to July, 23 so a six ends up. I think being a six week break with this, just giving you, just trying to give you some forward looking radar, and then eventing an audit would be after that's updated in the doc. Thank you. I knew this would be the fascinating topic, and there was a lot of interesting details to come they came out and to come back to. So thank you all talk to you soon. Bye, yeah. Wow. What a fun conversation at the start of it. You think logging is going to be really boring, but it is so foundational to everything else we build, and it is so difficult to build the information to have what people need when they need it for all of the different audiences. That's really one of the challenges here, something we didn't dig into as much as I would have liked. But definitely we're going to come back to things like audit, things like events, definitely a lot of loose ends for us to pull back in. If you like this type of content, and I hope you do. We're working to produce all of the tech ops podcasts as a series of blog posts so that we can be use them as education materials. But they work because we get people to come in and ask questions and be part of the discussion. I hope you will choose to do that you can find out our schedule and the sequence of things that we're talking about at the 2030 cloud. Thanks, and I'll see you there. Thank you for listening to the cloud 2030 podcast. It is sponsored by RackN, where we are really working to build a community of people who are using and thinking about infrastructure differently, because that's what RackN does. We write software that helps put operators back in control of distributed infrastructure, really thinking about how things should be run, and building software that makes that possible. If this is interesting to you, please try out the software. We would love to get your opinion and hear how you think this could transform infrastructure more broadly, or just keep enjoying the podcast and coming to the discussions and laying out your thoughts and how you see the future unfolding. It's all part of building a better infrastructure, operations community. Thank you. Applause.