20240418 AI infastructure

10:33PM May 25, 2024

Speakers:

Rob Hirschfeld

Tyler Johnson

Rich Miller

Keywords:

ai

building

infrastructure

cluster

system

reference architecture

inference

rob

architecture

latency

analog

infiniband

edge

work

people

put

conversation

day

gpt

question

Hello, I'm Rob Hirschfeld, CEO, co founder of RAC and in your host for the cloud 2030 podcast. In this episode, we discussed the challenge of defining infrastructure to support inferencing. And training, we really talk through what it's going to take to understand what to buy, what to build, how to build, how to put it together, and how hard it is to actually know what goes into the infrastructure behind an AI cluster. And why it's so difficult. I wish we had more answers for you in this podcast. But understanding why we don't have the answers is the first phase of understanding. So I know you will enjoy the conversation.

The topic for the day is different from from what we've been discussing. It was it's me trying to bring back this idea of building the opposite of the ops discussion for AI. Which is something I've been trying to have very frustratingly tried to have across the industry still, where I talk to people who are doing AI work, and then ask them how they're actually building the inference infrastructure behind it, or the training infrastructure behind it. And I universally get a shrug of, you know, yeah, of how they're winging it.

Yeah, Rob, that's because you're, you're a minimum of five years ahead of the curve on that.

I mean, people are building freight. I mean, yeah, I think people are building training.

And I don't mean in terms of thought leadership, like folks like us, I'm talking about, like actual books that are going to put something in the field.

You know, we're, we're talking to enterprises who are actively are Yeah, buying infrastructure that they know is going to be there. inferencing and or training infrastructure?

Oh, yeah. We're talking about something different. What I was thinking about, like using AI to manage that

infrastructure. Oh, no. Yeah, that's, that's, I'm not even sure. I'm not even sure. At the that's a hole

that that journey has begun. I mean, when I say saw, but if you watch what happened, and what the most interesting story that I've seen in operations in AI, is the devon.ai. And not the Devon what Devin provoked. I mean, they poked a bear. I mean, I don't know if you saw that they did it. Like Devin can do this, you can pass the sweep bench, you could do all these things. And within a day, there were like two or three competitors. Now a month later, there's like 10 competitors. There are a discord groups with 1000s of people trying to come up with solutions. So the beauty is, there was a lot of latent technology that people have been building for the last couple of years around, you know, that they're throwing up their sleeves and stuff, right. There's a couple of tools now that can claim that do as well. Yeah, like 13 14%? I mean, that's getting close to humans. Right? Or maybe better. There. You know, I, you know, I want I'm not gonna say that we don't need you know, there are some people like we've been Cohen who say DevOps is dead. We don't need any sysadmin or but like, you know, God bless, God bless. But the point is, you know, the argumentation right now that is going on is just mind blowing, what people are able to do with this?

I think, I think what you're referring to rob, is, you know, if we're getting into industrial class production, for machine learning and AI, we have Gen AI. What are people doing your question, your question is, what are what are you doing to think it through there needs

to be a reference architecture, there needs to be a standard, you know, like, right, like, you have to you know, all of these things have high east west traffic they use InfiniBand I was actually looking up even the Ethernet stuff, which is called rockI. Roc II is also InfiniBand over Ethernet, it's our DMA or internet so it's just InfiniBand over Ethernet. And, you know, but so you know, how many GPUs to stuff in a box? How, what type see if you don't use video, what should you be using? How much RAM do you need? What's the network topology? Like, like all of this operational stuff?

Hey, Rob doesn't go ahead. Sorry.

Go ahead.

Well, the question, you know, where do you think the smartest source of requirements for that is likely to come? And that is, do you start with, for example, the management and governance in the management of the the data sources, the datasets that feed all of this? You know, that's a, you know, that's a personal point of view. But is that without starting that, I was gonna say, again,

do we even have to figure that out? Because you think about like unsupervised learning, where we're models are creating their own categorization, which is another way of saying, constraints or groupings or requirements. What is AI to actually define that governance?

You know, there's nothing wrong with using that using AI to define the governance. I think what we're, we're talking at different levels. Actually, Rob?

I, you're not the actual bill.

Yeah. And when you say that, are we talking about? Well, given give me the top three or four, the top handful of have questions that you want to be addressed. That'll point us to that part of what you're calling the infrastructure, because I can I can, I can get lost in the in the layers here.

Well, specifically, what we're what we're looking for is, you know, one or more, you know, say three reference architectures that people can know what to buy and how to wire the gear together in a repeatable way.

Hey, Rob, yeah? Have you seen hedgehog? Have you talked to those guys? You should really hit targets. Yeah, I need to hook you up with those guys. It's Mike Dvorkin is one scientist. Or ACI, right for Cisco. Okay.

So Rob, let me ask you a classroom so. So RAC and you guys make heavy use of F? F. F is Infrastructure as Code? Sure. So if you're able to define the state of a complex system, ie infrastructure, yeah. With Infrastructure as Code, do you even need a reference architecture? In the age of AI?

Do you have that? Totally.

Yeah, I mean, this, this is one of one of my points about what we do. And there's there's weird thing, there's a weird aspect of what AI is good for and how you want to use it. There's, there's times when like, with an LLM, you want creativity or synthesis. There's times when for what we're describing on infrastructure, what you want is reliable. Well, true. Well proven pas that work very generically, it's like good libraries since standard, right? You know, a lot of infrastructures code is

more like the protein folding machine learning as opposed to large language models.

Yeah, there, I'll tell you the value the value of iterating over an unlimited because I've had this conversation for years. Now, the idea that I'm going to improve my raid and BIOS configuration by doing annealing of all the BIOS configuration options. The challenge with doing that is that you end up having to actually do a infield test. And so what you can do is you can do an iterative piece, but then you're gonna have to run the workload and test it and do all those pieces. And most of that stuff is not amenable to you flipping bits to try and come up with a better a better pattern on it. With a

digital twin, I want to hear what Joanne says about that.

Yeah. How can I, my camera's my port. I was saying something and then No, I was just gonna say that, you know, the there are two points, right? There's inference and training, right. And so they're different. Most people conflate those two things. Right? When we talk about then most companies are probably not going to have to trade, train their own large language models, right or most tune organization, inference is still going to be a cost and I think, I think you're spot on Rob. You know, again, I was looking at this, I've been working with his hedgehog, guys. They have a positive, you know, closed infrastructure based with GPUs. They're sort of going into this world, mostly around network config, like sort of SDN stuff, but Um, but the thing is, I think you're right. Like we need an architecture. I'm writing an article right now, before tech strong based on a conversation I had with Tim Crawford over it about all the stuff you're talking about, how can you transfer to engineers are now the most interesting people in E degrees? Right? Like because they can't get enough of them, right things that were like we haven't talked about in like 10 years now, are all coming back to because we got this GPU starving. So I think you're spot on, like GPU starving is going to be for large corporations, one of the biggest problems to solve the Google's the Microsoft's like, they're way ahead of the curve here. But so anyway, so I think I think, and

I do, I do think that that the enterprises are going to be looking at model refinement, and doing that on their, with their, on their own systems with their own data. And I think we're gonna, we'll see, I think inference, but building a big imprints Farm is a very real use case. And being able to have a dynamic inference farm is going to be a very real use case where you're in front where, right because if I've got, you know, more and more AI backed applications, meaning they have to do ad hoc inferencing as part of their workflow, then I'm going to, you know, I don't want to build an inference system for each application, I actually want a pool, this is classic, it I want a pool of inference capability that I can dynamically share across my different workloads, right, classify it. So that means that these enterprises are going to be looking at how do I build an AI inference cluster. But I think that same that same, and this is they're seeing it this way, that same capability, on off hours can be used to do model tuning and refinement, to share infrastructure. So

yeah, so here's the other part of that, though, that I haven't heard mentioned, memory. You can have all the GPUs in the world. But if you don't have ample memory that you can equally software define. And I see the architecture that you're talking about kind of in my little pea brain here, as the IAC is trained by machine learning. On the optimization strategies that you would need based on what kind of AI is generative, regenerative isn't an inference engine that you want. And I think you're going to have to look at how you would build out that stack. Because memory, heat, all of those wonderful things, it might end up being much more of a it's not going to be a cloud cluster, I can put it that way. It's definitely going to be a distributed architecture, with a lot of edge. And this is what's actually going to promote edge more than anything else. Because you have to start doing separation between storage compute and subsection compute to memory, as well. You're thinking about a quagmire.

The thing that we've seen and why I go back to reference architectures is the people I talked to don't have opinions, or they just don't know enough. In some cases, they don't know the workloads well enough to know what what they what they should put together to build their system. So so they're they're struggling with that they do know they there, there's a different network topology. It's a very high performance network topology, which is hard and confusing for them. They know they have to manage GPU and GPU firmware updates and patch and write there's a lot of operational concerns. At once, once we've identified like, this is the thing we could do any of it. It's not, you know, mostly right now it's being built by hand. As far as we can tell, people are still figuring that out. If there's a pattern that says by this type of gear, this type of steel by these things, and then wondering all together run the application or I mean, right now they're doing of all things Kubernetes is the control plane for this stuff, which doesn't make a lot that doesn't make that much sense to me, but that's what we see every single time we turn around. Which breaks me is them not

that I believe it's gonna be more useful now. Okay,

so So I look at it from this point of view, if I can offer this and then I'll talk to digital twins for a second. First of all, yes, a digital twin would be the best way to do this and to use it, but from the reference architecture point of view, I will try and dig out a link that I have There is within it was either in W three C, were I triple E, and I'm pretty sure it was I triple E. There was a paper written a while ago on distributed computing architectures, specifically for a I talk to great depth about the ops side. Yeah.

Yeah, there's gonna be a lot of work in this space. I mean, we're only what, six years in really? Since GPT. One.

You know, that even though I mean that, that, yeah. And one of them, I did see a fascinating post on on academic, but on the fediverse, about Microsoft, having to do a stretched, geographically distributed interconnect between their training clusters for GPT. Five. And the conversation was Why are you doing a geographic cluster? You know, geographically distributed cluster portray? Have you heard this, that? And the answer was, if we put it on, yeah, one data center and the we blow the grid. So we have to distribute the cluster geographically.

Okay, it's gonna, this is gonna sound kind of geeky, but what we need to do is we need to extend the cache coherent non uniform memory access model that I learned back in the supercomputer Unix days. So CC NUMA. Is, is really what we need to be thinking about, for how we manage and optimize these flows of information between infrastructure nodes and the network.

I agreed. But the challenge they're getting into is they're moving out of they're moving into physical requirements in which speed of light becomes a factor. And for what you're describing right now, the assumption is inside of a cluster is speed of light is not a family. No, that's

not true at all. CC NUMA. I mean, when we were building Superdome servers 20 years ago, cc Numa, took into account the speed of light limitations in the end, because we were the first implementation of InfiniBand, where we moved from PCI to InfiniBand. And all of our connections within the chipset, right. And the way that we implemented CC Numa, was by using what we would call today is like metadata or attributes about the system in a tag directory, so we'd have we'd have a directory that would contain all of the cache states of all the cache lines in the system. And like which processors on which cache lines what had to be updated, all of those kinds of things was was taking it was taking into account the latency within the system, because the whole system optimizes for minimizing latency in the system giving the bandwidth and latency connection limitations of each connection, or

I, the systems I've seen like that. So typically assume that the latency is introduced by routing issues at the top of rack, not by having to send signals across geographic distances inside of a cluster. No,

yeah, we were this thing. We were designing this stuff that nanosecond

timescale. That's what I that's what I'm trying to say. Yeah. So so. But, you know, I think these are fascinating concerns. Most of the, you know, most enterprises don't need 1000 Plus, right there, they can be within the spanning tree issue band. For these clusters. I mean, you'll get into edge to Joanna's point edge edge components, where we have a lot of individual sites that you want to have on a reference architecture, because you don't want variation that is, you know, a whole bunch of edge sites. My concern right now is that people don't know what the I don't know what the order, I don't know what to buy. When they, when they get it. They don't know how to put it together. They're still experimenting with that. And so you're buying very expensive infrastructure, you're still in an exploratory phase, which means that you're, you're churning through a whole bunch of variation. You when when you're doing hand build stuff, you don't have reliable, consistent architecture. So you end up troubleshooting, what's happening and these are not small groups that you know, they're not two systems at a time there are 100 node 200 node clusters, which are still materially large clusters, at least historically. And, you know, and so that the whole operational pattern of getting them up, you know, up to date managing them configuring them correctly rolling out an application on it. I mean, I would love to see, I would love to see a paper where those things were considered. And I don't have the expertise to define what that what makes sense, from an AI user. Well, here,

here's the question. Here's the question is, are we going to see a return to scale up versus scale out?

I mean, we definitely have, for some inferencing systems, you can scale up within a machine to do an inference. But

it's much easier to scale out.

Well, yeah, absolutely. But if you scale up, you can take millisecond latencies and turn them into nanosecond latencies.

True. But the only the only

the physical constraints of the system are, are, are problematic.

I'm not saying I agree with, that's where I think it will go and some edge use cases, not IoT Edge, but just corner cases in the market.

I disagree only for one reason, you can buy the components for a high performance computer as an Edge Server. And that may get you over the hurdles. Like in terms of HPC, I'm talking about like genome, right, the used in medical used in pharma using whatever, you can buy those components and build your servers. And then your cluster doesn't necessarily have to be proximate, it could be a cluster of edges or

go. Okay, so now you're you're you're basically saying there are context, their preconceived kind of approaches to the problems that would mitigate that would actually recommend different reference architectures as, as Rob is making use of the term. Yes. Right. Clearly, clearly, except, the other thing, though, is we're still dealing with a lot of we have a move a couple of moving targets, you mentioned memory, for example, I'm going to I'm going to throw out the whole the whole issue of we may be in danger of settling on a particular architecture, which is the transformer architecture as the basis for a lot of AI, when in fact, that may not be the way we utilize Gen AI. That's not that may not be the basis for Gen AI from for all that much longer. We're seeing things like

it seems analogous, fester. Exactly. And

it's, it's, it's like, saying, Every computer is going to be, you know, Von Neumann architecture, and suddenly somebody walks in and says, no longer. So we're still we're still, you know, kind of very early on in the whole notion of what these engines are, what they're going to look like, and what their requirements are going to be for memory for latency, you know, in in the datacenter, in the clusters, I, you know, I think we're going to find ourselves, you know, building custom architectures for classes of problems. So, perhaps Rob, the right thing to do is to first start out with what's a good way of classifying the problem. First, to set the context and then ask the question, given that, what are my what are my best practices? What are my best ways of going about building out designing and building out the infrastructure for that class of problem? Yeah,

look. Like on the edge, we're gonna see a rise in analog implementations of neural networks using circuits non digital, right? All right. MIT ASICs basically work on that right now.

And then it might also be similar to The progress would like CPU architecture, back in like the 90s or early aughts, where you went from just plain could throw as many power as much power as they can to predictable branching or a branch prediction and so on. And then you have the opposite approach, like what Microsoft is doing with their Stargate, the UCLA data center plan, where they just throw as much power as the candidate like, like five gigawatts for a single data center.

I equate Stargate to the Reagan area Reagan era kind of defense approach that says, you know, we're gonna build, we're going to we are going to outspend everybody else as a way of putting ourselves in, you know, in the cap. It's a it is absolutely a brutalist approach. And

it doesn't work in the long term, either. You know, we got its stars,

you know, it's Star Wars, Star Wars once again, you know,

and power for time machines, but that

when you can take out a $10 million satellite with a $3 laser burst. That economics doesn't work,

right. Yeah, exactly. Yeah. Well,

just a thought,

I've got a run, but just a thought about the analog computing thing. One of the challenges around analog control systems is the amount of human engineering brain power that it takes to design those systems. And that once we're able to design analog control systems using AI, that that whole space is going to explode. Oh,

it's an interesting question. I'm gonna put I'm gonna put that on the May calendar. Say, does that

imply does that? Does that imply, basically, kind of the return of ASICs? Yeah, because Klaus was just saying, I

know, beyond so analog ASICs,

what Yeah, exactly. But I mean, it's what's the next? Yeah. Oh, man. Alright, so

may 16. I'm putting analog computer systems on this list. And does AI make this more possible interest? Yeah.

There's a big plug in with quantum there, too. Yeah.

And I would also add, there's a big plug into parallelism, and the return of the mainframe

and if you want if we want to go into the Sci Fi direction, singularity, like there's no point and then where are you at the threshold of it where AI designed hardware is going to go beyond human comprehension

reduction. Right. All right. All right. You guys you guys are you guys are giving me flashbacks since my first real job was an analog digital and an analog computer hooked up to it one of the first IBM mini computers.

Okay, rich, you're making me feel old now?

I am. I am an old guy. Yeah, yeah.

Well, so am i But but you know, my days of semaphore tuning and construction. I was like, in my 20s teens. And I remember, like green screens of deaths are black Screens of Death with green code. I used to go bloody colorblind. Yeah.

Okay, guys, I like your idea of time on the calendar.

Okay, well, cool. idea.

I have a good day to you. Well, as an operations person, I really want some simple answers. I want to know what to buy, how to wire it, what the challenges are, and then be able to help people do that work. I mean, that's the core of sort of what what I do with rack end. And it is an frustrating exercise here to take people who know who have answers. Or often have answers and have the conversation like I've been having across the industry, where we don't have all the information yet, where people are still learning and trying to figure things out, and yet wanting to help be part of that iteration and explorations. a fascinating conversation for me, I'm really looking forward to where we go next, some of the topics that we brought up for future discussion and are on our calendar, if you want to be part of those conversations, please join us at the 2030 dot cloud, always informative and interesting. And the more people, the better the conversation. So I'll see you there. Thank you for listening to the cloud 2030 podcast. It is sponsored by rockin where we are really working to build a community of people who are using and thinking about infrastructure differently. Because that's what rec end does. We write software that helps put operators back in control of distributed infrastructure, really thinking about how things should be run, and building software that makes that possible. If this is interesting to you, please try out the software. We would love to get your opinion and hear how you think this could transform infrastructure more broadly, or just keep enjoying the podcast and coming to the discussions and laying out your thoughts and how you see the future unfolding. It's all part of building a better infrastructure operations community. Thank you