Paul's Meeting Notes

4:07PM Feb 27, 2025

Speakers:

Alan

John Lockwood

Sahas Munamala

Daniel May

David Lariviere

Med Belhadj

Hesham

Edmund Humenberger

Kevin Cameron

Keywords:

FPGA simulation

Ethernet network

chip communication

AXI bus

chiplet co-simulation

EDA tools

semantic communication

packet loss

queuing models

network on chip

RISC-V cores

memory hierarchy

AI workloads

floating point precision

high performance compute.

Almost compute different use cases simulation, but I'm expert on of none of those. So that was my intention to to at least document what's the state of the art and where really open AI atomic internet could help to heal the pain a little bit, as I understand, no one wants to come forward as and define himself expert, expert. So I think we will be stuck for a longer time on the yellow area, and therefore go to discussion, to all, to general discussion. You.

So

I would say myself, I'm sort of dead in how do we implement hardware things? And then my interest in life is simulation, and in simulation, I'm looking at something where I've got an awful lot of communication channels that I need to prioritize. Just like if you if you look at hardware, there's usually critical paths. So it's one of the things I'm trying to get at, is what are the critical paths in a software application, and how do we prioritize those paths in the sort of Ethernet layer?

Well, I have spoken with Sandia. They are doing a lot of analog simulation, and they having huge clusters of 1000s of machines doing analog circuitry simulation, which is kind of a lot of local so that's a specific domain in the high performance compute domain, yeah.

So one thing I saw from synopsis seems to be doing a lot of EDA tool support around these Axi busses, so they have a whole product offering of the simulation and set up for SOC and chip to chip, the chiplet co simulation. That seems to me like some of the big companies are using for building these docks and chips and chip like communication,

you mean a chip to chip AXIe bridge.

No, they have all the tools to simulate and build. Now, what beta provides are tools to synthesize and implement the tools or implement the chip communication. And Chip communication, what synopsis is providing is the verification tools to simulated tests. I mean, they're just a PDA CAD company.

One of the things I was trying to get out is like, can we guarantee the semantics of the communication? So, yeah, that's a good idea. I

really like your position of you know, wouldn't it be nice if we had language support and proper, correct protocols, so that we don't need to go down to the low level and simulate every interface. Is that the idea? It seems like a good idea. Yeah, the

kind of thing I was trying to do in simulation environment is short circuit the communication so you don't, you don't actually simulate it properly. You just say, Okay, I've got a semantic where I know that this, this will be delivered at a particular order at a particular place, and I can then run a whole lot of things in parallel and simulation, and then synthesizing the network as a something I can do analytically to verify this is

where the simulator that Dugan is working on gets very interesting. We can start modeling different flows and see what the implications are packet loss and other things. Yeah.

So the simulation you want to, sort of like, stretch things out and drop packets and see how the whole thing responds exactly,

and be able to model especially things like queuing and switches and routers, or if you're doing mesh, because then you can ask interesting questions like, okay, especially in these routers, often memory is absurdly scarce and in a way that has no bearing with economic reality other than at the market cornered. And so it's like, okay, well, what if we could bypass them and we just have, like, a 16 gig dim on one of these smart nets that's actually not that expensive anymore. What does that enable us to do? And then what are some of the sort of given the current data sizes, given the expected drop rates, what is the sweet spot? And does this make sense economically? And then what are the performance differences?

Yeah, and I think some of the other groups you know talked about doing some of these simulations symbolically. So you have, like, actual gate levels simulators you can use, but you know, you try and learn from the gate level what a symbolic level thing is going to do, and then you can kind of do quick simulations of what your your network's going to be like when you deploy a particular way. Yeah,

my expectation is for the types of problems we're interested in trying to better understand and then hopefully solve. I don't think we need a gate level or an analog level simulation of Ethernet if we trust the 30s and we trust this protocol, and we assume we know the index of a fraction and a fiber optic cable, most of this, we should be able to simulate Broadcom chip of how they handle things like

so there's the sort of semantics of a sort of Ethernet network in a compute farm, and then there's, okay, can I translate that semantic into a Network on chip, if I'm building a chiplet system,

also for the Kevin, do you mean like a Klein rock based description of expected wait times and queues, or do you mean something different? That's

probably right. Not too familiar. But if you, if you can play with something in simulation and say, Okay, I can tolerate a slow Ethernet, you know, channel here, because it's, I'm just writing a file or something, I'm not looking at it, then you can sort of allocate, you know, that particular channel?

Yeah. I mean, Leonard Klein rocks models are kind of based on, if have links that have rates and arrival times and scheduling policies, is that you can describe, as you said, symbolically, you can come up with the statistical nature of what the packet delays are going to be,

yeah? I mean, talking about doing some of this thing called Astra sim, and the AI hardware software group that I'm not too familiar with, what Astra sim actually

does, yeah, no matter, no matter how high you go in your abstraction, because when you do an F, let's say, an FPGA simulation, you don't simulate it in the district. You really work with the model of the logical primitives and with their timing behavior. But what you really want to get right in a simulation is almost a timing so you better have a good timing engine which describes the behavior of the system, because you might run into meta stability problems and interlocking problems, that kind of stuff. So there are different levels of abstraction on, for example, on the design level, when, when there are some languages, let's say, like amaranth, you can always simulate on higher abstractions. But the question is, is the timing behavior really correct?

Sorry, like quest ism or other open source event based simulators, or what's your tool of choice?

Well, in the open source world, there are two simulators. So one is very later, which compiles your digital circuit down to C plus plus and then just compiles a program and runs a simulator on a certain time resolution. That's pretty okay for for high abstraction levels, and it's very fast. It's about 1000 times faster than when you do really precise event based simulation. There's ecoboz, which is much more detailed, but runs much, much lower. But questa sim is probably functional, very similar,

yeah. What about how does that compare like to Xilinx ISIM. I mean quest to send model sim ISIM are all kind of peers which are basically just circuit simulator, digital circuit simulators.

Yes, with digital circuit simulators, you you have your logic levels and you don't, you don't work with analog voltages. You work with logic levels and yes, you can use the signings, you add a simulator. They are all pretty similar.

Anybody familiar with p4

Yeah, we did some p4 at Stanford. Is it that was the Barefoot group, was that Intel bought was all about p4

that's sort of in the C Plus Plus line, but more for networking.

Yeah, it's like a very stripped down kind of packet behavioral language. And so that the idea was to they actually implemented some switches with this, but the switches got discontinued at Intel. But the model was to be able to have like, packet events on arrival, look up, do something, drop packet, forward packet. And you can express that pretty concisely.

I'm just looking at somebody's paper on p4 and FPGAs.

Yeah, there's been full implementations, including p4 compilers that map it down to synthesizable RTL, so you can take it. Xilinx did a lot with this, where you can synthesize a p4 description into RTL, then synthesize that onto an FPGA and run it in hardware. And so there's a lot of flows to make that happen, and it's kind of, I mean p4 because it's kind of like a simplified RTL. It's pretty easy to map those tools.

Yeah, just posted a link to the silix Vitus networking before.

And I thought they had something else years ago. I don't remember it. I never got time to play around with it. But they they used to have various types of IP, some of it in house, some of it they were reselling a third party, but specifically for different ways of doing FPGA, accelerated package processing. And just taking off the top of my head, that could be an interesting one of, Are there cases where an FPGA could actually outperform a ASIC? In the context of, if you're dynamically changing the hardware itself, for in relatively real time, your packet flows, because if you start using like partial reconfiguration, using to update a small area, you can start doing some pretty cool stuff with that. In some cases, I hadn't thought of that till just now. Don't give us that process. That was

kind of what the hot state machine stuff is about. So basically, the idea there was, you're putting a microcode sort of engine in the FPGA, and then to change the behavior. Just change the microcode. You don't have to reprogram the FPGA, but if you're changing

the microcode, that could also be done in ASIC, I'm thinking more at the actual gate level,

yeah. I mean, the don't say thing is just about a fast way of updating the FPGA to get different behavior running from like C routine level, yeah,

yeah. Makes sense? Like, yeah. I

mean, like, the one thing with p4 was p4 worked great as long as protocol fit into what the white coated p4 engine could do. But as soon as you had, like, some bit operation that wasn't natively supported, performance with you ended up going back to, like, RTL and So, Dave, I think that's what you're saying. Like with the FPGA can shine is when you need to reprogram the data path to do something custom. Then having a micro coated model may not work, but some hybrid of that might work

exactly well. Guys, a simple example, what if you want different dedicated physical hardware cues for certain types of flows, or even reallocate them, similar to, like, what solar flare does with different I remember, called ufvi stacks. Yeah,

right. Or you spin

up a new virtual machine and you want a new queue for that. Can you reconfigure it? Because often these are, like, if you have the ASIC versions, you tend to be pre configured, I think, and it's like up to 8v necks or something. But with FPGAs, you could do some interesting things with the synocosities keep coming down, especially for, you know, 10 and the lower data rates.

Yeah, I was sorry. I just gotta say I was looking at like risk five and FPGA combination as being a good way forward to get more performance in computing, definitely, and then a big network of those things, you know, is this, like gonna, it's gonna work for communication as well as compute. And then

this is something I had mentioned earlier about the possibility of, can we make Ethernet a first class citizen at the ISA itself? So can you have, like, custom registers to store MAC addresses and to custom operation as an assembly level instruction, and then ideally, and John can vouch up and try to get the vendors like John to build this for a decade of you have an ultra low latency Mac phi directly wired to some kind of soft core processor. Yeah, the

kernel bypass Nicks were nice, but they go through the PC bus. Is that Colonel bypass without PCIe would be pretty fast and so, right? And by the way, just speaking of like on risk five and FPGA is that in China, they have adopted risk five all in because it's royalty free and it's they can legally make chips

try to censor us now, yeah,

so it's unstoppable. They can even sell those in the US, you know, maybe, unless they get tariff. And so the risk five, they have a complete open source like you could they put at hot chips. Last year, they had a couple speakers, and they talked about, like the risk five tools, like the full blown risk five with the full cache controllers, the whole everything you need to build a risk five system. Some of the Chinese researchers and government agencies have put into the open source and just said, Here it is. Anybody that wants to build this and do it. And of course, they're the ones building most of them, so it's advantageous for them to have this open source implementation.

I even would say that ETH Zurich is the most advanced research group publishing open source risk five designs, AXIe based accelerators, multi core. They are doing a lot of risk five cores, and doing tape outs on on 18 nanometers, 23 nanometers. They are, they're pretty advanced.

Yeah, I think they'll be here at the FPGA conference. Is that eth Derek has historically been really strong and actually building working things, that's for sure.

So you're going to report all the good stuff you find, John that

could be our topic or conversation next week. What did you learn at the conference?

I would say the figure I was chasing like a decade ago was getting the FPGAs in real tight with the processor core, because we were trying to do offload for some system bar log stuff. But, you know, most of the FPGAs were on the other side of the l2 cache, and the latency of transferring stuff to the FPGA was too high to make it useful Exactly.

That's why I was mentioning, well, I keep poking John to get, like a risk five or at the time, I think I was asking for a nios too. That is has the custom assembly instructions that then in the ALU part of almost the ALU itself would essentially be like an ultra low latency, high frequency trading phi, and so you can just start streaming out bits almost instantly. Yeah, yeah.

So my stuff and, like, moving threads around. They were thinking, Okay, if I can identify, like a particular routine call in the data path, I can then just say, I'll just, I'll lift the register set, throw it at the FPGA, and the FPGA will handle this exactly,

and good. That's one other thing, just as a cautionary tale, that I was very frustrated to learn. So I think both Altera and Xilinx are now switching their own in house soft cores away from the MicroBlaze of the do. So now each of those is now actually based on risk five. That part's great. The downside is right now Xilinx is planning on closed sourcing their risk five implementation, which is really frustrating, because it means you can't add custom assembly instructions anymore.

No, my view is that you don't really want to add custom sort of either stuff, because then you have a software compatibility issue. And, you know, I'd rather do things as routine level offload, because if you do routine level, you don't have to change the code.

And it looks like Suhas hand is up. Is that intentional Sahas or an accidental?

Yeah, I guess the conversation about the risk five cores, like, most of the like, interesting stuff is in the data pipeline, right? So writing it in C makes sense, but if all of the ports need to be handled separately, that means, can we fit eight soft cores on one FPGA, the high

end? Yeah, somebody did 1000 Yeah. Go ahead, John, yeah,

just kind of Kevin's right is that some people have fit it's same number of cores. And especially if you strip down and just do integer math, you can put like, 1000 FP 1000 risk, five small risk, five cores on an FPGA. Now,

yeah. And the nice thing with FPGAs is you can customize the exact footprint. So as John was alluding to, there are a bunch of different optional instruction sets that you either do or don't include, and you control things like the cache hierarchy. So if you want to resize it, you just lower your caches. Now, if you want a two megabyte cache per core, then it's a different story,

right? And typically, when these risk 5000 core FPGA implementations do use a network on chip because they you need a very regular layout, so they just put an overlay network on top of the FPGA, sometimes using the built in knock like conversal or sometimes implementing their own data path.

So would it be easier to write C code for these than like HLS code that represents the data path? Yeah. I guess for Yeah.

So just one data point is a lot of a lot of network processors for large platforms used for a long time, and they still do exactly that architecture with a bit of simplification. So you could do like risk five there is like a runs the gamut, not just with the instructions and Cache Levels and protection. You go from microcontroller to something that you know you could put in in a with some different level caches, and you could actually use it as an accelerator. But a lot of things, when it is data path, the two things that you want is a is to be able to write C code so you're not actually having to do that. And all the in ease and the outies from the core handled at line rate with the F with the ASIC or the FPGA. So you there are risk five, like pico risk or whatever, there are basically a glorified C state machines that is programmable in a proper risk five instruction set. And basically all your work is to push data in and push data out. So if you you can look there is a lot of stuff that is in public domain, but can tell you firsthand that very sophisticated implementation of that go into a lot of stuff that is actually running a lot of the network in in the field. And what it is is basically you have a network interface that say Ethernet, and you spray the packet to various networking element, which is basically a glorified processor. And what you can do is you can play around with that. You are targeting implementation that are actually heavy on data movement. So if you do that, you can choose what kind of things you can leave or make, basically, not really as important. So you reduce the gate count and you increase the performance. A lot of those are very simple, two or three stage pipelines, and they can basically push data in and out. And as a bonus, you write everything in C so that's, that's just a big point.

Sounds like I need to read more about soft course.

Well that that concept was used by raspberry, by their microcontroller, but a 2030, microcontroller. So each pin has a very tiny core which can execute any any kind of protocol on the pin, and it's extensively used to implement protocols up to 20 megahertz. So you don't have an SPI core. You just load the C program to behave like an SPI core.

So is that supported on the current like raspberry five, four, or is that something coming? No,

that's the on the Raspberry microcontroller, the RPi 2040, that's not the, the the arm, the big board, it's the small microcontroller, which you can buy for for $1 Yeah,

conceptually, it's really cool. You could basically do, like, programmable interfaces and software to even do things, I think, like HDMI. I mean, it's really,

yeah, VGA, HDMI, and anything Ethernet, sending a bit Ethernet on that small core, you said

it's 20 megahertz, though, that's the limit, right?

Well, one of the things is latency, if, if something, something could be slow, but you know, if it reacts very fast, it doesn't matter if it doesn't have much to do,

yeah, you usually program that in assembler, because you want to have a very tight loop, it's like, more like a powerful state machine.

Yeah, I kind of look at the code is like a big if you look at the call graph of a program, it's a whole lot of little compute context, which are state machines. And if you do it that way, then you can say, okay, I can take this whole thing and spread it over a whole lot of small machines, just like, how small can I get away with?

So one of the cool things that Xilinx did this was an old timer, a guy named Ken Chapman at Xilinx implemented something called the kcpsm, which was the ken Chapman programmable state machine. And it was basically one lot or one you could, it could be as small as you want. It could almost be as small as you wanted, but almost mapped down to a couple flip flops and a couple lookup tables. And you could, I mean, even on old FPGAs, you could put 1000 kcpsms onto that FPGA, and it did kind of something similar. You could program it. They had, you had a C like, or maybe it was, it was a C design flow that mapped to this low level assembly. And it was like less than an eight bit micro but it but the point was, you could implement C type protocols, and you could offload it to these and do it micro code, which was cool,

apparently, that is now called the PICO Blaze, for those still here, just pasted a link to it.

That's right. Thanks, Dave, yep.

But it was originally called the, let's see the ken Chapman PSM, and then they renamed it to the constant K COVID PSM, and now the PICO Blaze, and maybe this was the basis for what someone else was just talking about a few seconds ago with the what is it, Pico RV?

Yeah, that could have been what Raspberry Pi used because it was, it was open. I mean, that

makes sense. And we also have this pico risk five pico RV, size optimized.

The prv was written by us, so I know a little bit about that. But the thing is, like, just you have, you have an FPGA with, let's say, 8000 lots. You can put 20 risk, five cores there with 32 bit, so that the size truly doesn't matter anymore. Yeah,

it's really the question of how fast can your core run, and when you have it in silicon, obviously it can run 10 times faster than we when you have an FPGA fabric. So this IO cores do make sense, because they run at 354, megahertz, 400 megahertz. So they're running pretty fast. So you can do a lot when you have that kind of speed on your microcontroller, which you do not achieve on FPGA.

Well, I mean, like the kcpsm basically have like a two logic level delay. I mean, potentially these small microcontrollers, like a PICO blade, should be able to hit 567, 100, 800 megahertz, if you just

the PICO RV originally was designed for being included on a very fast FPGA to run in the high frequency clock domain of The of the rest of the circuitry It was intended for, for a lighter system. So it was really like very inefficient, but very fast from the clock rate. That was the original intention.

Just for those who were curious, I was just chat GPT and what the f maxes are right now in the latest line of FPGAs from Xilinx, and they're quoting like 500 800 megahertz for the versal, a cap that are taped out of seven nanometers. Yeah.

I mean, and we've seen that push to 900 megahertz. I think 1.1 gigahertz was a record for an unofficial over kind of overclocked high speed grade device, if

I can, yeah, if I can afford to nuke one of these, I would love to do this as a semester alarm project with some of the students overclocking FPGA fabric.

Yeah, I was talking to some guys about doing quasi adiabatic logic, and it looked like FPGAs might be a good application for that. It should possibly go faster and run at lower power

and lost. I

uh, but hierarchically, there's going to be a lot of soft cores, right? So this communication between them, would it slow down the FPGA to a crawl. I mean, I'm imagining like three or four layers of soft cores before it even attempts to put it into the main processor.

Design, and, sorry, designing the the access to the memory, so the the ingress and the egress pipeline from whatever if you build something like this with a bunch of processors to do data path processing, the memory interface. So for example, if you trying to do something where you have context tables, like routers, have either tcams or have external memory that implements basically some form of a lookup protocol your memory interface and how you share it, and also how you lay down the CPUs And how, basically the network between these CPUs is organized and how they actually communicate with each other. So let's take, for example, you have a pipeline that has, say, 100 gig, and you're splitting those packets and spraying them to each one of the CPUs, they're going to do some some function, and if you decided that you're going to chop the packet into, say, 256 byte or 64 bytes, and pass it to each CPU that's going to do some computation, how you actually aggregate the results and basically do that, I can tell you from experience that designing the CPU cores and implementing them and putting them together is about 5% of your end work. 95% is going to be designing the input and the output and the CPUs themselves are almost trivial compared to basically what, what and how you're going to design the memory hierarchy, the external interface that you need, if any, internal as well how you're going to share that and how you basically going to communicate with those two but you can disconnect that from the exact, or at least a subset of the things that you want to do. So if you're going to say that this is the task I want to do, then you can start actually laying down what type of network again, there's a lot of if you look at the documentation or research, and a lot of the products have been done from the late 90s to 2010 there's a lot of work. I

Yeah, so it's like a giant shared memory pool with message passing is kind of the idea depends how you implement it,

though. I mean, a lot of this is going to be implementation specific, and what the particular use cases are. Do you have to handle complex routing tables, or is it like a standard and data center a lot of times, where everything's routed out one gateway. And you do that on the software side.

You got it, yeah, that's exactly it. So there is implementation where people put stuff like that for things that are basically, as I said, they have kind of glorified state machines, and it's really efficient and actually quite good. Actually, people even had things like basically taking the data path and having kind of swap in and swap out where they were aggressive in that tape out of the Asics, knowing that they had all the knobs that if and when they had bugs, that they could actually fix the bugs by loading a new software, and there were hardware bugs. So those are the kind of things that can can help quite a bit, but as David was alluding to, you can disconnect the building of that network of communication and type till you basically put a kind of stick in the ground, term of, what is your problem, kind of, at least some of the problems you want to address, and what are your resources on the outside? What kind of memory hierarchy you have on the inside, etc. So that's, that's kind of part of it.

Yeah, if you're trying to do routers and switches. This can be a lot more complicated if you're doing NICs like I think this group is somewhat proposing or focusing more on it can be dramatically simpler. What do you want to accelerate?

So maybe as a data point we did attend, it a bit Ethernet switch and just to make clear what's the function of the of the CPU, the CPU is just loading the DMA engine, and it's just doing management stuff. So the important thing is that the CPU does not get on the on the white bus, because to be able to really feed the technique, you need to have very fast access to memory. And we achieved that by a twice 512 bit wide bus, where we did aggregate all the FIFO from the 10 gigabit, so we have 40 gigabit here, and then we can read it right to the DDR three memory through the 512 bit wide bus, by by by six gigabyte per second, and then everything else is on separate busses, and the CPU just doesn't handle anything of that. So that all happens by itself,

depending on the platforms you're targeting. In some cases, you don't even need external memory if you do kernel bypass and have dedicated CPU cores on the x86 host side, then all you really need is a very basic logic that's just forwarding packets between Ethernet and PCR express with some buffering in between.

Correct? Yeah. So David, just for kind of a reference point, and there are implementation where it has some level of forwarding, but it's not a full fledged router. So if you, for example, limit a typical Nick maybe few 10s of 1000s to, let's say, less than 100,000 entries, even functions like looking at doing lookups, etc, you can all do it internally and implement it with with integrated memories. A lot of the large FPGAs, actually, you will run out of most of them, you will run out of routing resources and lots before you run out of memory. So it's actually a pretty efficient way to do it exactly.

We tend to use qdrs rams if we need very, very high bandwidth, very, very fast Texas,

so on the net FPGA at Stanford is that we implemented a 10 gig by four. So a 40 gig switch, you can switch router, but we did the whole data path in RTL. And so that was all logic. And then the lookup tables for doing IPv four was a cam content addressable memory. And the open source project we did for Stanford had a small cam implemented using FPGA logic. And then for the larger cam that we did for other customers after Stanford, for agrologic was we called it the exact match search engine, which was an associative lookup engine that was optimized for much, much larger tables. And it used SRAM and block rim to do the associative lookup and resolve IP, v4, you know, t5, tuple matches and do a lookup in a cycle, and so, and that was kind of what got the interest of HFT, is that once we had this fully hardware data path implementation, is that, you know, like at the actual gate level of how long it took us to route a packet, it was like a couple clock cycles. And so it was like, you know, 20 nano seconds to forward a packet, and that's with an IP look up, as long as the tables were small enough to fit into on chip memory or or near memory. As the

question in the research, somebody, does somebody have an idea of what the best ratio of like beach front to chip area is for doing some of these things? Because if you make a use case,

the dependent. It varies so much. And like a simple thing we were just talking about, of Do you have external QDR, or are you trying to use the embedded block ramps? And so sometimes you get the massive chip just for the memory, but the vast majority of the lots you're not even using, unless you convert them to

memory too. Yeah, by the way, the block RAM on FPGAs. That was That turned out to be just a spectacular feature, because there's like 1000 block rams, which are SRAM, single clock cycle, 600 700 megahertz dual ported memories. And you have 1000 of these on a chip on just a regular size FPGA. And so both Xilinx and Altera have these in lattice now, have interiors, have these programmable chips that have 1000s of block rams, and so that, like, was a real key enabling feature to put hardware features into the switch. And so I don't think people, you know, we look at memory bandwidth and like what Nvidia is doing with HBM, going off chip to access their tables. It's like so inefficient, uses so much beach front, and it uses so much power, you know, that they're building these, you know, hundreds of kilowatt systems, because they're moving data on and off chip all the time. Is that with FPGAs, it all stays on chip as long as you architect it to use the block rams. And they also have larger on chip memories, which are called like L rems and B rooms. And there's a hierarchy of different memories that are on chip. Yeah, I

was trying to talk to micron about like, stacking their memory with processors rather than doing HBM blocks.

Well, that's what the AMD route went they have what they call their was it the 3d cache or the V cash? Yeah, so it's exactly that, like, I get another layer of the die, and so I have one of these chips on the lab and it's like a gigabyte of cash, or, like, else, yeah, or cash,

but yeah, they have HPM on Alveo cards, right? Is that? So you can get like, eight gigabytes of HPM,

yeah, the lateral distance across these interposers is quite high. I mean, really, you just want to go vertically into the memory.

Yeah, vertical stack would be better, because

that's part of the OCP chiplet.

Yes, apparently there's both an ultra scale vertex HBM series, and then chat GPT, still typing, but there's also a versal line as well.

Yeah, I think the thing I would be interested in getting at is, what's the optimal size of node when you're doing these sort of big software applications? So if you could, if you're spreading the stuff across an application across multiple machines, and the sort of the network nodes are a bit smart, you know, what do you actually program into them, and what's the optimal size?

Yep, that's where I personally get very interested for some use cases of, can we do our DMA? You know, whether it's like some more rocky two or I warp or,

yeah, for example, I applied for a research project where we tried to chip, to chip Axi bus, to access the block RAM on the other FPGA, just To have more access to block RAM. And because the limitation is always the chip to chip communication. But even there, when we can access block RAM, it's faster than DDR three RAM, for example.

Interesting,

yeah, I mean, cache to cache architectures are always preferred, right? And so that is if, and that's really kind of why some of the GPU systems people buy, big GPU systems, is that they're, they're not buying it because they're actually using the flops of the compute, because they're, like, at 4% but they're buying it because they can fit the whole data set into the fast memory and go cache to cache and RDMA and memory to memory. Instead of ever hitting you're ever needing to go out to a slower tier of memory or storage,

they're doing exactly what Intel used to do. I call, I used to call it the Xeon text. You want more certain amount of DRAM per motherboard. You got to buy more Xeons. Okay? Very intentionally done. I think, I don't know,

there was some research way back at in moss, in the trans Peter days, where they worked out the best size of processes was 24 bit like, and it's one of those things where that sort of seems like a better match for the cache size, and, you know, general efficiency, than like, 64 bit.

So I actually want to draw attention to this exact point, because this is precisely what is going to change, or is changing currently in the workload demands for ml workloads in general, that in some cases, people got very lucky with what how pre training, it's a ratio of compute or the compute demands were relatively high per unit of low latency memory or high bandwidth memory, and that that has already changed for training itself. It's not even just for inference. The reality is, is that current training looks like inference. It's the slow path of actually having policy rollouts and then doing some sort of RL over those policy rollouts, which means that you're doing the the the low latency. I can't generate tokens in parallel, or at least it's more complicated to generate them in parallel, which what that means is more memory per unit of compute. So

that was, yeah, Nvidia was kind of lucky in the fact that they had, you know, 2k and 4k screens, and that the memory sizes that they were doing to do graphics, and having a couple of gigabytes of memory fit well with some of the AI workloads. But that luck, you know, unless they change the architecture, will run out, because as the workload size changes, the optimal will be a little different. So who's well positioned to fix that? Though,

even if there are architectural changes, what has fundamentally shifted is basically pre trainings over. I mean, the open AI has stated that their last non reasoning model will be 4.5 which means that you that to be translated to the we've run out of the ability to see progress from pre training alone. Yeah.

GPT, 4.5 is the end of the road.

But then again, like that, people that we basically got told best a year ago at HPDs, directly from, from the folks. So it's like, this is, this is actually somewhere between 12 and 16 month old news at this point, yeah,

but also that, I mean, there's some degree is, you know, is that true, or is that just a position that they want to you know? Is that strategic? Is that is that

a lie? Exactly, I would.

I think in some ways it's common sense that basically that just just take it from the point of eventually you start running out of the ability to have high quality data to learn from, and if you actually want to produce super intelligence or something that is even just that, in a specific enough domain where there's not enough data to be able to create a competent system, you you have to be generating it yourself. You basically need to run it and explore your actual policy. And policy rollouts are sequential. They don't have the same kind of data parallelism. Just go read a whole bunch of text advantage. And even if there are architectural improvements to improve this ratio, though, they will be architectural improvements around the available hardware that, looking at it the other way around, is saying that the tail is wagging the dog. Yeah.

So how does that drive what OCP needs for the next compute device? Is that, what is the right architecture, what size memories and and what do they need? Is that what is OCP going to need in order to implement these things that go beyond GPT 4.5 because that that,

I think it's part of partially, sort of taking software workloads and then trying to break them down into the smallest components running in parallel, and analyze what does that need? And then reassembling and best form, assuming the network as well as the processors,

honestly, what, what? What I would do is that the the type of the models that look like this new generation. There's one open source available, one which is our one. And there are the the the the reasoning model fine tunes available of these. It's really just getting access to that workload and and tracing, like look at the traffic. I mean, we can talk about it theoretically, what it means to have 256, way sparsity on Moe. And you know what the overall vector, vector versus matrix, all reduce traffic looks like. But the these are, this is a workload that you can have direct access to, because you can just load it up on, if you have available hardware, load it onto it and see what it does.

So is this a sparse, distributed memory architecture is needed for this new workload? R1

so, so just all the answer is yes, but I'll give the slightly longer answer, which is that sparsity is clearly necessary for these. But now there's two types of sparsity. It used to be the case that you could break it up over the layers. So imagine if you have, like, a 50 layer neural network, you would load layers one through five on one piece of coherent memory, and then what you're doing is you're passing your input vectors through that, and then you grab them the results, and then you just send it over, over the network, to the next machine. So this looks, you know, it looks like point to point communication. And then the reality, though, is that that that means that you have, you have density across the layers. So now things have moved to moe, you know, mixture of experts is basically, I'm not even the better way to describe this is that basically you get intra layer sparsity. So, so what's happening is you basically have a small layer that decides which parts of the a sparse version of your matrix actually needs to be, needs to contribute to the answer. So this the earlier models this year started doing Moe of eight and 16 and deep seek is all the way pushed up to 256 way, where it's taking basically the best of 256 layers to to pass the vectors through. So you can think of this as if you were to visually lay this out, is that you have you can break it up horizontally and you can break it up vertically. There's limitations to how granular you can get this. But this will be a moving target, because people will get better and better at breaking this up more

and more. Is it advantageous to have still use a Matrix Model and have like a sparse matrix, or is that not even using matrices and using other like associative lookup techniques for the associated fit?

So I would, I would argue that the vast majority of this is going to be with matrices for the foreseeable future, and saying otherwise, it's kind of like arguing about ternary computing, that that is, as long as the hardware available is more sympathetic to linear transformations, then all of the operations will be linear transformations. There might be some exceptions to this for retrieval, where you actually start seeing, like, really sparse retreat that I don't know, like, maybe there is enough long term memory that people doing against disks, but

it's kind of a power issue. My view is that you want to go to things like spiking neural networks, where you break it down into pieces that are essentially all sparse, but you only fire up the pieces you need. Yeah,

that would be a hardware change. It's like basically what I'm saying is that we will be on matrices as long as we're on this type of silicon. Yeah, any

big matrix can be broken down into lots of smaller matrices when you're doing the math. So what's the optimal size? Now?

What about rag models? Because that's partly being used to augment these models. Is that what comes past GPT 4.5

I would argue the majority of rag is industry band aiding because they can't train any of their own stuff. I don't think that's I actually just don't even think that's very far off, like it's not very good

fixing the hallucinations.

But basically, like that rag is a poor man's version of long term memory, because doing semantic search relies on questions looking like answers like is the question of the so basically that there's a joint embedding space, which is all of your documents or things or the entire internet or what have you, but it's relying on this idea that your queries and your values are being generated by the same function, and that they will reside close to each other. And this already this just isn't true, and this is precisely how transformers work as well. Like the the the attention layers have separate matrices for how keys and values are embedded out of each token, and they're learned in a sparse way, so that there's like, it's called multi headed attention, because basically, there are multiple ways that keys are generated for every word that you pass in, and there's multiple ways you query for words, and that doesn't exist today for things like rag. It's really just a vector DB without any ability to like learn. There isn't learning and forgetting and reorganization. And there are current accounts out of Google, like the latest Titan paper that tries to readdress this that I see no possible world where long term memory isn't something where in the next somewhere between one to five years gets up ended with just a better way of implementing it, because rag is just it. Rag kind of violates the bitter lesson where it's it's not a learning solution, and it doesn't have a good way to get better.

Yeah,

that was a key value store. Useful in this new model is that being able to put long term memory into a KVS and have it built basically object stores? Is that useful for constructing these systems, either at the current or new silicon? Probably, yeah,

okay, does the deep seek mean that we're going to see different algorithms coming along pretty soon? Well,

I think you're going to get new algorithms all the time. The question is, how do you implement them efficiently? It's sort of with small, fixed processors, just kind of we can only make chips, yeah? So the question is, what kind of chips do we need to make?

Yeah, I was, I was from, from what I've seen, you know, in the last several years, there have been tweaks on the algorithms, but not new classes of algorithms. And I'm wondering, I don't know enough about what deep seek did to know if it's a substantive difference or just a qualitative difference.

I think we're on the same hardware, so qualitative.

Okay, I just wondering, so accessories are different or whatever.

So what I found interesting from the high performance compute people is that they follow the capabilities of the hardware. So it's not so much that the software people drive the hardware architecture, but more like the hardware people provide awesome hardware, and then the software people adopt to it.

From what I saw in HPC, it's, it's a it's a feedback. The, you know, the hard, the software people say, Gee, I wish I could. And then the hardware people sometimes figure out how to do it. So it's not a one way. Yes.

I mean, I kind of say that GPUs are like, not the best processes for anything apart from graphics. So I'm not sure how fast the the hardware guys are adapting.

Just sort of just heard a talk yesterday about doing analyzing crypto 3d crypto, Crystal, logic, crystal, crystal imagery, and they took it down from 12 hours to five minutes. It's a big deal.

What were they doing? There they

were. They were x ray, crystallography, yeah, taking the data and then producing something, I forgotten what? But, you know, this was an algorithms talk, and they were able to take the processing time from a half a day to five minutes. So, you know, using GPUs,

yeah, but the things like the GPU was like an lucky accident, because the money was spent to build gaming graphic cards, and they were misused. Of the same thing happened with the 60 65 old teeth too. That processor was designed to be a gaming console for TVs and crazy people just did made a home computer out of it. Yeah, programmable. Yeah.

Now that that was 1015, years ago when the GPUs first came out, people figure out how to use them for HPC, but these days, they've re architected the interior, largely for AI, but also for HPC, and they're doing lots of things, particularly in the memory subsystem, specifically for HPC, that would not be done for gaming. For example, numerical precision. They do double precision, and gaming never need a double precision.

Yeah. Interesting videos earnings is that, you know, the gaming segment is declining, right? That was one of the low points of the earnings call, was the fact that, and it's all data center, right? And so in video, that's

partly you're hitting the limit of what humans need in graphics. Yeah, so,

but John, the details on that, are they actually selling less gaming, or is it just as a percentage of their revenue because everyone's just buying as many of their new GPUs? Well,

I think it's everybody can do graphics at the level humans need.

Yeah? No. I think that, you know, every, every device now is going to have several GPUs for for graphics and gaming. But I think it's more that the other, the other uses are increasing. So all the major top 500 or top 50 machines for HPC are all dominated by GPUs for doing heavy floating point work. And I think that,

by the way, for the for the new workload that we're talking about, you know, the Post G, the post 4.5 work, is that, do we need floating point? Do we need 6480 bit IEEE, floating planner. Do we is it gonna be low precision? Is this gonna be, you know, reduced bit precision workloads? Yeah, for AI,

for AI, lower precision for HPC, double or even quad as the problem.

I mean, I did matrix math in college, like, you know, 40 years plus ago or something. You know, there's a you have to accumulate with accuracy. You don't necessarily need to do the multiplication with accuracy.

Yeah, I'm an HPC guy myself. In fact, I was the chair of the Gordon Bell Prize Committee for 10 years. So there are problems where you can get away with less than double. There are problems where you need double. And these days, as the problems get bigger, some people are seeing a need for quad. So in the HPC world, it's quite different than in the AI, where, you know, they're talking about eight bit floating point in AI,

yeah, and

even four bit and two bit and one bit,

yeah. So, so posits look really promising for AI. I don't know if anybody's going to adopt it, but

a big like simulation and models that do accumulate error. Now, does anybody have support for quad precision? That would be what 128 bit floating point, or is it just composed by doing some 60 orbit,

yeah, yeah. So I know people who are building libraries from for multi precision. There are multi precision libraries out there, but they are aimed at, you know, hundreds or 1000s of digits when you get into the to the quad, or, you know, double. You know twice. Quad arithmetic. You write the library differently. And I know people who are working on that the higher precision is normally used in a guy drawing a blank on the term on a correction procedure. So you do the basic calculation in double. But at certain places you use higher precision to reduce the round off and the result.

So is that mostly though you think for HPC, like, is this for simulating nuclear bombs, or is this for AI workloads? This is

HPC AI is quite happy with 16 and even eight bit float. So is

this a mainstream workload, or is this going to be something specific to the well, now our doe centers that are being defunded, well,

you got, you got your nuclear stewardship, you've got your weather forecasting, you've got all kinds of HPC modeling going on that's used in Economics for modeling the economy, you got it. You know, the rocket scientists on Wall Street need it. They're all,

do you see a need for Wall Street? 128, bit. 256, bit, floating point

of it much. I can see them doing it often. When you start caring about that, you stop doing floating point if you actually and then you do sort of infinite precision libraries. Yeah.

So my, my understanding of the, you know, the Black Scholes kind of calculations that they do is they don't need quad they may need it, as I said, for the error correcting step, but, but that it's not the main use, and they can get away with software versions.

Yep. And part of this, the weird thing in finances, we know Black Scholes is wrong. They really want to do is anyone, when anyone does a new library, they want to make sure it just agrees with what it's been doing for the last 30 years. And so often, like, I've actually heard of vendors where they have, like, better implementations, and they go to the investment banks and they're like, well, it doesn't agree with Bloomberg, so we don't want this there. You're sort of locked in. But

they also, they also have the Monte Carlo simulations, which, again, don't need quad, but are typically done in double. I wonder why. But they're typically done in double.

That's just because that's what's natively supported on the hardware, and so just sort of stuck. But, I mean, I'll be honest, a lot of people, when they do these things, they do not do the proper numerical analysis to actually figure out what they do need or what they should be using. Well,

I attend the seminars over at the math department at Stanford, and every year, three or four of their graduates go straight to Wall Street. So there are people there who are doing the correct numerical analysis, I will agree with you that the most don't have a clue.

Or are you saying that they're doing it, or just that they know how to when they go to No,

they're hired because their knowledge of the theory gets better results. Okay,

perfect. That's reassuring to her.

Yeah, but yeah, I'm a floating point guy. I spent 20 years doing HPC. I

did a lot of floating point to fixed point conversions at assembly decades ago, doing signal. So

by the way, from an expert, what is the best floating point, high precision floating point hardware device to use today? Is it at A and B or Intel, or is it an Nvidia GPU, or what they

all use IEEE floating points, so they're all the same.

Yeah. So for me, it was interesting to see the religious war between the IEEE people and the positive people. And what I learned is you better understand where your numbers end up, because you want to make sure that I use zero is where most of your numbers are, because if you misuse it, then you just lose precision. Yeah. Well,

innovation on exponent, mantissa, bit trade offs, right, right. That's,

that's the advantage of posits you get, you get to choose i, triple E, you can say single or double, and sometimes quad. But in posits, you get to pick how many bits in the exponent and how many bits in the mantissa, and do your arithmetic that way, and keep changing it as you go. So is politics

arbitrary precision? Can you go as high as you on, or is it just to find a quad?

Well, it's, it's typically whatever word size the machine supports, but within that word size you get to adjust your exponent and mantissa, not and it's, it's new. I mean, it it, I don't know when it first came out, but at least 10 years after the the I triple E floating point standard. Yet, are there

hardware systems that support the POSIX high precision?

There are. Now, I don't know specifically, but I've heard that there are. I think arm has a version that supports it natively, but I you know, I don't know the I don't remember the details. I haven't been following. I know John Gustafson, who started the whole thing? So,

okay, so awesome discussion. As we're now 10 minutes past the hour, I would suggest, if not anyone wants to, really want to say something which is really urgent, we close the session and maybe we'll get back next week.

Yeah, I have one comment, yes, yeah. It is a question, actually. What this all of this discussion about interesting discussion about AI and Model r1 and mathematic thematics and all of this. What all?

What is the question we

lost all this have anything to do

with you? Dropped out? You have to repeat the question.

I can't hear anything.

What all of this has to do was atomic Ethernet?

Well, a good question. Yeah, the well, these are all parallel algorithms, so the memory access patterns when you're spanning nodes are important for the network, whether it's floating point, you know, the floating point, probably not so much.

So question, what's, what's the use case? Where? Where is it relevant? And what is the sense? What is the size of the packets, and what is the latency you need, and what's the throughput you need to be competitive against our solutions, because there is no point in making a standard if you can do everything on a GPU cluster currently. So it's really, for me, it's to identify where the high value problem is which is not yet addressed Exactly.

Thank you very much.

Okay, anyone else? Okay, so thank you and see you next week. Yep,

bye,

you next time. Thank you. You