Either AI Is Happening to You, or You Are Making AI Happen

4:57PM May 30, 2024

Speakers:

Keywords:

tokens

inference

model

grok

question

chip

working

deploying

gpu

production

built

gpus

speed

training

running

gpt

answer

hardware

performance

av

You. Good morning, everyone. I said, Good morning. Awesome. All right, let's start off with a question, which is, why are we all here today? In 1906 there was a devastating earthquake here in San Francisco, and after that, the Panama Canal was opened, which actually provided greater access to San Francisco. And in order to show off San Francisco, the Palace of Fine Arts was built as a celebration of the rebirth of the city. And given that AI is having such a dramatic effect on business technology. I think this is a very suitable place to have an event where we're talking about how technology is going to change and how it's going to emerge. So I'm going to start off with a really quick demo here. And I think everyone here knows that grok is fast, but we've gotten a little faster, faster than AV can handle. All right, maybe no live demo for you.

All right, well, we're probably so fast that we're, oh, technical difficulties. All right, let's do this. I'm just going to show you anyway, because it's so fast, all that matters is that it's going to be blurring the screen. All right, tell me about the Palace of Fine Arts. It's done. Got a bunch of woes. All right, ah, all right. Tell me more. We broke it. It's so fast we broke the AV. I All right, I think I'm doing the rest of this unplugged. I will admit that is the first time that we've ever broke the AV before. All right, let's just go straight back to the presentation, if we can. All right, so I was actually doing this demo once, and the AV was glitching out so hard that there was only about a three second window where it would play, and then it would completely glitch, and then it would the video would play again, and then it would completely glitch. And I'm like, No, this is perfect. Boom, hit, enter, and it finished before it was done. Any

All right? Well, what I'll do is I'll talk a little bit about what I'm going to tell you I normally actually don't do presentations, so this is going to be more like what I normally do. So in the last couple of months, I think three months, we've actually been able to get Actually, let me just do one quick thing. Sorry, you.

All right, what we're going to do, I'm just going to answer questions for the next couple of minutes while we try and recover the AV from grog speed. So does anyone have any questions? All right, I see a hand over there. I'm just trying

to figure out how much equipment was required to use. Yeah, should I be imagining a room full of servers?

So the question is, what is the footprint of rocks, hardware, how? How much hardware does it take in order to provide the speed in order? And you've been using it, you like the speed. But I think your real question is, can we actually scale, or is it

so we're actually denser than NVIDIA GPUs in terms of the number of lpus that we fit per rack. However, we need fewer racks in total per token produced. And this is very counterintuitive. Think about it this way, a car factory is very large. However, you produce much more cars per square foot out of a factory than you do out of a workman's garage. So if you need roughly 6000 GPUs to get enough throughput, then you're probably going to need roughly 6000 Glock lpus to get the same throughput, however, lower cost per chip, lower energy. So what about between 1/3 and 1/10 the energy per token produced? Does that answer your question? Fewer

rats per tokens

produced? Produce fewer racks, because we fit more lpus in Iraq than you can fit GPUs, because the LP use use about 200 watts. In practice, you can fit a lot more of them in Iraq than when you need 1000 2000 watts for a GPU, but since we get about the same throughput per LPU as GPU, we need a lot less power, and therefore we can fit a lot more in the same amount of space. All right. Next question, hello, okay, hey, here left. Oh, you're left cool. How did you start? Grok, how did you come up with the idea? Ah, so originally we didn't intend to build a chip. I had built a chip at Google, so I started the Google TPU. And what ended up happening was I ended up in the Rapid Eval team, which is the group at Google X that comes up with all those crazy ideas of what to do next. And then what ended up happening was I started meeting with some VCs, and they asked, What would you do differently if you had this to do all over again? And my answer was, I would make it easier for the software, because when you program a GPU, you have to write what are called kernels, CUDA kernels, and it's a Herculean task. It's low level, and you're constantly tuning them and retuning them to get performance. It takes a lot of time. So we decided to spend the first six months working on the compiler, and a VC was willing to fund us to do this. We did it, and then we actually built the chip around the compiler. That's why, when new models come out, we actually have them in production the same day, even though no one is targeting our hardware when they train the models. In fact, when llama three came out within I think, it was 12 hours, but the same day that it came out, we actually had it in production. All right? Next question,

I have a question also related to the direction between a GPU versus LP or TPU, right? So I understand this is super, super fast. It's really very impressive about considering right now everything is based on a transformer. Now what if, you know, maybe the next year, it will be a new foundation from the member or anything else. Then what do you think about by that time? Like the advantage and disadvantage between LPU versus, you know, GPU parts, like, basically you against the Nvidia?

Excellent question. So really, how tailored are we to specific model architecture when we designed the LPU transformers didn't exist. So we actually didn't have the TPU designed for transformers. We designed it in general to accelerate inference, and that was the difference. It was inference rather than training. We have over 800 models compiling at state of the art performance and running on our hardware, and everything from RNNs, LSTMs, you'll see some other fun stuff, convolutions, graph neural nets, if we can get the presentation working, you'll see some of these things. And also, we're really excited about state space models. We think they're going to run particularly well on our hardware. But what we did was we focused on inference as opposed to training. And the difference there, it was about the latency, because you can't produce the 100th token until you've produced the 99th and this is very different from the parallel problem that GPUs solve really well. Does that answer your question?

Sorry, can you speak up? Partially from design perspective, design perspective, in a way that difference is a subset of the training, right? It's just like one way or not. Like, if like, the foundation model basically change completely, how can you achieve the best in class performance?

So if the foundation model changes completely. How can we get the best in class performance? So we're getting best in class performance on all of those different model architectures that we mentioned. And as I mentioned, state space models, which seem to be the thing that's coming next, we do well on we have things working that are convolution and attention combined. We have things that are working that are RNN and attention combined. And so we're pretty agnostic to that. It's really just we focused on inference as opposed to training. Thanks. And I guess you were just going to have to listen to me for the rest of this. All right, let's keep going. Hello.

I have a question about Grox speed in terms of time to first token? I know that you obviously are extremely fast in overall token predict token production, but what about time to first token and latency to get that first bit of answer? You're killing

me because I was going to show that off. All right, I'll spoil the punch line. We actually are now running at about 30,000 tokens input speed right now in our test deployments. So we haven't released that to production yet. It's coming soon, but 30,000 tokens per second input speed on llama 8 billion. And we expect to do a lot better than that throughout the year. And that is a record breaking amount of speed for input, which is why I want to show you. But let's put a pin in that, because that's actually, I think, not the next one. But if we get this working, you'll see it in just two slides. So we'll probably still get to that.

So the previous speaker talked about the scaling laws, and basically seemed like good GPT four was trained, what 30,008 to 100 and then GP, five, 100,000 but I've seen a paper analysis on Epic AI saying that there's also some scaling on the inference side, that if you held your training constant, that you could increase your inference level by 30 times. And then perhaps get those level of performance, like, you know, get to, like a GPT seven instead of a GPT six. It is that scaling something where, if we hit a limit, where it's like, okay, you have a nuclear power plant running your training, your million chips. And then, in order to get continue to perform with you, then scale massively on the inference side, that's that you've seen, or way overcome the scaling.

So a good example of this. So the scaling laws were a paper published by open AI a couple of years ago, and what it showed was the number of parameters that you have per model allows you to get better perplexity as you train on more tokens. So effectively, the idea was you would be able to absorb more tokens in your training the larger the model. What we saw with people training past what's called chinchilla optimal is it didn't really end you could actually just keep making the models better. Now you don't get as much per token trained on if you don't make the model bigger, but when you're talking about putting something into production for inference, if you can just keep training, then if you have enough money to do that, it can actually pay off, because it'll bring your inference cost down if you can get as llama three is doing almost GPT four quality with a much smaller model. It makes a lot more sense for you to train that model really extensively and then put it in production. And so there's a trade off. And back in 2013 when I was still at Google, we would see this sawtooth pattern where what kept happening was the best model would get really big, and then all of a sudden, the best model would get really small, and then the best model would get really big, and then really small, really big and really small. And what was happening was people would make the model very big to get the best eye popping quality that they could and then they would start optimizing it. They would come up with new techniques, and they would do better, and then all of a sudden, you get smaller, and it'd still be higher quality. And then someone would go, Hey, if we just make it bigger, then we'll get higher quality, and we'll set new records. And so that sawtooth keeps happening. What's great about the really big models is that they set the benchmark for what is possible, and then the more production minded folks end up training these smaller, better models. However, we actually really like the big models as an example. We have 180 billion parameter model running at 260 tokens per second on our hardware. For comparison, the experts in GPT four only 110 billion parameters, which means that if we were running GPT four, since the experts go on separate sets of hardware, we'd be running it much faster than 260 tokens per second. Hopefully, that answers the question. Thank

you. Thank you. Thank you. You probably have answered the question partially. Hi. You probably have answered your question partially. So in your in your architecture, so it's a lot of the SRAMs versus the old, cheaper options. So what do you envision? You know, the potential examples and the use cases would be in the inferences that can justify the cost, or maybe you can see a way that the cost could bring down a little bit. And

sorry, can you repeat the one again? I had a little bit of echo.

So, in your, in your chip architecture, okay, yeah,

yeah. So All right, so there's a famous computer architect named Jeff Dean, and he published this thing on GitHub, which is engine numbers every engineer should know and has the speed and throughput of reading from DRAM, from disk, from internal memory on the chip. And what's funny is it's accurate, but it was published 10 years ago. This technology hasn't gotten much better in the last 10 years. So what we realized when we started grok, since everyone performed pretty similarly when using external memory like that, external memory speed was the bottleneck. If you have the same amount of HBM as someone else, you get the same performance. If you have more HPM, you get more performance. If you have less HPM, you get less performance. We realized that there would be no way to differentiate if we relied on external memory, so we started with SRAM. Now SRAM is dramatically faster. It's the difference between if you're running your programs on a hard drive and moving to DRAM, moving to SRAM, is a little bit like what you get when you make that shift. So the problem that that introduced was then we needed to be able to scale across a very large number of chips. And in terms of doing that, there is no known networking technology that will allow you to do that, for inference, for training. It's easy because you're not latency sensitive. And so what we did was we came up with a fully synchronous interconnect. There's no packets we know to the nanosecond during the execution of the program, when data is going to go from this chip to that chip and what that data is all pre scheduled. As a result, we're able to get very low latency. So when we run llama 8 billion, we run that on 64 chips. When we run 70 billion, Ah, here we go. When we run that on 70 billion, we run that on about 512 chips. And when we run that 100 and 80 billion parameter model, we run it on 1500 chips. There's no known way to do that with InfiniBand, Ethernet and so on. So we had to solve all these problems in order to make this work. And there's about six different major problems that we had to solve all right. Now I'm going to go and answer one of the questions that we just saw. I'm going to skip past some stuff. But someone asked about the input speed, actually, first, really quickly. This is an amazing Oh, is it not? There we go. All right. So this is a great example of summarizing an article. Watch this. When you click submit, it immediately pops up. That was 172 milliseconds to 29,944 tokens per second. Input, 70 milliseconds for the input, 172 total. That's how much text was summarized in 172 milliseconds.

So So now, yeah, we're fast. So that's actually attracted a large number of developers to Glock. In fact, so many and thank you so much. We can barely keep track. These are just the folks who've pasted their applications built on top of grok inside of our Discord channel. We want to give everyone a chance to see all of the amazing work. And I go in there every day and check this, and it just keeps growing and growing and growing and growing. That's in 11 weeks. We now have 208,000 developers in 11 weeks with 90,000 active API keys. This is developers. This is not users. We're over a million people who've used our website. But this is developers. Now I've been playing with this while talking because by the end of this year, we intend to have 25 million tokens per second. We measure in seconds, not minutes, because a minute is just too long. And so what is 25 million tokens per second? That is the equivalent of the hyperscaler with the most GPUs available right now. Today, we're going to have that in production at the end of this year, running at grok speed. And every single Grokster carries a challenge coin like this that has that goal on it. And whenever we have a question of what we're going to do next, and it's not aligned with this, we put it down on the table and say that's what we're doing now. We're deploying a lot of these very angry sounding servers. You're asking about the footprint. So this is a direct answer to that. So this right here, you're going to see about 10 Grox to get to that 25 million tokens per second at the end of the year. We're deploying 100 times this much. And we're also deploying all around the world. In fact, this is the first data center we deployed in. It's actually here in the Bay. We have most of our servers now up in Spokane. There's a massive deployment. There's over 100 racks there today. That's our single largest deployment. But now we're also going to do a much larger deployment in the Midwest that's already like on its way. And then we've signed deals to deploy in Norway already, and we've also signed a deal to deploy a very large amount of compute in the Middle East as well, and all of that's happening by the end of next year. But all of these deals were in flight before grok went viral. Since then, we've gotten a lot more interest. But the thing is, tokens aren't enough, and so we were being asked, What about the models? Well, I'd like to show you a multimodal model running. This is just a toy example, but normally you wouldn't take a multimodal model and run it real time on your drawing, because they're too expensive. This right here, we actually have it running much faster now. So this is an old video, but it's live. Guessing what this is. Now it's not impressive what it's doing. What's impressive is we're using a fairly large model to do this, and we're running it in about 200 milliseconds for each guess. And then we also have this, I wanted to show you something cool. I

don't think the audio is coming true, which I just built to help which is grab my voice. It uses the Putney ice whisper model, the v3 large model. And as you can see, it's fast, really fast, and really accurate too. And this is not edited. It also works with longer audiences. So I could say something longer, like, I don't know, can't really think of anything right now, but okay, I think you get the point. It also supports translations, and I'll show you that another time. So how does this work? It's a little secret called Grog, no, G, R, O, Q, yeah, that's right. Anyways, I'm excited to show you what I'll be building with this and your consumer app. I'll show more details later, and you can DL for the test guide. Okay, that's it for now.

So we have a lot of different models running on us, but it's speed its tokens and its models. You need all three of these now, the number one feature request that we've had at grok is billing. This is probably a first in the history of any startup we Okay, all right, and so this has been a problem, and while we've had a bit of token capacity, we're about to ramp that pretty significantly. And what I came here today, why I came was to tell you that in 30 days, we're going to enable our first large production customers. They've already developed on us, they already have their workloads running, and we're going to have enough capacity to bring three to six of our largest customers in 30 days. And so the countdown has started. Thank you.