Advancements in Embodied AI: Key Technologies, Applications, and Future Trends
4:44PM May 29, 2024
Speakers:
Jim Fan
Keywords:
voyager
agent
robot
eureka
model
ai
minecraft
simulation
cat
transformer
humanoids
humans
era
train
called
humanoid robots
metamorph
data
task
video
Single architecture that does sequence modeling. And the next paper you probably never heard of, it's actually came out. It came out in the same year as a transformer paper and was written by the same set of authors. So you see they have this, you know, addictive habit of writing provocative papers. And this one also has a very cool title. It's called one model to learn them all, and it's based on transformer. It got the idea right to learn all the problems using one model, but it got the implementation wrong. So it uses all kinds of like specialized adapters to adapt modality A to B. It turns out you don't need any of that. You just need one transformer. So this paper actually age like milk. So with transformer, what do you do? You tokenize everything. You convert everything to sequences of tokens. And you convert a cat into, let's say, image patches, and then you can map it to whatever you want using a transformer. And what is transformer good at mapping any modality to any other modality sequences. And there's one more extraordinary ability, is that it can even mix modalities. So in one sequence, you can have different types of data interleaving, and that will come in very useful later in the talk. So that's one paper, and the second paper also has a very simple idea. You have an image of a cat, you apply noise to it, just sprinkle noise like salt and pepper until it's not recognizable, and then you reverse this process. You go from noise and you de noise until you get a clean image. And this paper is called a diffusion model, a very simple idea, just make an image cleaner iteratively. And these two papers became the backbone of the generative AI error so transformers. Was it good at generating discrete values? Basically, the tokens are integers, and transformers are mapping one sequence of integers to another sequence. And what are integers? They can be text. They can be symbols. So transformers are this sequential reasoning engine. And what are diffusion models? They're very good at generating continuous values like pixels and sound waves, and they do it in parallel. So you generate all the pixels at once. And this is the rendering engine. So what is discrete plus continuous? That is everything. And we get the building block to build everything. So now let's enter the generative AI era. So this was the old Alex net paper, a convolutional neural network that maps a cat to this one label. But now with transformer, we can do open vocabulary. We cannot just categorize cat. We can describe it in natural language, right? Just see this big paragraph that I just generated yesterday by GPT four. Oh, right. This image shows a fluffy white cat, blah, blah, blah. And it also shows it also tells you that there's a ball and a wall behind and the ball is transparent on a light colored surface. And what was more interesting is it also explains why this image is humorous, right? It says it's humorous and adorable, because, apparently the cat can fit itself into any space smaller than its size, right? It explains the joke, explains the mean to you. And this is the 1d output from transformer. And how about we invert this? We generate 2d images, right? So I put what Gd 4o generated back to Dali, and it's able to generate a pretty good cat. And how about we ramp up the game a bit? We can also generate Lyd models of cats, or really anything, right? And here the prompt can be an intricate wood carving of a cat wearing a kimono, sitting on a pottery wheel, shaping a clay ball. And what about videos? Video is also 3d but it's 2d plus one temporal dimension. And this is open air Sora, a giant Cathedral is completely filled with cat and there are cats everywhere, and a man enters the cathedral and bows before the giant cat king sitting on a throne. And Sora takes this combination of transformer and diffusion to the extreme. It executes it to perfection, right? Transformer is a reasoning engine here that thinks on the high level what each movement in the video should be like movie director and a diffusion is the renderer. Here, it paints the high resolution pixels like a photographer and solar combines these two into one giant model. And you can even do this one dimension up. You can do three plus 1d A 4d video, where you have both a 3d spatial and also 1d temporal dimension. And you can import this kind of 4d video into things like AR and VR, and then have a magic experience. And this work is called make a video 3d from meta AI. So in the generative AI era were basically climbing the dimensional ladder from zero d, which was just outputting a label to 1d text to the image 3d model, or two plus 1d video, and then 4d videos. We climbed the ladder in only seven years, while we were stuck in the first stage, the zero d1 for almost 20 years. And what's next? I'm calling this the neural and real engine. Basically, the next step is to add even more dimensions to generate worlds, fully interactive worlds, with audio, physics, animation, dynamic lighting, you name it, and that will be the pinnacle of genai. And we're not there yet. And this neural, Unreal Engine idea belongs to the next era. So let's look back to this question, a very profound one, what makes a cat a cat? In 1977 that's how it started. And in 2024 that's how it's going a model that generates all the complexity and this animated body of a cat in this rich setting. And at the beginning of time, we have the classical era using templates, graph matching. Then we enter the neural era, mapping any modality to anything else, but each model is highly specialized. And finally, with a reasoning engine known as transformer and the rendering engine known as diffusion models, we enter the generative error. And across these three errors, we see a steady increase in the scale of models and the training compute required, and someone's very happy about it. Thanks, boss. Look at how happy he is. I shared happiness with him. And then on the other axes, the diversity of the model goes down because we don't need that many models. We just need two. And the models are becoming simpler and more elegant. And basically the history of AI is a history of unification, to unify all the training pipelines and different architectures into one. So diversity goes down.
And now if we divide these three era into four aspects, the data, learning capability and the model, we get this table where, in the classical era, we use little to almost no data and the learning. There's no learning. Mostly it's engineered. These things are engineered objects, and the capability is very brittle anything outside the templates, good luck. And the models at most are shallow models, or no models at all. And then the neural era, we have these constrained benchmarks, these small scale data sets, and maybe ImageNet 1.2 million is already the largest scale. And for the learning we do mostly supervised, where you have the label data and the capability are more like domain specialist. And these models are the deep models. And the next error, the generative error, we open up the data into open vocabulary, right? Any words, not just 1000 categories, anything is fair game. And we scale to the internet, scale of things using internet text, right? All of Wikipedia is too small, and all of YouTube, all of everything, and the learning now is self supervised, right? Transformers predict the next word. You don't need a human to label the next word. It's already in a sentence. And diffusion models, you just need a collection of images. And the capability is we can now do domain general prompting, right? You can tell the model what you want in natural language, and we call them foundation models. And now, what is the next era? I know you're attending a joint AI summit, but let's talk about what's beyond the summit, and that is the agentic era, where, on a data side, we want infinite data now, because internet is too small for us, right? Internet is not enough data. We want infinite data, and we do that by using worlds, right, interactive environments in simulation of the real world, and we use self explore and self bootstrapping methods to learn autonomously in these worlds. So we don't collect data, right? Humans don't curate data and send it to GPT. You collect data yourself. And the capability, the core capability here is decision making. And what is the model, and that's the next chapter of the talk. So this is the agentic era, where it's categorized by a world, and the agent, and the world sends the agent a bunch of tokens, as you know, sensory, right perception, and the agent sends back a bunch of tokens as actions. So for me personally, my journey in the agentic era started in 2016 when I was taking a class at Columbia University, but I wasn't paying attention to the lecture, and instead, I was watching a board game on my laptop. And it wasn't just any game, it was a very, very special one. So this screenshot right here is the moment when DeepMind AlphaGo beat Lee Sedol, the reigning champion at the game of Go, right? The AI just won three matches out of five. It became the first ever to make this achievement. And unlike chess, go has a much larger search space, and it's more intuitive than rule based. It's impossible to use classical methods, and AlphaGo is this end to end neural network that plays go. So I still remember the adrenaline that day, right? Of seeing history unfold, of seeing AI Asians finally coming to the mainstream. But when the excitement fades, I realized that as mighty as AlphaGo was, it wasn't that different from Deep Blue 20 years ago, right? It can only do one thing and one thing alone. It is an agent, but it's not able to play any other games like Minecraft or Dota, and it certainly cannot do your dirty laundry or dishes. But what I truly want are AI Asians, as versatile as wall E, as diverse as all the robot forms, or we call embodiments in Star Wars, and works across infinite worlds, virtual or real, as in Ready Player One. So how do we get there in possibly the near future. And this is your hedge hikers guide to the agentic era. Most of the ongoing research efforts can be laid across these three axes, the number of skills an agent can do, the body forms or embodiments it can control, and the realities it can master. And here is where AlphaGo is, but the upper right corner is where we want to go. So I've been thinking for most of my career how to travel across this galaxy of challenges towards the upper right corner. And earlier this year, I had a great fortune to establish the gear lab with Jensen's blessings. And I'm very proud of the name gear stands for journalists, embodied Asian Research. I also had the honor to co found gear lab with my longtime friend and collaborator, Yuko Zhu. And this is a picture we took almost eight years ago at Stanford, when we were still PhD student students at Fei fei's lab. At that time we did, you know, hackathons all the time, especially before deadlines, where we were most productive. And just look at we were so young at that time, right? What PhD did to me? Right? The pursuit of AGI is all consuming, its pain and suffering. All right, so let's go back to the first principles, right? What essential features does a journalist agent have? First, it needs to be able to survive, navigate and explore an open ended world. An alpha goal has a singular goal, to be the company, to be the opponent, and it's not open ended. And second, the agent needs to have a large amount of pre trained knowledge, instead of knowing just a few things for the specific task. And third, a journalist, agent, as the name implies, must be able to do more than a few tasks, and ideally it should be infinitely multi task, right? You give it a reasonable instruction, and then the agent should be able to do whatever you want. Now, what does it take? Correspondingly, there are three things that are required. First, the environment needs to be open ended enough, because the agent's capability will ultimately be upper bounded by the environment complexity. And the planet earth we live on is a perfect example, because Earth is so complex that it allows an algorithm called natural evolution over billions of years to create all the humans in this room. So can we have a simulator that is essentially a Lo Fi Earth, but we can still run our lab computers, and second, we need to provide the agent with massive data, because it's not possible to explore from scratch. You need some common sense to bootstrap the learning. And third, once we have the environment and the data, we need a foundation model powerful enough to learn from these sources and this train of thought lends us in Minecraft, the best selling game of all time. And for those who are unfamiliar, Minecraft is a procedurally generated 3d voxel world. And in this game you can do whatever your heart desires. And what's special about this game is that Minecraft defines no score to optimize and no storyline to follow, and that makes it very well suited as a truly open ended AI playground. And as a result, we see some very impressive things, like in this one, someone built the Hogwarts Castle block by block in Minecraft and posted on YouTube. And then someone else, apparently, was nothing better to do. They built a functional neural network inside Minecraft because Minecraft supports logical gates, and it's actually a Turing complete game. And I want to highlight a number. Minecraft has 140 million active players. And to put this number in perspective, this is more than twice the population of UK and it just so happens that gamers are generally happier than PhDs, so they really like to stream online and share. And this huge player mass, they produce an enormous amount of online knowledge every day. So can we tap into this big treasure show of data. And we introduce mind dojo, a new open framework to help the community develop general purpose agents using Minecraft as a kind of primordial soup. And mindojo has three parts, a simulator, a database and an agent. And here we design a simulator API to unlock the full potential of the game for AI research like we support RGB and voxels as the sensory input, and you can do keyboard and mouse actions as output and mind. Dojo can be customized at every detail. You can control the different terrains, weathers and Monster spawning, and it supports creative tasks that are free form and open ended. And for example, we want the agent to build a house, but it's the same question, what makes a house a house? Right? It's impossible to use Python code, simple code to define that you have achieved building a house, and the best way to do it is to learn through data so that the actual concept of a house can be captured.
And next, we collect an internet skill knowledge base of Minecraft. Minecraft across three parts. The first is video and because Minecraft is one of the most streamed videos, video games online, we're able to collect 300,000 hours of videos with more than 2 billion words in English transcript. And we also collected Minecraft Wiki pages with more than 7000 pages, and then the Minecraft forum as a subreddit, where people go there and use it as a stack overflow to ask for help. So here's a peek at our mind Dojo wiki data set. And can you believe that someone listed all the 1000s of recipes in Minecraft page by page on this wiki, they just have a lot of time to kill. But I'm not complaining do more, because it's all data for our models. So please, please, do more and now how to train this foundation model, how to use the data that we collected? We have a very simple idea in the original mind Dojo paper. So for our YouTube data set, we have time aligned video clips and transcripts, and these are actually real tutorial videos, like here in text prompt three, as I raise my x in front of this pic, there's only one thing you know is going to happen. Some YouTuber actually said this in the tutorial. And now, with this time aligned video and text, we can first train a pair of encoders to map the video and the transcript to the vector space. And then we can train the embeddings by contrastive learning, which is essentially pulling together the video and text that match and push apart those that don't match. And this pair of encoder model is what we call the mind clip model, based on the idea of clip from OpenAI, and intuitively, mineclip learns the association between the video and the transcript that describes the action in the video, so it outputs a score between zero and one, where one means perfect description and zero means the text is irrelevant to the video. And now this essentially becomes a language prompted foundation reward model. So let's see mine clip in action. Here's an agent interacting with a simulator. The task is written in natural language. It's called Share sheep to obtain wool, and as the agent explores it generates a video snippet which is then encoded and sent to the MindTap model to compute a score. And the higher the association is, the higher the score is, the more we know that the video the actions of the agent is aligned with the text instruction, and that becomes the reward function to any reinforcement learning algorithm. And this looks familiar to you. If this looks familiar to you, that's because it's essentially reinforcement learning from human feedback or in Minecraft. And autof is the cornerstone algorithm that made chatgpt possible, and I believe it's also going to play a big role in building generalist agents. And then we see that the model can generalize also zero shot to different weathers, lighting conditions, because it's been trained on internet data. And here are some demos of our learned agent behavior on various tasks. Now let's put mine clip on the map. It's able to do a lot of tasks, but a limitation is that you have to manually decide a task prompt and then run training every time for a different prompt, and the agent isn't able to discover new things to do. So can we do better? And we did do better in 2023 when GPT four came a language model that is so good at coding and planning, and so we built Voyager, an agent that massively scales up on a number of skills. And when we said Voyager loose in Minecraft, it's able to play the game for hours on end without any human intervention. And here this video is a single episode of Voyager, where it explores the terrain, mine all kinds of materials, fight monsters, craft hundreds of recipes and unlock an ever expanding tree of skills. And how are we able to do this? The key insight is coding as action, so we convert the world into a textual representation. Thanks to a Minecraft JavaScript API open sourced by the community, and Voyager invokes GD four to generate a code snippet in Minecraft. And each code snippet is a skill, and yet, just like human engineers, Voyager isn't always able to get this right on the first try. So we have this self reflection mechanism to help it improve. And there are three sources for self reflection, the JavaScript execution era, the Asian state, like hunger and house, and the world state like landscape and enemies nearby. And once a skill becomes mature, Voyager will store the program into a skill library. And you can think of the skill library as a code base authored entirely by GPT four, so trial and error. And in the future, when Voyager faces a similar situation, it doesn't have to redo the trial and error again. It just retrieve the scale from the code library and execute it. And in this way, Voyager bootstraps its own capability recursively as it explores in Minecraft. And a question still remains, how does Minecraft? How does Voyager keep exploring indefinitely? We give Voyager a high level directive that is to find as many unique items as possible, and then Voyager itself implements a curriculum to find progressively harder and novel challenges to solve. In putting all these together, Voyager is not only able to master but also discover new skills along the way. And for the still skills I show here, we didn't program any of this. It's all Voyager idea. And what you see here is called lifelong learning, when an agent is forever curious and forever pursuing new adventures. And here are two bird's eye view of the Minecraft map, and the biggest orange circles are the distances that Voyager travels compared to many baseline methods, and because it likes traveling so much, and that's why we give it a name, Voyager, compared to mineclip, Voyager is able to pick up a lot more skills, But still, it only learns how to control one body. So can we scale on a different axis and have a single model that works across different body forms and that enters metamorph it is a project that I co developed with Stanford researchers. We created a foundation model that works on not just one but 1000s of robots with different arm and leg configurations. And metamorph has no problem adapting to extremely varied physical structures of these robots. So here's the intuition. We develop a vocabulary to describe the robot body parts, and then each part now becomes a sentence written in this language of the robot body. And more specifically, we can express each kinematic tree into tokens. We can convert those into tokens, and then the sequence by itself as a sentence will describe the morphology and kinematic properties of the robot. And now you can have different robots that have different numbers of joints and configurations, but a tokenizer doesn't care, right? It's all converted to tokens. And then what's the knee jerk reaction from AI researchers? We see sequences we reach of a transformer, because attention is all you need. So we apply a big fat transformer to control these robots, and this transformer is called metamorph, but unlike chatgpt, which writes out text, metamorph writes out motor control for each joint in the body. And because we want to learn a universal policy that works across morphologies, we batch together all the robot sentences and then train a big multitask network. And whatever the robot will look like, it's always going to be compatible with each other, because at the end of the day, it's just a sequence written in a different language, and then we just train these in parallel using reinforcement learning. So we teach a robot to walk through flat terrains, complex terrains, and also navigate obstacles. And in our experiment, we see that metamorph can control 1000s of different robot forms, and more interestingly, it's able to generalize to robot forms never seen during training. Now let's extrapolate into the future. If we expand the robot vocabulary to include more complex configurations, then I envision one day that metamorph 2.0 will be able to generalize to robot arms, dogs, different types of humanoids, and even beyond and compared to Voyager. Metamorph takes a big stride towards multibody control, and now it's time to take things to the next level and transfer skills across the boundaries of realities. Enters Isaac sim, Nvidia's simulation initiative.
Isaac Sim's greatest strength is to run physics simulation at 1000 times, or even more faster than real time. And here you see our 1000 robot hands training in parallel. And this character learns these impressive martial skills by going through 10 years worth of intense training in only three days of GPU compute time. So this is very much like the sparring dojo, the virtual dojo in the movie Matrix, and we're doing this in reality. And this race car scene here is where simulation has crossed in Uncanny Valley. Thanks to hardware accelerated ray tracing, we can now render very complex worlds with breathtaking levels of details and the photorealism here can help train computer vision models that will become the eyes of AI agents. And what's more, in Isaac sim, we can procedurally generate infinite worlds, and no two worlds will look the same. And here is an interesting thought, if the agent is trained on 10,000 different simulations, they may as well just generalize to the 10,000 and first reality, which is our physical world. And just let that sink in. Now, when new capabilities can this fantastic new simulation tool unlock. This is Eureka, an agent that achieves robot dexterity at superhuman level. Well, perhaps not all the humans, at least better than me. So we see the pen spinning here. I have given up on this task for a very long time, since my childhood, but apparently the robot I train can do this better. So the idea is, Isaac sim has a Python API to construct the training environments like this, five finger robot hand. And we also assume that there is a success criterion. For example, we can check if the pen has reached a specific orientation so the task is successful or not. And with these, the first step of Eureka is to pass the environment code and the task description as context to GPT four. And the task here is written in natural language, to make the hand spin pen to target positions, and then Eureka samples a reward function, which is basically a very fine grained signal that teaches the behavior of the neural network towards a good solution that you want. And normally this reward function is engineered by humans, and actually expert humans who are very familiar with physics simulation, and this is a very tedious and difficult process, and now Eureka can automate it. And once we have the reward function, we'll run reinforcement learning to maximize it through lots of trial and error, and it only 20 it only takes 20 minutes to finish a full training run and putting it together. GD four will generate a bunch of reward function candidates, and each will perform a full reinforcement learning training run, and then Eureka will pass on the automated feedback and ask the language model to self reflect on the results so it can propose better reward functions to better solve the problem and rinse and repeat. And you can think of it as a kind of in context evolutionary search. So compared to expert humans, we actually found that eureka can even outperform some of the experience and video engineers who wrote Isaac sim. And you see here, it can spin the pen. It can teach the robot hand to spin the pen along different axes, right? You can even spin the pen in reverse, right when gravity is against you. And these are really hard. I wrote reward functions before I want to pull my hair out, but Eureka is able to discover different reward functions for each of these tasks. And then we did all the first paper in simulation only. The next question is, how do we transfer from the sim to the real world? Here, there's an idea called domain randomization, which is basically the idea that if you train in 10,000 different simulations with different gravity, friction, you know object weights and size, and it may very well just generalize to our real world, because the real world now is part of your simulation. And this is what we do. This is the old Eureka, and we simply ask GD four to also sample the domain randomization configurations for us, and then in the hope that it can transfer to real world. And it does transfer. So here on the left hand side, it's a simulation of the robot dog running forward, and we're able to zero shot transfer to real world. And that's a real robot dog also running forward. And this is a hand rotating cubes. And that's a hand rotating cubes in the real world. We did not do the penspin in the real world because actually not the algorithm. The algorithm is OK, but the hardware is a limiting factor. There is actually no real Five Finger hardware hand in the world that can have so much force and agility to spin a pen. So we're still waiting for the hardware providers to catch up with Eureka. And this is the most impressive demo we showed we train our robot dog to walk a yoga ball in simulation. But we cannot really simulate the bouncy, you know, the deformable, yoga ball in sim. So we just randomize and randomize, and Eureka is able to find a very good solution. So we transfer to the real world. And this is my collaborator from UPenn walking the robot dog there on the street an episode right out of black mirror. And because it's using domain randomization, dr, we call it Dr Eureka. And it's worth noting that eureka is a general purpose method that bridge the gap between high level reasoning and low level motor control. So Eureka uses this paradigm that I call hybrid gradient architecture, when an LLM writes a reward function and does high level reasoning, and the reward function instructs another smaller neural network using reinforcement learning. And this is the dual loop design that we use and leverage the help of LLM to train robots and Dr Eureka just changes the reward function and add the sim to real configuration. And I envision someday that eureka plus plus will be able to just program the task, the embodiment and even entire simulations for me, so that I can go on a vacation someday, and then Eureka passports will do all the development for me, automate the entire robotics pipeline, and then when I come back, I have trained the robots, and don't tell Jensen that. So in this sense, the Eureka work isn't really a point on the map, but it's a force vector that can push the frontier along any axis, because it's a general purpose algorithm that simply does coding. And as we progress through the map, we'll eventually reach a single model that generalizes across all three axes. And I call that the foundation agent. And that's the upper right corner, the ultimate destination through our hitch hiking across the galaxy. And I believe Training Foundation agents will be very similar to chatgpt, right? All language tasks can be expressed as languaging, like texting and text out. And chatgpt Simply trains it by scaling it up across lots and lots of text. And very similar here, the foundation agent takes as prompt an embodiment specification and a language instruction, and then it outputs actions. And we simply scale it up massively across lots and lots of realities, and the foundation agent is the next chapter for our gear lab. So I know I'm running out of time here, and I will quickly go through the last one. One more thing, which is that in March, Jensen announced project root at Nvidia GTC keynote, and it's a Cornerstone Initiative our roadmap. The mission is to create a foundation model for humanoid robots. And this is a scene from Jensen's GTC keynote where a lot of the our humanoid robot partners showed up on stage. And why humanoid? Because it is the most general purpose form factor, right? The world is designed around the human for all our tools, equipments and houses, restaurants, they're all designed for humans. And in principle, given a good enough humanoid hardware, it should be able to do anything that a reasonably untrained human can do. So we just aim for the most general purpose hot weather. And why do it now? Because we see the manufacturing cost of humanoid robots dropping exponentially over time. In 2001 the NASA humanoid cost north of $1.5 million and just two weeks ago, unitree announced their g1 robot that will only cost $30,000 and that's roughly the price of a car. And the manufacturing cost of mature products will tend to approach the cost of the raw materials. And for the human robots, they only take 4% of the metal that's required to build a to build a car. So in principle, the cost can even go down further.
And here I just want to show a snippet from Jensen's keynote. And this is the view of my lab, the lab I work in. It's called the gear lab. And these are the four humanoids that partner with us for the GTC keynote. And so the fun story was one of our partner companies. They drove a car here with a trunk full of body parts, because the humanoids break so often, right? They're still quite brittle. So they changed like three different hands and two heads during the demo, and it's just a bag of body parts. And computer scientists are the creepiest people on Earth, and this is one of my favorite pictures from GDC. It's taken in front of Nvidia's headquarter. And by the way, Nvidia headquarter is called Voyager. I wonder why. And here we see electronic, Fourier agility and unitary, the four companies. And just look at how happy these new species of humans are there. And on the high level, we're building the good model to take as input multimodal instructions and then be able to control a humanoid robot in a loop. And here, I'll quickly show some demo where this is a video instruction, and this is the GR one robot from Fourier intelligence, imitating the human from a raw video and dancing, and it's able to keep balance. And this is the group model learning from a human teleoperator demonstration trajectory. And we do cold press juicing, yeah, and we did a lot of cleaning, a lot of cleaning. You can't believe it. And of course, we'll use simulation. Right before Nvidia was AI company, was a graphics company, so simulation is actually our stronghold. So we use ISAC lab to run lots of different environments and tasks for the humanoids, and we train different kinds of humanoids in this massively parallel simulation where three days of time is equal to 10 years. And we hope that those that are training simulation will transfer to the real world. So back to this question, what makes a cat? A cat the most profound question ever, and what does this mean in the agentic era? Well, we don't have any robots that even approach cats in the agility right? Cats are these embodied Asians that are phenomenal in their sensory motor loop. They are so reactive and even faster than humans. And that's so well, right? Right there. So can we have robots someday to be as good as cats, or even better, can we have humanoid robots pet the cat for you right, feel its first through the tactile senses on the fingers adapt to the cats, movements and even senses emotions, right through vision and sound, so that one day, Our humanoid robots will continue to serve our cat overlords on behalf of us humans. Now zooming out, I believe in a future where everything that moves will eventually be autonomous and project route and humanoid robots are only the first chapter. And one day, we'll realize that the agents across wall E, Star Wars and Ready Player One, no matter if they're in the physical or the virtual world, will just be different prompts to the same foundation agent, and that, my friends, is the North Star our quest for AGI, and I welcome you all to join me in the agent tech era.
Thanks.
Thank you, Jane, thank you so much. Really insightful. We enjoy your speech. So I'm so happy today here because I know that you are the keynote speaker and the student when you graduate from Columbia. So today, we're going to ask you question. Also, if all you have a question, you can ask us and also ask grace. So we're going to have open two questions, okay, before the next speaker, our mayor London, Bree so now the question is, how AI evolving in terms in this, versus academia? That
is a great question. I think currently, industry is leading AI, because all of these new advancements require massive amount of compute, but I still believe academia can make very meaningful contributions, because academia is more open minded, it's willing to take more risks. And actually, for things like robotics, there are so many principles that we haven't discovered yet, right? For llms, we have a recipe, and that's called chatgpt or llama, and we know how to train llms. But for Foundation agents, this is very much an early stage, right? We're just entering the agentic era, very early stage, and there are many ideas and many kind of new inspirations that are still required, and I think academia is very well positioned to deliver those inspirations. So I believe in academia, and I hope all the students here will, you know, be able to participate in the Asian era. And also to get your hands dirty, because many of the tools I mentioned, like isaclab, you know, the simulation and everything, these are like open access, so you can download them, you can run these experiments. And for the Voyager mindojo Eureka work that I just showed, all of them are open source,
wonderful. So any questions from the audience. So we can give the microphone to grace, and then we can have people for the question. We're going to be two questions, yeah, so I know a little bit shy today since so how's your stock? So congratulations for your stocks, maybe they start going very, very well. So before the audience question, have another question for you. Is this time we have a K 12 Initiative, where is your advice for students who are learning? Ai, yes,
I think the most important thing is to get your hands dirty. And there's nothing that replaces that you need practice, and there are so many resources online, open source code, and also very powerful models like llama, thanks to meta open sourcing it, you can use some of the best llms, even on a very simple lab computer. So definitely get your hands dirty, and also learn as much as possible from all the tutorials and everything online, but nothing replaces practice. So that wouldn't be my that would be my advice. Thank
you very much. Thank you. So, okay, oh, you're a guest right there. Okay, so please answer the ask the question,
how's it going? My name is Dominic Russo, formerly advanced robotics engineer at Amazon, and also Tesla question for you, where do you see robotics going in the next five to 10 years? As far as humanoid robotics, I know figure, AI, open, AI, etc,
where do you see that going? Yeah,
that's, that's a great question. So the cost of humanoids will definitely go down. And as I mentioned in the talk, it's going exponentially down, right, trending towards the cost of raw materials, so that's number one. But actually, I don't really believe that hardware is going to be the bottleneck. I think hardware is not good enough right now, but it will be good enough very soon, and it's improving at an accelerating speed. I think for robotics, the hard part the bottleneck is actually the AI brain, because no one has really figured out what's the recipe for foundation agent. I have some ideas on how to do it, but these are all like very early explorations, and even if you have 1000s of GPUs to scale up, what do scale up? Exactly right? Like for chatgpt, you scale up on the text. But for foundation agent, do you scale up on simulation? Do scale up on lots of internet data? Or do you scale up on human collected teleoperation data on the real robot? It's actually not clear how to do so. I think the AI will be coming the limiting factor. And whoever figures out the AI, or whoever you know, the first players who figure out AI will capture a massive market. So using Jensen's words, I really like his phrasing, he's calling the humanoid robotics a $0 trillion business. It's zero right now, but it will be a trillion dollar in the future, and Nvidia is very, very eager to do $0 trillion businesses.
So thank you so much. Do. Anybody have more questions. No. Okay, so, thank you. Thank you very much for this great speech, you will know. So, if anyone. One more question, you can actually add my chair, Judge ribbing and.