Robotics and GenAI by Tony Zhao (ALOHA, Stanford), Zipeng Fu (ALOHA, Stanford), Cheng Chi (CS PhD, Stanford University)
10:33PM May 29, 2024
Speakers:
Jim Fan
Tony Zhao
Cheng Chi
Keywords:
robot
robotics
task
work
data
hardware
ai
model
sensor
learning
years
human
humanoid
train
collect
tony
hard
solve
project
system
I'll pass it
I think that starts Hello. Hi everyone. I think we can get started on the panel about embodied AI and robotics. Okay, so I'll get started on this. Hello. Okay, yeah, thanks everyone. Okay, so let's get started. Thanks everyone for attending this panel. And I am Jim fan currently our senior research manager at Nvidia, and the lead of embodied AI. And today, I am very happy to have three young scientists here, and all of them are rising stars in the field to offer us their unique perspective. And let me get started by saying that all of us right, have seen llms and the rise of chatgpt the foundation models, right like LMS, can do coding. They can do all kinds of natural language tasks, and they can even generate videos like open air Sora. So now we can generate those videos and generate language almost at human level. Why can't we control robots like human level. And I think the question is pretty simple here. The reason is that you can download lots and lots of text and data from the internet, right? If you want to train a video generator model, just download the whole YouTube, but you cannot download the control data for robotics from the internet, and that's what makes it incredibly hard. So there's this thing called the moravex paradox, which is that the things that humans find easy are extremely difficult for AI and vice versa. And I think robotics is right at the thick of the moravix paradox, something that a five year old can do is beyond our most advanced robotic systems, and in fact, our most advanced robots can even perform at the same level as a cat or dog or chimpanzee, let alone humans. So how do we tackle this problem? And that is the topic today. So let's get started by welcoming our rising scientists and maybe everyone. Please introduce yourself and also briefly talk about how did you get into AI. So let's start with
Yeah, so let's start with Tony.
Yeah, thanks for the warm intro. So my name is Tony, and I'm currently a third year PhD student at Stanford, and my research has been always about robotics, embodied AI and machine learning in general. But I actually, like, did not start doing robotics in the beginning of my research career. I started as a mechanical engineer major in my undergrad, and I actually wrote my first line of Python in my sophomore year. And then I was, like, really into control theory, which is basically how you control these airplanes, like quadrotors, even dogs. But then I realized there are some limitations to using like model based approaches without learning to control these very intelligent dynamic systems. So what I end up getting into is to start doing reinforcement learning, which has been sort of the main driving force for humanoid locomotion for quadrupeds. But after that, I become realizing that if we always do reinforcement learning from scratch or even with some prior, sometimes the prior is the limiting factor, instead of the exploration and the adaptation. So I end up digging deeper into a method called imitation learning for my PhD. What it means is that, essentially, you collect a ton of data, like tele operating a robot or using some sort of device like motion capture, and then given that data set, you train a huge generative model to capture what humans typically do, sometimes with text condition or some other image conditioning. And so that has led to works like aloha and more recently, mobile aloha and aloha unleashed, which we basically just collect a ton of data, train a huge model and see how far we can push the envelope for robotics. So it turns out that tasks people thought are impossible, it's actually perfectly doable, as long as you scale up data collection. So one example is to tie shoelaces. People thought that maybe you need a extremely, extremely difficult humanoid hand and real time, like 100 hertz control and force feedback. And it turns out, when you have a lot of demonstrations, you need none of those. You just need a huge transformer and a lot of data, and that's pretty much it. So that's probably the biggest takeaway I have throughout my career of doing research about it is that don't underestimate data and scaling and try to not do the smart thing where you can do the dumb way of like scaling up data and training A larger model. Yeah. So that's my book intro.
Cool. Hi everyone. I'm currently a CS PhD at Stanford, working with Chelsea Fink, and I'm in the same lab as Tony. My research topic is focusing also on robotics, but more focused on using learning based approaches, such as machine learning, reinforcing learning and imitation learning to train robots and to get deployable robot systems that can be deployed in the real world, kind of similar to Tony that I'm also not a roboticist or like aI researcher by training, I started, like my undergrad, first in math, but after I saw the success of ResNet, specifically object detection, ImageNet, these performance are like, show the potential of using data, especially data driven AI, to solve real world tasks. I transfer from Mass Department to CS and then work on AI, specifically, more high level AI. But like as Jim mentioned, one of the biggest challenge, like in AI, it turns out to be the low level intelligence of robots, of agents, things like grasping, things like manipulation, things like locomotion, even moving around, getting on to the stage, are pretty hard to be achieved by Robots. Are pretty hard to be achieved by agents. So I gradually, like, look more into like, how to solve, how to focus on, like, tackling the more more of paradox. And I luckily got into the grass School of where the Pro, like, who Hans moralve proposes the morrow paradox, and he's a professor at CMU Carnegie Mellon University robotic Institute. I luckily got into graduates program there and then started to work on robotics, specifically using simulation as a source of data to generate diverse trial errors, data for training robots. And turns out to be quite a surprise to me, like even if the data is generated purely in simulation, purely in synthetic data, it can let, like, for example, a legged robots to do all kinds of fancy tasks like parkour, like jumping or hopping. These tasks are, in the past, pretty hard to be achieved. So for example, Boston Dynamics usually maintain a large team of over 100 people to get like a parkour skill of like, laggy robots, but with the learning based approach and massive amount of data, we can quickly use the GPU to train overnight. It's a pretty like, low cost GPU. It just 4090, and even if we just use a simple GPU to generate massive amount of data and use a pretty simple trial error technique to let the robot to learn it can let, like a robot dog, to achieve pretty impressive to dump pretty impressive tasks in the real world without using real world data. So that's quite a surprise, surprise to me, and later I moved to Sanford to Bay area to start my PhD, and gradually, um, there's another kind of shift in robotics that people started to think, um, if we're trying to tackle manipulation, specifically like fine grain manipulation And mobile manipulation, that are actually contact rich, that actually need to deal with soft objects, that need to deal with fluids that are pretty hard to be simulated. These kind of tasks are, turns out, even collect a small amount of robot data can and using advances in imitation learning can show them to be pretty efficient in tackle like, for example, fine grain manipulation and mobile emulation. And currently I'm working on using both imitation learning, reinforcement learning loss, amount of human data loss, amount of passive data, to use all the data we can to train robust and that's quite a surprise to me, that we don't really need a lot of like human background knowledge to actually engineer every rules and hardcore every programs of the robots, but let the robot to learn from the data can be a pretty efficient way to train robots. Hello.
Hello. All right, cool. My name is Chung. Thank you for coming to the panel today, and thanks, Dr Jim fan, for the kind introduction. So fortunately, I recently defended my PC from Columbia, and I'm fortunate to be advised by Professor Shira song, who recently moved to Stanford University. As a result, I'm also affiliated with Stanford now, and I did my undergrad in Michigan, where I spent a lot of time on autonomous vehicles, kind of problems. So I work on Slan and mapping and also large scale data processing for tossed vehicles. And then I did quite a lot of work in classical one. Now, like it used to be classical, used to be modern, now is classical approaches where you separate different stages of robotics into perception, planning and control, and they tackled each of them separately. In the beginning of PhD, I was working on deformable object perception only, so I built models to track cloth and hope in hope that it can be used by some robotic manipulation algorithms in the future. But after a few projects, I become quite frustrated, because in this paradigm, usually the code base that software system will build are very task specific. So every time, when we do one project that's like one paper for a PG student, that takes one PhD student in one year to build an environment and to solve that one task, and as soon as as we move on to the next task. When you throw away the code base because the algorithm is over fit, so that's specific task. And I just feel like it was really frustrating, and I feel like we're not moving forward. And something changed for me in 2022 that's when generative AI starts to work for image generation. That's when Dall e2 came out, when stable diffusion came out. And actually, I actually offered reading a paper for diffusion model. I realized that it has very nice property, that it can approximate multimodal continuous distributions. And this is exactly the kind of distribution that robotics need to represent, you know, a trajectory of actions into the future. So I wrote one of the first paper that brought, I would say, Make diffusion models work with real time robotic inference and control a paper. It's called diffusion policy, and in that approach, it's actually used a similar paradigm as Tony at the pong with his imitation learning, pretty much, you know, I teleoperated robot to do a certain task for around 100 200 times. That takes me, usually two, three hours in an afternoon, and then I can train the model overnight, and I can deploy the robot, the model the next morning. So this paradigm is very like it's very different, because traditionally, at least our lab, we have been trying to avoid collecting data manually, because, you know, that's considered cumbersome and inefficient. But after engine trying to engineer a system that makes this robot self improve the real world and understand how hard it is to actually build a robust system in the real world to before you have a robust policy to collect data so and then that's when I realized that, you know, collecting data manually is not that bad. And then that turns out to be a surprisingly general way to represent, to express the intent of human So, for example, previously, when I want the robot to, you know, fold a cloth, I need to write, like, 10 terms of loss function to describe what, what, how the class should be described, and also need to be efficient to compute and able to automatically compute this reward in the real world. But with imitation learning, you know, I just do the class, I do the cross working class 10 times, or, like, 100 times, and then I can get a policy that can replicate the same behavior. I think overall, that's these that could lead to a paradigm shift in robotics, where now we have one one paradigm, where similar paradigms, where you have raw data, paired action and observation data, you train an end to end model, and you deploy the model on, you know, a fairly simple robotic systems, and It could solve a variety of paths. I think this is like the intersection between robotics and AI, and I'm very positive about the future of this direction.
Yeah, thanks all the panelists for sharing their story on how they get into robotics and into this fascinating field. So it's a bit hard to hear. I'm not sure if everyone picked up, but just want to do a quick recap the three young scientists here, right here at this panel, they collectively did some of the best works in robotics recently, for example, Tony and supin here briefly covered the Aloha work, which was the famous Stanford cooking robot that you might have seen on use. And Chen over there proposed diffusion policy, and was the first work to use diffusion models that were typically useful for like generating to the images, but Chen adapted it to generate actions. And diffusion policy is kind of now the de facto method for robot learning. So just want to quickly recap some of the essence. So my next question is, all of you have done super amazing work. What is the work that you are most proud of, and what's the main advancement over the closest prior work and also in robotics, we all know that the experiments are not a smooth cell. Many of the times there are some very painful experience in setting up and actually getting those methods to work. So what was the most kind of painful experience and interesting experience to share backstage, and let's start with Tony.
Yeah, thanks. I think maybe the project I'm most proud of is the recent release of Aloha unleashed. So it's my internship project at Google demand, where I spent around nine months there working to scale up the Aloha project. So what that project does is that instead of having one robot collect all the data, we now have nine workstations, and we hire contractors to continuously collect data. Think about it as like folding your shirt 1000 times or cracking an egg 1000 times. And what we show is that after we collected these extremely diverse data, we can train a generative AI model with diffusion, with transformers, that can model the diverse way that human conducts a certain tasks. So let's say, think about like Tying shoelaces. There are so many ways that you can tie shoelace. There are, like, a bunny air method. There's a way you can, like, sort of wrap around the other loop and then pull so it's a really hard distribution to capture, and also due to the data constraint that the algorithm itself needs to be very efficient. So what we did there is to be very precise. When we control how the data is generated, we have a way, we have a menu. I have a video that tells the operators and the contractors how to collect the data, and we make sure the data quality is extremely high. So after that, we can train just a vanilla, generative AI model on the data. So what it does is that, given the image, it predicts how a human will do in the next following frames, so it could capture all the possible possibilities that a human might do, and choose one and commit. And what it shows is that we can do extremely, extremely difficult tasks with very low cost hardware. So one example is to use one robot to fix another robot. What it means that, like, let's say a robot's gripper is broken, so we can actually train another robot to swap the gripper for that robot. So this has is interesting because actually the precision required to swap the gripper there's like a very fine dovetail design that needs very precise insertion. So the precision there is actually exceeding the limit for the mechanical structure of the robot we are controlling. So typically this is thought to be like, just impossible, right? Like your robot cannot get sub millimeter like precision. How can you? How can it does things that are like, so hard, but it turns out the intuition is very similar to how humans use our own hands. So we actually don't know, like millimeter level where our fingertip is at we know roughly where it is, but most importantly, we have tactile sensors and we have vision feedback, so we observe how our hand is relative to the object that we are manipulating, and we feel the object, and that let us also exceed what the absolute precision of our own arm is. And what we are trying to do here is to see if we can do the same for robots. And turns out, with enough data and with enough randomization on the scene and the robotic setup, this phenomenon just emerges, we don't need to do anything fancy about it, as long as it has seen robots that are assembled in slightly different ways and have background in slightly different ways. When it sees a new one, it can actually zero shot, adapt to it. And basically that what we end up doing for that project is we can fix robots, we can tie shoelaces. We can hand all kinds of T shirts onto a clothes hanger and then put it on like a rack. And we also did industrial insertion task, for example, inserting a gear into a pack which needs like also sub millimeter precision. So I think probably the hardest part about the project is the operation around the data collection. And I think I'm lucky enough to be doing this project as part of my internship at Google, which already has some infrastructure set up to basically guide the data collectors and to control the data quality. And I think another difficulty is just hardware reliability. So you've probably seen, like all these demos, the robot seems like it can do pretty much everything, like cooking, like playing soccer, all kinds of really hard tasks. But there's still a really large gap from a demo that you can make a cool video out of, versus, like a super reliable demo that you can show anyone as I'll say, appears, and that is another there's another big gap from that to an actual product that you can ship, that people actually bring value to those people. So for us, we also think to the infrastructure support at Google, we're able to spend a lot of resources and time into just perfecting the hardware to make sure you can run it for like, two days straight without any errors or bugs, and that will support the whole data collection and evaluation of the policy. Yeah.
Yeah. So in my mind, I don't have, like, a favorite project, so my past few years in robotics, I only work on, I would say, two projects. The last one I did is collaboration with Sony and Chelsea, which is leveraging Aloha technology to actually make it apply to a home robot that can move around, that can manipulate things, and that can interact with humans, which is called Mobile Aloha. And the second series of projects I've been working on is robot docs. Specifically quadrupled how to improve that capabilities of these robot docs to achieve what biological dogs can actually achieve. So I'll briefly talk about the first one and then the second one. So the first project, mobile Aloha. I think what the biggest takeaway from us, from Tony and I, is that even with very low cost hardware, I mean, it's not low cost compared with, like other mature products, but compared with other manufactured robots that are being sold on the market, even, like if we just put a bunch of low cost robot Low Cost hardware, like assemble them and then couple it was a pretty advanced learning algorithms. Um, it can still, like learns. The robot can still learn to do complex things such as like holding something, using the bio by manual arm, and then put it inside and using the whole body to control itself to do all kinds of maneuvers, like holding a wine of glass of wine, and also like navigate to elevate it and press a very small button. These are the things that we are pretty hard to imagine, even in two or three years ago, given the advances in using generated AI, specifically transformers and imitation learning, and leveraging the high quality data collection system that is brought by Aloha, we can achieve a lot of things. I think the biggest pain point I have in the past when working with mobile aloha is still a human needs to collect data, and I'm not lucky enough like Tony, who has a bunch of people at Google demand to collect a lot of data for him, for mobile Aloha, Tony and I still need to, like, collect the data by ourselves, which is not quite time consuming For one task. We can just spend one or two hours to collect 50 to 100 data to train one task, but still, like using robots and then moving robots around and like tally up the robot for one or two hours, can still be consuming, time consuming. So this is a kind of one, the pinpoint of many current like methods or current project that do robot learning or use standard AI and apply it to robot robotics, is that grad students still need to, like do their own data collection. So in the future, I hope to see maybe more novel or easy way to collect the data in the real world, for example, like remote teleoperation. If I can work at home, I can, you know, maybe teleop, the robot that is in the lab, and then, without relocate to the actually stay in the stay in the lab for very long period of time. I think that's like one of the point. I'm pretty jealous of all the people who work on NLP or computer vision, that basically they can work at home all the time. They just need to type on the computers robotics. Robotics is very different. The second series of projects I've been working on this use like quadrupids. I think, I think one of the biggest takeaway is that it's kind of interesting to see how much we can leverage on the current advances in AI and not hard code, all the rules, hardcore, all the physics, into the program. Without them, we can still achieve a pretty savvy city our controllers for these agile robots, and we have been seeing a shift, a paradigm shift in the industry that more and more manufacturers like unit tree, are starting to use reinforcement learning and starting to use intended data that generated from simulation to train their robot and actually deploy their neural network on the robots. If you buy like a robot dot from unit tree, you can actually use actually neural network that ship with the robot. So that's quite a surprising thing to see, because in the past, people think neural networks not robust, but like, turns out, it's actually more robust than all the classical messes. I think one of the main pain point I've been facing, like using this kind of technique, is there's a lot of human engineering, still a lot of human engineering in how to baking what we want to achieve into the reward function. So actually, we need to specify, like, let the robot to move at a certain speed, and add another term to specify, let the robot to saving the energy. Let the add another term to specify, letting telling the robot it cannot do crazy things. So these kind of pro like way to specify the objectives are still, I will say, time consuming. So there are many like works specifically like from Jim fan that using like feedback from large language model like Eureka. Let's call Eureka to correct the reward function and then automatically adapt the reward function to get a better and better reward function for the robot to train on. Yeah. So these are the two projects being working on.
Okay, cool. So I'm really proud of my pass through projects, and the first one I want to talk about is diffusion policy. And I think my biggest takeaway from the project is actually not about how good diffusion model is, but because, but about what parts of the robotic system is actually the bottleneck? So, you know, Robotics has a rich history in imitation Learning and Behavior cloning in back in like 20 years ago, but I think in the past decade, no when we look at end to end learning system for robotics. It had a front end which ingests a raw pixel information and a back end that outputs action. And I think in the past decades, people have been because, you know, the rapid advancement of computer vision, most researchers believe that the information, the front end, information injection and perception system, has been a bottleneck of a system. And therefore, you know, there was huge amount of effort put into improving division encoder and also pre training of this part. But I feel like in the past decade, the development of the back end, on how the model actually outputs action onto a robot, has been a little neglected. And my observation from the theorem policy is that by if your back end is not able to represent the distribution you want, then no matter how powerful your front end is, the performance of the entire system will be bottleneck. And I hope that this project will raise awareness in the robotics community to focus on, you know, other parts of the system, like the back end, maybe ik, maybe some low level control, these things that could have been the bottleneck of the system, but sometimes neglected. The second project I'm really proud of is my recent paper called universal manipulation interface. It's really a open source hardware project, but I think my proudest feature is that I think I was able to deploy the first end to end manipulation system actually in the wild. So the reason I go to achieve that is that I built this printed gripper with a GoPro attached on top. And because this device is so portable and I don't need a robot to tally up, a robot to collect robot compatible data, I was able to, you know, go to a random restaurant before, like when I sit down, before the winner come in, I just collect data. And very quickly, I was able to collect 1400 demonstrations all over Bay area, in different locations and different settings. And using that fairly small data set, you know, like three people collect in a week, I can train a model that's able to rotate like a do a table, cop set task we have like a saucer and a espresso cup. The robot's able to rotate the espresso cup, put it on a saucer, but do it any table in Stanford. So I was able to push a robot to anywhere in Stanford and to do this task reliably. And I think the real lock here is actually data collection. It's not the algorithm. And just like Stephan said, data has been a bottleneck for robotics. And I do think that methods like Umi will unlock a new order magnitude of data available for robots in the real world. And I hope that this foundation data set will actually lead to foundation model. Yeah. Thanks all the panelists for sharing their latest work and also insights. My next question is about hardware. So this year in particular, we have seen a Cambrian explosion of new hardware, right, like ranging from kind of cheap, affordable and reliable hardware like Aloha, all the way to very expensive hardware like humanoids, right? We have seen Tesla Optimus folding clothes, right? Elon is tweeting all the time about how Optimus is doing. We have also seen Boston Dynamics humanoids, right? Five years ago, they can do backflip, and today, they are even weirder, right? If you have seen their latest robots, it's basically a contortionist. It can rotate its body 360 degrees, and it looks like an episode straight out of black mirror. And those robots cost a lot to manufacture and produce, and of course, anywhere in between, we have robot arms, we have robot dogs. We have also more exotic, specialized robot hardware. So my question to panelists are, what's your view on the evolution of hardware up until now and moving forward into the future? All right, Jim, this is a great question hardware. We think, I think that you know, the development robot cannot be detached from the advancement of hardware. But also, I think the recent wave of end to end learning actually unlocks a different set of hardwares available for robotics. For example, previously, you know, when we want to integrate a new type of sensor into robots, we need to build a very detailed model for this sensor. And sometimes we need to, even need to simulate the imperfections of the noise of the sensor in simulator in order to do sim trivial transfer. But the recent wave of imitation learning and end to end learning allows us to use pretty much raw any type of raw sensor input, let it be video, audio, or maybe tactile or other kind of sensor without any explicit modeling. Because, you know, the power of antenna learning is that is able to automatically discover the optimal, at least for the task, the optimal representation, intermediate representation, of this of the sensor. So what I hope to see is that there will be an explosion of cheap and diverse sensors, you know, on road, like in human hand, we have a tactile we have four sensing. We have temperature sensing. And then beyond human we have echo sensing from bats and and then dolphins. And you know, actually, a lot of these sensors are available for a couple of bucks on digikey, I would want to have like a like a cheap and open source data collection device that integrates all of these cheap sensors as well as like below $1 and collect all of those data and let the model figure out what's the best sensor to learn from. And I hope that you know this kind of new sensor configurations can enable new capabilities for robotics.
Yeah, I totally agree with what Tran said is that, basically, the cost of sensors are getting lower and and we have, you know, the electrical engineers had developed a lot of great sensors, so we can maybe use them for the multi modality of the sensing of the perception of the control of the robot. I don't have a strong opinion about whether we should be like developing more general hardware platform, like human ways, or maybe we should, or maybe we should focus on more specific hardware design. But given my passive experiments, experiments and experience, I think my feeling is that if we want to achieve actually, like not just 90% of success rate, but 99% or 99.99% of success rate for a specific task, then I think maybe designing the hardware that tailored for that task can be very important. All these kind of, I would say, foundation models, as in the digital world, like chatgpt or image generators, they don't really have that high accuracy that is needed for the real world robotics task, because the cost of failure in real world for robots is pretty high. If chatgpt makes a mistake, then maybe we think it's funny, but like, if the robot makes a mistake, then it can actually cause damage, or, like, damage itself. So I was saying that it's kind of different from, like, generative AI apply in the physical world versus AI apply in the digital world, yeah, so that's my take.
I think for hardware, I'm probably more minimalist. I believe in like using the lowest degree of freedom to do the most amount of dexterity. So I think people tend to associate dexterity with how complicated the robot hand is. And I think that cannot be further from the truth, because imagine you only have like two fingers, and you have two hands, you can still do like a lot of tasks, like cleaning your house, even like driving with two fingers, is not impossible. So I think, I think it's important to have good hardware, but we're currently definitely bottlenecked by the intelligence behind it. So the example I just showed is that, like, because we are so smart, even if you have like two finger or like one finger, we can still do a lot of things. Think about like a bird. It can use its peak to do like, a lot of things. And so I think while it is important to maybe have a form that is people looks familiar like a humanoid form, so that is more relatable and it can work around humans more easily, I wouldn't think that a human or form is a necessity for high intelligence or high dexterity, and yeah, and another, I guess example is on dexterous hands. So at the moment, you've probably seen optimists, probably seeing figure robots or other robots. I think it's surprising that a lot of those hands looks like human hands, like for like five fingers, one thumb and four rest of fingers, but the way they function is completely different. So one example is like our palm can actually bend, but none of those robots has a pump that can deform. It's just a flat piece of metal or a piece of plastic. What it means is that if a human do a task with bending of palm that is just not replicable on the robot, and I would just go and argue that if we were to like one by one, replicate like humanoid hand that can do everything human hand can do, but with tendons or motors, it's going to be extremely difficult. And I think while it's important, the focus should always be on intelligence and on how to control the hardwares we already have?
Yes. So Tony just mentioned he's a minimalist on the hardware, and I take a slightly different view. I'm a maximalist on the hardware, and that's why we at Nvidia are working on humanoid robots, and we're building foundation models for humanoid robots. And if you recall from Jensen's keynote at GTC in March, he availed something called Project root, which was what we are working on to build the AI brains for the humanoids. So I actually agree with most of what Tony said, Right? Humanoids are not the best form factor currently, and also for many tasks, right? There are so many tasks that you just have much better, more specialized robot hardware to do, but the human form factor is the most familiar factor, and also the most general purpose. And we just kind of go maximalist to the most general purpose, even though it's harder to control in the short run. So I think it's all about a time horizon, right? Like, do want to, you know, do business to capture some economic value. In the short run, then I don't recommend humanoid. This is not the way to go. But in the long run, we believe in this kind of hardware iteration. And many labs are doing this iteration very fast, right? Like, like Tesla, like unit tree, like fully intelligence, they're doing this really fast, and the cost of a mature product typically trends towards the cost of the raw material. And a humanoid robot only takes about 4% of the metal material that a car requires. So a car costs maybe around 30k these days, and it's very possible that human noise can cost even less than that someday, not right now, right there are bulky, expensive and hard to control, but we believe maybe in some form of hardware scaling law moving to the future. And that gets to my last question to all the panelists, so what does solving robotics mean to you, right? When do you think, like, we are there, like, like, what do you see? And then you can say, Okay, this is it. This is the journalist robot that we've been chasing. And then, what's the timeline? What's the estimate? How many years are we away from that? All right, I think, you know, the term solved is a very negative word in academia. People don't want things to be solved, but I would say we, we we don't. We can. Robots can be useful without we like solving the robots themselves. I think I would say, I think making robots, making useful robots, you know, can do the daily chores for you, is a task simpler than AGI. It doesn't require like, you know, maybe I only need five buttons to I can count how many tasks you can do. Maybe it's not that hard of a problem. But right now, I think it is so hard because we don't have a data but right now, I think the new paradigm of internet learning point us towards the direction to get the right data. And I think by with the entire industry pushing toward the same direction, it's very likely that within the next two to three years, we'll have pretty capable and usable robots that can do, you know, beyond one specific task, like I would say two to three tasks, and I would call that generalist robots already. I mean, it's very useful. You can deploy it in vertical in a scenario, you know, that's a home, let's say smart factory. Let's say hotel or hospital. You can solve, you know, five tasks that pretty much covers the job description of a human. And I think that is certainly very valuable, and I think that will be deployed very likely before a full humanoid that can do anything that human can do. So I'm actually incredibly optimistic about industry pushing in this direction. I think there will be many intermediate steps between, between now and to where we reach embodied AGI, and I think every single step toward that direction can establish a generation of companies that can fulfill the need of like, you know, real world problems.
So, yeah, I totally agree with chance. So, so I will be brief, so basically, yeah, like, there are different, I will say, like there are two levels of intelligence for robots. So one level is trying to let the robot solve some specific tasks that will say, maybe in one or two years, there will be robots that can be able to solve one specific task to a very high success rate without the help of humans. And we already seen like, for example, like robots like Roombas are actually can be deployed and be run in daily lives, and people are using it without any troubles. I will say the second level of intelligence. Robot is kind of more like AGI or general intelligence for robots, that will say it's pretty hard, because that robotics pretty different from all the natural language understanding or computer vision, in the sense that the Data the dimensionality that the robotics need to be solved, the input dimensionality and the output dimensionality of the robot software is much larger than any other applications of AI and also, robots need to run at certain frequency. It's not like asking the robot to make a prediction. Then it's the end of the day, the robot actually needs to make the prediction 10 times or 50 times a second. So imagine that amount of data that's needed that can be encountered by robots, so that, because of that, I will say, in order to achieve robot AGI, it will take a long time to get that amount of data that's needed and get the algorithm to be mature enough To consume that amount of data.
I would think adaptability to, like solving a new task will be one of the core capabilities for like a real in body. Ai, like solving robotics, because no matter how big the data set is, there will always be new tasks in the world. For example, like Ikea has now a new type of desk. How can the robot learn to assemble it? It will probably need to figure out, like using some new tools or new components, and look at the menu to figure out how to sort of make this new shelf or table. And I think robotics will not just be about like dexterity, like in like atomic way, like small scales, but also how to chain them together in a way that just makes sense, and it's almost like reasoning in the context of more physical Intelligence.
Great. So for me, I think journalist robots may be three years away. Whatever that means, at least, that's what I tell Jensen, don't hold me accountable. Three years later, if I failed at solving robotics, you might not see me anymore. Yeah. So I do believe in the next three years, we will see some major breakthroughs, right? And as the panelists mentioned here, right, the ability to adapt quickly to new skills, right, to move from specialist robots to more general purpose robots that can solve tasks based on, let's say, natural language instructions. I believe a breakthrough will be within the next three years, if not sooner. And that being said, I don't think robots will just be deployed and then become as many as iPhones anytime soon, because you also have the production issue, right? Anything that touches the hardware is a very complicated supply chain and logistic issue, and also there are a lot of societal and ethical and safety concerns as well, right? Like, are people gonna accept robots into their daily lives? That is a very tough question to answer. Right? When Steve Jobs first unveiled iPhone on stage, we have no idea that one day we'll just have this device in everyone's pocket, right? We have no idea people will accept this new thing. So it's the same thing. We have no idea if people will accept robots, and if they do, what kinds of robots they will accept at which region of the world, right? It's not clear. And for that, I cannot really assign a timeline, but for the research part, I am confident we'll see something major in the next three years. So let's all you know. look forward to that. And with that, thanks to all the panelists for this amazing discussion, and very insightful analysis. And thanks, everyone, for joining us.