On July 29, a reporter for the New York Times was at Google Labs to get a first look at Google’s latest RT-2 model-driven robot. A one-armed robot stood in front of a table. On the table sat three plastic statues: a lion, a whale, and a dinosaur. The engineer gave the robot the command, “Pick up the extinct animals.” The robot whirred for a moment, then its arm extended and its claws fell open. It catches the dinosaur. It was a flash of intelligence.
Until last week,” the New York Times described, “this demonstration was impossible. Robots can’t reliably manipulate objects they’ve never seen before, and they certainly can’t make the logical leap from ‘extinct animal’ to ‘plastic dinosaur.'”
While it’s still on display, and Google doesn’t intend to immediately release or commercialize it on a larger scale, this demonstration is enough to show a glimpse of the opportunities that big models can bring to robots.
Before the era of big models, people trained robots, usually optimized for each task, such as grabbing a certain toy, need a sufficient amount of data, the robot can accurately identify the toy from all angles, in all kinds of light, to grab successfully. And making the robot aware that it has the task of grasping a toy also requires programming the robot to solve it. The intelligence and generalization capabilities of large models have shown a way to solve these problems and move towards general-purpose robotics.
1. Putting Transformer to work for robots
Google’s new RT-2 model, known as Robotic Transformer 2, utilizes the Transformer architecture as a base for its models.The Transformer architecture, which was proposed in 2018, is the bottom layer of the base for the current firestorm of Large Language Models (LLMs), but the fact is that, as an architecture, Transformer But in fact, as an architecture, Transformer can be used not only for LLM, but also for training other types of data. Back in March, Google released PaLM-E, the world’s largest visual language model (VLM) at the time.
In a VLM, language is encoded as vectors, and people provide the model with a large corpus that allows it to predict what a human would normally say next, using this to generate linguistic responses. In the Visual Language Model, on the other hand, the model can encode image information as vectors similar to language, allowing the model to both ‘understand’ text and ‘understand’ images in the same way. And the researchers provide the visual language model with a large corpus and images, allowing it to perform tasks such as visual quizzing, captioning images, and object recognition.
Both images and language are relatively easy data to acquire in large quantities. As a result, it’s easy for models to achieve stunning results. Trying to use the Transformer architecture to generate robot behaviors, on the other hand, has a major difficulty. “The data involved in robot actions is very expensive.” Prof. Huazhe Xu, an assistant professor at Tsinghua University’s Institute for Cross-Information Studies, told Geek Park, “Visual and verbal data come from humans and are passive data, whereas robot action data, all of it, comes from the robot’s active data. For example, if I want to study the action of a robot pouring coffee, no matter whether I write code for the robot to perform, or utilize other ways for the robot to perform, it is necessary for the robot to actually perform the action once in order to get this data. So the data from robots is on a completely different scale and magnitude than language and pictures.”
In RT-1, the first generation of the Transformer model for robots that Google worked on, Google opened up such a challenge for the first time by trying to build a model of visual verbal action. To build such a model, Google used 13 robots to assemble a dataset of active data from more than 700 tasks collected over 17 months in a built kitchen environment. The dataset recorded three dimensions simultaneously:
Vision – camera data from the robots as they perform task operations;
Language – task text described in natural language;
and Robot Motion – data on xyz axes and deflections while the robot is performing the task, etc.
Although good experimental results were obtained at that time, it was conceivable that it would be very difficult to further increase the amount of data in the dataset. The innovation of RT-2, however, is that RT-2 uses the previously described visual language model (VLM) PaLM-E and another VLM PaLI-X as its base – the simple VLM can be trained with network-level data because the amount of data is large enough to get good enough The robot can be trained with network-level data because the amount of data is large enough to get good enough results, and in the fine-tuning phase, the robot’s action data is added in to fine-tune together (co-finetuning).
In this way, the robot is equivalent to having a common sense system that has been learned on a huge amount of data – although it can’t grasp bananas yet, it can recognize bananas, and even knows that bananas are a kind of fruit that monkeys would like to eat. In the fine-tuning stage, by adding the knowledge of how the robot grabs bananas after seeing them in the real world, the robot not only has the ability to recognize bananas in various lights and angles, but also has the ability to grab bananas.
In this way, the data required to train the robot with the Transformer architecture is significantly reduced.
RT-2 directly used the vision/language/robot action dataset used in the RT-1 training phase in the fine-tuning phase. The data given by Google shows that RT-2 performs just as well as RT-1 when it comes to grasping items that originally appeared in the training data. And because of the ‘brain with common sense’, the success rate when grasping items that had not been seen before increased from 32% in RT-1 to 62%. “That’s the beauty of the big model,” Hsu said. There’s no way to break it down to whether the success rate increased because it recognized that the two objects were made of similar materials, or were similar in size, or something else,” Xu said. After it learns enough, it springs out with some capability.”
2. The future of using natural language to interact with robots
Academically, the strong generalization demonstrated by RT-2 has the potential to solve the problem of insufficient training data for robots. Beyond that, RT-2 is intuitively impressive because of its intelligence. In experiments where the researcher wanted it to pick up ‘something that could be used as a hammer’, the robot picked up a rock from a pile of items, and when asked to pick up a drink offered to a tired person, the robot chose a Red Bull from a pile of items. Such skills come from the ability of researchers to introduce ‘chains of thought’ when training large models. This kind of multi-segment semantic reasoning is very difficult to achieve in traditional robot imitation learning research. However, utilizing natural language to interact with robots is not new to RT-2.
In the past, researchers have always needed to translate task requirements into code for the robot to understand, as well as write code to correct the robot’s behavior if something goes wrong, a process that requires multiple interactions and is inefficient. Now that we have very intelligent conversational robots, the next natural step is to have the robots interact with humans in natural language. “We started working on these language models about two years ago, and then we realized that they hold a wealth of knowledge.” So we started connecting them to robots,” says Karol Hausman, a research scientist at Google.
However, making big models the brains of robots has its own set of challenges. One of the most important of these is the problem of grounding, that is, how to translate the usually pie-in-the-sky responses of the big models into commands that drive the robots’ actions. 2022 saw the launch of Google’s Say-can model, which, as its name suggests, uses a two-pronged, two-dimensional model. As the name suggests, the model uses two considerations to help robots act. One consideration is say, which combines with Google’s big language model, PaLM, to break down an acquired task through natural language and human interaction to find the most appropriate action for the moment, and can, which uses an algorithm to calculate the probability that the robot will be able to successfully perform the task at hand. The robot performs actions based on these two considerations.
For example, if we say to the robot, “I spilled my milk, can you help me?” The robot will first go through the language model for task planning, at which point it may make the most sense to find a cleaner, followed by finding a sponge and wiping it itself. The robot will then go through the algorithm and calculate that as a robot, it has a low probability of being able to successfully find a cleaner, and a high probability of finding a sponge to wipe itself. After the twofold consideration, the robot would then choose the action of finding a sponge to wipe the milk. Although in such a two-layer modeling architecture, the actions that the robot can successfully perform are already pre-designed, and the big language model is just able to help the robot choose the appropriate task planning. In such a model, the robot already shows a strong sense of intelligence.
However, although the results look similar from the outside, RT-2 takes a different path. By learning all three types of data – vision, language, and robot behavior – at the same time when the model is trained, RT-2’s model does not perform task decomposition before task manipulation, but rather, natural language input is used to generate the output of the action directly through the model’s operations. “The two-layer structure is similar to when I want to do something, and I think of the first step to do this and the second step to do that, and then I execute these strategies one by one.” Prof. Hsu Huazhe said, “while the end-to-end structure is similar to the one where I don’t particularly think carefully about what the first step or the second step is, and then I just do the thing.” An example of the latter can be likened to the way we type and chat on our cell phones every day; we generally don’t think carefully about exactly how our muscles are going to move when we type and chat; instead, we think of the word we want to type and just type it.
“Both different routes or different approaches have yet to prove themselves to be the only correct way,” said Xu Huazhe. Xu Huazhe said. But because of RT-2’s excellent performance, a technical direction where the model can take over input and output seems worth exploring. “Because of this change (RT-2’s excellent performance), we’ve had to rethink our entire research program,” said Vincent Vanhoucke, head of robotics at Google DeepMind. “A lot of what we were doing before became completely useless.”
3. Is RT-2 the robot’s GPT3 moment?
Google’s RT-2 robot is not perfect. In an actual demonstration witnessed by a New York Times reporter, it incorrectly identified the flavor of a can of lemon-flavored soda (saying “orange”). On another occasion, when asked what kind of fruit was on the table, the robot replied ‘white’ (which was actually a banana). A Google spokesperson explained that the bot used cached answers to previous testers’ questions because its Wi-Fi had been briefly interrupted.
On top of that, training bots using large models inevitably comes with a cost issue. Currently, when Google’s robots reason and make judgments, they need to send data to the cloud, where multiple TPUs perform calculations and send the results back to the robot, which then performs the action. This kind of calculation can be imagined to be very expensive. According to Vincent Vanhoucke, head of robotics at Google DeepMind, the new research opens the door to robots being able to be used in a manned environment – the researchers believe that robots with built-in language models could go into warehouses, be used in the healthcare industry, and even become home assistants to help fold laundry, take items out of the dishwasher, and tidy up around the house.
“If you’re opening a factory and you need to use a robot, the success rate must be demanding. You wouldn’t want to say you bought a robot and then you need a lot of people to maintain that robot and perfect the things that the robot doesn’t do well enough. That would be too costly.” Prof. Hsu Huazhe said, “Robots in home scenarios may be another situation, because maybe the success rate of some tasks in home scenarios is not as demanding. For example, folding clothes, folding is not that good, maybe in your eyes this task fails, but the impact on you will not be very big.”
Yaan Lecun, one of the Artificial Intelligence triumvirate, famously asserted, as he has emphasized many times, that AI is not smart enough. While any child can quickly learn to clear the table and put the dishes in the dishwasher, a robot cannot. This may be true of current robotics research, but just as the imperfect GPT-3 gave the industry a glimpse of where large models could go, perhaps today’s imperfect RT-2 will usher in a future where robots enter the home and become our assistants.