Google DeepMind Announces LLM-Based Robot Controller RT-2

Google DeepMind recently announced Robotics Transformer 2 (RT-2), a vision-language-action (VLA) AI model for controlling robots. RT-2 uses a fine-tuned LLM to output motion control commands. It can perform tasks not explicitly included in its training data and improves on baseline models by up to 3x on emergent skill evaluations.

DeepMind trained two variants of RT-2, using two different underlying visual-LLM foundation models: a 12B parameter version based on PaLM-E and a 55B parameter one based on PaLI-X. The LLM is co-fine-tuned on a mix of general vision-language datasets and robot-specific data. The model learns to output a vector of robot motion commands, which is treated as simply a string of integers: in effect, it is a new language that the model learns. The final model is able to accept an image of the robot's workspace and a user command such as "pick up the bag about to fall off the table," and from that generate motion commands to perform the task. According to DeepMind,

Not only does RT-2 show how advances in AI are cascading rapidly into robotics, it shows enormous promise for more general-purpose robots. While there is still a tremendous amount of work to be done to enable helpful robots in human-centered environments, RT-2 shows us an exciting future for robotics just within grasp.

Google Robotics and DeepMind have published several systems that use LLMs for robot control. In 2022, InfoQ covered Google's SayCan, which uses an LLM to generate a high-level action plan for a robot, and Code-as-Policies, which uses an LLM to generate Python code for executing robot control. Both of these use a text-only LLM to process user input, with the vision component handled by separate robot modules. Earlier this year, InfoQ covered Google's PaLM-E which handles multimodal input data from robotic sensors and outputs a series of high-level action steps.

RT-2 builds on a previous implementation, RT-1. The key idea of the RT series is to train a model to directly output robot commands, in contrast to previous efforts which output higher-level abstractions of motion. Both RT-2 and RT-1 accept as input an image and a text description of a task. However, while RT-1 used a pipeline of distinct vision modules to generate visual tokens to input to an LLM, RT-2 uses a single vision-language model such as PaLM-E.

DeepMind evaluated RT-2 on over 6,000 trials. In particular, the researchers were interested in its emergent capabilities: that is, to perform tasks not present in the robot-specific training data, but that emerge from its vision-language pre-training. The team tested RT-2 on three task categories: symbol understanding, reasoning, and human recognition. When compared to baselines, RT-2 achieved "more than 3x average success rate" of the best baseline. However, the model did not acquire any physical skills that were not included in the robot training data.

In a Hacker News discussion about the work, one user commented:

It does seem like this work (and a lot of robot learning works) are still stuck on position/velocity control and not impedance control. Which is essentially output where to go, either closed-loop with a controller or open-loop with a motion planner. This seems to dramatically lower the data requirement but it feels like a fundamental limit to what task we can accomplish. The reason robot manipulation is hard is because we need to take into account not just what's happening in the world but also how our interaction alters it and how we need to react to that.

Although RT-2 has not been open sourced, the code and data for the RT-1 have been.

About the Author

Anthony Alford

Show moreShow less

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

InfoQ Article Contest

About the Author

Anthony Alford

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter