Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Google's PaLM-E Combines Vision and Language AI for Robot Control

Google's PaLM-E Combines Vision and Language AI for Robot Control

Researchers from Google's Robotics team recently announced PaLM-E, a combination of their PaLM and Vision Transformer (ViT) models designed for controlling robots. PaLM-E handles multimodal input data from robotic sensors and outputs text commands to control the robot's actuators. Besides performing well on several robotics tasks, PaLM-E also outperforms other models on the OK-VQA benchmark.

PaLM-E addresses the problem of grounding or embodying a large language model (LLM) by including multimodal sensor data in the LLM's inputs. These inputs are first passed through encoders which project them into the embedding space the LLM uses for language input tokens, resulting in multimodal sentences of text and other data. PaLM-E then produces a textual output, such as an answer to an input question, or high-level instructions for a robot. According to Google:

PaLM-E pushes the boundaries of how generally-capable models can be trained to simultaneously address vision, language and robotics while also being capable of transferring knowledge from vision and language to the robotics domain...PaLM-E not only provides a path towards building more capable robots that benefit from other data sources, but might also be a key enabler to other broader applications using multimodal learning, including the ability to unify tasks that have so far seemed separate.

Google's robotics researchers have open-sourced several previous LLM systems for robot control. In 2022, InfoQ covered both SayCan, which uses an LLM to output a high-level action plan, and Code-as-Policies, which uses an LLM to output low-level robot control code. 

PaLM-E is based on a pre-trained PaLM language model. Robot sensor data is injected into a textual input; for example, the model can handle input questions such as "What happened between <img_1> and <img_2>?", where "img_1" and "img_2" are images encoded by a ViT and mapped to the same embedding space as the text input tokens. The output of the model in this case would be an answer to the question. Google created encoders for several input modalities, including robot state vectors (e.g., 3D pose info), 3D scene representations, and entity references for objects in the robot environment.

PaLM-E Embeddings

PaLM-E Model Architecture. Image Source:

The researchers evaluated PaLM-E by using it to control simulated and real-world robots performing multiple tasks: grasping and stacking objects, pushing objects on a table-top environment, and manipulation by a mobile robot in a kitchen environment. PaLM-E was able to construct "long-horizon" plans for the robots, and in the table-top pushing was able to generalize to tasks involving objects not seen in training. In the kitchen, the robot was able to complete long-horizon tasks "even under adversarial disturbances."

Several users commented on the work in a Hacker News discussion. One user wondered how well the model performance scaled with the number of parameters. Another user replied:

The performance does scale up with parameters, though it’s not linear. As discovered by Google in their work on their Chinchilla LLM, performance also scales up with the size of your training set. They were able to do work to define the optimal amount of training for a model of a given size to get the most out of your budget. So even if we don’t find any better model architectures, which we probably will, if we increase the size of our models, training corpus and budget, we should continue to get more performant models.

The PaLM-E site has several demo videos of robots performing tasks while controlled by the model.

About the Author

Rate this Article