Google Open-Sources Natural Language Robot Control Method SayCan

Researchers from Google's Robotics team have open-sourced SayCan, a robot control method that uses a large language model (LLM) to plan a sequence of robotic actions to achieve a user-specified goal. In experiments, SayCan generated the correct action sequence 84% of the time.

The technique and experiments were described in a paper published on arXiv. SayCan uses prompt engineering to convert a user's input, such as "I spilled my drink, can you help?" into a dialog asking the robot the steps to bring the user a sponge. Each of the robot's skills has a textual description that the LLM can use to compute its probability of fulfilling a step, as well as a value function that indicates how likely the skill is to succeed given the current state of the world. SayCan then combines these two probabilities to choose the next action. Because the method produces a series of text-based steps, the result is a human-interpretable plan. According to the Google team:

Our experiments on a number of real-world robotic tasks demonstrate the ability to plan and complete long-horizon, abstract, natural language instructions at a high success rate...As we explore future directions for this work, we hope to better understand how information gained via the robot's real-world experience could be leveraged to improve the language model and to what extent natural language is the right ontology for programming robots.

LLMs have been shown to exhibit general knowledge about many subjects and can solve a wide range of natural-language processing (NLP) tasks. However, they also can generate responses that, while logically sound, would not be helpful for controlling a robot. For example, in response to "I spilled my drink, can you help?" a LLM might respond "You could try using a vacuum cleaner."

To improve the LLM's ability to plan a sequence of actions in SayCan, the raw user input was preceded with a chain of thought prompt consisting of 17 example inputs and their associated plans. Because the LLM outputs a probability distribution over text tokens for the next item in a sequence, the text description of a skill ("the probability that a skill is useful for the instruction") along with its value function output ("the probability of successfully executing said skill") can be used together with that distribution to select the next best action in the plan sequence.

How SayCan chooses a skill

Image source: https://say-can.github.io/

To evaluate SayCan, Google researchers compiled a set of 101 instructions for an Everyday robot to execute, with a variety of complexities and time-horizons, from "bring me a fruit" to "I spilled my coke on the table, throw it away and bring me something to clean." Google integrated SayCan with a variety of LLMs, including PaLM and FLAN. PaLM-SayCan performed the best, achieving a planning success rate of 84% and an execution success rate of 74%, compared with FLAN-SayCan at 70% and 61% respectively. The team noted that PaLM-SayCan struggled with instructions containing a negative, such as "bring me a snack that isn’t an apple," but pointed out that this is a common failing of LLMs in general.

In a Twitter thread about the work, Google scientist Fei Xia noted:

Incorporating a new skill only needs 1) a skill policy 2) its natural language description and 3) an affordance function.

One open-source version of SayCan for a simulated desktop environment is available on GitHub.

About the Author

Anthony Alford

Show moreShow less

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

InfoQ Article Contest

About the Author

Anthony Alford

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter