Google’s PaLM-E robot gives a glimpse of what is to come with Optimus


Staff member
Google just announced PaLM-E which gives an example of what happens when you combine large language and visual data sets with AI in a robotic body that can move in the physical environment and intake sensory data from that environment.

Turns out you can create a robot that no longer needs human input or further training to complete tasks. And what is incredible, the robot can also complete NEW tasks with inputs it has never been previously exposed to.

Considering Tesla has already been doing this type of thing for years, then this Google robot gives a tantalising glimpse at what is to come with Optimus.

Google’s PaLM-E is a generalist robot brain that takes commands​

On Monday, a group of AI researchers from Google and the Technical University of Berlin unveiled PaLM-E, a multimodal embodied visual-language model (VLM) with 562 billion parameters that integrates vision and language for robotic control.

They claim it is the largest VLM ever developed and that it can perform a variety of tasks without the need for retraining.

According to Google, when given a high-level command, such as "bring me the rice chips from the drawer," PaLM-E can generate a plan of action for a mobile robot platform with an arm (developed by Google Robotics) and execute the actions by itself.

PaLM-E does this by analyzing data from the robot's camera without needing a pre-processed scene representation. This eliminates the need for a human to pre-process or annotate the data and allows for more autonomous robotic control.

It's also resilient and can react to its environment. For example, the PaLM-E model can guide a robot to get a chip bag from a kitchen—and with PaLM-E integrated into the control loop, it becomes resistant to interruptions that might occur during the task.

In a video example, a researcher grabs the chips from the robot and moves them, but the robot locates the chips and grabs them again.

In another example, the same PaLM-E model autonomously controls a robot through tasks with complex sequences that previously required human guidance. Google's research paperexplains how PaLM-E turns instructions into actions:

“We demonstrate the performance of PaLM-E on challenging and diverse mobile manipulation tasks. We largely follow the setup in Ahn et al. (2022), where the robot needs to plan a sequence of navigation and manipulation actions based on an instruction by a human.

For example, given the instruction "I spilled my drink, can you bring me something to clean it up?", the robot needs to plan a sequence containing "1. Find a sponge, 2. Pick up the sponge, 3. Bring it to the user, 4. Put down the sponge." Inspired by these tasks, we develop 3 use cases to test the embodied reasoning abilities of PaLM-E: affordance prediction, failure detection, and long-horizon planning. The low-level policies are from RT-1 (Brohan et al., 2022), a transformer model that takes RGB image and natural language instruction, and outputs end-effector control commands.”

PaLM-E is a next-token predictor, and it's called "PaLM-E" because it's based on Google's existing large language model (LLM) called "PaLM" (which is similar to the technology behind ChatGPT). Google has made PaLM "embodied" by adding sensory information and robotic control.

Since it's based on a language model, PaLM-E takes continuous observations, like images or sensor data, and encodes them into a sequence of vectors that are the same size as language tokens. This allows the model to "understand" the sensory information in the same way it processes language.

Google Robotics isn't the only research group working on robotic control with neural networks. This particular work resembles Microsoft's recent "ChatGPT for Robotics" paper, which experimented with combining visual data and large language models for robotic control in a similar way.

Robotics aside, Google researchers observed several interesting effects that apparently come from using a large language model as the core of PaLM-E. For one, it exhibits "positive transfer," which means it can transfer the knowledge and skills it has learned from one task to another, resulting in "significantly higher performance" compared to single-task robot models.

Also, they observed a trend with model scale: "The larger the language model, the more it maintains its language capabilities when training on visual-language and robotics tasks—quantitatively, the 562B PaLM-E model nearly retains all of its language capabilities."

And the researchers claim that PaLM-E exhibits emergent capabilities like multimodal chain-of-thought reasoning (allowing the model to analyze a sequence of inputs that include both language and visual information) and multi-image inference (using multiple images as input to make an inference or prediction) despite being trained on only single-image prompts. In that sense, PaLM-E seems to continue the trend of surprises emerging as deep learning models get more complex over time.

Google researchers plan to explore more applications of PaLM-E for real-world scenarios such as home automation or industrial robotics. And they hope PaLM-E will inspire more research on multimodal reasoning and embodied AI.

"Multimodal" is a buzzword we'll be hearing more and more as companies reach for artificial general intelligence that will ostensibly be able to perform general tasks like a human.