# EmbodiedReasoner **Repository Path**: agiros/EmbodiedReasoner ## Basic Information - **Project Name**: EmbodiedReasoner - **Description**: Embodied-Reasoner旨在将深度思考能力拓展至具身交互任务中，不仅处理多模态输入的能力，还会根据交互的不同阶段生成多样化的思考过程（包括分析、规划和反思）。 - **Primary Language**: Python - **License**: MulanPSL-2.0 - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 7 - **Forks**: 2 - **Created**: 2025-03-20 - **Last Updated**: 2025-10-15 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Embodied-Reasoner（具身推理器） ✨论文链接 Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks

🤗 Hugging Face | 🤖 ModelScope | 📑 Arxiv | 📑 WebPage

## 视频 [![点击观看视频](assets/video_thumbnail.png)](https://gitee.com/agiros/EmbodiedReasoner/raw/master/assets/video_cn.mp4) ## 新闻 - **2025.03:** 初始发布。 ## 内容 - [简介](#简介) - [性能](#性能) - [示例](#示例) - [训练](#训练) - [步骤1. 安装依赖](#步骤1-安装依赖) - [步骤2. 数据准备](#步骤2-数据准备) - [步骤3. 运行训练脚本](#步骤3-运行训练脚本) - [评估](#评估) - [步骤1. 安装依赖](#步骤1-安装依赖-1) - [步骤2. 运行评估脚本](#步骤2-运行评估脚本) - [任务与轨迹引擎](#任务与轨迹引擎) - [引用](#引用) - [使用许可](#使用许可) ## 简介 Embodied-Reasoner（具身推理器）是一种将深度思考能力扩展到具身交互任务的新方法。其核心逻辑是：有效的具身推理不仅需要处理多模态输入的能力，还需要生成适应交互不同阶段的多样化思考过程（分析、规划、反思）。

Embodied-Reasoner的核心贡献如下： - **数据引擎**：我们构建了能自动生成连贯**观察-思考-行动(Observation-Thought-Action)**轨迹链的数据引擎。这些轨迹链融合了多样化的具身专属思考过程，例如**情境分析(situational analysis)**、**空间推理(spatial reasoning)**、**自我反思(self-reflection)**、**任务规划(task planning)**和**双重验证(double verification)**。这些图文交织的连贯轨迹链可指导模型基于交互历史和空间布局进行规划与推理，从而提升其时空推理能力。 - **迭代训练流程**：我们进一步设计了包含**模仿(imitation)**、**自我探索(self-exploration)** 和 **自我纠正(self-correction)** 三阶段的具身模型迭代训练框架。首先通过合成轨迹的模仿学习培养基础交互能力，接着通过拒绝采样微调增强探索能力，最后通过反思微调实现自我纠错。 - **交互式评估框架**：我们在12个与训练场景不同的新颖情境中构建了809个测试案例，人工设计指令并标注对应的关键动作与最终状态：`<指令，关键动作，最终状态> ()`。值得注意的是，我们的测试集包含25个精心设计的超长跨度任务，每个任务涉及4个子任务和14-27个关键动作。 ## 性能我们将Embodied-Reasoner与先进的视觉语言模型和视觉推理模型进行了对比： - 成功率（Success Rate, %）：衡量任务是否被成功完成 - 搜索效率（Search Efficiency, %）：评估任务执行效率（步数越多效率越低） - 任务完成度（Task Completeness, %）：计算预测动作中属于关键动作集的比例

## 示例 ### 模拟器示例 Embodied-Reasoner展现出多种自发思考行为，例如分析环境状态（#1,3）、反思遗漏细节（#4）、基于最新观察结果进行推理（#5）、回溯线索以优化规划（#9）等。这些思考过程虽跨越多轮交互，仍能保持连贯性和逻辑一致性。相比之下，缺乏思考能力的通用视觉语言模型在长跨度交互任务中表现不佳，常产生不合理动作，例如遗忘任务目标或出现重复搜索行为。

### 真实世界示例为评估我们推理模型的泛化能力，我们设计了真实世界实验。我们的模型经过两次探索（步骤1、2）后排除台面与餐桌干扰，最终在橱柜中定位到咖啡（#7），并将其放入微波炉进行加热（#11）。然而观察到，OpenAI o3-mini未能制定合理计划，直接前往微波炉而未先搜寻咖啡。

## 快速入门 ### 训练 #### 步骤1. 安装依赖 ```shell conda create -n llama-factory python=3.11 conda activate llama-factory git clone -b embodied-reasoner https://github.com/iGangao/LLaMA-Factory.git cd LLaMA-Factory pip install -e ".[torch,metrics]" pip install wandb accelerate deepspeed importlib-metadata ``` #### 步骤2. 数据准备请参考 `data/README.md` 以查看有关数据集文件格式的详细信息。 #### 步骤3. 运行训练脚本运行如下训练脚本: ```shell bash scripts/train.sh ``` ### 评估 #### 步骤1. 安装依赖 ```shell conda create -n embodied-reasoner python=3.9 conda activate embodied-reasoner pip install -r requirements.txt ``` #### 步骤2. 运行评估脚本运行如下评估脚本: ```shell bash scripts/eval.sh ``` ## 任务与轨迹引擎你可以进入 `data_engine` 文件夹以生成任务与轨迹。以下是 `data_engine` 文件夹中的关键文件结构： ```plaintext data_engine/ ├── taskgenerate/ # 存放任务生成所需的物品信息与房间元数据 │ ├── bathrooms/ │ ├── bedrooms/ │ ├── kitchens/ │ ├── living_rooms/ │ └── pick_up_and_put.json ├── TaskGenerate.py # 任务生成 ├── o1StyleGenerate.py # 基础任务轨迹生成 ├── o1StyleGenerate_ordered.py # 复杂任务轨迹生成 ├── vlmCall.py # 调用视觉语言模型（VLM） └── vlmCallapi_keys.py # 设置 API 密钥 ``` --- ### 步骤 1：生成任务 `TaskGenerate.py` 可用于合成任务模板及其对应的关键动作。生成的任务相关数据将保存在 `data_engine` 文件夹下的 `_metadata` 目录中。你可以运行以下 Python 脚本来生成任务，并在脚本中修改任务类型等参数： ```bash python TaskGenerate.py ``` 下面是一个生成的任务数据示例，其中 `actions` 包含该任务的关键动作列表： ```json { "taskname": "find the Apple in the room", "tasktype": "single_search", "metadatapath": "taskgenerate/kitchens/FloorPlan1/metadata.json", "actions": [ { "action": "navigate to", "objectId": "CounterTop|-00.08|+01.15|00.00", "objectType": "CounterTop", "baseaction": "", "reward": 1, "relatedObject": [ "CounterTop|-00.08|+01.15|00.00", "Apple|-00.47|+01.15|+00.48" ] }, { "action": "end", "objectId": "", "objectType": "", "baseaction": "", "reward": 1, "relatedObject": [ "CounterTop|-00.08|+01.15|00.00", "Apple|-00.47|+01.15|+00.48" ] } ], "totalreward": 2 } ``` --- ### 步骤 2：生成 O1 风格轨迹 `o1StyleGenerate.py` 与 `o1StyleGenerate_ordered.py` 可用于为 10 类子任务合成轨迹。其中 `o1StyleGenerate_ordered.py` 主要用于生成更复杂的顺序物品传递任务。你可以运行以下 Python 脚本来生成轨迹，并在脚本中设置任务类型和轨迹类型（通常，`b` 最短，`a` 较长，`c` 最长）： ```bash python o1StyleGenerate.py python o1StyleGenerate_ordered.py ``` 以下是一个轨迹文件夹示例，包含轨迹的 JSON 文件与相关图像：

下面是一个生成的轨迹 JSON 文件内容示例： ```json { "scene": "FloorPlan1", "tasktype": "...", "taskname": "find the Apple in the room", "trajectory": [ "<...>...", "<...>...", "..." ], "images": [ ".../init_observe.png", "..." ], "flag": "", "time": "...", "task_metadata": { "..." } } ``` 字段说明： - **scene**：任务所在的场景名称 - **tasktype**：任务的类型 - **taskname**：任务的名称 - **trajectory**：轨迹中包含的推理与决策内容 - **images**：对应每个轨迹步骤的图像路径（第一个图像是初始状态，后续图像与 `trajectory` 中每步动作对应） - **time 和 flag**：记录轨迹生成的时间戳和异常信息 - **task_metadata**：步骤1中生成的任务元信息 --- 关于我们完整的轨迹数据集，请访问 👉 [Hugging Face 数据集页面](https://huggingface.co/datasets/zwq2018/embodied_reasoner) ## 引用如果您认为我们的工作对您有所帮助，欢迎引用。 ``` @article{embodied-reasoner, title = {Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks}, author = {Wenqi Zhang and Mengna Wang and Gangao Liu and Huixin Xu and Yiwei Jiang and Yongliang Shen and Guiyang Hou and Zhe Zheng and Hang Zhang and Xin Li and Weiming Lu and Peng Li and Yueting Zhuang}, journal = {arXiv preprint arXiv:2503.xxxxx}, year = {2025} } ``` ## 使用许可 [![Code License](https://img.shields.io/badge/Code_License-%E6%9C%A8%E5%85%B0-yellow)](LICENSE) 本代码库遵循木兰（Mulan）许可证。 ## 联系我们如果您有任何问题，请联系我们： zhangwenqi # zju.edu.cn, lipeng # iscas.ac.cn ## 致谢我们的训练代码使用了[LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) 和模拟器[Ai2-THOR](https://github.com/allenai/ai2thor)。感谢他们的精彩工作。#

Embodied-Reasoner ✨This is the official implementation of paper Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks

1. 💫 Embodied Task

2. 💫 Deep Reasoning Model

3. 💫 Multimodal Scene

4. 💫 Long-horizon Decision

5. 💫 Multi-turn Interaction

🤗 Hugging Face | arXiv Arxiv | 📑 WebPage | 📺 Bilibili

## Video 📷 📷 https://github.com/user-attachments/assets/da9c5b42-ab8e-4101-9ec0-a226590d23fc 🎙️ **Paper Sharing**: https://www.bilibili.com/video/BV1Cs7Hz4ETk?t=1623.2 ## News 🔥🔥 - **2025.03:** We release our paper and dataset. ## Contents 🌳🌳 - [Overview](#Overview) - [Performance](#preformance) - [Examples](#examples) - [Training](#training) - [Step 1. Install Requirements](#step-1-install-requirements) - [Step 2. Data prepare](#step-2-data-prepare) - [Step 3. Run Training Scripts](#step-3-run-training-scripts) - [Evaluation](#evaluation) - [Step 1. Install Requirements](#step-1-install-requirements-1) - [Step 2. Run Evaluation Scripts](#step-2-run-evaluation-scripts) - [Task and Trajectory Engine](#Overview) - [Citation](#citation) - [License](#license) ## Overview 🦾🦾 In this paper, we present **Embodied-Reasoner**, a multimodal embodied model that extends o1-style deep-reasoning capabilities to embodied interactive tasks. It can perform complex tasks in AI2THOR such as searching for hidden objects, manipulating and transporting items with several impressive features:

👉 Deep Reasoning abilities, e.g., analysis, spatial reasoning, reflection, planning.
👉 Interleaved Multimodal Processing capabilities, especially handling long sequences of interleaved image-text context
👉 Environmental Interaction abilities, enabling it to autonomously observe the environment, explore rooms, and find hidden objects
👉 Open-source Models released in 7B/2B sizes
👉 Open-source Dataset 🤗 Hugging Face: 9.3k interleaved observation-reasoning-action trajectories, including 64K images and 8M thought tokens.

Our contributions can be summarized as follows: > **Task and Trajectory Engine:** Automatically synthesizes coherent **Observation-Thought-Action** trajectories, spaning 107 diverse indoor scenes, e.g., kitchens and living rooms, and covers 2,100 interactive objects (e.g., eggs, laptops) and 2,600 containers (e.g., refrigerators, drawers), *64K* a first-person perspective image from interaction and *8M* thought tokens.

> **Long CoT with Diverse Thinking Pattern:** analysis, spatial reasoning, reflection, planning, and verification. These coherent, image-text interleaved trajectories boost its spatial, temporal reasoning capabilities.

> **Iterative Training Pipeline:** A three-stage iterative training pipeline that combines **imitation learning**, **self-exploration tunning**, and **self-correction tunning**. > **Interactive Evaluation Framework:** 809 test cases across 12 novel scenarios: ``

## Performance 🌿🌿 We compare the performance of Embodied-Reasoner against advanced VLMs and visual reasoning models. - Success Rate (%) measures whether a task is successfully completed. - Search Efficiency (%) evaluates task efficiency—more steps indicate lower efficiency. - Task Completeness (%) computes the proportion of predicted actions that belong to the set of key actions.

## Examples 👀 👀 ### Simulator Experiments Embodied-Reasoner exhibits spontaneous thinking behaviors, e.g., analyzing environmental states (#1,3), reflecting on missed details (#4), reasoning based on the latest observations (#5), and recalling cues for efficient planning (#9). These thoughts remain coherent and logically consistent despite spanning multiple rounds. In contrast, general VLMs lacking thinking abilities struggle with long-horizon interactive tasks and produce unreasonable actions, e.g., forget tasks or repetitive searching.

### Real-World Experiments To evaluate the generalization of our reasoning model, we design a real-world experiment. Our model rules out the countertop and dining table after two explorations (steps 1,2), ultimately locating the coffee (#7) in the cabinet and placing it in the microwave for heating (#11). However, we observe that OpenAI o3-mini fails to formulate a reasonable plan, heading to the microwave first instead of searching for the coffee.

## QuickStart 🎯🎯 ### Training #### Step 1. Install Requirements ```shell conda create -n llama-factory python=3.11 conda activate llama-factory git clone -b embodied-reasoner https://github.com/iGangao/LLaMA-Factory.git cd LLaMA-Factory pip install -e ".[torch,metrics]" pip install wandb accelerate deepspeed importlib-metadata ``` #### Step 2. Data prepare Please refer to `data/README.md` for checking the details about the format of dataset files. #### Step 3. Run training scripts Run the training scripts: ```shell bash scripts/train.sh ``` ### Evaluation #### Step 1. Install Requirements ```shell conda create -n embodied-reasoner python=3.9 conda activate embodied-reasoner pip install -r requirements.txt ``` #### Step 2. Run evaluation scripts Run the evaluation scripts: ```shell bash scripts/eval.sh ``` ## Task and Trajectory Engine ⛲⛲ You can navigate to the data_engine folder to synthesize tasks and trajectories. Below are the key files within the data_engine: ```plaintext data_engine/ ├── taskgenerate/ # Item information and room metadata for task generation │ ├── bathrooms/ │ ├── bedrooms/ │ ├── kitchens/ │ ├── living_rooms/ │ └── pick_up_and_put.json ├── TaskGenerate.py # Task synthesis script ├── o1StyleGenerate.py # Trajectory synthesis script ├── o1StyleGenerate_ordered.py # Complex task trajectory synthesis script ├── vlmCall.py # Script to call the VLM └── vlmCallapi_keys.py # Please Set your API keys here ``` #### Step 1. Generate Task `TaskGenerate.py` can synthesize task templates and corresponding key actions. The generated task-related data will be stored in the `_metadata` folder under data_engine. You can run the following Python script to perform the task generation, and can modify parameters like task types within this Python file. ```shell python TaskGenerate.py ``` For example, one generated task data entry is shown below, where actions contains a list of key actions for the task. ```json { "taskname": "Locate the Apple in the room.", "tasktype": "single_search", "metadatapath": "taskgenerate/kitchens/FloorPlan1/metadata.json", "actions": [ { "action": "navigate to", "objectId": "CounterTop|-00.08|+01.15|00.00", "objectType": "CounterTop", "baseaction": "", "reward": 1, "relatedObject": [ "CounterTop|-00.08|+01.15|00.00", "Apple|-00.47|+01.15|+00.48" ] }, { "action": "end", "objectId": "", "objectType": "", "baseaction": "", "reward": 1, "relatedObject": [ "CounterTop|-00.08|+01.15|00.00", "Apple|-00.47|+01.15|+00.48" ] } ], "totalreward": 2 } ``` #### Step 2. Generate O1-style Trajectory `o1StyleGenerate.py` and `o1StyleGenerate_ordered.py` can synthesize trajectories for 10 different sub-task types. Specifically, o1StyleGenerate_ordered.py is designed to synthesize more complex sequential object transfer tasks. You can run the following Python script to perform the trajectory generation. Additionally, you can set the task type and trajectory type within the script (typically, 'b' is shortest, 'a' is longer, and 'c' is the longest). ```shell python o1StyleGenerate.py python o1StyleGenerate_ordered.py ``` Below is an example folder of a generated trajectory, including the JSON file and associated images for the trajectory.

Below is an example of the JSON file contents: ```json { "scene": "FloorPlan1", "tasktype": "...", "taskname": "Locate the Apple in the room.", "trajectory": [ "<...>...", "<...>...", "..." ], "images": [ ".../init_observe.png", "..." ], "flag": "", "time": "...", "task_metadata": { "..." } } ``` - **scene:** the scene where the task is performed. - **tasktype:** the type of the task. - **taskname:** the name of the task. - **trajectory:** reasoning and decision-making content of the trajectory - **images:** paths to corresponding images (the first image represents the initial state; each subsequent image corresponds to the state after performing each action listed in trajectory). - **time and flag:** records the generation timestamp and exceptions encountered during trajectory generation. - **task_metadata:** task information generated during Step 1. To view our complete trajectory dataset, please visit our Hugging Face Page. Please refer to `data_endine/README.md` for checking the details about the data engine. ## Citation If you find our work helpful, feel free to give us a cite. ``` @article{embodied-reasoner, title = {Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks}, author = {Wenqi Zhang and Mengna Wang and Gangao Liu and Huixin Xu and Yiwei Jiang and Yongliang Shen and Guiyang Hou and Zhe Zheng and Hang Zhang and Xin Li and Weiming Lu and Peng Li and Yueting Zhuang}, journal = {arXiv preprint arXiv:2503.21696}, year = {2025} } ``` ## License [![Code License](https://img.shields.io/badge/Code_License-%E6%9C%A8%E5%85%B0-yellow)](LICENSE) The codebase is licensed under 木兰. ## Contact Us If you have any questions, please contact us by email: zhangwenqi@zju.edu.cn, lipeng@iscas.ac.cn ## Acknowledgements Our training code uses [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) and uses the Simulator with [Ai2-THOR](https://github.com/allenai/ai2thor). Thanks for their wonderful works.