# EmbodiedReasoner **Repository Path**: agiros/EmbodiedReasoner ## Basic Information - **Project Name**: EmbodiedReasoner - **Description**: Embodied-Reasoner旨在将深度思考能力拓展至具身交互任务中,不仅处理多模态输入的能力,还会根据交互的不同阶段生成多样化的思考过程(包括分析、规划和反思)。 - **Primary Language**: Python - **License**: MulanPSL-2.0 - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 7 - **Forks**: 2 - **Created**: 2025-03-20 - **Last Updated**: 2025-10-15 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Embodied-Reasoner(具身推理器) ✨论文链接 Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks
🤗 Hugging Face
| 🤖 ModelScope
| 📑 Arxiv
| 📑 WebPage
Embodied-Reasoner的核心贡献如下:
- **数据引擎**:我们构建了能自动生成连贯**观察-思考-行动(Observation-Thought-Action)**轨迹链的数据引擎。这些轨迹链融合了多样化的具身专属思考过程,例如**情境分析(situational analysis)**、**空间推理(spatial reasoning)**、**自我反思(self-reflection)**、**任务规划(task planning)**和**双重验证(double verification)**。这些图文交织的连贯轨迹链可指导模型基于交互历史和空间布局进行规划与推理,从而提升其时空推理能力。
- **迭代训练流程**:我们进一步设计了包含**模仿(imitation)**、**自我探索(self-exploration)** 和 **自我纠正(self-correction)** 三阶段的具身模型迭代训练框架。首先通过合成轨迹的模仿学习培养基础交互能力,接着通过拒绝采样微调增强探索能力,最后通过反思微调实现自我纠错。
- **交互式评估框架**:我们在12个与训练场景不同的新颖情境中构建了809个测试案例,人工设计指令并标注对应的关键动作与最终状态:`<指令,关键动作,最终状态> (
## 示例
### 模拟器示例
Embodied-Reasoner展现出多种自发思考行为,例如分析环境状态(#1,3)、反思遗漏细节(#4)、基于最新观察结果进行推理(#5)、回溯线索以优化规划(#9)等。这些思考过程虽跨越多轮交互,仍能保持连贯性和逻辑一致性。相比之下,缺乏思考能力的通用视觉语言模型在长跨度交互任务中表现不佳,常产生不合理动作,例如遗忘任务目标或出现重复搜索行为。
### 真实世界示例
为评估我们推理模型的泛化能力,我们设计了真实世界实验。我们的模型经过两次探索(步骤1、2)后排除台面与餐桌干扰,最终在橱柜中定位到咖啡(#7),并将其放入微波炉进行加热(#11)。然而观察到,OpenAI o3-mini未能制定合理计划,直接前往微波炉而未先搜寻咖啡。
## 快速入门
### 训练
#### 步骤1. 安装依赖
```shell
conda create -n llama-factory python=3.11
conda activate llama-factory
git clone -b embodied-reasoner https://github.com/iGangao/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics]"
pip install wandb accelerate deepspeed importlib-metadata
```
#### 步骤2. 数据准备
请参考 `data/README.md` 以查看有关数据集文件格式的详细信息。
#### 步骤3. 运行训练脚本
运行如下训练脚本:
```shell
bash scripts/train.sh
```
### 评估
#### 步骤1. 安装依赖
```shell
conda create -n embodied-reasoner python=3.9
conda activate embodied-reasoner
pip install -r requirements.txt
```
#### 步骤2. 运行评估脚本
运行如下评估脚本:
```shell
bash scripts/eval.sh
```
## 任务与轨迹引擎
你可以进入 `data_engine` 文件夹以生成任务与轨迹。以下是 `data_engine` 文件夹中的关键文件结构:
```plaintext
data_engine/
├── taskgenerate/ # 存放任务生成所需的物品信息与房间元数据
│ ├── bathrooms/
│ ├── bedrooms/
│ ├── kitchens/
│ ├── living_rooms/
│ └── pick_up_and_put.json
├── TaskGenerate.py # 任务生成
├── o1StyleGenerate.py # 基础任务轨迹生成
├── o1StyleGenerate_ordered.py # 复杂任务轨迹生成
├── vlmCall.py # 调用视觉语言模型(VLM)
└── vlmCallapi_keys.py # 设置 API 密钥
```
---
### 步骤 1:生成任务
`TaskGenerate.py` 可用于合成任务模板及其对应的关键动作。生成的任务相关数据将保存在 `data_engine` 文件夹下的 `
1. 💫 Embodied Task 2. 💫 Deep Reasoning Model 3. 💫 Multimodal Scene 4. 💫 Long-horizon Decision 5. 💫 Multi-turn Interaction
🤗 Hugging Face
   |   
> **Long CoT with Diverse Thinking Pattern:** analysis, spatial reasoning, reflection, planning, and verification. These coherent, image-text interleaved trajectories boost its spatial, temporal reasoning capabilities.
> **Iterative Training Pipeline:** A three-stage iterative training pipeline that combines **imitation learning**, **self-exploration tunning**, and **self-correction tunning**.
> **Interactive Evaluation Framework:** 809 test cases across 12 novel scenarios: `
## Performance 🌿🌿
We compare the performance of Embodied-Reasoner against advanced VLMs and visual reasoning models.
- Success Rate (%) measures whether a task is successfully completed.
- Search Efficiency (%) evaluates task efficiency—more steps indicate lower efficiency.
- Task Completeness (%) computes the proportion of predicted actions that belong to the set of key actions.
## Examples 👀 👀
### Simulator Experiments
Embodied-Reasoner exhibits spontaneous thinking behaviors, e.g., analyzing environmental states (#1,3), reflecting on missed details (#4), reasoning based on the latest observations (#5), and recalling cues for efficient planning (#9). These thoughts remain coherent and logically consistent despite spanning multiple rounds. In contrast, general VLMs lacking thinking abilities struggle with long-horizon interactive tasks and produce unreasonable actions, e.g., forget tasks or repetitive searching.
### Real-World Experiments
To evaluate the generalization of our reasoning model, we design a real-world experiment. Our model rules out the countertop and dining table after two explorations (steps 1,2), ultimately locating the coffee (#7) in the cabinet and placing it in the microwave for heating (#11). However, we observe that OpenAI o3-mini fails to formulate a reasonable plan, heading to the microwave first instead of searching for the coffee.
## QuickStart 🎯🎯
### Training
#### Step 1. Install Requirements
```shell
conda create -n llama-factory python=3.11
conda activate llama-factory
git clone -b embodied-reasoner https://github.com/iGangao/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics]"
pip install wandb accelerate deepspeed importlib-metadata
```
#### Step 2. Data prepare
Please refer to `data/README.md` for checking the details about the format of dataset files.
#### Step 3. Run training scripts
Run the training scripts:
```shell
bash scripts/train.sh
```
### Evaluation
#### Step 1. Install Requirements
```shell
conda create -n embodied-reasoner python=3.9
conda activate embodied-reasoner
pip install -r requirements.txt
```
#### Step 2. Run evaluation scripts
Run the evaluation scripts:
```shell
bash scripts/eval.sh
```
## Task and Trajectory Engine ⛲⛲
You can navigate to the data_engine folder to synthesize tasks and trajectories. Below are the key files within the data_engine:
```plaintext
data_engine/
├── taskgenerate/ # Item information and room metadata for task generation
│ ├── bathrooms/
│ ├── bedrooms/
│ ├── kitchens/
│ ├── living_rooms/
│ └── pick_up_and_put.json
├── TaskGenerate.py # Task synthesis script
├── o1StyleGenerate.py # Trajectory synthesis script
├── o1StyleGenerate_ordered.py # Complex task trajectory synthesis script
├── vlmCall.py # Script to call the VLM
└── vlmCallapi_keys.py # Please Set your API keys here
```
#### Step 1. Generate Task
`TaskGenerate.py` can synthesize task templates and corresponding key actions. The generated task-related data will be stored in the `
Below is an example of the JSON file contents:
```json
{
"scene": "FloorPlan1",
"tasktype": "...",
"taskname": "Locate the Apple in the room.",
"trajectory": [
"<...>...",
"<...>...",
"..."
],
"images": [
".../init_observe.png",
"..."
],
"flag": "",
"time": "...",
"task_metadata": {
"..."
}
}
```
- **scene:** the scene where the task is performed.
- **tasktype:** the type of the task.
- **taskname:** the name of the task.
- **trajectory:** reasoning and decision-making content of the trajectory
- **images:** paths to corresponding images (the first image represents the initial state; each subsequent image corresponds to the state after performing each action listed in trajectory).
- **time and flag:** records the generation timestamp and exceptions encountered during trajectory generation.
- **task_metadata:** task information generated during Step 1.
To view our complete trajectory dataset, please visit our Hugging Face Page.
Please refer to `data_endine/README.md` for checking the details about the data engine.
## Citation
If you find our work helpful, feel free to give us a cite.
```
@article{embodied-reasoner,
title = {Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks},
author = {Wenqi Zhang and Mengna Wang and Gangao Liu and Huixin Xu and Yiwei Jiang and Yongliang Shen and Guiyang Hou and Zhe Zheng and Hang Zhang and Xin Li and Weiming Lu and Peng Li and Yueting Zhuang},
journal = {arXiv preprint arXiv:2503.21696},
year = {2025}
}
```
## License
[](LICENSE)
The codebase is licensed under 木兰.
## Contact Us
If you have any questions, please contact us by email:
zhangwenqi@zju.edu.cn, lipeng@iscas.ac.cn
## Acknowledgements
Our training code uses [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) and uses the Simulator with [Ai2-THOR](https://github.com/allenai/ai2thor). Thanks for their wonderful works.
Embodied-Reasoner
✨This is the official implementation of paper
Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks
Arxiv
   |    📑 WebPage
   |    📺 Bilibili