# PixelCraft **Repository Path**: mirrors_microsoft/PixelCraft ## Basic Information - **Project Name**: PixelCraft - **Description**: High-Fidelity Visual Reasoning on Structured Images - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-10-01 - **Last Updated**: 2025-10-04 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on Structured Images [๐Ÿ“„ Arxiv](https://arxiv.org/abs/2509.25185) | ๐Ÿšง Code: *Coming Soon* --- ## ๐Ÿ“Œ Overview **PixelCraft** is a novel multi-agent system designed to enable **high-fidelity visual reasoning** on **structured images** such as charts, scientific plots, and geometric diagrams. Unlike existing multimodal large language models (MLLMs) that often suffer from perceptual errors and rigid linear reasoning, PixelCraft introduces a **dynamic, collaborative agent framework** that integrates precise image processing with flexible, non-linear reasoning strategies. By combining pixel-level grounding with classical computer vision tools and multi-agent collaboration, PixelCraft brings notable gains in structured image understanding. --- ## ๐Ÿš€ Key Features * **๐Ÿ” High-Fidelity Image Processing:** Fine-tuned MLLM with pixel-level grounding provides precise localization of visual elements, enabling accurate data extraction and visual manipulation. * **๐Ÿง  Multi-Agent Collaboration:** A planner, reasoner, critics, and visual tool agents work together in a three-stage workflow โ€” **query-aware tool selection**, **role-driven reasoning**, and **iterative self-correction**. * **๐Ÿ“Š Image Memory & Non-Linear Reasoning:** A novel image memory mechanism allows agents to **revisit intermediate visual states**, **branch reasoning paths**, and **refine conclusions**, overcoming the limitations of linear chain-of-thought approaches. * **๐Ÿ”ง Specialized Visual Tools:** PixelCraft introduces a rich suite of visual tool agents for tasks like subfigure cropping, region magnification, legend masking, auxiliary line drawing, and geometric construction. --- ## ๐Ÿ“ˆ Performance Highlights PixelCraft achieves substantially higher performance in structured visual reasoning across multiple challenging datasets: | Benchmark | GPT-4o | GPT-4.1-mini | Claude 3.7 Sonnet | | -------------- | ------ | ------------ | ----------------- | | **ChartXiv** | 55.2 | 68.1 | 73.9 | | **ChartQAPro** | 58.83 | 65.56 | 69.82 | | **EvoChart** | 70.24 | 79.44 | 80.48 | --- ## ๐Ÿงช System Workflow The PixelCraft pipeline follows a **three-stage collaborative workflow**, enabling precise tool usage and flexible reasoning.

PixelCraft Pipeline

**Workflow Stages:** 1. **Query-Aware Agent Selection:** The dispatcher analyzes the query and selects only the most relevant visual tool agents. 2. **Role-Driven Agent Discussion:** The planner coordinates reasoning among agents, manages intermediate results, and dynamically recalls visual memory. 3. **Iterative Self-Correction:** Critics verify tool outputs and reasoning logic, enabling refined and more accurate answers. --- ## ๐Ÿ“š Citation If you find this work helpful in your research, please cite our paper: ```bibtex @misc{zhang2025pixelcraft, title={PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on Structured Images}, author={Shuoshuo Zhang and Zijian Li and Yizhen Zhang and Jingjing Fu and Lei Song and Jiang Bian and Jun Zhang and Yujiu Yang and Rui Wang}, year={2025}, eprint={2509.25185}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2509.25185}, } ```