🎯 SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories

Muzhi Zhu^1,2, Yuzhuo Tian¹, Hao Chen^1*, Chunluan Zhou², Qingpei Guo^2*, Yang Liu¹, Ming Yang², Chunhua Shen^1*

CVPR2025

🚀 Overview

📖 Description

Multimodal Large Language Models (MLLMs) demonstrate remarkable capabilities in understanding images but still struggle with pixel-level tasks like segmentation. SegAgent addresses this by introducing a novel Human-Like Mask Annotation Task (HLMAT), enabling MLLMs to mimic the annotation trajectories of human experts using interactive segmentation tools.

SegAgent effectively leverages these annotation trajectories without requiring architectural modifications or additional implicit tokens. Our approach significantly enhances MLLMs' segmentation and mask refinement abilities, establishing a new paradigm for assessing fine-grained visual understanding and multi-step reasoning.

🚩 Plan

Release the weights.
Release the inference code.
Release the trajectory generation code and training scripts.

🛠️ Getting Started

🎫 License

For academic usage, this project is licensed under the 2-clause BSD License. For commercial inquiries, please contact Chunhua Shen.

🖊️ Citation

If you find this work helpful for your research, please cite:

@article{zhu2025segagent,
  title={SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories},
  author={Zhu, Muzhi and Tian, Yuzhuo and Chen, Hao and Zhou, Chunluan and Guo, Qingpei and Liu, Yang and Yang, Ming and Shen, Chunhua},
  journal={arXiv preprint arXiv:2503.08625},
  year={2025},
  url={https://arxiv.org/abs/2503.08625}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
images		images
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎯 SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories

🚀 Overview

📖 Description

🚩 Plan

🛠️ Getting Started

🎫 License

🖊️ Citation

About

Releases

Packages

License

aim-uofa/SegAgent

Folders and files

Latest commit

History

Repository files navigation

🎯 SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories

🚀 Overview

📖 Description

🚩 Plan

🛠️ Getting Started

🎫 License

🖊️ Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages