The X-SAM image segmentation model, jointly developed by Sun Yat-sen University, Pengcheng Lab, and Meituan, was officially launched recently. This large multimodal model has made a significant breakthrough in the field of image segmentation, elevating the traditional capability of "segmenting anything" to "segmenting anything at all," significantly improving the adaptability and application scope of the model.

The traditional Segment Anything Model (SAM) has shown effectiveness in generating dense segmentation masks, but it suffers from an obvious limitation: its design allows only one type of visual reference input. To overcome this technical bottleneck, the research team innovatively proposed a Visual Grounded Segmentation (VGS) task framework, enabling precise instance segmentation of all objects through interactive visual references, thus providing the multimodal large language model with pixel-level understanding.

The technical design of X-SAM integrates several innovations. The model supports a uniform input format and output representation, capable of processing various types of visual and textual inputs. Its dual encoder architecture ensures a deep understanding of image content and segmentation features, while the segmentation connector provides multi-scale information fusion, significantly increasing the accuracy of segmentation.

image.png

The most remarkable feature is that X-SAM integrates the latest Mask2Former architecture as the segmentation decoder, allowing the model to simultaneously segment multiple target objects in a single operation, completely breaking the traditional technical limitations of SAM, which could only process a single object. This improvement not only increases processing efficiency but also opens up the possibility of batch segmentation tasks in complex scenarios.

For model training, the research team adopted a three-step progressive training strategy, ensuring stable performance improvements through a gradual learning process. After comprehensive testing on more than 20 major segmentation datasets, X-SAM achieved superior performance in segmentation dialogue generation tasks and text-image understanding tasks, thus validating the effectiveness of its technical solution.

The launch of X-SAM indicates a new direction for the development of image segmentation technology, and provides an important technical foundation for building a more intelligent general visual understanding system. The research team stated that the next step will be to deeply explore the application of this technology in the video domain, promoting the unified development of image and video segmentation, and further pushing the boundaries of machine visual understanding capabilities.

This scientific achievement holds significant academic importance, and its potential in practical applications such as autonomous driving, medical imaging, and industrial detection is very promising. With the release of the model and the promotion of the technology, it is expected to accelerate the overall development of the computer vision field.

Paper address: https://arxiv.org/pdf/2508.04655

Code address: https://github.com/wanghao9610/X-SAM

Demo address: https://47.115.200.157:7861