Open-Vocabulary SAM is a vision-based foundation model built upon SAM and CLIP, focusing on interactive segmentation and recognition tasks. It achieves a unified framework of SAM and CLIP through two unique knowledge transfer modules, SAM2CLIP and CLIP2SAM. Extensive experiments on various datasets and detectors demonstrate the effectiveness of Open-Vocabulary SAM in segmentation and recognition tasks, significantly outperforming naive benchmarks combining SAM and CLIP. Moreover, training with image classification data enables this method to segment and recognize approximately 22,000 categories.