SPARC is a simple method for pre-training on image-text pairs, aiming to learn more fine-grained multi-modal representations from them. It utilizes sparse similarity metrics and grouping of image patches and language tokens, learning representations that encode both global and local information through contrastive fine-grained sequence loss and global image-text embedding contrastive loss. SPARC shows improvement on both coarse-grained image-level tasks and fine-grained region-level tasks, including classification, retrieval, object detection, and segmentation. Additionally, SPARC enhances model trustworthiness and image description capabilities.