VocalText-Contrastive-Embedding
PublicThis repository features a CLIP-inspired contrastive model that aligns audio signals and transcripts in a shared embedding space, enabling bi-directional retrieval (Audio→Text and Text→Audio). It uses a frozen Whisper-large encoder for audio (with a lightweight trainable adapter) and a trainable Nomic-embed-text-v1.5 for text