AIbase

VocalText-Contrastive-Embedding

Public

This repository features a CLIP-inspired contrastive model that aligns audio signals and transcripts in a shared embedding space, enabling bi-directional retrieval (Audio→Text and Text→Audio). It uses a frozen Whisper-large encoder for audio (with a lightweight trainable adapter) and a trainable Nomic-embed-text-v1.5 for text

Creat2025-07-22T22:19:07
Update2025-07-23T08:54:51
1
Stars
0
Stars Increase