Learns joint visual representations through prefix projection alignment.
AnasMohamed
A large-scale vision-language model based on Vision Transformer architecture, supporting cross-modal understanding between images and text
LanguageBind
Video-LLaVA is an open-source multimodal model trained by fine-tuning a large language model on multimodal instruction-following data, capable of generating interleaved images and videos.
Video-LLaVA is a multimodal model that unifies visual representations through pre-projection alignment learning, capable of handling visual reasoning tasks for both images and videos.