No description available
SmolVLM2 is a lightweight language model focused on video content analysis and generation.
SmolVLM-256M is the world's smallest multimodal model, capable of efficiently processing image and text inputs to generate text outputs.
SmolVLM-500M is a lightweight multimodal model capable of processing image and text inputs to generate text outputs.
An efficient open-source visual language model
Mungert
SmolVLM is a compact open-source multimodal model that can accept image and text inputs and generate text outputs. It is designed for high efficiency and is suitable for device-side applications.
SmolVLM-500M-Instruct is a lightweight multimodal model in the SmolVLM series. It can process image and text inputs and generate text outputs. The model is designed for high efficiency and is suitable for device-side applications, maintaining strong performance in multimodal tasks.
Andres77872
A vision-language model specialized in describing anime-style images, fine-tuned based on SmolVLM-500M-Base
mradermacher
SmolVLM2-2.2B-Instruct is a vision-language model with a parameter scale of 2.2B, focusing on video text-to-text tasks and supporting English.
SmolVLM2-2.2B-Instruct is a 2.2B parameter vision-language model focused on video-text-to-text tasks, supporting English.
A vision-language model specialized in describing anime-style images, fine-tuned from SmolVLM-500M-Base, trained on 180K synthetic image/caption pairs generated by large language models.
smdesai
SmolVLM2-2.2B-Instruct-4bit is a vision-language model based on MLX format conversion, focusing on video text-to-text tasks.
mlx-community
MLX format model converted from SmolVLM2-500M-Video-Instruct, supporting video-to-text tasks
This is a video-text-to-text model converted based on the MLX framework, suitable for video understanding and instruction-following tasks.
This is a video-text-to-text model based on the MLX format, developed by HuggingFaceTB, supporting English language processing.
HuggingFaceTB
A lightweight multimodal model designed for analyzing video content, capable of processing video, image, and text inputs to generate text outputs.
SmolVLM2-256M-Video is a lightweight multimodal model specifically designed for analyzing video content, capable of processing video, image, and text inputs to generate text outputs.
SmolVLM2-2.2B is a lightweight multimodal model designed for analyzing video content. It can process video, image, and text inputs and generate text outputs.
vidore
A visual retriever based on SmolVLM-Instruct-250M using ColBERT strategy, capable of efficiently indexing documents from visual features
A visual retrieval model based on SmolVLM-Instruct-500M and the ColBERT strategy, capable of efficiently indexing documents through visual features
mjschock
An intelligent vision-language model fine-tuned from HuggingFaceTB/SmolVLM-Instruct, optimized for training speed using Unsloth and TRL libraries
A visual retrieval model based on SmolVLM-Instruct and ColBERT strategy, capable of efficiently indexing documents through visual features