SmolDocling is an ultra-compact multimodal vision-language model specifically designed for efficient document conversion. Based on the Idefics3 architecture, this model achieves powerful document understanding capabilities with a parameter scale of 256M. It supports the extraction of various document elements such as text, tables, formulas, and code from images and is fully compatible with the Docling ecosystem.
Multimodal
TransformersEnglish