Recently, IBM officially released Granite-Docling-258M, an open-source vision-language model designed for end-to-end document conversion. Compared to traditional OCR (Optical Character Recognition) technology, Granite-Docling focuses on preserving the layout information of documents, effectively extracting elements such as tables, code, formulas, lists, and headings, and outputting structured machine-readable formats rather than simplified Markdown formats. The model is now available on the Hugging Face platform, where users can experience it live and obtain the MLX version optimized for Apple Silicon.

image.png

Granite-Docling is an improved version of SmolDocling-256M. IBM has optimized the original technical architecture, using the Granite165M language model and upgrading the visual encoder to SigLIP2, while maintaining the Idefics3-style connector. These updates have increased the parameter count of Granite-Docling to 258M, significantly improving performance in layout analysis, full-page OCR, code, formulas, and tables. In addition, IBM has resolved instability issues found in the preview model, such as the repeated token loop phenomenon.

Granite-Docling uses an architecture based on Idefics3 and employs the nanoVLM training framework. Its output, DocTags, is a markup language developed by IBM that clearly represents document structure, including elements, coordinates, and relationships, making it easy for subsequent tools to convert it into Markdown, HTML, or JSON format. This structured output not only maintains the order of table topologies, mathematical formulas, code blocks, and headings but also improves data indexing quality and enhances retrieval capabilities.

image.png

In terms of multilingual support, Granite-Docling has added experimental support for Japanese, Arabic, and Chinese for the first time, although English remains the primary target. IBM recommends integrating Granite-Docling with Docling, using its CLI/SDK to automatically convert PDFs, office documents, and images into multiple formats. This model runs smoothly in environments such as Transformers, vLLM, ONNX, and MLX, with special optimization for Apple Silicon.

Granite-Docling's release marks another major advancement in enterprise-level document AI technology. By integrating IBM's Granite foundation architecture, the SigLIP2 visual encoder, and the nanoVLM training framework, the model provides excellent performance while remaining lightweight, offering a solid foundation for handling tables, formulas, code, and multilingual text. Overall, Granite-Docling provides a practical solution for accurate and reliable document conversion and enhanced retrieval workflows.

huggingface:https://huggingface.co/collections/ibm-granite/granite-docling-682b8c766a565487bcb3ca00

Key Points:

🌟 The new model Granite-Docling-258M aims to improve document conversion accuracy and preserve layout information.  

🔧 It uses an advanced technical architecture, performing well in multiple areas compared to the previous version, SmolDocling.  

🌍 It adds support for multiple languages, enhancing the model's application scope and flexibility.