Andrej Karpathy, former head of autonomous driving at Tesla and co-founder of OpenAI, recently commented on the open-source DeepSeek-OCR paper on Twitter, presenting a thought-provoking idea: compared to traditional text input, using images as input for large language models (LLMs) might be more efficient. This perspective has sparked discussions within the AI research community about the future direction of model input methods.
Karpathy believes that the widely used text token input method may be both wasteful and inefficient, and future research should perhaps shift toward image input. He outlined several dimensions of the potential advantages of image input over text input.
Firstly, there is an improvement in information compression rate. When text is rendered into images, more information can be conveyed with fewer visual tokens. This is because one image patch can contain multiple characters' information, while in traditional text tokenization, each character or subword requires its own token. This compression could significantly improve model efficiency and reduce computational costs when handling large contexts.
Secondly, there is richer information expression. Image input naturally supports bold, color, font size, layout, and other visual elements. These format details are either lost in traditional pure text input or need to be represented through additional markup languages (such as Markdown), which increases token consumption. Directly using images allows the model to better understand the document's visual structure and key points naturally.
Thirdly, there is room for optimizing attention mechanisms. Image input can use bidirectional attention mechanisms, whereas traditional text generation tasks usually adopt autoregressive causal attention. Bidirectional attention allows the model to simultaneously focus on all positions in the context, typically providing stronger comprehension capabilities. This approach avoids some inherent limitations of autoregressive text processing.
Karpathy particularly criticized the complexity of tokenizers. He believes that tokenizers are a non-end-to-end legacy module that introduces unnecessary complexity. For example, visually identical characters may be mapped to different tokens due to different Unicode encodings, causing the model to have different interpretations of seemingly identical inputs. Removing the tokenizer and directly processing images would make the entire system more concise and unified.
From a technical implementation perspective, Karpathy's view is based on the maturity of visual encoders. Architectures like Vision Transformers can already efficiently process image inputs, and models like DeepSeek-OCR have demonstrated that visual-to-text conversion can achieve high accuracy. Extending this capability to all text processing tasks is technically feasible.
However, Karpathy also pointed out an asymmetry: although user input can be an image, model output still needs to remain in text form, as generating realistic images remains an unresolved problem. This means that even with image input, the model architecture still needs to support text generation and cannot completely abandon text processing capabilities.
This perspective has sparked discussions across multiple levels. From an efficiency standpoint, if image input indeed improves information density, it would have significant advantages in processing long documents and large contexts. From a unification perspective, image input can unify document understanding, OCR, and multimodal question answering tasks under a single framework, simplifying the model architecture.
However, image input also faces challenges. First, there is the computational cost—although the information density is higher, the computational overhead of image encoding may offset some benefits. Second, there is editability—pure text is easy to edit and manipulate, while "text" in image form loses this flexibility during subsequent processing. Third, there is ecological compatibility. A large amount of existing text data and toolchains are based on character/token representation, and fully transitioning to image input would require rebuilding the entire ecosystem.
From a research direction perspective, Karpathy's viewpoint suggests an interesting possibility: as visual model capabilities improve, traditional "language models" may evolve into more general "information processing models," where text is just one form of information presentation, not the only input representation. This transformation may blur the boundaries between language models and multimodal models.
The DeepSeek-OCR paper became the catalyst for this discussion, indicating that OCR tasks have evolved from simple character recognition to deeper document understanding. If OCR models can accurately understand various formats and layouts of text, it is conceptually reasonable to consider all text tasks as "visual understanding" tasks.
Karpathy's self-deprecating remark—"I need to control myself from immediately developing a chatbot that only supports image input"—expresses interest in this idea while also hinting at the complexity of practical implementation. This radical architectural shift requires extensive experimental validation to prove its effectiveness across various tasks, while addressing the practical challenges mentioned above.
From an industry application perspective, even if image input is ultimately proven to be superior, the transition will be gradual. A more likely path is a hybrid model: using image input in scenarios where visual formatting information needs to be preserved, and text input in scenarios requiring flexible editing and combination. This hybrid strategy can balance the advantages of both approaches.
In summary, Karpathy's perspective presents a research direction worth further exploration, challenging the conventional assumption that text tokens are the standard input for language models. Regardless of whether this vision is fully realized, it provides a new perspective for thinking about the optimization of model input representations, potentially giving rise to a new generation of more efficient and unified AI architectures.