Andrej Karpathy comments on the DeepSeek-OCR paper: Image input may become a new direction for large language models

AIbase基地

Published inAI News · 9 min read · Oct 21, 2025

Andrej Karpathy, former head of autonomous driving at Tesla and co-founder of OpenAI, recently commented on the open-source DeepSeek-OCR paper on Twitter, presenting a thought-provoking idea: compared to traditional text input, using images as input for large language models (LLMs) might be more efficient. This perspective has sparked discussions within the AI research community about the future direction of model input methods.

Karpathy believes that the widely used text token input method may be both wasteful and inefficient, and future research should perhaps shift toward image input. He outlined several dimensions of the potential advantages of image input over text input.

Firstly, there is an improvement in information compression rate. When text is rendered into images, more information can be conveyed with fewer visual tokens. This is because one image patch can contain multiple characters' information, while in traditional text tokenization, each character or subword requires its own token. This compression could significantly improve model efficiency and reduce computational costs when handling large contexts.

Secondly, there is richer information expression. Image input naturally supports bold, color, font size, layout, and other visual elements. These format details are either lost in traditional pure text input or need to be represented through additional markup languages (such as Markdown), which increases token consumption. Directly using images allows the model to better understand the document's visual structure and key points naturally.

Thirdly, there is room for optimizing attention mechanisms. Image input can use bidirectional attention mechanisms, whereas traditional text generation tasks usually adopt autoregressive causal attention. Bidirectional attention allows the model to simultaneously focus on all positions in the context, typically providing stronger comprehension capabilities. This approach avoids some inherent limitations of autoregressive text processing.

Karpathy particularly criticized the complexity of tokenizers. He believes that tokenizers are a non-end-to-end legacy module that introduces unnecessary complexity. For example, visually identical characters may be mapped to different tokens due to different Unicode encodings, causing the model to have different interpretations of seemingly identical inputs. Removing the tokenizer and directly processing images would make the entire system more concise and unified.

From a technical implementation perspective, Karpathy's view is based on the maturity of visual encoders. Architectures like Vision Transformers can already efficiently process image inputs, and models like DeepSeek-OCR have demonstrated that visual-to-text conversion can achieve high accuracy. Extending this capability to all text processing tasks is technically feasible.

However, Karpathy also pointed out an asymmetry: although user input can be an image, model output still needs to remain in text form, as generating realistic images remains an unresolved problem. This means that even with image input, the model architecture still needs to support text generation and cannot completely abandon text processing capabilities.

This perspective has sparked discussions across multiple levels. From an efficiency standpoint, if image input indeed improves information density, it would have significant advantages in processing long documents and large contexts. From a unification perspective, image input can unify document understanding, OCR, and multimodal question answering tasks under a single framework, simplifying the model architecture.

However, image input also faces challenges. First, there is the computational cost—although the information density is higher, the computational overhead of image encoding may offset some benefits. Second, there is editability—pure text is easy to edit and manipulate, while "text" in image form loses this flexibility during subsequent processing. Third, there is ecological compatibility. A large amount of existing text data and toolchains are based on character/token representation, and fully transitioning to image input would require rebuilding the entire ecosystem.

From a research direction perspective, Karpathy's viewpoint suggests an interesting possibility: as visual model capabilities improve, traditional "language models" may evolve into more general "information processing models," where text is just one form of information presentation, not the only input representation. This transformation may blur the boundaries between language models and multimodal models.

The DeepSeek-OCR paper became the catalyst for this discussion, indicating that OCR tasks have evolved from simple character recognition to deeper document understanding. If OCR models can accurately understand various formats and layouts of text, it is conceptually reasonable to consider all text tasks as "visual understanding" tasks.

Karpathy's self-deprecating remark—"I need to control myself from immediately developing a chatbot that only supports image input"—expresses interest in this idea while also hinting at the complexity of practical implementation. This radical architectural shift requires extensive experimental validation to prove its effectiveness across various tasks, while addressing the practical challenges mentioned above.

From an industry application perspective, even if image input is ultimately proven to be superior, the transition will be gradual. A more likely path is a hybrid model: using image input in scenarios where visual formatting information needs to be preserved, and text input in scenarios requiring flexible editing and combination. This hybrid strategy can balance the advantages of both approaches.

In summary, Karpathy's perspective presents a research direction worth further exploration, challenging the conventional assumption that text tokens are the standard input for language models. Regardless of whether this vision is fully realized, it provides a new perspective for thinking about the optimization of model input representations, potentially giving rise to a new generation of more efficient and unified AI architectures.

Google Colab Launches KaggleHub to Help Users Access Kaggle Datasets and Models with One Click

Google integrates Colab with KaggleHub, introducing a Data Explorer feature. Users can search Kaggle datasets, models, and competitions directly in Colab notebooks without switching interfaces. Accessible via the left toolbar with filters for type or relevance, it simplifies resource access and enhances convenience.....

2025 Global Top 500 Unicorn Companies Revealed! SpaceX, ByteDance, and OpenAI Lead the Way, Chinese Companies Strongly Enter the List

On December 3, the 2025 Global Top 500 Unicorn Companies Conference was held in Laoshan District, Qingdao. The conference released the '2025 Global Top 500 Unicorn Companies Report', with evaluation criteria including a valuation of over 7 billion yuan, unique technology, and business models. The report shows that the total valuation of global unicorn companies in 2025 reached 3.914 trillion yuan, achieving growth compared to last year.

The Japanese Government Uses AI Technology to Early Identify Adolescents with Suicidal Tendencies

The Japanese government is advancing an AI initiative aimed at early identification of adolescents with suicidal tendencies and providing psychological support to address the issue of adolescent suicide. This effort comes amid increasing discussions about the negative impacts of AI, particularly following recent lawsuits against OpenAI over AI tools that may induce suicide among teenagers, sparking widespread public concern about the risks of AI applications.

Latest AI News

AI Daily Brief

AI Product Finder

AI Product Rankings

AI Product Submit

AI Tools Directory

AI Models Finder

LLM Leaderboard

Model Providers

Compare LLMs

LLM Cost Calculator

LLM Arena

MCP Servers

MCP Client

MCP Case Tutorials

MCP Ranking

MCP Service Submission

MCP Playground

MCP Inspector

AI Brand Monitoring Tool

AI Search Visibility Checker

GEO Services

AI Model Compatibility Checker

AI Deployment Calculator

Andrej Karpathy comments on the DeepSeek-OCR paper: Image input may become a new direction for large language models

AIbase基地

This article is from AIbase Daily

AI News Recommendations

The ChatGPT Ad Fiasco Escalates: OpenAI Admits Recommendation Inaccuracy and Urgently Disables Shopping Prompt Function

GPT-5.2 Released Early! OpenAI Sounds Red Alert to Counter Gemini 3, Claims 18% Improvement in Reasoning Speed

OpenAI Promises to Improve ChatGPT Experience and Stop Ad-Like Recommendation Messages

Google Colab Launches KaggleHub to Help Users Access Kaggle Datasets and Models with One Click

OpenAI is caught in an advertising controversy: executives make emergency clarification, denying testing of commercial ads

2025 Global Top 500 Unicorn Companies Revealed! SpaceX, ByteDance, and OpenAI Lead the Way, Chinese Companies Strongly Enter the List

The Japanese Government Uses AI Technology to Early Identify Adolescents with Suicidal Tendencies

OpenAI Launches GPT-5.1-Codex-Max with Exceptional Value for Money

OpenAI Forced to Disclose ChatGPT User Chat Records in Copyright Lawsuit

OpenAI Launches a 'Confession' Mechanism to Reveal AI's Potential Misconduct

Latest AI News

AI Daily Brief

AI Product Finder

AI Product Rankings

AI Product Submit

AI Tools Directory

AI Models Finder

LLM Leaderboard

Model Providers

Compare LLMs

LLM Cost Calculator

LLM Arena

MCP Servers

MCP Client

MCP Case Tutorials

MCP Ranking

MCP Service Submission

MCP Playground

MCP Inspector

AI Brand Monitoring Tool

AI Search Visibility Checker

GEO Services​

AI Model Compatibility Checker

AI Deployment Calculator

Andrej Karpathy comments on the DeepSeek-OCR paper: Image input may become a new direction for large language models

AIbase基地

This article is from AIbase Daily

AI News Recommendations

The ChatGPT Ad Fiasco Escalates: OpenAI Admits Recommendation Inaccuracy and Urgently Disables Shopping Prompt Function

GPT-5.2 Released Early! OpenAI Sounds Red Alert to Counter Gemini 3, Claims 18% Improvement in Reasoning Speed

OpenAI Promises to Improve ChatGPT Experience and Stop Ad-Like Recommendation Messages

Google Colab Launches KaggleHub to Help Users Access Kaggle Datasets and Models with One Click

OpenAI is caught in an advertising controversy: executives make emergency clarification, denying testing of commercial ads

2025 Global Top 500 Unicorn Companies Revealed! SpaceX, ByteDance, and OpenAI Lead the Way, Chinese Companies Strongly Enter the List

The Japanese Government Uses AI Technology to Early Identify Adolescents with Suicidal Tendencies

OpenAI Launches GPT-5.1-Codex-Max with Exceptional Value for Money

OpenAI Forced to Disclose ChatGPT User Chat Records in Copyright Lawsuit

OpenAI Launches a 'Confession' Mechanism to Reveal AI's Potential Misconduct

GEO Services