Recently, Meta AI, in collaboration with the University of California, San Diego (UCSD), has introduced a new technology called Deep Think with Confidence (DeepConf), aimed at helping enterprises effectively reduce computing costs while maintaining high accuracy in complex reasoning tasks with large language models (LLMs).
Currently, improving LLM reasoning capabilities often relies on the "consistency through multiple samplings and majority voting" strategy, but this approach can lead to rapid growth in computational resources, taking time and effort. Low-quality reasoning paths may even result in incorrect answers being selected. The innovation of DeepConf lies in its ability to no longer treat all reasoning paths equally, but instead to filter and adjust weights based on confidence signals within the model.
DeepConf introduces various refined confidence metrics, such as:
Group Confidence: calculates the average confidence of a segment of tokens during the reasoning process;
Tail Confidence: focuses on the confidence level at the end of the reasoning;
Lowest Group Confidence: identifies the most "fragile" part of the reasoning path;
Bottom-10% Confidence: focuses on the least confident parts of the reasoning.
DeepConf supports two execution modes:
Offline Thinking: first generate multiple complete reasoning paths, then select the best ones based on confidence for voting or weighted voting;
Online Thinking: evaluate in real-time during the reasoning process. If the confidence of the current path falls below a threshold, it is immediately terminated to save resources.
In multiple open-source models (such as DeepSeek-8B, Qwen3-32B, GPT-OSS-120B) and complex mathematical and STEM reasoning tasks (AIME, HMMT, BRUMO25, GPQA-Diamond), DeepConf has shown impressive results:
In offline mode, using GPT-OSS-120B, the accuracy on AIME2025 reached 99.9%, while the number of generated tokens was reduced by 84.7% compared to traditional methods;
In online mode, DeepSeek-8B improved the accuracy by 5.8 percentage points on AIME24, while using 77.9% fewer tokens.
Enterprises can choose different settings based on their risk preferences:
DeepConf-high (Conservative Mode): typically reduces generation costs by about 50%, with almost no impact on accuracy, suitable for high-risk scenarios like finance and law;
DeepConf-low (Aggressive Mode): saves 70%–85% of tokens, suitable for scenarios that require speed but have more flexible error tolerance, such as draft answers and knowledge retrieval.
Using DeepConf does not require retraining the model; it only requires adding a small amount of logic processing during inference. In addition, it has good compatibility and can be seamlessly integrated with existing inference frameworks (such as vLLM, SGLang, TensorRT-LLM). As the researchers stated, this provides a "plug-and-play" efficient solution for real-world enterprise deployment of LLM inference tasks.
Paper: https://arxiv.org/abs/2508.15260
Key Points:
🧠 Confidence-Guided Selection: DeepConf filters or ranks reasoning paths based on local confidence (group, tail, lowest point, etc.), rather than using a one-size-fits-all majority voting approach.
⏱ Significantly Improved Efficiency: Achieves up to 99.9% reasoning accuracy, while reducing the number of generated tokens by as much as 84.7%.
️🎛 Adjustable Strategy Modes: Enterprises can choose between "high security" or "high efficiency" modes based on their risk preferences, achieving optimal results with minimal resources.