Microsoft and Salesforce's joint research found that even the most advanced AI language models suffer from severe reliability issues in long conversations. When users progressively express their needs, the average system performance drops by 39%, which poses an important warning for the practical application of AI assistants.

Simulating Real Conversations Reveals Performance Flaws

The research team created a testing method named "sharding" to simulate users gradually clarifying their requirements during actual conversations. Unlike providing complete information all at once, this method breaks tasks into multiple steps, closely aligning with real usage scenarios.

The results were shocking: the accuracy rate of AI models plummeted from around 90% to just 51%. All 15 tested models, ranging from the small open-source model Llama-3.1-8B to large commercial systems like GPT-4o, exhibited this sharp decline.

QQ20250529-092044.png

Each experiment involved 90 to 120 instructions, which were broken down into smaller subtasks from high-quality datasets.

Top Models Also Affected

Even the top models in the study—Claude3.7Sonnet, Gemini2.5Pro, and GPT-4.1—performed worse in multi-round dialogues compared to single-round dialogues, with a drop of 30% to 40%. More worryingly, these models' consistency significantly decreased, with a 50-point difference between their best and worst performances on the same task.

Four Key Issues Emerge

The research identified four core problems in AI models during multi-round dialogues:

  • Premature Conclusions: Making judgments before obtaining all necessary information.
  • Overreliance on History: Over-trusting previous responses, even if they contain errors.
  • Information Neglect: Ignoring critical information during the conversation.
  • Excessive Detail: Providing overly detailed answers, leading to incorrect assumptions about information gaps.

Technical Optimizations Yield Little Effect

To improve reliability, the research team tried various technical improvements, including lowering the model temperature setting to reduce randomness and having AI repeat user instructions. However, none of these optimizations produced significant effects.

The study found that changing the amount of detail provided at each step also made no difference. The only reliable solution is to provide all necessary information at the start of the conversation.

QQ20250529-092051.png

Large language models often "get lost" in multi-step, unspecified dialogues, resulting in a significant decline in performance.

Divergence Between Capability and Reliability

The performance decline shows two layers: a basic capability decrease of only about 16%, but unreliability surges by 112%. In single-round tasks, more capable models are generally more reliable, but in multi-round dialogues, all models perform equally poorly, regardless of their baseline skill level.

Practical Response Strategies

Based on the research findings, experts proposed two practical recommendations:

For Users: When conversations deviate from the topic, it's better to restart a new conversation instead of trying to correct it. At the end of the conversation, ask the AI to summarize all requirements and use that as the starting point for a new conversation.

For Developers: They should pay more attention to the development of reliability in multi-round dialogues. Future models need to maintain stable performance even when instructions are incomplete, rather than relying on special prompt techniques or parameter adjustments.

Industry Impact and Outlook

This study highlights major challenges faced by AI assistants in real-world applications. Since users typically express their needs through progressive conversations, reliability issues could significantly impact user experience and the actual value of AI systems.

Researchers emphasized that reliability is equally important as original performance, especially for real-world AI assistants handling complex, multi-step interactions. This discovery points out an important direction for improvement in the AI industry.