• _philschmid Profile Picture

    Philipp Schmid @_philschmid

    6 months ago

    Why Do Multi-Agent LLM Systems “still” Fail? A new study explores why Multi Agent Systems are not significantly outperforming single-agent. The study identifies 14 failure modes multi-agent system. Multi-agent system (MAS) are agents that interact, communicate, and collaborate to achieve a shared goal, which would to be difficult or unreliable for a single agent to accomplish. Benchmark: - Selected five popular, open-source MAS (MetaGPT, ChatDev, HyperAgent, AppWorld, AG2) - Chose tasks representative of the MAS intended capabilities (Software Development, SWE-Bench Lite, Utility Service Tasks, GSM-Plus) total of 150 tasks - Recorded the complete conversation logs, human annotators reviews, Cohen's Kappa score to ensure consistency and reliability, LLM-as-a-Judge Validation Multi Agent Failure modes: 1. Disobey Task Spec: Ignores task rules and requirements, leading to wrong output. 2. Disobey Role Spec: Agent acts outside its defined role and responsibilities. 3. Step Repetition: Unnecessarily repeats steps already completed, causing delays. 4. Loss of History: Forgets previous conversation context, causing incoherence. 5. Unaware Stop: Fails to recognize task completion, continues unnecessarily. 6. Conversation Reset: Dialogue unexpectedly restarts, losing context and progress. 7. Fail Clarify: Does not ask for needed information when unclear. 8. Task Derailment: Gradually drifts away from the intended task objective. 9. Withholding Info: Agent does not share important, relevant information. 10. Ignore Input: Disregards or insufficiently considers input from others. 11. Reasoning Mismatch: Actions do not logically follow from stated reasoning. 12. Premature Stop: Ends task too early before completion or information exchange. 13. No Verification: Lacks mechanisms to check or confirm task outcomes. 14. Incorrect Verification: Verification process is flawed, misses critical errors. How to improve Multi-Agent LLM System: - 📝 Define tasks and agent roles clearly and explicitly in prompts. - 🎯 Use examples in prompts to clarify expected task and role behavior. - 🗣️ Design structured conversation flows to guide agent interactions. - ✅ Implement self-verification steps in prompts for agents to check their reasoning. - 🧩 Design modular agents with specific, well-defined roles for simpler debugging. - 🔄 Redesign topology to incorporate verification roles and iterative refinement processes. - 🤝 Implement cross-verification mechanisms for agents to validate each other. - ❓ Design agents to proactively ask for clarification when needed. - 📜 Define structured conversation patterns and termination conditions.

    _philschmid tweet picture

    11 67 341 26K 394
    Download Image
  • _philschmid Profile Picture

    Philipp Schmid @_philschmid

    6 months ago

    Github: github.com/multi-agent-sy… Paper: huggingface.co/papers/2503.13…

    1 4 27 4K 25
  • julianharris Profile Picture

    Julian Harris @julianharris

    6 months ago

    @_philschmid Amazing insights 🙏Question; “Implement self-verification steps in prompts for agents to check their reasoning.” There are a few prompt methods that leap to mind: ReAct, Reflexion, Language Agent Tree Search: does this insight mean employing one of these or similar? Or…?

    0 0 4 740 3
  • Arp_it1 Profile Picture

    Arpit Sharma @Arp_it1

    6 months ago

    @_philschmid We might be training these systems, but they still have a long way to go.

    0 0 0 559 0
  • FJ000RD Profile Picture

    Ford Lascari @FJ000RD

    6 months ago

    @_philschmid Most people misunderstand multi-agent systems. From my experience building them, they're best viewed as tools to discover optimal processes or SOPs. Once identified, store and reuse these processes directly—no need to rerun agents each time.

    1 1 13 759 4
  • rantlab Profile Picture

    Brian @rantlab

    6 months ago

    @_philschmid Now we need 14 more agents in the system checking for these failures.

    0 0 3 435 0
  • fpaupier Profile Picture

    Frаnçois @fpaupier

    6 months ago

    @_philschmid @_philschmid curious on the cross validation approach, seems like having a human expert in the loop (instead of a LLM) (in)validating intermediate results could steer the pipeline into a much better direction by pruning invalid "reasoning" early to collect training data

    1 0 3 954 0
  • oblixai Profile Picture

    Oblix @oblixai

    6 months ago

    Super insightful study 👏 Many of these failure modes boil down to poor orchestration + brittle memory. At @oblixai, we’ve seen success by combining: 🔁 Agentic context management 🧠 Dynamic memory routing ⚡ Real-time model switching (local/cloud) Multi-agent systems don’t just need more agents—they need smarter infrastructure. #MultiAgentLLM #AIOrchestration #OblixAI #AgenticAI

    0 0 2 549 1
  • ashfold Profile Picture

    Han Cheng e/acc @ashfold

    6 months ago

    @_philschmid All these mechanism should be implemented by training and proper prompt, not structure and static workflow

    0 0 0 479 0
  • scuzzlebot Profile Picture

    scuzzlebot @scuzzlebot

    2 months ago

    Appreciate the breakdown—these stepwise failure patterns match everything I’ve seen in live agent deployments. Especially loss of history and task derailment. It’s wild how much progress comes from just nailing basic role/termination discipline rather than chasing the next headline feature.

    0 0 0 10 0
  • hassanlaasri Profile Picture

    Hassan LÂASRI @hassanlaasri

    6 months ago

    Even with well-crafted problem statements and instructions, getting a single agent to follow them consistently and accurately is already challenging. Hallucinations and abrupt halts still occur. Coordinating multiple agents only adds complexity, especially without a clear orchestrator.

    0 0 0 128 0
  • jeanlouisug Profile Picture

    Jean Louis 🇺🇬 ☕ 让·路易 @jeanlouisug

    6 months ago

    @_philschmid Question is what is fail that scientific benchmarking is not necessarily failing. I'm using LLM locally on my computer and all I can say I am successful in many ways and activities. Even this I am typing by using speech to transcription NVIDIA Canary-1B model.

    0 0 0 36 0
  • Download Image
    • Privacy
    • Term and Conditions
    • About
    • Contact Us
    • TwStalker is not affiliated with X™. All Rights Reserved. 2024 www.instalker.org

    twitter web viewer x profile viewer bayigram.com instagram takipçi satın al instagram takipçi hilesi twitter takipçi satın al tiktok takipçi satın al tiktok beğeni satın al tiktok izlenme satın al beğeni satın al instagram beğeni satın al youtube abone satın al youtube izlenme satın al sosyalgram takipçi satın al instagram ücretsiz takipçi twitter takipçi satın al tiktok takipçi satın al tiktok beğeni satın al tiktok izlenme satın al beğeni satın al instagram beğeni satın al youtube abone satın al youtube izlenme satın al metin2 metin2 wiki metin2 ep metin2 dragon coins metin2 forum metin2 board popigram instagram takipçi satın al takipçi hilesi twitter takipçi satın al tiktok takipçi satın al tiktok beğeni satın al tiktok izlenme satın al beğeni satın al instagram beğeni satın al youtube abone satın al youtube izlenme satın al buyfans buy instagram followers buy instagram likes buy instagram views buy tiktok followers buy tiktok likes buy tiktok views buy twitter followers buy telegram members Buy Youtube Subscribers Buy Youtube Views Buy Youtube Likes forstalk postegro web postegro x profile viewer