Startup World

Share My Research is Synceds column that welcomes scholars to share their own research breakthroughs with over 2M global AI enthusiasts.
Beyond technological advances,Share My Researchalso calls for interesting stories behind the research and exciting research ideas.Meet the authorInstitutions: Penn State University, Duke University, Google DeepMind, University of Washington, Meta, Nanyang Technological University, and Oregon State University.
The co-first authors are Shaokun Zhang of Penn State University and Ming Yin of Duke University.In recent years, LLM Multi-Agent systems have garnered widespread attention for their collaborative approach to solving complex problems.
However, its a common scenario for these systems to fail at a task despite a flurry of activity.
This leaves developers with a critical question: which agent, at what point, was responsible for the failure? Sifting through vast interaction logs to pinpoint the root cause feels like finding a needle in a haystacka time-consuming and labor-intensive effort.This is a familiar frustration for developers.
In increasingly complex Multi-Agent systems, failures are not only common but also incredibly difficult to diagnose due to the autonomous nature of agent collaboration and long information chains.
Without a way to quickly identify the source of a failure, system iteration and optimization grind to a halt.To address this challenge, researchers from Penn State University and Duke University, in collaboration with institutions including Google DeepMind, have introduced the novel research problem of Automated Failure Attribution.
They have constructed the first benchmark dataset for this task, Who&When, and have developed and evaluated several automated attribution methods.
This work not only highlights the complexity of the task but also paves a new path toward enhancing the reliability of LLM Multi-Agent systems.The paper has been accepted as a Spotlight presentation at the top-tier machine learning conference, ICML 2025, and the code and dataset are now fully open-source.Paperhttps://arxiv.org/pdf/2505.00212Codehttps://github.com/mingyin1/Agents_Failure_AttributionDatasethttps://huggingface.co/datasets/Kevin355/Who_and_WhenResearch Background and ChallengesLLM-driven Multi-Agent systems have demonstrated immense potential across many domains.
However, these systems are fragile; errors by a single agent, misunderstandings between agents, or mistakes in information transmission can lead to the failure of the entire task.Currently, when a system fails, developers are often left with manual and inefficient methods for debugging:Manual Log Archaeology : Developers must manually review lengthy interaction logs to find the source of the problem.Reliance on Expertise : The debugging process is highly dependent on the developers deep understanding of the system and the task at hand.This needle in a haystack approach to debugging is not only inefficient but also severely hinders rapid system iteration and the improvement of system reliability.
There is an urgent need for an automated, systematic method to pinpoint the cause of failures, effectively bridging the gap between evaluation results and system improvement.Core ContributionsThis paper makes several groundbreaking contributions to address the challenges above:1.
Defining a New Problem: The paper is the first to formalize automated failure attribution as a specific research task.
This task is defined by identifying the 2.
failure-responsible agent and the decisive error step that led to the tasks failure.Constructing the First Benchmark Dataset: Who&When : This dataset includes a wide range of failure logs collected from 127 LLM Multi-Agent systems, which were either algorithmically generated or hand-crafted by experts to ensure realism and diversity.
Each failure log is accompanied by fine-grained human annotations for:Who: The agent responsible for the failure.When: The specific interaction step where the decisive error occurred.Why: A natural language explanation of the cause of the failure.3.
Exploring Initial Automated Attribution Methods : Using the Who&When dataset, the paper designs and assesses three distinct methods for automated failure attribution:All-at-Once: This method provides the LLM with the user query and the complete failure log, asking it to identify the responsible agent and the decisive error step in a single pass.
While cost-effective, it may struggle to pinpoint precise errors in long contexts.Step-by-Step: This approach mimics manual debugging by having the LLM review the interaction log sequentially, making a judgment at each step until the error is found.
It is more precise at locating the error step but incurs higher costs and risks accumulating errors.Binary Search: A compromise between the first two methods, this strategy repeatedly divides the log in half, using the LLM to determine which segment contains the error.
It then recursively searches the identified segment, offering a balance of cost and performance.Experimental Results and Key FindingsExperiments were conducted in two settings: one where the LLM knows the ground truth answer to the problem the Multi-Agent system is trying to solve (With Ground Truth) and one where it does not (Without Ground Truth).
The primary model used was GPT-4o, though other models were also tested.
The systematic evaluation of these methods on the Who&When dataset yielded several important insights:A Long Way to Go: Current methods are far from perfect.
Even the best-performing single method achieved an accuracy of only about 53.5% in identifying the responsible agent and a mere 14.2% in pinpointing the exact error step.
Some methods performed even worse than random guessing, underscoring the difficulty of the task.No All-in-One Solution: Different methods excel at different aspects of the problem.
The All-at-Once method is better at identifying Who, while the Step-by-Step method is more effective at determining When.
The Binary Search method provides a middle-ground performance.Hybrid Approaches Show Promise but at a High Cost: The researchers found that combining different methods, such as using the All-at-Once approach to identify a potential agent and then applying the Step-by-Step method to find the error, can improve overall performance.
However, this comes with a significant increase in computational cost.State-of-the-Art Models Struggle: Surprisingly, even the most advanced reasoning models, like OpenAI o1 and DeepSeek R1, find this task challenging.
This highlights the inherent difficulty of automated failure attribution, which demands a higher level of reasoning than what is required for more conventional tasks.The Importance of Explicit Reasoning: Providing explicit prompts that require the LLM to explain its reasoning in the All-at-Once and Step-by-Step methods was shown to improve performance.
Context Length is a Limiting Factor: The study also revealed that as the context length of the failure logs increases, the performance of all attribution methods tends to decrease, with a more pronounced impact on the accuracy of identifying the error step.Like this:LikeLoading...





Unlimited Portal Access + Monthly Magazine - 12 issues


Contribute US to Start Broadcasting - It's Voluntary!


ADVERTISE


Merchandise (Peace Series)

 


Tesollo to present humanoid robot hand at AI for Good Global Summit 2025


The curious rise of giant tablets on wheels


Rocket Report: Japan’s workhorse booster takes a bow; you can invest in SpaceX now


World-first: DJI drone movies whole Everest path in one go


DJI’s ultimate phone gimbal gets early Prime Day discount


SEW-EURODRIVE now assembles planetary gear units in the U.S.


Ready-made stem cell therapies for pets could be coming


Supplier of concealed security app spills passwords for 62,000 users


Judge: You can’t ban DEI grants without bothering to define DEI


Meta's AI superintelligence effort sounds just like its failed metaverse


The Last of Us co-creator Neil Druckmann exits HBO show


2025 VW ID Buzz review: If you want an electric minivan, this is it


Man’s ghastly festering ulcer stumps doctors—until they cut out a wedge of flesh


xAI data center gets air authorization to run 15 turbines, but imaging reveals 24 on site


Sky Elements Drone Show Aims for World Records on July 4 Celebrations


Quantum Systems and Fraunhofer FHR to Integrate State-of-the-Art Radar Technology into UAVs


The Number Of P-51 Mustangs Are LeftThe newest survivor census maintained by the lover site MustangsMustangs pegs general numbers at 311 complete airframes. Of these, 29 remain in long-lasting storage, 54 remain in active restoration hangars, 159 are sti


Buyers still waiting: DJI drones face ongoing US Customs snag


How to Set Up a Planetary Gear Motion with SOLIDWORKS


Intuitive Surgical obtains CE mark for da Vinci 5 robot


Pittsburgh Robotics Network introduces Deep Tech Institute for Leadership and Innovation


Cluely’s ARR doubled in a week to $7M, founder Roy Lee says. But rivals are coming.


Who is Soham Parekh, the serial moonlighter Silicon Valley startups can’t stop hiring


Stripe’s first employee, the founder of fintech Increase, sort of bought a bank


Why Cloudflare desires AI business to pay for content


Pinwheel introduces a smartwatch for kids that includes an AI chatbot


Castelion is raising a $350M Series B to scale hypersonic rocket service


Tighten up your cap table with Fidelity, Cimulate, and DepositLink at A Technology NewsRoom All Stage 2025


Writer CEO May Habib to take the AI Stage at A Technology NewsRoom Disrupt 2025


Israeli quantum startup Qedma just raised $26M, with IBM joining in


TikTok is being flooded with racist AI videos created by Google's Veo 3


Whatever that might go wrong with X's new AI-written neighborhood notes


New proof that some supernovae may be a double detonation


Rice might be essential to developing better non-alcoholic beer


AT T present Wireless Account Lock defense to curb the SIM-swap scourge


From Le Mans to Driven-- where does F1: The Movie rank


NYT to start searching erased ChatGPT logs after beating OpenAI in court


Paramount accused of bribery as it settles Trump suit for $16 million


Medical groups warn Senate budget bill will create dystopian health care system


Tesla Q2 2025 sales dropped more than 13% year over year


What's incorrect with AAA games The development of the next Battlefield has answers.To comprehend exactly what's happening with the next Battlefield title-- codenamed Glacier-- we need to rewind a bit. broadened the franchise audience to more directly com


Astronomers might have found a third interstellar item


RTX and Shield AI Partner to Develop New Defense Capabilities


NYPD Considers Net-Firing Drones to Take Down 'Hostile' Drones


Iran Unveils Shahed 107


China Starts Production of D18 Cargo Drone for Low-Altitude Strategic Logistics Operations


Wildlife Drones Saving Rhinos from Poachers in India’s National Parks


DJI expands Power lineup with mighty new Power 2000 station


ABB updates IRB 1200 line, adds 3 robot families for China


Galbot picks up $153M to commercialize G1 semi-humanoid


Luminous gets funding to bring LUMI solar construction robot to Australia


Wonder Dynamics co-founder Nikola Todorovic joins the AI Stage at A Technology NewsRoom Disrupt 2025


Robinhood's co-founder is beaming up (and down) the future of energy


Lovable on track to raise $150M at $2B appraisal


RFK Jr.'s health department calls Nature scrap science, cancels memberships


Pentagon might put SpaceX at the center of a sensor-to-shooter targeting network


FCC chair decides prisoners and their families should keep paying high phone rates


Moderna states mRNA flu vaccine cruised through trial, beating standard shot


Nudify app's strategy to dominate deepfake porn depends upon Reddit, docs show


Nothing Phone 3 gets here July 15 with a small dot matrix rear display


United States crucial facilities exposed as feds caution of possible attacks from Iran


White House works to ground NASA science objectives before Congress can act


Glen Powell plays a hazardous game in The Running Man trailer


Ted Cruz plan to penalize states that control AI shot down in 99-1 vote


GOP desires EV tax credit gone; it would be a catastrophe for Tesla


GOP budget expense poised to squash renewable resource in the US


Tuesday Telescope: A howling wolf in the night sky


Pay up or stop scraping: Cloudflare program charges bots for each crawl


Silvus Technologies Launches Spectrum Dominance 2.0 Next Generation EW Defenses


France's XSun and H3 DYNAMICS Join Forces to Develop World's First Solar Hydrogen Electric UAV


Ukraine’s New Drone Built to Kill Shaheds


Russia's Weapons Stockpile: How Many Missiles and Drones are Left


Parry Labs and Airbus Partner on United States Marine Corps' Unmanned Aerial Logistics Connector


Top 10 robotics advancements of June 2025


Farmer-first future: Agtonomy's technique to clever farming


Genesis AI brings in $105M to build universal robotics foundation design


Amazon releases new AI structure model, releases 1 millionth robotic


Civ Robotics areas Series A funding for automated surveying


Figma moves closer to a blockbuster IPO that could raise $1.5 B


Roadway to Battlefield: Central Eurasia's entrance to A Technology NewsRoom Startup Battlefield


David George from a16z on the future of going public at A Technology NewsRoom Disrupt 2025


Mo Jomaa breaks down IPO preparation for creators on the Scale Stage at A Technology NewsRoom All Stage


Genesis AI introduces with $105M seed funding from Eclipse, Khosla to build AI models for robots


A mammoth tusk boomerang from Poland is 40,000 years old


Analyst: M5 Vision Pro, Vision Air, and smart glasses coming in 2026–2028


Research study roundup: 6 cool science stories we nearly missed out on


Drug cartel hacked FBI official’s phone to track and kill informants, report says


Half a million Spotify users are unknowingly grooving to an AI-generated band


Senate GOP budget plan expense has little-noticed arrangement that might harm your Wi-Fi


Texas politicians advance in effort to wrench space shuttle bus from Smithsonian


Nearly 12 million individuals would lose medical insurance under Senate GOP expense


Project Hail Mary trailer looks like a winner for Andy Weir fans


Meta, TikTok can’t toss wrongful death suit from mom of “subway surfing” teen


Supreme Court to choose whether ISPs need to disconnect users accused of piracy


Trump's tariff threat pushes Canada to scrap digital services tax


NIH budget cuts affect research study funding beyond US borders


The second launch of New Glenn will aim for Mars


Android 16 review: Post-hype


Cops Helicopter Chasing Drones Near United States Air Base in Near Miss with F-15


ZeroAvia Gets UK Government Grant for Development and Flight Test of Liquid Hydrogen Fuel System


Shield AI and Amazon Web Services Collaborate to Deliver Mission Autonomy at Fleet Scale


Raspberry Pi Powers Next-Gen UAV Swarm Intelligence


US Air Force Reaper Drones to Test New Anti-Hacking Software


FAA approves AVSS parachute for DJI Matrice 4 drones


Shell extends multi-million dollar deal with drone firm Cyberhawk


DJI simply revealed its most effective delivery drone yet


Joby Aviation (JOBY) begins piloted eVTOL flights in the United Arab Emirates [Video]


Unitree ends up being a legged robotic unicorn with Series C financing


Tacta Systems raises $75M to give robots a ‘smart nervous system’


Sri Mandir keeps investors hooked as digital devotion grows


Legal software company Clio drops $1B on law data giant vLex


Next-gen procurement platform Levelpath catches $55M


From $5 to financial empowerment: Why Stash co-founder Brandon Krieg is a must-see at A Technology NewsRoom All Stage 2025


Tailor, a 'headless' ERP start-up, raises $22M Series A


Ex-Meta engineers have actually built an AI tool to plan every information of your trip


3 powerhouses cover how to prepare now for your later-stage raise at A Technology NewsRoom Disrupt 2025


Not simply luck-- it's method: Tiffany Luck on winning over VCs at A Technology NewsRoom All Stage


Tiny AI ERP startup Campfire is winning numerous start-ups from NetSuite, Accel led a $35M Series A


Jennifer Neundorfer on how AI is reshaping the way startups are built — live at A Technology NewsRoom All Stage


Kristen Craft brings fresh fundraising strategy to the Foundation Stage at A Technology NewsRoom All Stage