How to Automatically Attribute Failures in LLM Multi-Agent Systems Using the Who&When Dataset

By

Introduction

When your LLM-powered multi-agent system fails on a task, you're not just left with a broken output — you're left with a headache. Which agent made the mistake? At what step did things go wrong? Manual log crawling feels like hunting for a single typo in a novel. Fortunately, researchers from Penn State University and Duke University, in collaboration with Google DeepMind, the University of Washington, Meta, Nanyang Technological University, and Oregon State University, have introduced a structured solution: automated failure attribution. Their work, accepted as a Spotlight presentation at ICML 2025, provides the first benchmark dataset (Who&When) and several evaluation methods to pinpoint the root cause of failures. This guide walks you through applying these tools to your own multi-agent systems, saving you hours of frustration.

How to Automatically Attribute Failures in LLM Multi-Agent Systems Using the Who&When Dataset
Source: syncedreview.com

What You Need

Step-by-Step Guide

Step 1: Understand the Task of Failure Attribution

Before diving into code, grasp the core concept. In LLM multi-agent systems, multiple agents collaborate (e.g., via conversation or tool use) to solve a problem. A failure occurs when the final output is incorrect or incomplete. Failure attribution answers two questions: which agent caused the failure and at which point in the interaction (i.e., which timestamp or turn). The Who&When dataset simulates such failures with ground-truth labels, so you can evaluate the accuracy of your attribution method.

Step 2: Clone the Repository and Set Up the Environment

  1. Open a terminal and run:
    git clone https://github.com/mingyin1/Agents_Failure_Attribution.git
  2. Navigate to the directory:
    cd Agents_Failure_Attribution
  3. Create a virtual environment (recommended):
    python -m venv venv
    source venv/bin/activate # On Windows: venv\Scripts\activate
  4. Install dependencies:
    pip install -r requirements.txt

Step 3: Download the Who&When Dataset

The dataset is hosted on Hugging Face. Run the provided download script or use the Hugging Face datasets library:

from datasets import load_dataset
dataset = load_dataset("Kevin355/Who_and_When")

Alternatively, visit the dataset page and download the files manually. Place them in a data/ folder within the repository.

Step 4: Understand the Dataset Structure

The dataset contains multi-agent interaction logs, each labeled with:

Familiarize yourself with the format by examining a sample: dataset['train'][0] in Python.

Step 5: Choose an Attribution Method

The paper introduces several automated methods. Start with the trace-based method which uses a pre-trained LLM to analyze the entire interaction trace and predict the responsible agent and time. More advanced options include:

The repository includes scripts for each. For your first run, use the default trace-based approach.

Step 6: Run Attribution on a Sample Failure

Execute the provided evaluation script:

python run_attribution.py --dataset_path ./data/Who_and_When --method trace_based --split test

This will analyze a batch of test cases and output predictions vs. ground truth. The script logs the results including accuracy for who and when separately.

Step 7: Interpret the Results

Check the output summary. A high who accuracy (e.g., >80%) indicates the method reliably identifies the failing agent. A low when accuracy suggests the method struggles with pinpointing the exact moment. Examine false positives — does the model blame an agent too early or too late? The paper reports baseline metrics (e.g., random guessing gives ~25% accuracy for who in a 4-agent system), so compare accordingly.

Step 8: Apply to Your Own Multi-Agent System

To use this on your custom system, you must log interactions in the same format as the dataset: a JSON or dict with keys for agent names, message content, timestamps, and final success/failure. Modify the attribution scripts to accept your data. The trace_based method can be adapted by feeding your logs to the LLM with a similar prompt template.

Tips for Success

Tags:

Related Articles

Recommended

Discover More

Leaked Images Reveal Microsoft's Next-Gen Xbox Controller Optimized for Cloud GamingAutomating Intellectual Toil: How AI Researchers Leverage Copilot for Agent-Driven DevelopmentCargo and crates.io Security Update: tar Crate Vulnerability (CVE-2026-33056)Google Chrome's On-Device AI Model Can Consume 4GB of Storage: What You Need to KnowMastering On-Site Search: A Guide to Defeating the Big Box