Ruder AI Agents Score Higher on Complex Reasoning Tasks

AI agents that can interrupt each other mid-conversation outperform those that politely wait their turn, according to a new study from researchers at the University of Electro-Communications in Tokyo.

Contents

The Problem With Polite AI
Personality Types and Urgency Scores
The Numbers
What It Suggests

The finding challenges a basic assumption built into most multi-agent AI systems: that structured, orderly communication produces the best results. It does not.

The Problem With Polite AI

Standard large language models follow a rigid call-and-response pattern. One agent speaks, another waits, another replies. It mirrors computer protocol more than human conversation, and the researchers suspected that formality was leaving accuracy on the table.

“Current multi-agent systems often feel artificial because they lack the messy, real-time dynamics of human conversation,” said Yuichi Sei, professor in the Department of Informatics at the University of Electro-Communications and co-author of the study. “We wanted to see if giving agents the social cues we take for granted, like the ability to interrupt or the choice to stay quiet, would improve their collective intelligence.”

To test this, the team built a framework where LLMs were no longer bound to take turns. Each model could be assigned a personality that allowed it to speak out of order, cut off another agent, or say nothing at all.

Personality Types and Urgency Scores

The researchers anchored each agent’s behavior in the “big five” personality traits from classical psychology: openness, conscientiousness, extraversion, agreeableness, and neuroticism. These traits shaped how each agent participated in discussion.

The team also reprogrammed the models to process responses sentence by sentence rather than generating a complete reply before the next agent began. This gave the system finer control over conversational flow.

Central to the framework was an “urgency score.” When an agent detected an error or identified something it considered critical, the score spiked and the agent could interrupt immediately, regardless of whose turn it was. When the score was low, the agent stayed silent, cutting noise from the conversation rather than contributing just to fill space.

The Numbers

The team tested performance using 1,000 questions from the Massive Multitask Language Understanding (MMLU) benchmark, a standard AI reasoning test spanning science, humanities, and other domains.

Three conversational formats were compared: fixed speaking order, dynamic speaking order, and dynamic speaking order with interruption enabled. The results were direct.

When one agent initially gave a wrong answer, accuracy was 68.7% with fixed-order discussion, 73.8% with dynamic order, and 79.2% when interruption was allowed.
In a harder scenario where two agents started with incorrect answers, accuracy came in at 37.2% with fixed order, 43.7% with dynamic order, and 49.5% with interruption enabled.

Each step toward less formal communication produced a measurable gain. The jump from fixed-order to interruption-enabled was the largest in both scenarios.

What It Suggests

The results point to a specific mechanism: timely correction. When an agent can flag an error the moment it spots one, rather than waiting for its designated turn, the group reaches a better answer faster. Silence, too, turns out to be productive. Agents that stay quiet when they have nothing concrete to add keep the conversation cleaner.

Human debate has always worked this way. The AI field, it appears, is only now starting to build systems that reflect it.

Photo by Arion Reyvonputra on Unsplash

This article is a curated summary based on third-party sources. Source: Read the original article

The Problem With Polite AI

Personality Types and Urgency Scores

More Read

The Numbers

What It Suggests

All the latest Foxiz news straight to your inbox​

All the latest Foxiz news straight to your inbox