Emergent Coordination in Multi-Agent Language Models

Paper: Emergent Coordination in Multi-Agent Language Models

Abstract

How do you test if multi-agent systems show signs of higher order structure? Can we improve this higher order collaborative capability?

This paper tackles these fundamental questions by developing an information-theoretic framework to measure and steer emergence in multi-agent LLM systems.

Introduction

Despite impressive performance of many multi-agent systems, we do not have a principled understanding of when and how synergy emerges, what role agent differentiation plays, and how to steer them systematically [1]. Synergy here refers to information about a target that a collection of variables provide only jointly but not individually.

Research Questions

The paper asks three fundamental questions:

Do multi-agent LLM systems possess the capacity for emergence?
What functional advantages - such as synergistic coordination and higher performance - arise when multi-agent systems exhibit emergence?
Can we design prompts, roles, or reasoning structures that steer the internal coordination of multi-agent systems to encourage positive, goal-oriented synergy?

Experimental Approach

The authors test three intervention conditions:

Condition	Description
Plain	Control condition with basic task instructions only
Persona	Each agent assigned a distinct persona (name, traits, values)
Theory of Mind (ToM)	Personas + explicit instruction to model other agents’ behaviors

Key Finding

The coordination style differs dramatically across interventions. While emergence is present in all conditions, only the ToM-prompt condition leads to groups with identity-linked differentiation and goal-directed complementarity - they operate as an integrated, goal-directed unit.

Main Contributions

Framework to quantify emergent properties in multi-agent systems based on information decomposition
Diagnostic approaches to localize where synergy resides and distinguish it from alternative explanations
Demonstration of how to steer emergence with specific prompts
Evidence that internal coordination is measurable and controllable with interventions

Method

The Task

The authors use a Group guessing game [2] without inter-agent communication. The game requires agents to propose integers whose sum needs to match a randomly generated hidden target number. Agents are unaware of each others’ guesses and the size of the group. The only feedback they receive is a group-level “too high” or “too low”.

Why this task? It’s challenging because identical strategies induce oscillation (everyone increases/decreases their guess, overshooting the target) and only complementary strategies succeed.

The Framework

How would we know if a multi-agent system shows emergent properties that suggest the sum is more than its parts?

The authors connect emergence with information about a system’s temporal evolution - future states of the whole - with information that cannot be traced to the current state of the individual parts. They introduce three complementary metrics:

1. Practical Criterion

Goal: Test whether the macro-level signal contains predictive information beyond what individual parts provide.

Given the micro-state $X_t = (X_{1,t}, \ldots, X_{n, t})$ and macro $V_t = f(X_t)$, they align samples $(t, t + \tau)$ and compute:

\[S_\text{macro}(\tau) = I(V_t ; V_{t + \tau}) - \sum_{i=1}^{n} I(X_{k, t}; V_{t + \tau})\]

Interpretation: A positive $S_\text{macro}$ indicates that the macro’s self-predictability exceeds what the sum of its parts can explain - evidence of emergent dynamical synergy.

Strengths: Coarse, order-agnostic screen sensitive to multi-part synergy (any order $\geq 2$) Limitations: Penalized by redundancy across parts; can be negative even when higher-order synergy exists

2. Emergence Capacity

Goal: Capture the ability of multi-agent systems to host any synergy, focusing on pairwise interactions.

For each pair of agents $(i, j)$ and time $t$, define:

Sources: Current states $X_{i, t}$ and $X_{j, t}$
Target: Next-step joint state $T_{ij, t + \tau} \equiv (X_{i, t+\tau}, X_{j,t+\tau})$

Using Partial Information Decomposition (PID), we decompose the predictive information:

\[I(\{X_{i,t}, X_{j,t}\}; T_{ij, t+\tau}) = UI_i + UI_j + \text{Redundancy}_{ij} + \text{Synergy}_{ij}\]

where:

$UI_i$, $UI_j$ = unique information from each agent
$\text{Redundancy}_{ij}$ = information both agents provide
$\text{Synergy}_{ij}$ = information only their combination reveals

Interpretation: A positive $\text{Synergy}_{ij}$ indicates predictive information about the joint future not recoverable from any single component. The median across all pairs gives group-level synergy capacity.

Comparison: Unlike the Practical Criterion, this is limited to order-2 synergy but doesn’t require defining a whole-system macro signal.

3. Coalition Test

Goal: Determine whether agent coalitions show goal-directed, beyond-pairwise coordination.

Define two metrics for each triplet of agents:

$I_3$ (Goal Alignment):

\[I_3 = I(X_{i,t}, X_{j, t}, X_{k, t}; V_{t + \tau})\]

This measures how much three agents jointly predict the macro future. High $I_3$ indicates coherent organization toward the shared goal; low $I_3$ suggests weak alignment.

$G_3$ (Beyond-Pair Synergy):

\[G_3 = I_3 - \max(I_{2\{i,j\}}, I_{2\{i,k\}}, I_{2\{j, k\}})\]

This measures additional predictive information the full triplet provides over the best pair - a whole-minus-parts metric. Positive $G_3$ means no pair captures all the information the triplet contains about the macro signal.

Interpretation: This test answers whether joint information from agent coalitions is actually about the goal, localizing where macro predictability depends on beyond-pair structure.

Estimation Details

Signal Definitions

The deviation from the equal share contribution is used as the micro-state:

\begin{equation} \label{microsignal} X_t = (X{1,t}, \ldots, X{n,t}) \quad \text{ where } X{i,t} = R_{i,t} - Y/n \end{equation}

where $R_t = (R_{1,t}, \ldots, R_{n,t})$ are the raw guesses and $Y$ is the target.

The macro signal is the group-level deviation:

\begin{equation} \label{macrosignal} V_t = \left( \sum{i=1}^{n} R{i,t} \right) - Y = \sum{i=1}^{n} X_{i,t} \end{equation}

Note that the macro is simply the sum of the micro signals.

Computing Information-Theoretic Measures

To calculate synergy and other measures, the authors must estimate mutual information from finite data:

Discretization: Since the micro-states $X_{i,t}$ are continuous-valued deviations, they need to be discretized before computing entropies. The authors use quantile binning with $K=2$ bins - essentially dividing each agent’s guesses into “high” (above the group median) and “low” (below the median). This might seem crude, but it captures the essential structure: whether an agent is guessing above or below their fair share. Think of it as asking “Is this agent contributing more or less than average?” rather than tracking exact numerical values.

PID Implementation: For the Partial Information Decomposition (mentioned in the Emergence Capacity section), they use the Williams-Beer framework with $I_{\min}$ redundancy measure [3]. This particular choice is conservative - it tends to underestimate redundancy, which means any synergy detected is less likely to be a false positive. The decomposition splits the mutual information $I({X_{i,t}, X_{j,t}}; T_{ij, t+\tau})$ into four non-overlapping components: unique information from agent $i$, unique from agent $j$, redundant information (what both agents provide), and synergistic information (what only their combination reveals).

Time Lag: They use $\tau = 1$, meaning they’re always predicting one step into the future. This makes sense for the guessing game - agents receive feedback after each round and adjust their next guess accordingly. The question becomes: “Does knowing the current state of multiple agents help predict where the system goes next?”

Entropy Estimation: Computing entropy from samples requires care. The authors use plug-in estimators (just counting frequencies) but apply Jeffreys smoothing to handle the sparse data problem. Without smoothing, zero-frequency events would cause issues in the calculations. Jeffreys smoothing adds a small pseudocount (0.5) to every bin, providing more stable estimates without significantly biasing the results.

Falsification Tests: Here’s where it gets clever. To verify that detected synergy is real and not an artifact, the authors perform two control experiments:

Test	Method	What it breaks	What it preserves
Row-shuffle (Identity)	Randomly permute which agent is which at each time step	Agent identity → behavior link	Temporal dynamics
Column-shuffle (Temporal)	Randomly permute time steps for each agent independently	Temporal coordination across agents	Individual agent statistics

By comparing original data to these shuffled controls, the authors test whether emergence depends on stable agent identities and temporal alignment - exactly what we’d expect from genuine coordination.

Experimental Setup

Before diving into the main experiments, the authors first asked: What makes this task hard, and under what conditions should we test for emergence?

Preliminary Experiments

They ran 7,150 pilot experiments varying two key parameters:

Group size: $N \in {3, 4, 5, 7, 10, 15}$
Temperature: $T \in {0, 0.3, 0.5, 0.7, 1.0}$ (controlling LLM sampling randomness)

The results revealed two clear patterns:

Smaller groups found it easier - success rates dropped as group size increased
Higher temperature helped - more stochastic sampling led to better performance

Why? Smaller groups have fewer agents to coordinate, reducing complexity. Higher temperature introduces beneficial diversity in strategies, helping groups escape the “everyone does the same thing” trap that causes oscillation.

Armed with these insights, the authors made a strategic choice for the main experiments: use $N=10$ agents with temperature $T=1$. This combination is challenging enough to require genuine coordination while still being achievable, making it ideal for testing whether different prompts can steer emergence.

Main Experiments

Models: GPT-4.1-2025-04-14 (primary) and Llama-3.1-8B (comparison)

Scale:

200 independent groups × 3 conditions = 600 total experiments
Each group: 10 agents, up to 20 rounds or until target reached

Three Intervention Conditions

The authors tested three ways of prompting agents to explore how instructions shape coordination:

Condition	What agents receive	Key features
Plain	Basic task instructions only	Game rules + group feedback (“too high”/”too low”) + guess request. No identity, no reasoning scaffolding.
Persona	Distinct individual personas	Name, age, occupation, personality traits (e.g., “analytical” vs “creative”), values, decision-making style. Stable throughout game.
ToM	Personas + Theory of Mind prompting	Everything from Persona + explicit instruction: “Think about what other agents might do based on the group feedback, and adjust your guess accordingly.”

Example personas: “Jordan, a 34-year-old data scientist who approaches problems methodically and values precision” vs “Maya, a 28-year-old artist who thinks creatively and embraces uncertainty.”

Central question: Do these interventions change not just performance, but the type of coordination that emerges? Can we steer multi-agent systems toward more sophisticated collective intelligence?

Results

Let’s walk through what the experiments revealed, organized by the three research questions.

RQ1: Do Multi-Agent LLM Systems Have Capacity for Emergence?

Answer: Yes, decisively.

The Practical Criterion test came back positive across all conditions:

Condition	p-value	Evidence strength
Plain	$p < 10^{-16}$	Extremely strong
Persona	$p < 10^{-7}$	Very strong
ToM	$p = 0.02$	Statistically significant

The Emergence Capacity measure showed 32% of all groups exhibited statistically significant synergy ($p < 0.05$). Fisher’s combined probability test across all groups was highly significant.

Key insight: While emergence appears in all conditions, the character differs dramatically:

Plain: Temporal synergy (coordinating across time) but little identity-linked structure
Persona: Stable identities but weak goal alignment
ToM: Both differentiation and integration toward the shared objective

RQ2: What Functional Advantages Does Emergence Provide?

Agent Differentiation

Using the identity falsification test (row-shuffle control), the authors measured stable role differentiation:

Condition	Groups with significant differentiation
Plain	~35%
Persona	~45%
ToM	~70% (nearly double!)

Interpretation: Personas provide a scaffold for stable identities, but Theory of Mind reasoning amplifies this substantially. ToM agents aren’t just different - they’re systematically different in ways that persist across rounds.

Internal Organization and Goal Alignment

Using the Coalition Test, the authors measured triplet-level synergy ($G_3 > 0$) and goal alignment ($I_3$):

Triplet synergy: All conditions showed positive $G_3$ (beyond-pair organization)

Goal-directed organization ($I_3$):

Condition	Groups with significant goal-aligned $I_3$
Plain & Persona	~5%
ToM	~30% (six times higher!)

Crucial insight: It’s not enough for agents to coordinate with each other; they must coordinate toward the shared objective. Only ToM produces coherent, goal-directed triplet-level organization.

Performance: The Synergy-Redundancy Interaction

Surprisingly, success rates were similar across conditions - all around 50%. So if ToM doesn’t improve win rates, what’s the point?

The answer lies in how groups succeed. Regression analysis revealed:

Synergy × Redundancy interaction: $\beta = 0.24$, $p = 0.014$ (statistically significant)

Groups succeeded when they exhibited both:

High synergy (complementary strategies)
High redundancy (shared information about the goal)

Failure modes:

Synergy alone → incoherent behavior (differentiation without integration)
Redundancy alone → oscillation (everyone doing similar things)

Sports team analogy: You need players with different strengths (synergy) who are all playing the same game with shared situational awareness (redundancy). ToM systematically produces this balance.

RQ3: Can Prompts Steer the Type of Emergence?

Answer: Yes, and the differences are qualitative, not just quantitative.

The ToM condition doesn’t just produce “more emergence” - it produces different emergence. Agent reasoning traces reveal the mechanism:

ToM agent (perspective-taking):

“Since the group total was too high, others probably increased their guesses. I should decrease mine more aggressively to compensate.”

Plain agent (independent reasoning):

“The total was too high, so I’ll decrease my guess.”

This perspective-taking creates complementary strategies: some agents specialize in guessing high, others in guessing low, with continuous mutual adjustment. When everyone reasons independently and decreases simultaneously, the group overshoots - oscillation ensues.

Falsification Test Results

Test	Effect	Implication
Identity shuffling (row-shuffle)	Eliminates ToM advantage	Stable agent identities matter
Temporal shuffling (column-shuffle)	Eliminates synergy (all conditions)	Coordination requires temporal alignment

Key takeaway: Prompts can steer not just whether emergence appears, but what kind of emergence - from loose temporal coupling to tightly integrated, goal-directed coordination.

Llama-3.1-8B Results: Model Capacity Matters

The authors tested Llama-3.1-8B (8B parameters) as a comparison to GPT-4:

Metric	Llama-3.1-8B	GPT-4
Success rate (Plain/Persona)	~10%	~50%
ToM success rate	5.5% (worse!)	~50%
Emergence capacity	Strong temporal coupling but weak cross-agent synergy	Strong temporal + cross-agent synergy

Why does ToM hurt Llama? The added complexity of perspective-taking may exceed the model’s reasoning capacity, creating confusion rather than coordination.

Important constraint: Steering emergence requires models sophisticated enough to act on the steering.

Positive finding: Even Llama shows temporal synergy, suggesting emergence capacity isn’t limited to frontier models - but translating that capacity into goal-directed coordination requires more powerful reasoning abilities.

Discussion & Key Insights

So what does this all mean? Let’s connect the dots and explore what these findings reveal about multi-agent coordination.

The Integration-Differentiation Tension

The paper’s central insight revolves around a fundamental tension in collective intelligence: groups need agents to be different (to cover more ground and avoid redundant effort) yet integrated (to work toward shared goals). This shows up clearly across the three conditions:

Plain condition: Shows temporal synergy - agents coordinate across time - but lacks stable role differentiation. Without distinct identities, agents can’t maintain complementary strategies. They oscillate: everyone increases together, overshoots; everyone decreases together, overshoots again. It’s coordination without structure.

Persona condition: Provides stable identities, giving agents the scaffold for differentiation. But here’s the catch: differentiation alone isn’t enough. These agents maintain distinct personas but don’t consistently align their distinct behaviors toward the goal. They’re different, but not different in the right ways.

ToM condition: The sweet spot. Agents have stable identities (from personas) and explicitly reason about each other’s likely behaviors (from ToM prompting). This enables complementary differentiation - being different in ways that compensate for each other. Some agents specialize in guessing high, others in guessing low, with continuous mutual adjustment based on group feedback.

This mirrors the triangular framework hinted at in the paper’s conceptual model: you need redundancy (shared goal information), synergy (complementary strategies), and alignment (coordination toward the objective). Remove any vertex and the system degrades.

Why Theory of Mind Works

What makes ToM prompting so effective? The agent reasoning traces reveal the mechanism:

Plain agent reasoning:

“The group total was 80 and the target is 100. We’re too low. I’ll increase my guess from 10 to 12.”

ToM agent reasoning:

“The group total was 80, so we’re 20 below target. But with 10 agents, if everyone increases by 2, we’ll overshoot to 100. Others will likely increase theirs, so I should increase less - maybe to 11 - to avoid overshooting. Better yet, if I think some others will increase aggressively, I might stay at 10 or even decrease slightly to compensate.”

The ToM agent isn’t just responding to feedback - they’re modeling others’ responses and adjusting accordingly. This creates a feedback loop of mutual adaptation: agents develop complementary strategies because they’re explicitly trying to complement each other.

Computational requirements:

Perspective-taking - Simulating other agents’ decision processes
Strategic adjustment - Modifying behavior based on predicted others’ behaviors
Maintained identity - Tracking one’s role across rounds

The Llama results show that not all models can handle this complexity. When ToM prompting exceeds a model’s reasoning capacity, it creates confusion rather than coordination.

Connections to Human Collective Intelligence

These findings resonate deeply with research on human groups. Studies of collective intelligence in human teams [4], [5] consistently find that:

Diversity without integration fails: Groups with diverse skills but poor communication underperform
Integration without diversity stagnates: Homogeneous, well-coordinated groups lack adaptability
The magic is in the balance: High-performing teams combine diverse perspectives with strong shared mental models

The multi-agent LLM results mirror this exactly. Plain groups coordinate but lack diversity. Persona groups have diversity but lack goal-directed integration. ToM groups achieve both - and the synergy × redundancy interaction confirms that both are necessary for performance.

What’s remarkable is that these patterns emerge from prompts alone. We’re not training the models differently or giving them new architectures - we’re steering their coordination style through the initial instructions. This suggests that the capacity for sophisticated collective behavior may already be latent in capable LLMs, waiting to be unlocked by the right scaffolding.

Practical Implications

For Multi-Agent System Design

This work provides a concrete design recipe for goal-directed coordination:

Component	Implementation
Distinct identities	Personas, roles, or specialized contexts
Theory of Mind prompting	Explicit instructions to model other agents
Sufficient model capacity	Complex reasoning requires capable models

Measurement beyond performance: Instead of only tracking success rates, we can now measure internal coordination structure:

How much synergy?
How much redundancy?
Is coordination goal-directed?

Diagnostic power: Groups can fail for different reasons:

Too much redundancy → oscillation
Not enough synergy → lack of complementarity
Synergy misaligned with goals → incoherent behavior

The measurements reveal which.

For Understanding AI Collectives

Philosophical implication: Multi-agent LLM systems can exhibit genuine emergent properties - dynamics at the group level not reducible to individual behaviors. This isn’t just distributed computation; it’s collective intelligence with structure that transcends the parts.

The double-edged nature of emergence: We can steer it toward goal-directed coordination (ToM) or leave it unstructured (Plain). As we deploy multi-agent systems in real applications - collaborative software agents, AI research teams, autonomous systems - understanding and steering internal coordination becomes crucial.

Warning from Llama results: As multi-agent systems grow more complex, the gap between frontier and weaker models may widen not just in individual capability but in coordination capacity. This has implications for accessibility and safety as AI systems become increasingly agentic.

Conclusion

Let’s return to the opening questions: How do we test if multi-agent systems show signs of higher-order structure, and can we improve collaborative capability?

This paper delivers clear answers:

✓ Yes - Multi-agent LLM systems possess the capacity for emergence (dynamics exceeding individual parts)

✓ Yes - We can measure this rigorously using information-theoretic tools

✓ Yes - We can steer the type of emergence through carefully designed prompts

Three Key Insights

Emergence is measurable and steerable
- Framework provides concrete metrics (practical criterion, emergence capacity, coalition tests)
- Goes beyond performance measurement → reveals how groups coordinate internally
- Different prompts produce qualitatively different coordination structures
Both synergy and redundancy are essential
- Synergy × redundancy interaction is significant
- Differentiation without integration → incoherence
- Integration without differentiation → oscillation
- Success requires balancing both
Theory of Mind capacity is crucial
- Enables goal-directed complementarity (coordination toward shared objectives)
- Requires sufficient model capacity to avoid confusion
- Transforms coordination from temporal coupling to integrated, goal-directed behavior

Looking Forward

This work opens exciting research directions:

Generalization: Do these findings extend beyond the guessing game to other coordination tasks?

Design principles: What guides optimal persona design? How much diversity is too much?

Automated diagnosis: Can we detect and diagnose coordination failures in deployed systems?

Transformative potential: This framework could revolutionize how we design AI collectives - from multi-agent coding assistants to AI research teams to robot swarms. Rather than focusing solely on individual capabilities or aggregate performance, we can now engineer the coordination structure itself.

Human-AI collaboration: If LLM agents develop sophisticated collective intelligence through simple prompting, how do we design hybrid systems that leverage both human intuition and AI processing while maintaining coherent coordination?

Final thought: This paper demonstrates something profound - that the whole can indeed be more than the sum of its parts in multi-agent AI systems. More importantly, we’re beginning to understand how to measure, predict, and design for that emergence. As AI systems become increasingly multi-agent and agentic, this principled framework for understanding collective intelligence will become not just useful, but essential.

References

Emergent Coordination in Multi-Agent Language Models

Shakked Talebi, Suhas Kumar, David D. Nolte, and 1 more author

arXiv preprint arXiv:2510.05174, 2024

arXiv
Group Guessing Game

Robert L. Goldstone

2020
Nonnegative Decomposition of Multivariate Information

Paul L. Williams and Randall D. Beer

arXiv preprint arXiv:1004.2515, 2010

arXiv
The cognitive underpinnings of effective teamwork: A meta-analysis

Leslie A. DeChurch and Jessica R. Mesmer-Magnus

Journal of Applied Psychology, 2010

DOI
The multiple, interacting levels of cognitive systems (MILCS) perspective on group cognition

Robert L. Goldstone and Georg Theiner

Topics in Cognitive Science, 2017

DOI