Multi-agent AI systems exhibit emergent risks that no single agent produces in isolation. Existing safety frameworks rely on binary classifications of agent behavior, discarding the uncertainty inherent in proxy-based evaluation. We introduce SWARM (System-Wide Assessment of Risk in Multi-agent systems), a simulation framework that replaces binary good/bad labels with soft probabilistic labels p = P(v = +1) in [0, 1], enabling continuous-valued payoff computation, toxicity measurement, and governance intervention. SWARM implements a modular governance engine with configurable levers (transaction taxes, circuit breakers, reputation decay, and random audits) and quantifies their effects through probabilistic metrics including expected toxicity E[1 - p | accepted] and quality gap E[p | accepted] - E[p | rejected]. Across seven scenarios with five-seed replication, we observe that strict governance reduces welfare by over 40% without improving safety. In parallel, aggressively internalizing system externalities collapses total welfare from a baseline of +262 down to -67, while toxicity remains invariant. Similarly, circuit breakers require careful calibration; overly restrictive thresholds severely diminish system value, whereas an optimal threshold balances moderate welfare with minimized toxicity. In companion experiments, we demonstrate that soft metrics can detect proxy gaming by self-optimizing agents that pass conventional binary evaluations. Furthermore, we observe that this basic governance layer can be applied to live LLM-backed agents (Concordia entities, Claude, GPT-4o Mini) without architectural modification. These results demonstrate that distributional safety requires continuous risk metrics and that governance lever calibration involves quantifiable tradeoffs between safety and system welfare. The source code of the framework and all project resources are publicly available at swarm-ai.org.
Download Full PDF@article{aiersilan2026soft,
title={Soft-Label Governance for Distributional Safety in Multi-Agent Systems},
author={Aiersilan, Aizierjiang and Savitt, Raeli},
journal={arXiv preprint arXiv:2604.19752},
year={2026}
}