Detecting and Deterring Manipulation in a Cognitive Hierarchy

Published in ArXiv, 2024

Recommended citation: Alon et al. (2024). "Detecting and Deterring Manipulation in a Cognitive Hierarchy " ArXiv https://arxiv.org/pdf/2405.01870

This paper addresses the vulnerability of agents with limited Theory of Mind (ToM) depth to manipulation by more sophisticated agents. To mitigate this, we propose the ℵ-IPOMDP framework, which enhances model-based reinforcement learning agents with anomaly detection and out-of-belief policies. This allows agents to recognize deceptive behaviors, even without fully understanding them, and to adopt defensive strategies. The framework’s effectiveness is demonstrated in both mixed-motive and zero-sum games, leading to more equitable outcomes and reduced exploitation. The study’s implications span AI safety, cybersecurity, cognitive science, and psychiatry.