Motivation
High-stakes decisions often require agents to ask smart questions before acting, but we lack benchmarks that measure how agents navigate explore/exploit tradeoffs.
Task
Collaborative Battleship pits a question-asking Captain and an all-knowing Spotter against hidden ships, turning information search into gameplay.
Behavior gap
Compared to human players, LM agents generally ask less informative questions, produce inaccurate answers, and waste turns on low-probability moves.
Method
We develop novel Monte Carlo inference strategies for agents based on principles from Bayesian Experimental Design (BED), guiding what they ask, where they shoot, and when they choose to explore vs. act.
Results
Our approach enables weaker LMs, such as Llama-4-Scout, to outperform both humans (8% → 82% win rate) and frontier models (0% → 67% win rate vs. GPT-5) at ≈1% of GPT-5's cost.
Generality
We replicate these findings on "Guess Who?" where our methods significantly boost accuracy (+28.3–42.4 p.p.), illustrating the generality of our approach.
Impact
Our findings demonstrate how Bayesian inference at test time can improve information-seeking and decision-making in discovery settings.
Battleship encapsulates a full active learning loop: agents must form data-driven hypotheses, perform experiments, and made educated guesses. At the same time, the game rules are straightforward, and state and action representations are lightweight. This makes Battleship an ideal evaluation harness for agents that doesn't require extensive prompt engineering.
For Spotter agents, we measure QA accuracy across a variety of questions, which often require grounded reasoning about the board and dialogue context. For Captain agents, our Collaborative Battleship task provides a variety of metrics that measure both end-to-end game performance as well as more granular aspects of question-asking. For instance, to measure overall performance, Targeting Score (F1) captures the balance between precision and recall in taking shots. Meanwhile, to measure the utility of specific questions, we compute expected information gain (EIG), which measures how much information a question is likely to reveal about the board state.
We equip agents with a "world model" -- think of this as a tool that agents can call to sample hypothetical board states. This allows agents to maintain a hypothesis space over possible board configurations and perform Bayesian belief updates based on new evidence. The key to making this work is code generation: this allows agents to translate questions expressed in natural language into executable code that interacts with the world model. Scroll down for examples of how this works in practice.
Yes -- we extend our Bayesian strategies to a general family of information-seeking games from TextArena. As an initial test, we replicated our full experiments on "Guess Who?", a classic game where players ask yes/no questions to identify a hidden character. Here, we find that our approach boosts accuracy by 28 to 42 percentage points. The same strategies apply to other partially observed games and diagnostic workflows where agents must alternate between probing and acting.
Scan the full abstract for the quantitative takeaways, then head to the interactive Game Explorer to watch agents plan in real time.
Translating questions into Python code enables the Spotter to reason more effectively about the board state.
True board known to the Spotter
We equip agents with a "world model" that allows them to sample possible boards. This allows agents to represent uncertainty, form approximate beliefs, and make grounded inferences.
Shots so far constrain the hypothesis space.
Each sample is one plausible arrangement of ships consistent with the evidence.
We marginalize over the sample set to obtain a tile-level probability map.
Executable questions let us score how much information each hypothesis-reducing split reveals.
Captain agents can simulate an "internal" Spotter agent that provides code translations for candidate questions.
Executing the code against each sample splits the hypothesis space into True/False partitions.
Sample 0 → pending
Sample 1 → pending
Sample 2 → pending
Sample 3 → pending
Sample 4 → pending
Expected information gain (EIG) is maximized when the question splits the hypothesis space evenly.
We use channel noise (ε) to model uncertainty in the predicted answer (either due to mis-translations or inherent ambiguity in the question).
Belief-aware policies help agents decide what to ask, where to shoot, and when to explore vs. act.
Precision Moves
Bayes-M
InfoMax Questions
Bayes-Q
Explore vs. Act
Bayes-D
Press the Play button below to get started. Use the Game Browser to select scenarios based on Captain Strategy and LLM.
This work builds on our prior paper on modeling human-like question-asking in Battleship:
@article{grand2025battleship,
title={Shoot First, Ask Questions Later? Building Rational Agents That Explore and Act Like People},
author={Gabriel Grand and Valerio Pepe and Joshua B. Tenenbaum and Jacob Andreas},
journal={ArXiv},
year={2025},
volume={abs/2510.20886},
url={https://arxiv.org/abs/2510.20886}
}