Shoot First, Ask Questions Later?

Building Rational Agents That Explore and Act Like People

1MIT CSAIL 2MIT Brain and Cognitive Sciences 3Harvard SEAS
Loading live playthrough…

We introduce Collaborative Battleship, a decision-oriented dialogue task for studying information-seeking behavior in AI agents and humans.

Abstract at a Glance

Key points

  1. Motivation

    High-stakes decisions often require agents to ask smart questions before acting, but we lack benchmarks that measure how agents navigate explore/exploit tradeoffs.

  2. Task

    Collaborative Battleship pits a question-asking Captain and an all-knowing Spotter against hidden ships, turning information search into gameplay.

  3. Behavior gap

    Compared to human players, LM agents generally ask less informative questions, produce inaccurate answers, and waste turns on low-probability moves.

  4. Method

    We develop novel Monte Carlo inference strategies for agents based on principles from Bayesian Experimental Design (BED), guiding what they ask, where they shoot, and when they choose to explore vs. act.

  5. Results

    Our approach enables weaker LMs, such as Llama-4-Scout, to outperform both humans (8% → 82% win rate) and frontier models (0% → 67% win rate vs. GPT-5) at ≈1% of GPT-5's cost.

  6. Generality

    We replicate these findings on "Guess Who?" where our methods significantly boost accuracy (+28.3–42.4 p.p.), illustrating the generality of our approach.

  7. Impact

    Our findings demonstrate how Bayesian inference at test time can improve information-seeking and decision-making in discovery settings.

Dig deeper

Grounding questions by generating code

Translating questions into Python code enables the Spotter to reason more effectively about the board state.

Explore examples:
Select an example to replay the conversation.

Board State

True board known to the Spotter

Across LMs, code generation boosts question-answering accuracy of Spotter agents.

Spotter QA accuracy by language model
Accuracy varies across the 15 LMs we evaluated. Dotted line indicates mean human performance (92.5%).
Spotter QA accuracy by strategy, highlighting code generation
Code generation consistently improves Spotter QA over direct answering and CoT (+14.7% absolute, CoT+Code vs. Base LM).
Base LM CoT Code CoT + Code

Monte Carlo inference with a probabilistic world model

We equip agents with a "world model" that allows them to sample possible boards. This allows agents to represent uncertainty, form approximate beliefs, and make grounded inferences.

Observed partial board

Partial Battleship board showing observed hits and misses

Shots so far constrain the hypothesis space.

Hypothesis Space
Sampled board hypothesis 1
Sampled board hypothesis 2
Sampled board hypothesis 3
Sampled board hypothesis 4
Sampled board hypothesis 5

Each sample is one plausible arrangement of ships consistent with the evidence.

Approximate belief state

Heatmap showing marginal probability that each tile contains a ship

We marginalize over the sample set to obtain a tile-level probability map.

Optimizing questions via expected information gain (EIG)

Executable questions let us score how much information each hypothesis-reducing split reveals.

Code translation enables mental simulation

Captain agents can simulate an "internal" Spotter agent that provides code translations for candidate questions.

Preparing example…

Evaluating across the hypothesis space

Executing the code against each sample splits the hypothesis space into True/False partitions.

Sampled boards

Hypothesis board sample 0 Sample 0 → pending
Hypothesis board sample 1 Sample 1 → pending
Hypothesis board sample 2 Sample 2 → pending
Hypothesis board sample 3 Sample 3 → pending
Hypothesis board sample 4 Sample 4 → pending

Returns False

Returns True

EIG peaks at balanced hypotheses

Expected information gain (EIG) is maximized when the question splits the hypothesis space evenly.

Expected information gain vs. p(True) under ε = 0.10. Curve peaks at p(True) = 0.5.
Channel noise
ε = 0.10
Max EIG
1 − H(ε)

We use channel noise (ε) to model uncertainty in the predicted answer (either due to mis-translations or inherent ambiguity in the question).

Bayesian strategies for building rational agents

Belief-aware policies help agents decide what to ask, where to shoot, and when to explore vs. act.

Precision Moves

Bayes-M

Maintain a weighted particle belief over feasible boards and fire at the tile with the highest posterior chance of containing a ship, yielding sharp MAP-driven targeting.

InfoMax Questions

Bayes-Q

Sample candidate questions, translate them to executable checks, and choose the one whose expected information gain (EIG) is highest under the current belief.

Explore vs. Act

Bayes-D

Compare the discounted post-question MAP hit rate against the current best shot; ask when the value of information exceeds immediate reward, otherwise exploit with a targeted move.

Targeting performance by strategy

Targeting scores (F1) by Captain strategy
Targeting scores (F1) by Captain strategy. Incorporating Bayesian strategies for questions (QBayes), moves (MBayes), and decisions (DBayes) brings weaker LMs from near-random performance to super-human levels.
Expected information gain as a function of sampled questions under Q Bayes
Expected information gain (EIG) scales with the number of candidate questions sampled by QBayes. Drawing up to 10 programs yields +0.227 bits per question (94.2% of the information-theoretic ceiling) while driving redundant questions nearly to zero for weaker models.
Human Baseline Llama-4-Scout GPT-4o GPT-5

Game Explorer

Press the Play button below to get started. Use the Game Browser to select scenarios based on Captain Strategy and LLM.

Board State

Loading curated games…

Game Browser

Prior Work

This work builds on our prior paper on modeling human-like question-asking in Battleship:

BibTeX


@article{grand2025battleship,
  title={Shoot First, Ask Questions Later? Building Rational Agents That Explore and Act Like People},
  author={Gabriel Grand and Valerio Pepe and Joshua B. Tenenbaum and Jacob Andreas},
  journal={ArXiv},
  year={2025},
  volume={abs/2510.20886},
  url={https://arxiv.org/abs/2510.20886}
}