BattleshipQA: Shoot First, Ask Questions Later?

Abstract at a Glance

Key points

Motivation

High-stakes decisions often require agents to ask smart questions before acting, but we lack benchmarks that measure how agents navigate explore/exploit tradeoffs.
Task

Collaborative Battleship pits a question-asking Captain and an all-knowing Spotter against hidden ships, turning information search into gameplay.
Behavior gap

Compared to human players, LM agents generally ask less informative questions, produce inaccurate answers, and waste turns on low-probability moves.
Method

We develop novel Monte Carlo inference strategies for agents based on principles from Bayesian Experimental Design (BED), guiding what they ask, where they shoot, and when they choose to explore vs. act.
Results

Our approach enables weaker LMs, such as Llama-4-Scout, to outperform both humans (8% → 82% win rate) and frontier models (0% → 67% win rate vs. GPT-5) at ≈1% of GPT-5's cost.
Generality

We replicate these findings on "Guess Who?" where our methods significantly boost accuracy (+28.3–42.4 p.p.), illustrating the generality of our approach.
Impact

Our findings demonstrate how Bayesian inference at test time can improve information-seeking and decision-making in discovery settings.

Dig deeper

Why use Battleship as a testbed?

How do you measure performance?

What did the baseline language models struggle with?

How do the Bayesian strategies work?

Do these Bayesian strategies work in other settings?

Where should readers go next?

View full abstract

Abstract

Many high-stakes applications of AI require forming data-driven hypotheses and making targeted guesses; e.g., in scientific and diagnostic settings. Given limited resources, to what extent do agents based on language models (LMs) act rationally? We develop methods to benchmark and enhance agentic information-seeking, drawing on insights from human behavior. First, we introduce a strategic decision-oriented dialogue task called Collaborative Battleship, in which a partially informed Captain must balance exploration (asking questions) and action (taking shots), while a fully informed Spotter must provide accurate answers under an information bottleneck. Compared to human players (N=42), we find that LM agents struggle to ground answers in context, generate informative questions, and select high-value actions. Next, to address these gaps, we develop novel Monte Carlo inference strategies for LMs based on principles from Bayesian Experimental Design (BED). For Spotter agents, our approach boosts accuracy by up to 14.7% absolute over LM-only baselines; for Captain agents, it raises expected information gain (EIG) by up to 0.227 bits (94.2% of the achievable noise ceiling). Combined, these components yield sharper targeting (+0.303–0.374 F1), and enable weaker LMs, such as Llama-4-Scout, to outperform both humans (8% → 82% win rate) and frontier models (0% → 67% win rate vs. GPT-5) at ≈1% of GPT-5's cost. We replicate these findings on Guess Who? where our methods significantly boost accuracy (+28.3–42.4 p.p.), demonstrating their general applicability for building rational information-seeking agents.

Grounding questions by generating code

Translating questions into Python code enables the Spotter to reason more effectively about the board state.

Explore examples:

Select an example to replay the conversation.

Board State

True board known to the Spotter

Across LMs, code generation boosts question-answering accuracy of Spotter agents.

Spotter QA accuracy by language model — Accuracy varies across the 15 LMs we evaluated. Dotted line indicates mean human performance (92.5%).

Spotter QA accuracy by strategy, highlighting code generation — Code generation consistently improves Spotter QA over direct answering and CoT (+14.7% absolute, CoT+Code vs. Base LM).

Base LM CoT Code CoT + Code

Monte Carlo inference with a probabilistic world model

We equip agents with a "world model" that allows them to sample possible boards. This allows agents to represent uncertainty, form approximate beliefs, and make grounded inferences.

Observed partial board

Partial Battleship board showing observed hits and misses

Shots so far constrain the hypothesis space.

Hypothesis Space

Each sample is one plausible arrangement of ships consistent with the evidence.

Approximate belief state

Heatmap showing marginal probability that each tile contains a ship

We marginalize over the sample set to obtain a tile-level probability map.

Optimizing questions via expected information gain (EIG)

Executable questions let us score how much information each hypothesis-reducing split reveals.

Code translation enables mental simulation

Captain agents can simulate an "internal" Spotter agent that provides code translations for candidate questions.

Preparing example…

Evaluating across the hypothesis space

Executing the code against each sample splits the hypothesis space into True/False partitions.

Sampled boards

Returns False

Returns True

EIG peaks at balanced hypotheses

Expected information gain (EIG) is maximized when the question splits the hypothesis space evenly.

Channel noise: ε = 0.10
Max EIG: 1 − H(ε)

We use channel noise (ε) to model uncertainty in the predicted answer (either due to mis-translations or inherent ambiguity in the question).

Bayesian strategies for building rational agents

Belief-aware policies help agents decide what to ask, where to shoot, and when to explore vs. act.

Precision Moves

Bayes-M

Maintain a weighted particle belief over feasible boards and fire at the tile with the highest posterior chance of containing a ship, yielding sharp MAP-driven targeting.

InfoMax Questions

Bayes-Q

Sample candidate questions, translate them to executable checks, and choose the one whose expected information gain (EIG) is highest under the current belief.

Explore vs. Act

Bayes-D

Compare the discounted post-question MAP hit rate against the current best shot; ask when the value of information exceeds immediate reward, otherwise exploit with a targeted move.

Targeting performance by strategy

Targeting scores (F1) by Captain strategy. Incorporating Bayesian strategies for questions (Q_Bayes), moves (M_Bayes), and decisions (D_Bayes) brings weaker LMs from near-random performance to super-human levels.

Expected information gain as a function of sampled questions under Q Bayes — Expected information gain (EIG) scales with the number of candidate questions sampled by Q_Bayes. Drawing up to 10 programs yields +0.227 bits per question (94.2% of the information-theoretic ceiling) while driving redundant questions nearly to zero for weaker models.

Human Baseline Llama-4-Scout GPT-4o GPT-5

Game Explorer

Press the Play button below to get started. Use the Game Browser to select scenarios based on Captain Strategy and LLM.

Board State

Loading curated games…

Game Browser

Prior Work

This work builds on our prior paper on modeling human-like question-asking in Battleship:

Loose LIPS Sink Ships: Asking Questions in Battleship with Language-Informed Program Sampling

arXiv preprint 2402.19471

Published in CogSci 2024

LIPS paper overview showing question sampling, LoT translation, and information gain computation

BibTeX


@article{grand2025battleship,
  title={Shoot First, Ask Questions Later? Building Rational Agents That Explore and Act Like People},
  author={Gabriel Grand and Valerio Pepe and Joshua B. Tenenbaum and Jacob Andreas},
  journal={ArXiv},
  year={2025},
  volume={abs/2510.20886},
  url={https://arxiv.org/abs/2510.20886}
}

Shoot First, Ask Questions Later?

Building Rational Agents That Explore and Act Like People

We introduce Collaborative Battleship, a decision-oriented dialogue task for studying information-seeking behavior in AI agents and humans.

Abstract at a Glance

Dig deeper

Grounding questions by generating code

Board State

Across LMs, code generation boosts question-answering accuracy of Spotter agents.

Monte Carlo inference with a probabilistic world model

Observed partial board

Approximate belief state

Optimizing questions via expected information gain (EIG)

Code translation enables mental simulation

Evaluating across the hypothesis space

Sampled boards

Returns False

Returns True

EIG peaks at balanced hypotheses

Bayesian strategies for building rational agents

Targeting performance by strategy

Game Explorer

Board State

Game Browser

Prior Work

BibTeX