MaD Physics: Evaluating Information-Seeking Under Constraints in Physical Environments

Abstract

Scientific discovery is fundamentally a resource-constrained process that requires navigating complex trade-offs between the quality and quantity of measurements due to physical and cost constraints. Existing benchmarks for evaluating agents for scientific discovery focus on either static knowledge-based reasoning or unconstrained experimental design tasks, and do not capture the ability to make measurements and plan under constraints.

To bridge this gap, we propose Measuring and Discovering Physics (MaD Physics), a benchmark to evaluate the ability of agents to make informative measurements and conclusions subject to constraints on the quality and quantity of measurements. The benchmark consists of three environments, each based on a distinct physical law. To mitigate contamination from existing knowledge, MaD Physics includes altered physical laws. In each trial, the agent makes measurements of the system until it exhausts an allotted budget, and then must infer the underlying physical law to make predictions about the future state of the system.

MaD Physics evaluates two fundamental capabilities of scientific agents: inferring models from data and planning under constraints. We also demonstrate how MaD Physics can be used to evaluate other capabilities such as multimodality and in-context learning. We benchmark agents on MaD Physics using four Gemini models (2.5 Flash Lite, 2.5 Flash, 2.5 Pro, and 3 Flash), identifying shortcomings in their structured exploration and data collection capabilities, and highlighting directions to improve their scientific reasoning.

What's inside

Benchmark Features

3 minimal physical domains. Classical, fluid, and quantum mechanics environments, each considering a minimal system governed by a physical law.
Altered physics. Modified laws allow evaluation of information-seeking capability instead of memorization.
Measurement under budget. Each observation has a cost; agents must trade off quality versus quantity within a fixed measurement budget.
Prediction from collected data. After the measurement phase, agents must infer the underlying law and predict the system's future state.
Multiple evaluation modes. Numerical and image-based observations, in-context learning across episodes, and parameter inference under a known structural form.

Three physical domains

Environments

Domain 1

Classical Mechanics

A system of N spherical objects evolving under Newtonian dynamics, with optional anisotropic inertial mass (governed by a coupling constant κ) and modified gravity laws (1/r or rippled).

Normal. Standard Newtonian dynamics — no alteration.

1/r gravity. Modified gravity: gravitational force ∝ 1/r.

Ripple gravity. Modified gravity: rippled inverse-square law.

Anisotropic κ=10. Anisotropic inertia with κ = 10.

Anisotropic κ=20. Anisotropic inertia with κ = 20.

Combined. Anisotropic inertia (κ = 10) with 1/r gravity.

Domain 2

Fluid Mechanics

2D incompressible viscous flow (Kelvin–Helmholtz instability), governed by the Navier–Stokes equations. Alterations introduce a state-dependent gyroscopic forcing that perturbs the velocity perpendicular to the flow, modulated by either local kinetic energy (velocity modulation) or vorticity (vorticity modulation).

Normal. Standard Navier–Stokes — no alteration.

Velocity mod. Gyroscopic forcing scaled by local kinetic energy.

Vorticity mod. Opposing force layers within turbulent eddies.

Combined. A convex combination of velocity and vorticity modulation.

Domain 3

Quantum Mechanics

Two particles in a 2D box, evolving under the time-dependent Schrödinger equation with smoothed infinite-well potentials. Alterations include a generalized Born rule (probability density p-norm with p ≠ 2) and non-linear entanglement initialization (spatial-correlation factor parameterized by λ).

Normal (p=2). Standard quantum mechanics: separable initial state, p = 2 Born rule.

Born rule p=3. Generalized Born rule with p = 3 — modified measurement postulate.

Entangled init. Non-linear entanglement initialization with spatial correlation factor.

Main Results

We benchmark a minimal agent scaffold with code execution across Gemini 2.5 Flash Lite, 2.5 Flash, 2.5 Pro, and 3 Flash. Performance generally improves with model capability, and a Strategy system prompt inspired by Bayesian experimental design tends to help. However, even the strongest models struggle to recover correct symbolic forms of the underlying laws, and smaller Gemini 2.5 models often produce out-of-bounds predictions in classical mechanics, indicating headroom for both better scaffolds and base capabilities.

Predictions are not clipped — entries far above 1 in the Classical table reflect runaway out-of-bounds predictions, separating models that produce stable predictions from those that don't.

Classical Mechanics — combined config uses κ=10 with 1/r gravity
Model	Prompt	Prediction Error ↓
		Normal	Anisotropic Inertia		Altered Gravity		Combined
		Normal	κ=10	κ=20	1/r	Ripple	Combined
Gemini 2.5 Flash Lite	Base	6.61	13.78	831.97	638.25	253.44	1241.11
Gemini 2.5 Flash Lite	+ Strategy	48.32	115.02	74.62	463.83	1183.7	964.43
Gemini 2.5 Flash	Base	5.97	35.36	4.42	28.22	66.75	198.69
Gemini 2.5 Flash	+ Strategy	7.93	12.32	60.36	26.45	44.32	23.33
Gemini 2.5 Pro	Base	1.93	2.12	13.41	11.40	15.67	38.72
Gemini 2.5 Pro	+ Strategy	0.67	1.56	0.37	1.22	0.50	0.49
Gemini 3 Flash	Base	0.29	0.36	0.88	0.31	0.39	0.38
Gemini 3 Flash	+ Strategy	0.38	0.39	0.37	0.37	0.43	0.35

Fluid Mechanics — combined config uses a convex combination of velocity and vorticity modulation
Model	Prompt	Prediction Error ↓
		Normal	Velocity Modulation		Vorticity Modulation		Combined
		Normal	γ=0.5	γ=0.7	γ=5.0	γ=10.0	Combined
Gemini 2.5 Flash Lite	Base	0.68	0.73	0.82	0.65	0.81	1.02
Gemini 2.5 Flash Lite	+ Strategy	0.41	3.26	0.81	0.59	0.73	0.81
Gemini 2.5 Flash	Base	0.47	0.51	0.66	0.82	0.85	0.86
Gemini 2.5 Flash	+ Strategy	0.39	0.47	0.22	0.75	0.50	0.71
Gemini 2.5 Pro	Base	0.21	0.17	0.18	0.73	2.51	0.71
Gemini 2.5 Pro	+ Strategy	0.23	0.10	0.33	0.44	0.70	0.69
Gemini 3 Flash	Base	0.26	0.49	0.50	0.36	0.79	0.23
Gemini 3 Flash	+ Strategy	0.17	0.14	0.41	0.27	0.19	0.31

Quantum Mechanics — combined config uses λ=25 and p=1
Model	Prompt	Prediction Error ↓
		Normal	Measurement Norm		Entanglement		Combined
		Normal	p=1	p=3	λ=5.0	λ=15.0	Combined
Gemini 2.5 Flash Lite	Base	0.17	0.66	0.31	0.11	0.25	0.54
Gemini 2.5 Flash Lite	+ Strategy	0.11	0.61	0.21	0.14	0.14	0.55
Gemini 2.5 Flash	Base	0.14	0.54	0.13	0.13	0.18	0.60
Gemini 2.5 Flash	+ Strategy	0.11	0.47	0.09	0.07	0.16	0.59
Gemini 2.5 Pro	Base	0.08	0.52	0.10	0.14	0.15	0.58
Gemini 2.5 Pro	+ Strategy	0.03	0.46	0.08	0.10	0.12	0.40
Gemini 3 Flash	Base	0.11	0.53	0.04	0.30	0.16	0.39
Gemini 3 Flash	+ Strategy	0.05	0.48	0.05	0.07	0.09	0.92

Variants

Beyond predictive error on the default environments, MaD Physics supports several variants that probe additional capabilities: image-based observations, in-context learning across episodes, and parameter inference under a known structural form.

Visual observations

An additional variant of the Classical environment provides only image renderings of the system instead of numerical state values. Trends across model capability and altered laws hold, but errors are noticeably larger.

Visual Classical Mechanics
Model	Prompt	Pred. Err. ↓
Model	Prompt	Normal	κ=10
Gemini 2.5 Flash	Base	6.81	1786.13
Gemini 2.5 Flash	+ Strat.	3.16	15.64
Gemini 2.5 Pro	Base	4.56	21.65
Gemini 2.5 Pro	+ Strat.	0.65	13.23

In-context learning

Prediction error on Classical Mechanics across episodes for Gemini 2.5 Pro and Gemini 3 Flash. Gemini 3 Flash starts lower and continues to improve, while 2.5 Pro fails to learn under altered physics.

Parameter inference

Active-sensing variant: estimate κ (the inertial-memory coupling) given the model’s structural form. Gemini 2.5 Pro consistently underestimates κ, indicating a bias toward standard physics.

MaD Physics
Evaluating Information Seeking Under Constraints in Physical Environments