Abstract
Scientific discovery is fundamentally a resource-constrained process that requires navigating complex trade-offs between the quality and quantity of measurements due to physical and cost constraints. Existing benchmarks for evaluating agents for scientific discovery focus on either static knowledge-based reasoning or unconstrained experimental design tasks, and do not capture the ability to make measurements and plan under constraints.
To bridge this gap, we propose Measuring and Discovering Physics (MaD Physics), a benchmark to evaluate the ability of agents to make informative measurements and conclusions subject to constraints on the quality and quantity of measurements. The benchmark consists of three environments, each based on a distinct physical law. To mitigate contamination from existing knowledge, MaD Physics includes altered physical laws. In each trial, the agent makes measurements of the system until it exhausts an allotted budget, and then must infer the underlying physical law to make predictions about the future state of the system.
MaD Physics evaluates two fundamental capabilities of scientific agents: inferring models from data and planning under constraints. We also demonstrate how MaD Physics can be used to evaluate other capabilities such as multimodality and in-context learning. We benchmark agents on MaD Physics using four Gemini models (2.5 Flash Lite, 2.5 Flash, 2.5 Pro, and 3 Flash), identifying shortcomings in their structured exploration and data collection capabilities, and highlighting directions to improve their scientific reasoning.
Main Results
We benchmark a minimal agent scaffold with code execution across Gemini 2.5 Flash Lite, 2.5 Flash, 2.5 Pro, and 3 Flash. Performance generally improves with model capability, and a Strategy system prompt inspired by Bayesian experimental design tends to help. However, even the strongest models struggle to recover correct symbolic forms of the underlying laws, and smaller Gemini 2.5 models often produce out-of-bounds predictions in classical mechanics, indicating headroom for both better scaffolds and base capabilities.
Predictions are not clipped — entries far above 1 in the Classical table reflect runaway out-of-bounds predictions, separating models that produce stable predictions from those that don't.
Classical Mechanics — combined config uses κ=10 with 1/r gravity
| Model |
Prompt |
Prediction Error ↓ |
| Normal |
Anisotropic Inertia |
Altered Gravity |
Combined |
| κ=10 |
κ=20 |
1/r |
Ripple |
| Gemini 2.5 Flash Lite | Base | 6.61 | 13.78 | 831.97 | 638.25 | 253.44 | 1241.11 |
| + Strategy | 48.32 | 115.02 | 74.62 | 463.83 | 1183.7 | 964.43 |
| Gemini 2.5 Flash | Base | 5.97 | 35.36 | 4.42 | 28.22 | 66.75 | 198.69 |
| + Strategy | 7.93 | 12.32 | 60.36 | 26.45 | 44.32 | 23.33 |
| Gemini 2.5 Pro | Base | 1.93 | 2.12 | 13.41 | 11.40 | 15.67 | 38.72 |
| + Strategy | 0.67 | 1.56 | 0.37 | 1.22 | 0.50 | 0.49 |
| Gemini 3 Flash | Base | 0.29 | 0.36 | 0.88 | 0.31 | 0.39 | 0.38 |
| + Strategy | 0.38 | 0.39 | 0.37 | 0.37 | 0.43 | 0.35 |
Fluid Mechanics — combined config uses a convex combination of velocity and vorticity modulation
| Model |
Prompt |
Prediction Error ↓ |
| Normal |
Velocity Modulation |
Vorticity Modulation |
Combined |
| γ=0.5 |
γ=0.7 |
γ=5.0 |
γ=10.0 |
| Gemini 2.5 Flash Lite | Base | 0.68 | 0.73 | 0.82 | 0.65 | 0.81 | 1.02 |
| + Strategy | 0.41 | 3.26 | 0.81 | 0.59 | 0.73 | 0.81 |
| Gemini 2.5 Flash | Base | 0.47 | 0.51 | 0.66 | 0.82 | 0.85 | 0.86 |
| + Strategy | 0.39 | 0.47 | 0.22 | 0.75 | 0.50 | 0.71 |
| Gemini 2.5 Pro | Base | 0.21 | 0.17 | 0.18 | 0.73 | 2.51 | 0.71 |
| + Strategy | 0.23 | 0.10 | 0.33 | 0.44 | 0.70 | 0.69 |
| Gemini 3 Flash | Base | 0.26 | 0.49 | 0.50 | 0.36 | 0.79 | 0.23 |
| + Strategy | 0.17 | 0.14 | 0.41 | 0.27 | 0.19 | 0.31 |
Quantum Mechanics — combined config uses λ=25 and p=1
| Model |
Prompt |
Prediction Error ↓ |
| Normal |
Measurement Norm |
Entanglement |
Combined |
| p=1 |
p=3 |
λ=5.0 |
λ=15.0 |
| Gemini 2.5 Flash Lite | Base | 0.17 | 0.66 | 0.31 | 0.11 | 0.25 | 0.54 |
| + Strategy | 0.11 | 0.61 | 0.21 | 0.14 | 0.14 | 0.55 |
| Gemini 2.5 Flash | Base | 0.14 | 0.54 | 0.13 | 0.13 | 0.18 | 0.60 |
| + Strategy | 0.11 | 0.47 | 0.09 | 0.07 | 0.16 | 0.59 |
| Gemini 2.5 Pro | Base | 0.08 | 0.52 | 0.10 | 0.14 | 0.15 | 0.58 |
| + Strategy | 0.03 | 0.46 | 0.08 | 0.10 | 0.12 | 0.40 |
| Gemini 3 Flash | Base | 0.11 | 0.53 | 0.04 | 0.30 | 0.16 | 0.39 |
| + Strategy | 0.05 | 0.48 | 0.05 | 0.07 | 0.09 | 0.92 |