Regime-Conditioned BO: Why Your Benchmarks Lie

New research published on arXiv on May 7, 2026, challenges the conventional wisdom of Bayesian Optimization (BO) acquisition function benchmarking, demonstrating that the “best” strategy is highly dependent on experimental conditions, or “regimes.” The paper introduces the Portable Regime Score (PRS), a metric that predicts which acquisition function will outperform others based on the budget-to-context ratio and prior knowledge quality, effectively eliminating the stability of universal leaderboards and offering a practical protocol for operators to select optimal strategies conditionally.

Published benchmarks for Bayesian Optimization acquisition functions are often misleading because they average performance across diverse conditions, obscuring conditional effects.
The paper introduces the Portable Regime Score (PRS), calculated as (B/|A|)(1-rho), where B is the budget, |A| is the number of contexts, and rho is the prior rank correlation. PRS predicts which acquisition function will perform best under specific experimental regimes.
A new tool, RegimePlanner, dynamically estimates rho and switches acquisition functions, outperforming static strategies and even per-context oracles in tested scenarios.
The research audited 40 transfer-BO papers and found 98% failed to vary the budget-to-context ratio as a controlled variable, a critical oversight that leads to unstable rankings.

What changed

The core shift introduced by this arXiv paper is a fundamental re-evaluation of how Bayesian Optimization (BO) acquisition functions should be benchmarked and selected. Historically, the field has largely relied on “average treatment effect” comparisons, aiming to identify a universally superior acquisition function. This approach, as the audit of 40 transfer-BO papers from top-tier ML conferences (2022–2025) revealed, often overlooks critical experimental variables. Specifically, 98% of these papers failed to systematically vary the budget-to-context ratio (B/|A|) as a controlled axis, leading to potentially spurious conclusions.

The new research demonstrates that the optimal acquisition function is not static but “regime-conditioned.” For instance, on the GDSC2 benchmark, changing only the experimental budget completely reversed the ranking of two common acquisition functions: at a budget (B) of 50, Greedy outperformed UCB by 0.050 Hit@1, but at B=100, UCB surpassed Greedy by 0.035 Hit@1. This direct reversal under different budget conditions highlights the instability of unconditional rankings.

To address this, the paper proposes the Portable Regime Score (PRS) as a predictive metric. PRS, defined as (B/|A|)(1-rho), incorporates the budget-to-context ratio (B/|A|) and the prior rank correlation (rho), which quantifies the quality of prior knowledge. This score allows practitioners to estimate the likely winner among acquisition functions based on pre-comparison observables, moving beyond a one-size-fits-all mentality. The “No-Free-Leaderboard” proposition explains this instability: when the Conditional Average Treatment Effect (CATE) changes sign across different regimes, any reported Average Treatment Effect (ATE) becomes a function of the specific benchmark mixture, making it unreliable for general application.

How it works

The mechanism behind regime-conditioned evaluation centers on identifying and quantifying the experimental conditions that dictate acquisition function performance. The key insight is that the effectiveness of an acquisition function, which guides the exploration-exploitation trade-off in BO, is not absolute but relative to the available budget, the number of contexts, and the quality of prior information.

The Portable Regime Score (PRS) is the central operational component. It’s calculated as (B/|A|)(1-rho). Here:

B represents the experimental budget, or the total number of evaluations allowed.
|A| is the number of distinct contexts or tasks being optimized.
rho is the prior rank correlation. This metric quantifies how well the initial, pre-experimental ranking of candidate solutions (based on prior knowledge or pilot data) correlates with their true, underlying performance. A high rho indicates strong prior knowledge, while a low rho suggests weak or misleading priors.

The PRS essentially captures the interplay between the available resources (budget per context) and the uncertainty reduction potential (influenced by prior knowledge). A higher PRS might indicate a regime where more sophisticated exploration strategies are beneficial, while a lower PRS might favor simpler, more exploitative approaches. The research found a significant correlation (beta=0.50, p=1.1e-9) between PRS and acquisition function performance across 79 diverse conditions spanning chemistry, drug-response biology, and hyperparameter optimization (HPO).

Building on this, the paper introduces RegimePlanner. This system operates by dynamically estimating rho online during the initial phases of an experiment. Based on this real-time rho estimation and the known budget and context count, RegimePlanner uses the PRS to predict the optimal acquisition function and switches to it accordingly. This adaptive strategy allows it to outperform static choices. For example, RegimePlanner won all 16 HPO-B search spaces at B=100 and exceeded the matched {Greedy,UCB} per-context oracle on GDSC2 by 18%.

This approach is conceptually similar to contextual bandit problems, where the optimal action (acquisition function in this case) depends on the observed context (the regime parameters) [1]. By explicitly modeling these regimes, the research moves beyond simplistic “probability matching strategies” to a more nuanced, predictive selection process.

Why it matters for operators

For any operator engaged in experimental optimization—whether it’s tuning hyperparameters for a deep learning model, discovering new drug candidates, or optimizing chemical processes—this research is a critical wake-up call. The long-held practice of picking a “best” Bayesian Optimization acquisition function based on generalized benchmarks is fundamentally flawed. You are likely leaving performance on the table, or worse, actively choosing a suboptimal strategy, because the optimal choice is highly conditional on your specific experimental setup.

The immediate implication is that operators must stop treating acquisition functions as universal tools. Instead, adopt a regime-conditioned mindset. Before embarking on a costly optimization run, assess your budget-to-context ratio (B/|A|) and, crucially, the quality of your prior knowledge (rho). This isn’t just academic; the paper shows that a simple budget change can flip the performance ranking of acquisition functions. If you’re running a high-budget, low-prior-knowledge experiment, your optimal strategy will differ significantly from a low-budget, high-prior-knowledge scenario.

The practical protocol outlined by the researchers is straightforward and actionable: always report B/|A|, rho, K (number of contexts), and the chosen metric alongside any claimed acquisition advantage. This isn’t just for academic rigor; it’s a necessary step for internal reproducibility and intelligent decision-making. Operators should integrate pilot experiments to estimate rho before committing to a full-scale optimization. Tools like RegimePlanner, or custom implementations of its logic, represent a significant advancement, allowing for dynamic adaptation rather than static selection. This adaptive approach means you’re no longer guessing which acquisition function is “best” but actively determining it based on real-time experimental conditions, leading to more efficient resource allocation and faster discovery cycles.

Benchmarks and evidence

The research provides concrete evidence of regime-dependent performance reversals and the predictive power of the Portable Regime Score (PRS):

Acquisition Function Ranking Reversal: On the GDSC2 benchmark, at a budget (B) of 50, Greedy acquisition function outperformed UCB by 0.050 Hit@1. However, when the budget was increased to B=100, UCB then outperformed Greedy by 0.035 Hit@1. This demonstrates a clear reversal of optimal strategy based solely on budget.
PRS Predictive Power: Across 79 diverse experimental conditions, spanning chemistry, drug-response biology, and hyperparameter optimization (HPO), a hierarchical model showed a significant correlation between PRS and acquisition function performance, with a beta coefficient of 0.50 (p=1.1e-9). This indicates that PRS is a strong predictor of which acquisition function will perform better under specific regimes.
Equivalence Zones: The analysis revealed that 19% of the tested conditions fell into an “equivalence zone,” where the absolute advantage of one acquisition function over another was less than 0.01 Hit@1. This suggests that in nearly one-fifth of cases, the choice of acquisition function might not significantly impact the outcome, allowing for simpler choices.
RegimePlanner Performance: The proposed adaptive system, RegimePlanner, which estimates rho online and switches acquisition functions accordingly, achieved superior performance. It won all 16 HPO-B search spaces when tested at B=100. Furthermore, on the GDSC2 benchmark, RegimePlanner exceeded the performance of a matched per-context oracle (which theoretically knows the best static choice for each context) for both Greedy and UCB by 18%.
Prediction Accuracy: Pre-registered predictions using the PRS framework achieved an overall accuracy of 27/40 (67.5%) in identifying the winning acquisition function. Within specific “EMA prior families,” the accuracy exceeded 90%. This indicates a robust ability to predict optimal strategies based on regime parameters.

Risks and open questions

Generalizability of PRS: While PRS showed strong predictive power across 79 conditions, its generalizability to entirely novel domains or extremely high-dimensional search spaces remains an open question. The specific functional form (B/|A|)(1-rho) might require calibration or adjustment for different problem classes.
Online rho Estimation Robustness: RegimePlanner’s success hinges on accurately estimating rho online from pilot contexts. The robustness of this estimation, especially in very noisy environments or with limited initial data, needs further investigation. Poor rho estimation could lead to suboptimal acquisition function switching.
Computational Overhead of RegimePlanner: While RegimePlanner offers performance gains, the computational cost of dynamically estimating rho and switching strategies should be evaluated against the benefits, particularly in scenarios where individual evaluations are very cheap but the optimization budget is tight.
Beyond Greedy/UCB: The paper primarily focuses on the reversal between Greedy and UCB. How PRS and RegimePlanner perform with a broader range of acquisition functions (e.g., Expected Improvement, Thompson Sampling [1]) and more complex multi-objective optimization scenarios is an area for future work.
Operator Adoption Curve: Implementing a regime-conditioned evaluation protocol requires a shift in operator mindset and potentially new tooling. The practical challenges of integrating B/|A| and rho considerations into existing MLOps or scientific discovery pipelines need to be addressed for widespread adoption.

Sources

Multi-armed bandit – Wikipedia — https://en.wikipedia.org/wiki/Multi-armed_bandit

Author

Siegfried Kamgo

Founder and editorial lead at FrontierWisdom. Engineer turned operator-analyst writing about AI systems, automation infrastructure, decentralised stacks, and the practical economics of frontier technology. Focus: turning fast-moving releases into durable, implementation-ready playbooks.

Regime-Conditioned BO: Why Your Benchmarks Lie

Turn this article into a repeatable weekly edge.

What changed

How it works

Why it matters for operators

Benchmarks and evidence

Risks and open questions

Sources

Author

Siegfried Kamgo

Get the next blueprint before it becomes common advice.

Related Articles

AI News Roundup, 2026-05-07: LLM Efficiency & Robot Smarts

RLHF Alignment Collapse: New Method Prevents Exploitation

arXiv: Distributed Output Templates Drive In-Context Learning

Leave a Reply Cancel reply