Frontier Signal
SIREN Corrects LLM Evaluation’s Winner’s Curse
A new protocol, SIREN, addresses the 'winner's curse' in LLM evaluation, providing more reliable performance estimates for models tuned on adaptive benchmarks by separating selection from evaluation.
Read the briefing