AsymmetryZero: Semantic Evals Operationalize Human Expert

AsymmetryZero is a new framework designed to operationalize human expert preferences as semantic evaluations for AI models and agents, particularly Large Language Models (LLMs). It addresses the challenge of encoding subjective, procedural, and domain-specific requirements into consistent, auditable grading criteria. By defining tasks as “stable evaluation contracts” that explicitly detail what is graded, how each criterion is judged, and how decisions are aggregated, AsymmetryZero aims to provide a more robust and scalable method for evaluating AI performance against nuanced human expectations.

AsymmetryZero defines task evaluation as a “stable evaluation contract” specifying grading criteria, judgment methods, and aggregation rules.
The framework enables comparable, auditable scores across both model-only evaluations (using Inspect) and agentic evaluations (using Harbor Framework).
Research shows “compact juries” (smaller, less powerful models) can reduce per-criterion judging cost by 94.4-95.8% and latency by 72.9-78.3% compared to “frontier juries” (larger, more powerful models), while maintaining stable aggregated task-level outcomes.
Frontier juries exhibit significantly less internal dissent (6.1-11.5% 3-2 split rate) than compact juries (28.7-32.4%).

What changed

Historically, evaluating complex AI tasks, especially those involving subjective human preferences or domain-specific procedures, has been a bottleneck in reinforcement learning (RL) and agent development. Existing methods often rely on exact-match targets or broad preference judgments, which struggle to capture the nuance of real-world requirements. The core problem, as identified by the AsymmetryZero paper, is the “faithful encoding of expert requirements into the evaluation itself.”

AsymmetryZero introduces a structured approach to this encoding problem. Instead of ad-hoc rubrics or vague instructions, it proposes a “stable evaluation contract” for each task. This contract explicitly details:

What is being graded: Clear identification of criteria.
How each criterion is judged: Specific methods and thresholds for evaluation.
How criterion-level decisions are aggregated: A defined process for combining individual judgments into a final task outcome.

This approach allows the same evaluation contract to be executed across different settings—model-only evaluations using Inspect and agentic evaluations using the Harbor Framework—ensuring comparable scores and shared audit artifacts. This contrasts with previous methods where evaluation logic might be implicitly embedded in prompts or vary significantly between testing environments, making consistent benchmarking difficult. The framework leverages the concept of a semantic layer, as seen in tools like Semantica, to build context graphs and decision intelligence systems with explainability and provenance [1, 6].

How it works

AsymmetryZero operates on the principle of defining a task as an executable “evaluation contract.” This contract is a formal specification of how a task’s output should be assessed against human expert preferences. The framework integrates with existing evaluation platforms:

Inspect: Used for model-only evaluations, where the AI model directly produces an output that is then graded against the contract.
Harbor Framework: Employed for agentic evaluations, where an AI agent interacts with an environment, and its actions or final state are assessed according to the same contract. Harbor is part of a broader trend in “harness engineering” for AI agents, focusing on tools and patterns for evaluation, memory, and orchestration [2, 5].

The innovation lies in the contract’s explicit nature. It’s not just a set of instructions, but a structured, machine-readable definition that allows for automated or semi-automated grading. This structure helps to reduce ambiguity and bias, which are common challenges in human-in-the-loop (HITL) evaluation processes [8]. By making grading criteria explicit, AsymmetryZero facilitates the use of “juries” of smaller, less powerful LLMs (compact juries) to perform evaluations that traditionally required larger, more expensive frontier models (frontier juries) or direct human experts.

The framework’s effectiveness was demonstrated in a study using Harbor, where a fixed set of task contracts was evaluated by both frontier and compact juries. This allowed for a direct comparison of their agreement levels, internal dissent, and resource consumption. The ability to use compact juries for evaluation is a significant operational advantage, as it drastically reduces the cost and latency associated with obtaining evaluation feedback.

Why it matters for operators

For engineers, founders, traders, and consultants working with LLMs and AI agents, AsymmetryZero offers a critical step towards more reliable and cost-effective evaluation. The current state of LLM evaluation is often a black box, with subjective human feedback or expensive frontier models driving iterative improvements. This framework provides a standardized, auditable mechanism to operationalize human expert preferences, transforming qualitative feedback into quantitative, actionable metrics.

Operators should recognize that the “stable evaluation contract” is not merely a technical detail; it’s a strategic asset. By explicitly defining what success looks like for complex, subjective tasks, organizations can build a shared understanding across product, engineering, and compliance teams. This clarity is crucial for AI governance, ensuring that AI systems align with business objectives and ethical guidelines [6]. Furthermore, the ability to use “compact juries” for evaluation is a game-changer for development cycles. Reducing per-criterion judging cost by over 94% and latency by over 70% means faster iteration, more frequent testing, and ultimately, quicker deployment of more robust AI solutions. This enables a continuous improvement loop where expert review becomes a structured data source for prompt and tool evolution, rather than a one-off checkpoint [2].

My take is that while the paper focuses on the technical aspects of evaluation, the true operational leverage comes from the shift in mindset it enables: treating evaluation criteria as first-class, version-controlled artifacts. This allows for rigorous A/B testing of different model architectures, prompting strategies, or agentic behaviors against a consistent, human-aligned benchmark. Founders should consider baking this contract-based evaluation into their product development from day one, as it will significantly de-risk deployment and accelerate market feedback loops. Engineers should explore integrating this framework with their CI/CD pipelines, treating evaluation contracts as critical code that evolves alongside their models.

Benchmarks and evidence

The study conducted using the AsymmetryZero framework and the Harbor Framework provided compelling evidence for the efficacy of compact juries:

Criterion-level agreement (Frontier vs. Compact Jury):
- Ranges from 75.9% to 89.6% (strict common-subset agreement: 77.8% to 92.1%). This indicates a high degree of concordance between the judgments of powerful frontier models and smaller compact models at the granular criterion level.
Internal Dissent (3-2 split rate):
- Compact Juries: Exhibited substantially higher internal dissent, with 3-2 split rates ranging from 28.7% to 32.4%.
- Frontier Juries: Showed significantly lower internal dissent, with 3-2 split rates ranging from 6.1% to 11.5%. This suggests that while compact juries can agree with frontier juries, they are individually less decisive or consistent among themselves.
Resource Reduction (Compact vs. Frontier Jury):
- Per-criterion Judging Cost: Compact juries reduced the cost to roughly 4.2% to 5.6% of frontier juries. This represents a 94.4% to 95.8% reduction in cost.
- Latency: Compact juries reduced latency to roughly 21.7% to 27.1% of frontier juries. This represents a 72.9% to 78.3% reduction in time.

These verifiable traces demonstrate that while compact juries might show more internal disagreement, their aggregated task-level outcomes often remain comparatively stable, offering a significant trade-off in cost and speed for evaluation. The frontier-class solvers used in the study included Claude Opus 4.6, GPT-5.4, Grok-4.20, and Gemini-3.1-Pro, indicating the high-performance models against which the compact juries were benchmarked.

Risks and open questions

Maintaining Fidelity of Compact Juries: While aggregated task outcomes remain stable, the higher internal dissent in compact juries suggests a potential for subtle biases or inconsistencies that might become problematic in highly sensitive applications. Operators must carefully monitor the types of tasks where compact juries are deployed and ensure that the “stable evaluation contract” is robust enough to absorb this internal variability without compromising overall quality.
Complexity of Contract Definition: Defining a “stable evaluation contract” for highly subjective or rapidly evolving tasks can be challenging. The initial investment in meticulously crafting these contracts, particularly for complex domain-specific knowledge, could be substantial. This requires deep collaboration between domain experts and AI engineers.
Generalizability Across Domains: The study focused on specific tasks. It remains an open question how well the AsymmetryZero framework and the observed performance of compact juries will generalize across a wider array of domains and task complexities, especially those requiring nuanced ethical or safety considerations.
Evolving Frontier Models: As frontier models continue to improve rapidly, the cost-benefit analysis of using compact juries might shift. Continuous re-evaluation of the optimal jury composition will be necessary.

Sources

Author

Siegfried Kamgo

Founder and editorial lead at FrontierWisdom. Engineer turned operator-analyst writing about AI systems, automation infrastructure, decentralised stacks, and the practical economics of frontier technology. Focus: turning fast-moving releases into durable, implementation-ready playbooks.

AsymmetryZero: Semantic Evals Operationalize Human Expert Preferences

What changed

How it works

Why it matters for operators

Benchmarks and evidence

Risks and open questions

Sources

Author

Siegfried Kamgo

Leave a Reply Cancel reply

AsymmetryZero: Semantic Evals Operationalize Human Expert Preferences

Turn this article into a repeatable weekly edge.

What changed

How it works

Why it matters for operators

Benchmarks and evidence

Risks and open questions

Sources

Author

Siegfried Kamgo

Get the next blueprint before it becomes common advice.

Related Articles

eOptShrinkQ: Near-Lossless KV Cache Compression for LLMs

QKVShare: Quantized KV-Cache Handoff for On-Device LLMs

RoboAlign-R1: Reward-Aligned Robot World Models Boost Performance

Leave a Reply Cancel reply