A new benchmark, ESARBench, has been introduced to evaluate Multimodal Large Language Model (MLLM)-driven Unmanned Aerial Vehicles (UAVs) in Embodied Search and Rescue (ESAR) tasks. Leveraging Unreal Engine 5 and AirSim, ESARBench simulates highly realistic, large-scale disaster environments with dynamic variables and 600 tasks modeled after real-world rescue cases, exposing critical bottlenecks in spatial memory, aerial adaptation, and the balance between search efficiency and flight safety for current agentic AI systems.
- ESARBench introduces the first comprehensive benchmark for MLLM-driven UAVs in Embodied Search and Rescue (ESAR).
- The benchmark uses Unreal Engine 5 and AirSim to create four high-fidelity, large-scale environments based on real GIS data, including dynamic weather and stochastic clue placement.
- It includes 600 tasks modeled on real-world rescue scenarios and evaluates agents on metrics beyond simple object detection, focusing on autonomous exploration and informed decision-making.
- Initial evaluations reveal significant challenges for MLLM-driven UAVs in spatial memory, aerial adaptation, and balancing search efficiency with flight safety.
What changed
Historically, UAV Search and Rescue (SAR) research has predominantly relied on traditional computer vision and path-planning algorithms. While effective for specific tasks, these methods often lack the nuanced reasoning and adaptability required for complex, unstructured disaster environments. The advent of Multimodal Large Language Models (MLLMs) has begun to transform this landscape, offering UAVs enhanced capabilities in spatial reasoning, semantic understanding, and complex decision-making, making them inherently more suitable for SAR operations. However, a significant gap existed: the absence of a unified, comprehensive benchmark specifically designed for these MLLM-driven, embodied agents in SAR contexts.
The new ESARBench addresses this by proposing the novel task of Embodied Search and Rescue (ESAR). ESAR requires aerial agents not just to detect objects, but to autonomously explore complex environments, identify rescue clues, reason about potential victim locations, and execute informed decision-making under dynamic conditions. This moves beyond simpler “ObjectNav” tasks, which typically focus on navigating to a known object, by demanding a higher level of agentic intelligence and environmental understanding from the UAV. The benchmark itself is built using Unreal Engine 5 and AirSim, creating four large-scale, photorealistic open environments mapped directly from real-world Geographic Information System (GIS) data. This level of fidelity, combined with dynamic weather, time-of-day variations, and stochastic clue placement, represents a significant leap from prior simulated environments, which often lacked such realism and complexity.
Furthermore, ESARBench provides a dataset of 600 tasks, each modeled after real-world rescue cases, and introduces a robust set of evaluation metrics that go beyond simple task completion rates. This shift from traditional vision and path-planning metrics to those assessing embodied intelligence and decision-making is a critical change, pushing the boundaries of what is expected from autonomous SAR systems.
How it works
ESARBench operates as a high-fidelity simulation environment designed to test the full embodied intelligence of MLLM-driven UAVs in SAR scenarios. At its core, it leverages two powerful platforms: Unreal Engine 5 for photorealistic rendering and physics, and AirSim for realistic drone dynamics and sensor simulation. This combination allows for the creation of immersive, large-scale open environments that closely mimic real-world disaster zones, derived from actual GIS data. These environments are not static; they incorporate dynamic variables such as changing weather conditions (e.g., fog, rain), varying times of day (affecting light and shadows), and stochastic placement of rescue clues, ensuring that each task presents unique challenges.
The “embodied” aspect is crucial. Unlike benchmarks that might test an MLLM’s reasoning in isolation, ESARBench requires the agent to physically navigate the environment, interpret sensory input (visual, potentially thermal or lidar), and make decisions that directly impact its trajectory and search strategy. The MLLM component acts as the “brain,” processing multimodal inputs from the UAV’s simulated sensors to understand the environment, identify potential clues (e.g., debris patterns, specific objects), and infer victim locations. This inference often requires complex spatial reasoning and semantic understanding, going beyond simple object recognition. For instance, a UAV might need to understand that a discarded backpack near a partially collapsed structure is a stronger clue than a random piece of rubble.
The benchmark’s 600 tasks are carefully curated from real-world rescue cases, providing a diverse set of objectives and complexities. Agents are evaluated not just on finding victims, but on the efficiency of their search path, their ability to avoid hazards, and the quality of their decision-making under uncertainty. This includes metrics related to search coverage, time to discovery, and flight safety. The interaction between the MLLM’s reasoning capabilities and the UAV’s physical embodiment in a dynamic, uncertain environment is what ESARBench aims to rigorously test.
Why it matters for operators
For operators in fields ranging from emergency services and disaster response to robotics development and AI research, ESARBench is more than just another academic benchmark; it’s a critical reality check for the promise of agentic AI in the physical world. The current hype around MLLMs and agentic systems often focuses on their impressive performance in digital domains, like coding or web browsing, where benchmarks like BenchLM’s WebArena or SWE-bench show rapid progress. However, the transition to embodied agents, especially in high-stakes environments like search and rescue, introduces an entirely new class of challenges.
What ESARBench exposes is the significant gap between theoretical MLLM capabilities and practical, robust aerial autonomy. The identified bottlenecks—spatial memory, aerial adaptation, and the efficiency-safety trade-off—are not trivial. For a robotics engineer, this means that simply integrating a powerful MLLM like Kimi K2.5 or Claude Mythos Preview (which leads agentic scores on some benchmarks) into a drone platform won’t magically solve SAR problems. The MLLM needs to be deeply integrated with robust perception, navigation, and control systems that can handle the unpredictability of real-world physics, sensor noise, and dynamic environments. It highlights the need for specialized training data and architectures that prioritize persistent spatial reasoning and adaptive flight control over general-purpose language understanding.
For founders and product managers developing UAV solutions for emergency response, ESARBench provides a crucial framework for validating claims and guiding R&D. Instead of relying on anecdotal evidence or simplified demos, they can now rigorously test their agentic UAVs against a standardized, realistic set of scenarios. This benchmark will likely drive the development of more specialized MLLMs and hybrid AI architectures that combine the reasoning power of large models with classical robotics techniques for robustness and safety. Operators should view early “perfect scores” on general agentic benchmarks with skepticism when considering embodied applications. The true test lies in benchmarks like ESARBench, which simulate the messy, unpredictable reality of physical operations, forcing a more grounded approach to AI development.
Benchmarks and evidence
ESARBench provides a comprehensive framework for evaluating MLLM-driven UAVs, moving beyond traditional metrics to assess embodied intelligence in SAR. The benchmark evaluates diverse baselines, including traditional heuristics and advanced MLLM-based ObjectNav agents, revealing significant challenges. While specific numeric performance figures for each baseline are not detailed in the abstract, the qualitative findings highlight critical bottlenecks:
- Spatial Memory: Agents struggle with maintaining a persistent understanding of the environment as they explore. This is crucial for efficient search, avoiding re-exploration of already covered areas, and building a coherent mental map of the disaster zone.
- Aerial Adaptation: The ability of agents to dynamically adjust their flight patterns and search strategies based on environmental conditions (e.g., obstacles, weather, visibility) and the evolving search context is a major challenge. This requires real-time perception and decision-making under uncertainty.
- Search Efficiency vs. Flight Safety: A fundamental trade-off exists. Aggressive, fast search strategies might cover more ground quickly but increase the risk of collision or mission failure. Conversely, overly cautious approaches can lead to unacceptably long search times. Current agents struggle to optimally balance these competing objectives.
These findings contrast with the performance of MLLMs in purely digital or text-based agentic benchmarks. For instance, Claude Mythos Preview achieves a perfect agentic score on some benchmarks, with GPT-5.4 close behind, and Claude Opus 4.7 leads SWE-bench Verified at 87.6%. However, these benchmarks do not involve physical embodiment, real-time sensor processing, or navigation in dynamic 3D environments. The challenges identified by ESARBench underscore that performance in abstract agentic tasks does not directly translate to robust embodied intelligence in complex physical domains.
The benchmark’s construction, using Unreal Engine 5 and AirSim with GIS-derived environments, provides a level of photorealism and dynamic variability that is essential for meaningful evaluation. The 600 tasks, modeled after real-world rescue cases, ensure that the evaluation is relevant to actual SAR operations, pushing agents to perform complex reasoning and decision-making beyond simple object detection or path planning.
Risks and open questions
- Simulation-to-Real Transfer Gap: While ESARBench offers high fidelity with Unreal Engine 5 and AirSim, the perennial challenge of transferring learned policies from simulation to real-world UAVs remains. Factors like sensor noise, unmodeled physics, and dynamic environmental changes in the physical world can still degrade performance.
- Computational Cost of MLLMs: Running advanced MLLMs on-board UAVs, especially for real-time decision-making in resource-constrained environments, presents significant computational and power consumption challenges. The trade-off between model complexity and deployability needs further exploration.
- Ethical Considerations and Trust: As UAVs become more autonomous in SAR, ethical questions regarding decision-making in life-or-death situations, potential for false positives/negatives, and accountability for errors will become paramount. Building trust in these agentic systems is critical for adoption.
- Generalization to Novel Disaster Scenarios: The benchmark includes 600 tasks, but real-world disasters are infinitely varied. How well MLLM-driven UAVs generalize to entirely novel environments, unforeseen obstacles, or unprecedented types of clues remains an open question.
- Human-Agent Collaboration: While ESARBench focuses on autonomous agents, real-world SAR often involves human teams. The benchmark doesn’t explicitly address how these agentic UAVs would effectively collaborate with human operators, share information, or receive high-level directives.
Sources
- Vision-And-Language Navigation for Unmanned Systems: Progress and Perspectives | Springer Nature Link — https://link.springer.com/chapter/10.1007/978-981-95-7656-2_52
- Agentic Benchmarks 2026: Tool Use, Browsing, Computer Use | BenchLM.ai — https://benchlm.ai/agentic
- GitHub – masamasa59/ai-agent-papers: A collection of AI Agents papers (Updated biweekly) · GitHub — https://github.com/masamasa59/ai-agent-papers
- Kimi K2.5 Tech Blog: Visual Agentic Intelligence — https://www.kimi.com/blog/kimi-k2-5
- [Leaderboard] AI Agent Framework Scorecard 2026 | Rapid Claw — https://rapidclaw.dev/blog/ai-agent-benchmarks-2026
- How to Build a Lightweight Vision-Language-Action-Inspired Embodied Agent with Latent World Modeling and Model Predictive Control – MarkTechPost — https://www.marktechpost.com/2026/04/27/how-to-build-a-lightweight-vision-language-action-inspired-embodied-agent-with-latent-world-modeling-and-model-predictive-control/
- AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery — https://arxiv.org/html/2604.25256v1
- Intelligent Unmanned Aerial Vehicle Swarm Control Under Electronic Warfare: A Cognitive–Intent Dual-Stream Reinforcement Learning Framework — https://www.mdpi.com/2504-446X/10/5/342