A new framework called VANGUARD (Video Anomaly Understanding through Reasoning and Grounding) unifies video anomaly detection (VAD) with interpretable chain-of-thought reasoning and precise spatial grounding, addressing long-standing limitations of traditional VAD methods. Published on , VANGUARD leverages a multimodal large language model (VLM) to not only classify anomalies but also explain why an event is anomalous and where it occurs within the video frame, achieving 94% ROC-AUC on the UCF-Crime dataset.
- VANGUARD integrates anomaly classification, spatial grounding, and chain-of-thought reasoning into a single VLM for video anomaly detection.
- It addresses VLM “hallucinations” in spatial grounding through a three-stage progressive training curriculum.
- The framework employs a teacher-student pipeline, using a VLM (Qwen3-VL-4B) to generate structured reasoning trajectories from sparse VAD annotations.
- VANGUARD achieved 94% ROC-AUC and 84% F1 on UCF-Crime, outperforming prior methods by also providing interpretable explanations and object localization.
- It demonstrates zero-shot transfer capabilities to new domains like XD-Violence and ShanghaiTech without needing target-specific adaptation.
What changed
Traditionally, video anomaly detection has been treated as a binary classification problem, determining if an event is “normal” or “abnormal” [4]. This approach, while functional, offers little insight into the nature of the anomaly or its precise location, making it difficult for operators to understand and respond effectively. Existing Vision-Language Models (VLMs) have shown promise in scene understanding but often struggle with reliable spatial grounding, frequently generating “hallucinated” or geometrically inaccurate bounding boxes when asked to localize objects [5].
VANGUARD fundamentally shifts this paradigm by integrating three critical capabilities into a single multimodal LLM: anomaly classification, precise spatial grounding, and interpretable chain-of-thought reasoning. This means that instead of just flagging an anomaly, VANGUARD can explain what makes it anomalous (e.g., “person running with a weapon”), where it is happening (with accurate bounding boxes), and provide a step-by-step reasoning process. This level of interpretability and precision was largely absent from prior VAD methods, which either focused solely on detection or produced unreliable localization. The paper highlights that this approach directly tackles the VLM hallucination problem in spatial grounding by explicitly training for it.
How it works
VANGUARD employs a multimodal large language model (VLM) as its core, built upon a three-stage progressive training curriculum designed to mitigate common VLM weaknesses like spatial hallucination.
The first stage involves a “classifier warmup,” where the VLM’s frozen backbone features are used for initial anomaly classification. This establishes a foundational understanding of normal versus anomalous patterns.
The second stage introduces LoRA (Low-Rank Adaptation) for spatial grounding. This fine-tunes the VLM to accurately identify and draw bounding boxes around anomalous objects or regions. The researchers note that this stage is crucial for overcoming the VLM’s tendency to hallucinate bounding boxes, making the spatial localization reliable. GroundingDINO provides the necessary bounding box supervision during this phase.
The final stage focuses on generating chain-of-thought reasoning. Here, the VLM is trained to produce step-by-step explanations for its anomaly classifications and localizations. To address the inherent sparsity of annotations in typical VAD datasets, VANGUARD uses a teacher-student annotation pipeline. A powerful VLM, specifically Qwen3-VL-4B, acts as a “teacher” to generate detailed, structured reasoning trajectories for sub-clips based on the limited manual annotations available from datasets like UCA. This synthetic data generation augments the training process, allowing the model to learn complex reasoning patterns. The paper also suggests that this structured reasoning acts as an implicit regularizer, leading to more balanced predictions compared to classification-only fine-tuning.
Why it matters for operators
For operators in fields ranging from industrial automation and quality control to public safety and surveillance, VANGUARD represents a significant leap forward. The ability to receive not just an alert, but an explanation and precise localization of an anomaly, transforms raw data into actionable intelligence. Consider a manufacturing plant: instead of a generic “anomaly detected on assembly line 3,” an operator could get “robot arm 4 is performing an out-of-sequence movement (reason: attempting to pick up a part already missing, potential software glitch), located at coordinates X,Y.” This level of detail drastically reduces investigation time and enables targeted intervention, minimizing downtime and potential safety hazards.
The zero-shot transfer capability demonstrated on XD-Violence and ShanghaiTech datasets is particularly compelling. It implies that VANGUARD, once trained, can be deployed in novel environments with different types of anomalies without requiring extensive, costly, and time-consuming re-training or adaptation for each new domain. This reduces the barrier to entry for adopting sophisticated anomaly detection in diverse operational contexts. However, operators should temper expectations regarding “perfect” zero-shot performance; while promising, domain shifts always introduce some performance degradation. The real advantage here is the vastly reduced effort for initial deployment and the ability to detect previously unseen anomaly types with a reasonable degree of accuracy, providing a strong baseline for further iterative refinement. The interpretability also aids in building trust in AI systems, a critical factor for operational adoption where human oversight remains paramount.
Benchmarks and evidence
VANGUARD’s performance was rigorously evaluated against established video anomaly detection benchmarks.
On the UCF-Crime dataset, VANGUARD achieved:
- 94% ROC-AUC (Receiver Operating Characteristic Area Under Curve), a common metric for classification performance.
- 84% F1 score, which represents the harmonic mean of precision and recall.
These figures are presented alongside the novel capabilities of simultaneously producing interpretable chain-of-thought explanations and spatial grounding of anomalous objects, features absent from prior VAD methods.
The research also highlights the success of its staged training approach. Ablation studies confirmed that this progressive curriculum, layering classifier warmup, LoRA-adapted spatial grounding, and chain-of-thought generation, significantly outperforms a monolithic optimization strategy. Furthermore, the structured reasoning component was shown to act as an implicit regularizer, leading to more balanced predictions than fine-tuning for classification alone.
Crucially, VANGUARD demonstrated strong zero-shot transfer capabilities. Without any target-domain adaptation, it successfully generalized to the XD-Violence and ShanghaiTech datasets, indicating its robustness and potential for deployment in diverse, unseen environments.
Risks and open questions
- Data Scarcity for True Novelty: While the teacher-student pipeline helps with sparse annotations, truly novel anomalies that deviate significantly from learned patterns might still pose a challenge. The effectiveness of the “teacher” VLM in generating diverse and representative reasoning trajectories for all potential anomalies is critical.
- Computational Overhead: Multimodal LLMs, especially those performing chain-of-thought reasoning, typically require more computational resources per query compared to traditional, simpler anomaly detection models [3]. Operators need to consider the latency and infrastructure costs for real-time deployment, particularly in high-throughput video environments.
- Grounding Reliability in Edge Cases: Although VANGUARD aims to reduce hallucination, the reliability of spatial grounding in highly ambiguous or rapidly changing scenes remains an open question. The precision of bounding boxes is paramount for actionable intelligence.
- Quantifying “Informational Value” of Reasoning: While explanations are provided, quantifying the real informational value and utility of these explanations for human operators in high-stakes scenarios is an area for further research. Frameworks like MEG-RAG aim to quantify multimodal evidence grounding, which could be relevant for validating VANGUARD’s explanations [1, 2].
Sources
- MEG-RAG: Quantifying Multi-modal Evidence Grounding for Evidence Selection in RAG — https://arxiv.org/html/2604.24564
- [2604.24564] MEG-RAG: Quantifying Multi-modal Evidence Grounding for Evidence Selection in RAG — https://arxiv.org/abs/2604.24564
- Large language model – Wikipedia — https://en.wikipedia.org/wiki/Large_language_model
- Machine learning – Wikipedia — https://en.wikipedia.org/wiki/Machine_learning
- Hallucination (artificial intelligence) – Wikipedia — https://en.wikipedia.org/wiki/Hallucination_(artificial_intelligence)