Qwen3.5-Omni is a multimodal AI model that scales to hundreds of billions of parameters with 256k context length, supporting audio-visual understanding, speech generation across 10 languages, and achieving state-of-the-art results on 215 benchmarks.
| Released by | Qwen team |
|---|---|
| Release date | |
| What it is | Multimodal AI model with hundreds of billions of parameters |
| Who it’s for | Developers and researchers building audio-visual AI applications |
| Where to get it | Not yet disclosed |
| Price | Not yet disclosed |
- Qwen3.5-Omni scales to hundreds of billions of parameters with 256k context length support [1]
- The model achieves state-of-the-art results across 215 audio and audio-visual benchmarks, surpassing Gemini-3.1 Pro [1]
- Supports over 10 hours of audio understanding and 400 seconds of 720P video processing at 1 FPS [1]
- Introduces ARIA technology for dynamic text-speech alignment to improve conversational speech stability [1]
- Enables multilingual understanding and speech generation across 10 languages with emotional nuance [1]
- Qwen3.5-Omni represents the largest scale advancement in the Qwen-Omni family with hundreds of billions of parameters [1]
- The model processes heterogeneous text-vision pairs and over 100 million hours of audio-visual content during training [1]
- ARIA technology addresses streaming speech synthesis instability through dynamic text-speech unit alignment [1]
- Audio-Visual Vibe Coding emerges as a new capability for coding based on audio-visual instructions [1]
- The model supports sophisticated real-time interaction with semantic interruption and voice control features [2]
What is Qwen3.5-Omni
Qwen3.5-Omni is a multimodal AI model that processes text, audio, and video simultaneously with hundreds of billions of parameters. The model represents the latest advancement in the Qwen-Omni family, scaling to hundreds of billions of parameters and supporting a 256k context length [1]. Qwen3.5-Omni demonstrates robust omni-modality capabilities by leveraging a massive dataset comprising heterogeneous text-vision pairs and over 100 million hours of audio-visual content [1].
The model facilitates sophisticated interaction, supporting over 10 hours of audio understanding and 400 seconds of 720P video processing at 1 FPS [1]. Qwen3.5-Omni expands linguistic boundaries, supporting multilingual understanding and speech generation across 10 languages with human-like emotional nuance [1].
What is new vs the previous version
Qwen3.5-Omni delivers three major new capabilities over Qwen3-Omni across interaction, captioning, and technical architecture. The model introduces controllable audio-visual captioning, capable of generating controllable, detailed, and structured captions as well as screenplay-level fine-grained descriptions [2]. This includes automatic segmentation, timestamp annotation, and detailed descriptions of characters and their relationship to audio [2].
| Feature | Qwen3-Omni | Qwen3.5-Omni |
|---|---|---|
| Parameters | Not specified | Hundreds of billions [1] |
| Context Length | Not specified | 256k tokens [1] |
| Audio Understanding | Limited duration | Over 10 hours [1] |
| Video Processing | Not specified | 400 seconds of 720P at 1 FPS [1] |
| Speech Synthesis | Basic | ARIA dynamic alignment [1] |
| Captioning | Basic | Controllable screenplay-level [2] |
How does Qwen3.5-Omni work
Qwen3.5-Omni employs a Hybrid Attention Mixture-of-Experts (MoE) framework for both Thinker and Talker components, enabling efficient long-sequence inference. The architecture processes multiple modalities through specialized pathways:
- Multimodal Input Processing: The model ingests text, audio, and video data through dedicated encoders that convert each modality into unified token representations [1]
- Hybrid Attention MoE: The Thinker component uses mixture-of-experts routing to efficiently process different types of content while maintaining computational efficiency [1]
- ARIA Speech Alignment: The system dynamically aligns text and speech units to address encoding efficiency discrepancies between text and speech tokenizers [1]
- Talker Generation: The output component generates responses across modalities with precise temporal synchronization and automated scene segmentation [1]
- Real-time Interaction: The model supports semantic interruption through native turn-taking intent recognition and end-to-end voice control [2]
Benchmarks and evidence
Qwen3.5-Omni-plus achieves state-of-the-art results across 215 audio and audio-visual understanding, reasoning, and interaction subtasks and benchmarks. The model surpasses Gemini-3.1 Pro in key audio tasks and matches it in comprehensive audio-visual understanding [1].
| Benchmark Category | Performance | Comparison | Source |
|---|---|---|---|
| Audio Tasks | State-of-the-art | Surpasses Gemini-3.1 Pro | [1] |
| Audio-Visual Understanding | State-of-the-art | Matches Gemini-3.1 Pro | [1] |
| Total Benchmarks | 215 subtasks | SOTA across all categories | [1] |
| Context Processing | 256k tokens | Extended context support | [1] |
| Video Processing | 400 seconds 720P | 1 FPS processing rate | [1] |
Who should care
Builders
Developers building multimodal applications can leverage Qwen3.5-Omni’s audio-visual capabilities for creating sophisticated conversational AI systems. The model’s support for over 10 hours of audio understanding and 400 seconds of video processing enables complex multimedia applications [1]. The ARIA technology provides stable speech synthesis for real-time conversational interfaces [1].
Enterprise
Companies requiring multilingual audio-visual processing can utilize Qwen3.5-Omni’s support for 10 languages with emotional nuance. The model’s controllable audio-visual captioning capabilities enable automated content analysis and screenplay-level descriptions for media companies [2]. Enterprise applications benefit from the model’s comprehensive real-time interaction features [2].
End users
Users seeking advanced AI assistants gain access to sophisticated audio-visual understanding and natural speech generation. The model’s ability to perform Audio-Visual Vibe Coding allows users to generate code based on audio-visual instructions [1]. Real-time interaction capabilities include semantic interruption and voice control over volume, speed, and emotion [2].
Investors
The advancement represents significant progress in omnimodal AI capabilities, with Qwen3.5-Omni achieving state-of-the-art performance across 215 benchmarks. The model’s emergence of Audio-Visual Vibe Coding indicates new market opportunities in multimodal programming interfaces [1].
How to use Qwen3.5-Omni today
Access methods and implementation details for Qwen3.5-Omni are not yet disclosed in the technical report. Based on the Qwen model family pattern, the model will likely be available through:
- API Access: Integration through Qwen’s API endpoints for developers building applications
- Model Downloads: Direct model weights for local deployment and fine-tuning
- Cloud Platforms: Hosted inference through major cloud providers
- Development Tools: SDKs and libraries for multimodal application development
Specific pricing, availability dates, and access requirements are not yet disclosed [1].
Qwen3.5-Omni vs competitors
Qwen3.5-Omni competes directly with other large-scale multimodal models in the audio-visual AI space.
| Model | Parameters | Audio Performance | Video Support | Languages |
|---|---|---|---|---|
| Qwen3.5-Omni | Hundreds of billions [1] | Surpasses Gemini-3.1 Pro [1] | 400s 720P at 1 FPS [1] | 10 languages [1] |
| Gemini-3.1 Pro | Not disclosed | Baseline comparison [1] | Not specified | Not specified |
| GPT-4o | Not disclosed | Not compared | Not specified | Not specified |
| Claude-3.5 | Not disclosed | Not compared | Limited | Not specified |
Risks, limits, and myths
- Computational Requirements: Hundreds of billions of parameters require significant computational resources for inference and deployment [1]
- Speech Synthesis Stability: While ARIA addresses instability, streaming speech synthesis remains challenging due to encoding discrepancies [1]
- Context Length Limitations: Despite 256k context support, processing extremely long sequences may impact performance [1]
- Training Data Bias: The model’s performance depends on the quality and diversity of 100+ million hours of training data [1]
- Real-time Processing: Audio-visual processing at scale may introduce latency in real-time applications [1]
- Language Coverage: Support limited to 10 languages may exclude specific regional requirements [1]
- Availability Uncertainty: Release timeline and access methods remain undisclosed [1]
FAQ
What is Qwen3.5-Omni and how does it work?
Qwen3.5-Omni is a multimodal AI model with hundreds of billions of parameters that processes text, audio, and video simultaneously using a Hybrid Attention Mixture-of-Experts framework [1].
How many parameters does Qwen3.5-Omni have?
Qwen3.5-Omni scales to hundreds of billions of parameters, representing a significant evolution over its predecessor [1].
What is the context length of Qwen3.5-Omni?
Qwen3.5-Omni supports a 256k context length for processing long sequences of multimodal content [1].
How long can Qwen3.5-Omni process audio and video?
The model supports over 10 hours of audio understanding and 400 seconds of 720P video processing at 1 FPS [1].
What is ARIA in Qwen3.5-Omni?
ARIA is a technology that dynamically aligns text and speech units to enhance stability and prosody of conversational speech with minimal latency impact [1].
How many languages does Qwen3.5-Omni support?
Qwen3.5-Omni supports multilingual understanding and speech generation across 10 languages with human-like emotional nuance [1].
What is Audio-Visual Vibe Coding?
Audio-Visual Vibe Coding is a new capability that allows the model to perform coding directly based on audio-visual instructions [1].
How does Qwen3.5-Omni compare to Gemini-3.1 Pro?
Qwen3.5-Omni surpasses Gemini-3.1 Pro in key audio tasks and matches it in comprehensive audio-visual understanding across 215 benchmarks [1].
When will Qwen3.5-Omni be available?
The technical report was published on , but specific availability dates are not yet disclosed [1].
What are the main improvements over Qwen3-Omni?
Qwen3.5-Omni adds controllable audio-visual captioning, comprehensive real-time interaction, and voice cloning capabilities over its predecessor [2].
Can Qwen3.5-Omni handle real-time conversations?
Yes, the model supports comprehensive real-time interaction including semantic interruption through native turn-taking intent recognition and end-to-end voice control [2].
What training data was used for Qwen3.5-Omni?
The model was trained on a massive dataset comprising heterogeneous text-vision pairs and over 100 million hours of audio-visual content [1].
Glossary
- ARIA
- A technology that dynamically aligns text and speech units to improve conversational speech stability and prosody
- Audio-Visual Vibe Coding
- A capability allowing AI models to generate code directly from audio-visual instructions
- Hybrid Attention MoE
- A Mixture-of-Experts framework combining attention mechanisms for efficient processing of different content types
- Omni-modality
- The ability to process and understand multiple input modalities including text, audio, and video simultaneously
- Talker
- The output generation component of the model responsible for producing responses across different modalities
- Thinker
- The reasoning component of the model that processes and analyzes multimodal inputs before generation
Sources
- [2604.15804] Qwen3.5-Omni Technical Report — https://arxiv.org/abs/2604.15804
- Qwen3.5-Omni Technical Report — https://arxiv.org/html/2604.15804v1
- Paper page – Qwen3.5-Omni Technical Report — https://huggingface.co/papers/2604.15804
- Qwen3.6-Max-Preview: Smarter, Sharper, Still Evolving — https://qwen.ai/research
- Qwen3.5 – How to Run Locally | Unsloth Documentation — https://unsloth.ai/docs/models/qwen3.5
- Qwen (Qwen) — https://huggingface.co/Qwen
- Qwen3.5 & Qwen3.6 Usage Guide – vLLM Recipes — https://docs.vllm.ai/projects/recipes/en/latest/Qwen/Qwen3.5.html
- Qwen Models | OpenRouter — https://openrouter.ai/qwen