DeepSeek V4-Pro and V4-Flash launched on , featuring hybrid attention architecture for million-token context inference. The models are available through NVIDIA GPU-accelerated endpoints and optimized for Blackwell chips, with V4-Pro containing 1.6T total parameters and 49B active parameters.
| Released by | DeepSeek |
|---|---|
| Release date | |
| What it is | Fourth-generation flagship AI models with hybrid attention architecture |
| Who it is for | Developers building long-context AI applications |
| Where to get it | NVIDIA GPU endpoints, Hugging Face, local deployment |
| Price | Not yet disclosed |
- DeepSeek V4-Pro features 1.6T total parameters with 49B active parameters for maximum performance
- DeepSeek V4-Flash offers 284B parameters with 13B active parameters for higher-speed inference
- Both models use hybrid attention combining Compressed Sparse Attention and Heavily Compressed Attention
- Models support million-token context length for processing extensive documents and conversations
- Available through NVIDIA GPU endpoints on build.nvidia.com and for local deployment via vLLM
- DeepSeek V4 represents the fourth generation of flagship models optimized for million-token context processing
- The hybrid attention architecture dramatically improves long-context efficiency compared to traditional attention mechanisms
- NVIDIA Blackwell GPU optimization enables high-performance inference for both model variants
- Developers can access the models immediately through NVIDIA’s GPU-accelerated endpoints platform
- Local deployment requires vLLM version 0.9.0 or newer with MoE support for optimal performance
What is DeepSeek V4
DeepSeek V4 is the fourth-generation flagship AI model family from DeepSeek featuring hybrid attention architecture for efficient million-token context inference. The family includes two variants: DeepSeek V4-Pro with 1.6T total parameters and 49B active parameters, and DeepSeek V4-Flash with 284B parameters and 13B active parameters designed for higher-speed applications [1].
Both models implement a hybrid attention mechanism combining Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) to dramatically improve long-context efficiency [2]. The models are optimized for NVIDIA Blackwell GPUs and available through multiple deployment options including cloud endpoints and local installation.
What is new vs the previous version
DeepSeek V4 introduces hybrid attention architecture and million-token context support as major improvements over previous generations.
| Feature | DeepSeek V3 | DeepSeek V4 |
|---|---|---|
| Context Length | Not yet disclosed | 1 million tokens |
| Attention Mechanism | Standard attention | Hybrid CSA + HCA attention |
| V4-Pro Parameters | Not applicable | 1.6T total, 49B active |
| V4-Flash Parameters | Not applicable | 284B total, 13B active |
| GPU Optimization | General GPU support | NVIDIA Blackwell optimized |
| Release Date | Not yet disclosed |
How does DeepSeek V4 work
DeepSeek V4 operates through a hybrid attention architecture that combines two specialized attention mechanisms for efficient long-context processing.
- Compressed Sparse Attention (CSA): Reduces computational complexity by focusing attention on the most relevant tokens within the context window
- Heavily Compressed Attention (HCA): Further compresses attention patterns to enable processing of million-token contexts with manageable memory requirements
- Mixture of Experts (MoE) Architecture: Activates only a subset of parameters (49B out of 1.6T for V4-Pro) during inference to maintain efficiency
- NVIDIA Blackwell Optimization: Leverages specialized GPU acceleration for improved inference performance and throughput
- Dynamic Context Management: Intelligently manages the million-token context window to maintain relevance and computational efficiency
Benchmarks and evidence
DeepSeek V4 demonstrates competitive performance rivaling leading commercial models according to initial evaluations.
| Benchmark | Performance | Source |
|---|---|---|
| Overall Performance | Rivaling GPT-5.5 and Claude Opus 4.7 | [5] |
| Context Length | 1 million tokens | [5] |
| Architecture Type | Open-source hybrid architecture | [5] |
| Hardware Requirements | 4 DGX Spark systems recommended | [6] |
Who should care
Builders
Developers building applications requiring extensive context processing benefit from DeepSeek V4’s million-token capability. The models support complex document analysis, long-form content generation, and multi-turn conversations without context truncation. NVIDIA GPU endpoint integration simplifies deployment for rapid prototyping and production applications.
Enterprise
Organizations processing large documents, legal contracts, or technical manuals gain efficiency from DeepSeek V4’s extended context window. The hybrid attention architecture reduces computational costs while maintaining performance quality. Enterprise teams can deploy through NVIDIA’s managed endpoints or on-premises infrastructure.
End Users
Users requiring AI assistance with lengthy documents, research papers, or complex projects benefit from DeepSeek V4’s ability to maintain context across extensive interactions. The Flash variant provides faster responses for time-sensitive applications while preserving context awareness.
Investors
DeepSeek’s continued innovation in open-source AI models demonstrates competitive positioning against commercial alternatives. The NVIDIA partnership and Blackwell optimization indicate strong technical execution and market positioning in the AI infrastructure ecosystem.
How to use DeepSeek V4 today
DeepSeek V4 is immediately available through multiple deployment options for developers and organizations.
NVIDIA GPU Endpoints
Developers can access DeepSeek V4 through NVIDIA GPU-accelerated endpoints on build.nvidia.com as part of the NVIDIA Developer Program [1]. Registration provides immediate API access without local hardware requirements.
Local Deployment
For local installation, create a Python virtual environment and install vLLM with MoE support:
python -m venv v4flash-env
source v4flash-env/bin/activate
pip install --upgrade pip
pip install "vllm>=0.9.0"
This vLLM version includes official support for DeepSeek V4 Flash models [3].
Hugging Face Integration
Both DeepSeek V4-Pro and V4-Flash are available on Hugging Face for direct model access and fine-tuning workflows [2].
DeepSeek V4 vs competitors
DeepSeek V4 competes directly with leading commercial and open-source language models in the million-token context category.
| Model | Context Length | Parameters | Availability | Cost Model |
|---|---|---|---|---|
| DeepSeek V4-Pro | 1M tokens | 1.6T total, 49B active | Open source | Not yet disclosed |
| GPT-5.5 | Not yet disclosed | Not yet disclosed | Commercial API | Pay-per-token |
| Claude Opus 4.7 | Not yet disclosed | Not yet disclosed | Commercial API | Pay-per-token |
| Llama 3.2 | 128K tokens | 405B parameters | Open source | Free for research |
Risks, limits, and myths
- Hardware Requirements: Full V4-Pro deployment requires substantial GPU resources, with 4 DGX Spark systems recommended for optimal performance
- Quantization Dependency: Most users need quantized versions for practical deployment due to memory requirements
- Training Data Transparency: DeepSeek did not specify which GPUs were used for model training amid regulatory concerns
- Performance Scaling: Million-token context processing may experience latency increases with maximum context utilization
- API Pricing: Commercial endpoint pricing structure remains undisclosed, affecting cost planning for production deployments
- Model Stability: As a newly released model, long-term stability and edge case performance require additional validation
FAQ
How much does DeepSeek V4 cost to use?
DeepSeek V4 pricing through NVIDIA GPU endpoints is not yet disclosed. The models are available open-source for local deployment with appropriate hardware.
What hardware do I need to run DeepSeek V4 locally?
DeepSeek V4-Pro requires approximately 4 DGX Spark systems for full deployment. Most users should wait for quantized versions to reduce hardware requirements.
Can I use DeepSeek V4 for commercial applications?
Yes, DeepSeek V4 is available as open-source models suitable for commercial use, subject to the specific license terms provided by DeepSeek.
How does DeepSeek V4 compare to GPT-4 for long documents?
DeepSeek V4 supports 1 million token context length, significantly exceeding GPT-4’s context window for processing extensive documents without truncation.
Is DeepSeek V4 available through OpenAI API?
No, DeepSeek V4 is available through NVIDIA GPU endpoints, Hugging Face, and local deployment, but not through OpenAI’s API platform.
What programming languages work with DeepSeek V4?
DeepSeek V4 supports standard API integration through HTTP requests, compatible with Python, JavaScript, and other languages supporting REST API calls.
How fast is DeepSeek V4-Flash compared to V4-Pro?
DeepSeek V4-Flash uses 13B active parameters compared to V4-Pro’s 49B active parameters, designed specifically for higher-speed inference applications.
Can I fine-tune DeepSeek V4 on my own data?
Yes, DeepSeek V4 models are available on Hugging Face and support fine-tuning workflows for domain-specific applications and customization.
What is the difference between CSA and HCA attention?
Compressed Sparse Attention (CSA) focuses on relevant tokens while Heavily Compressed Attention (HCA) further reduces memory requirements for million-token processing.
Does DeepSeek V4 support function calling and tool use?
Specific function calling capabilities are not yet disclosed in the available documentation. Check the model documentation for current feature support.
How do I get access to NVIDIA GPU endpoints for DeepSeek V4?
Join the NVIDIA Developer Program and access build.nvidia.com to use DeepSeek V4 through GPU-accelerated endpoints.
Is DeepSeek V4 better than Claude for coding tasks?
Specific coding benchmark comparisons are not yet disclosed. DeepSeek V4 performance rivals Claude Opus 4.7 according to initial evaluations, but task-specific comparisons require further testing.
Glossary
- Compressed Sparse Attention (CSA)
- Attention mechanism that reduces computational complexity by focusing on the most relevant tokens within the context window
- Heavily Compressed Attention (HCA)
- Advanced compression technique that enables processing of million-token contexts with manageable memory requirements
- Mixture of Experts (MoE)
- Architecture that activates only a subset of model parameters during inference to maintain efficiency while scaling total capacity
- Active Parameters
- The subset of total model parameters that are activated and used during inference for a specific input
- Context Window
- The maximum number of tokens a language model can process simultaneously while maintaining coherent understanding
- NVIDIA Blackwell
- Advanced GPU architecture optimized for AI workloads and large language model inference
- vLLM
- Open-source library for efficient large language model serving and inference optimization
- GPU-Accelerated Endpoints
- Cloud-based API services that use GPU hardware to provide fast inference for AI models
Sources
- Build with DeepSeek V4 Using NVIDIA Blackwell and GPU-Accelerated Endpoints | NVIDIA Technical Blog — https://developer.nvidia.com/blog/build-with-deepseek-v4-using-nvidia-blackwell-and-gpu-accelerated-endpoints/
- deepseek-ai/DeepSeek-V4-Pro · Hugging Face — https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro
- Run DeepSeek V4 Flash Locally: Full 2026 Setup Guide — https://ghost.codersera.com/blog/run-deepseek-v4-flash-locally-full-2026-setup-guide/
- deepseek-ai/DeepSeek-V4-Flash · Hugging Face — https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash
- DeepSeek V4 Released: Everything You Need to Know (April 2026) — https://felloai.com/deepseek-v4/
- Deepseek V4 released – DGX Spark / GB10 – NVIDIA Developer Forums — https://forums.developer.nvidia.com/t/deepseek-v4-released/367696
- DeepSeek releases new flagship open source AI model V4 By Investing.com — https://www.investing.com/news/stock-market-news/deepseek-releases-new-flagship-open-source-ai-model-v4-4634548
- Deepseek V4 released – Page 2 – DGX Spark / GB10 – NVIDIA Developer Forums — https://forums.developer.nvidia.com/t/deepseek-v4-released/367696?page=2