RTPrune Boosts DeepSeek-OCR Inference Speed and Accuracy

RTPrune, a novel two-stage token pruning method, significantly enhances the efficiency and accuracy of DeepSeek-OCR inference by intelligently managing visual tokens. Published on arXiv on May 4, 2026, this approach achieves a 1.23x faster prefill and 99.47% accuracy on the OmniDocBench benchmark, retaining 84.25% of tokens. It addresses the challenge of redundant visual information in long-text processing by prioritizing high-norm tokens and then merging the remainder using optimal transport theory, offering a more robust compression mechanism than previous methods.

Two-Stage Pruning: RTPrune employs a two-stage process for visual token pruning in DeepSeek-OCR, first prioritizing high-norm tokens and then merging remaining tokens.
Enhanced Efficiency: The method achieves a 1.23x faster prefill inference speed on OmniDocBench for DeepSeek-OCR-Large.
Improved Accuracy: RTPrune maintains high textual fidelity, reaching 99.47% accuracy on OmniDocBench.
Dynamic Adaptation: It uses a dynamic pruning ratio tailored to token similarity and textual density, optimizing the efficiency-accuracy trade-off for OCR tasks.
Targeted for DeepSeek-OCR: Unlike general VLM pruning methods, RTPrune is specifically designed based on the observed two-stage reading trajectory of DeepSeek-OCR.

What changed

DeepSeek-OCR, a model leveraging visual-text compression for long-text processing, has faced challenges with redundant textual and structural information within its visual tokens. Prior token pruning methods for conventional Vision-Language Models (VLMs) often compromised textual fidelity due to improper compression mechanisms [7]. The core innovation with RTPrune is a tailored approach that recognizes and exploits DeepSeek-OCR’s unique two-stage decoding process.

Specifically, the research observed that DeepSeek-OCR initially focuses on high-norm tokens before redistributing attention to others. RTPrune capitalizes on this by introducing a two-stage pruning strategy. The first stage prioritizes these “high-norm” visual tokens, which are crucial for salient textual and structural information. The second stage then intelligently pairs and merges the remaining tokens using optimal transport theory for efficient feature aggregation. This is a significant departure from methods that might simply discard tokens or use less sophisticated merging, which could lead to a loss of critical context for OCR tasks.

Furthermore, RTPrune introduces a dynamic pruning ratio. This ratio adapts based on token similarity and textual density, a capability not commonly found in general VLM pruning techniques. This dynamic adjustment allows for a more nuanced balance between efficiency and accuracy, which is particularly vital for the varied nature of OCR documents.

How it works

RTPrune operates on the understanding that DeepSeek-OCR processes visual information in two distinct phases during its decoding trajectory. This insight, derived from analyzing the model’s behavior, forms the foundation of the two-stage pruning mechanism.

Stage 1: High-Norm Token Prioritization. The initial step involves identifying and prioritizing visual tokens with high “norms.” These tokens are empirically found to carry the most salient textual and structural information within a document. By retaining these high-norm tokens, RTPrune ensures that the most critical visual cues for OCR are preserved early in the process. This stage acts as a coarse filter, focusing on the most informative parts of the visual input.
Stage 2: Optimal Transport-Based Merging. After the high-norm tokens are secured, the remaining visual tokens, which might contain less immediate but still valuable information, undergo a merging process. This merging is not arbitrary; it’s guided by optimal transport theory. This mathematical framework allows for the efficient aggregation of features by finding the most cost-effective way to “move” information from one set of tokens to another. In practice, this means tokens are paired and merged in a way that minimizes information loss while significantly reducing the total number of tokens. This intelligent merging helps to compress the visual input without sacrificing the textual fidelity required for accurate OCR.

Beyond these two stages, RTPrune incorporates a dynamic pruning ratio. This ratio is not fixed but adapts based on two key factors: token similarity and textual density. For regions with high textual density or closely similar tokens, the pruning might be more aggressive, as redundancy is likely higher. Conversely, in sparser or more diverse visual areas, the pruning ratio might be lower to ensure no critical information is lost. This adaptive mechanism allows RTPrune to achieve a superior efficiency-accuracy trade-off compared to static pruning strategies, which are less responsive to the varying characteristics of document images.

Why it matters for operators

For operators leveraging DeepSeek-OCR in production environments, RTPrune represents a tangible improvement in the cost-efficiency and performance of document processing workflows. DeepSeek’s models, including DeepSeek-V3, are already known for their cost-effectiveness and efficiency gains through innovations like Compressed Sparse Attention [3, 5]. RTPrune extends this lineage specifically for OCR, a domain critical for automating tasks like invoice processing, legal document analysis, and data extraction from scanned records.

The reported 1.23x faster prefill on DeepSeek-OCR-Large means that initial processing of documents will be quicker, directly impacting throughput and reducing latency for applications that rely on rapid document ingestion. This is particularly valuable for high-volume scenarios where even marginal speed improvements can lead to significant operational savings and improved user experience. Furthermore, maintaining a 99.47% accuracy on OmniDocBench while achieving this speedup, with only 84.25% token retention, indicates that the efficiency gains do not come at the expense of reliability. Operators can trust the output without needing extensive post-processing or manual verification, which is often a hidden cost in OCR deployments.

From a FrontierWisdom perspective, the critical takeaway is that this isn’t just another generic pruning technique. RTPrune’s success stems from a deep understanding of DeepSeek-OCR’s internal mechanics—its “reading trajectory.” This highlights a broader trend: as models become more specialized and complex, optimizing them effectively requires model-specific insights rather than one-size-fits-all solutions. Operators should be wary of applying generic optimization techniques to highly specialized models like DeepSeek-OCR without understanding the underlying architectural nuances. Instead, look for optimizations that are purpose-built and validated against the specific model and task at hand, as RTPrune is for DeepSeek-OCR. This approach ensures that the claimed benefits translate directly into real-world operational advantages, rather than theoretical gains that fall apart in production.

Benchmarks and evidence

The effectiveness of RTPrune is substantiated by specific performance metrics on the OmniDocBench dataset, particularly when applied to the DeepSeek-OCR-Large model.

Accuracy: RTPrune achieved 99.47% accuracy on OmniDocBench. This indicates that the pruning process effectively retains the critical information necessary for highly accurate OCR, preventing degradation in output quality.
Inference Speed: The method demonstrated a 1.23x faster prefill speed. “Prefill” refers to the initial processing phase where the input sequence is encoded, and accelerating this directly translates to quicker document ingestion and lower latency for applications.
Token Retention: These performance gains were achieved while retaining only 84.25% of the original visual tokens. This significant reduction in token count is the direct mechanism behind the speedup, as the model has less data to process during inference.

These figures demonstrate a strong efficiency-accuracy trade-off, showing that RTPrune can significantly reduce computational load and accelerate inference without compromising the high accuracy expected from advanced OCR systems.

Risks and open questions

Generalizability to Other DeepSeek-OCR Variants: While the paper specifies DeepSeek-OCR-Large, it’s an open question whether RTPrune’s benefits and specific pruning ratios translate directly to other potential DeepSeek-OCR model sizes or architectures without re-tuning.
Impact on Edge Cases and Low-Quality Documents: The 99.47% accuracy is impressive on OmniDocBench, but real-world OCR often deals with severely degraded documents, unusual fonts, or complex layouts. The impact of aggressive token pruning on such edge cases, where every visual cue might be critical, needs further investigation.
Computational Overhead of Pruning: While the method aims to accelerate inference, the computational cost of the two-stage pruning process itself (e.g., calculating token norms, optimal transport for merging) needs to be fully quantified relative to the gains. Is the overhead negligible, or does it become a factor for very short documents?
Integration Complexity: For operators, integrating a custom pruning method like RTPrune into existing DeepSeek-OCR pipelines could introduce complexity. The ease of implementation and availability of pre-trained models or libraries incorporating RTPrune will be key to its adoption.

Sources

Author

Siegfried Kamgo

Founder and editorial lead at FrontierWisdom. Engineer turned operator-analyst writing about AI systems, automation infrastructure, decentralised stacks, and the practical economics of frontier technology. Focus: turning fast-moving releases into durable, implementation-ready playbooks.

RTPrune Boosts DeepSeek-OCR Inference Speed and Accuracy

Turn this article into a repeatable weekly edge.

What changed

How it works

Why it matters for operators

Benchmarks and evidence

Risks and open questions

Sources

Author

Siegfried Kamgo

Get the next blueprint before it becomes common advice.

Related Articles

LLMs Optimize Zero-Shot Classification Definitions for Web Filtering

SCOPE-FE: Scalable Auto Feature Engineering for High-Dimensional Data

LLMs Implement Agent-Based Models: A Replication Study

Leave a Reply Cancel reply