LegalBench-BR is the first public benchmark for evaluating large language models on Brazilian legal text classification, featuring 3,105 appellate proceedings from Santa Catarina State Court across five legal areas with specialized fine-tuned models significantly outperforming commercial LLMs.
| Released by | Not yet disclosed |
|---|---|
| Release date | |
| What it is | First public benchmark for evaluating LLMs on Brazilian legal text classification |
| Who it is for | Legal AI researchers and Portuguese NLP developers |
| Where to get it | Full dataset and model released publicly |
| Price | Free |
- LegalBench-BR contains 3,105 appellate proceedings from Santa Catarina State Court collected via DataJud API
- BERTimbau-LoRA achieves 87.6% accuracy by updating only 0.3% of model parameters
- Commercial LLMs like GPT-4o mini and Claude 3.5 Haiku perform poorly on administrative law classification
- Fine-tuned models eliminate systematic bias toward civil law that affects general-purpose LLMs
- Dataset covers five legal areas with LLM-assisted labeling and heuristic validation
- Domain-adapted fine-tuning significantly outperforms general-purpose LLMs on Brazilian legal classification tasks
- Commercial LLMs exhibit systematic bias toward civil law categories in legal text classification
- LoRA fine-tuning provides efficient domain adaptation with minimal parameter updates
- Brazilian legal NLP requires specialized models rather than relying on general commercial solutions
- Public legal datasets enable reproducible research in Portuguese legal natural language processing
What is LegalBench-BR
LegalBench-BR is the first public benchmark specifically designed for evaluating large language models on Brazilian legal text classification tasks. The dataset comprises 3,105 appellate proceedings from the Santa Catarina State Court (TJSC) collected through the DataJud API provided by the National Council of Justice (CNJ). The benchmark covers five distinct legal areas and uses LLM-assisted labeling with heuristic validation to ensure data quality.
The benchmark addresses a critical gap in Portuguese legal natural language processing by providing standardized evaluation metrics for legal AI systems. Unlike existing legal benchmarks that focus primarily on English or other languages, LegalBench-BR specifically targets Brazilian legal terminology and procedural contexts.
What is new vs previous benchmarks
LegalBench-BR introduces the first Portuguese-language legal classification benchmark, filling a significant gap in multilingual legal AI evaluation.
| Feature | LegalBench-BR | Previous Legal Benchmarks |
|---|---|---|
| Language | Portuguese (Brazilian) | Primarily English |
| Legal System | Brazilian civil law | Common law systems |
| Data Source | Official court API (DataJud) | Various sources |
| Task Focus | 5-class legal area classification | Multiple legal tasks |
| Validation Method | LLM-assisted with heuristics | Human annotation primarily |
How does LegalBench-BR work
LegalBench-BR operates through a systematic data collection and evaluation pipeline designed for Brazilian legal text classification.
- Data Collection: Appellate proceedings are collected from Santa Catarina State Court via the official DataJud API
- Annotation Process: Legal texts are labeled across five categories using LLM-assisted annotation with heuristic validation
- Class Balancing: Test sets are balanced across legal categories to ensure fair evaluation metrics
- Model Evaluation: Performance is measured using accuracy and macro-F1 scores on classification tasks
- Benchmark Testing: Models are evaluated on their ability to distinguish between administrative, civil, criminal, tax, and labor law cases
Benchmarks and evidence
LegalBench-BR evaluation results demonstrate significant performance gaps between domain-adapted and general-purpose models on Brazilian legal classification.
| Model | Accuracy | Macro-F1 | Administrative Law F1 | Source |
|---|---|---|---|---|
| BERTimbau-LoRA | 87.6% | 0.87 | 0.91 | LegalBench-BR paper |
| Claude 3.5 Haiku | Not disclosed | 0.65 | 0.08 | LegalBench-BR paper |
| GPT-4o mini | Not disclosed | 0.59 | 0.00 | LegalBench-BR paper |
The fine-tuned BERTimbau-LoRA model achieves 22 percentage points higher macro-F1 than Claude 3.5 Haiku and 28 percentage points higher than GPT-4o mini. The performance gap is most pronounced in administrative law classification, where commercial LLMs fail completely while the specialized model excels.
Who should care
Builders
AI developers working on legal technology applications for Brazilian markets need domain-specific benchmarks to evaluate model performance. LegalBench-BR provides standardized metrics for Portuguese legal NLP systems and demonstrates the necessity of fine-tuning for legal classification tasks.
Enterprise
Law firms and legal technology companies operating in Brazil require accurate automated document classification systems. The benchmark reveals that commercial LLMs cannot substitute for domain-adapted models in Brazilian legal contexts, informing technology procurement decisions.
End users
Legal professionals and researchers working with Brazilian court documents benefit from improved classification accuracy that specialized models provide. The benchmark enables better tooling for legal research and case management systems.
Investors
Venture capital and private equity firms evaluating legal AI startups can use LegalBench-BR results to assess technical capabilities and market positioning in Portuguese-speaking legal markets.
How to use LegalBench-BR today
LegalBench-BR is available as a complete research package including dataset, trained models, and evaluation pipeline.
- Download Dataset: Access the full 3,105 appellate proceedings dataset from the public release
- Load Pre-trained Model: Use the released BERTimbau-LoRA model for immediate Brazilian legal text classification
- Run Evaluation Pipeline: Execute the provided evaluation scripts to benchmark new models against established baselines
- Fine-tune Custom Models: Apply LoRA fine-tuning techniques to adapt models for specific legal domains
- Validate Results: Use the class-balanced test set to measure accuracy and macro-F1 performance
LegalBench-BR vs competitors
LegalBench-BR addresses Portuguese legal classification while existing benchmarks focus on English legal tasks and broader evaluation scopes.
| Benchmark | Language | Task Type | Legal System | Dataset Size |
|---|---|---|---|---|
| LegalBench-BR | Portuguese | 5-class classification | Brazilian civil law | 3,105 cases |
| LegalBench | English | Multi-task evaluation | US common law | Not disclosed |
| LexGLUE | English | Multiple NLP tasks | EU/US law | Not disclosed |
According to research sources, classification and decision-making tasks are evaluated using Accuracy and F1, while imbalanced or precision-oriented tasks adopt F0.5 in legal benchmarks [3]. LegalBench-BR follows this standard with accuracy and macro-F1 metrics for its classification task.
Risks, limits, and myths
- Geographic Limitation: Dataset only covers Santa Catarina State Court, potentially limiting generalizability to other Brazilian jurisdictions
- Temporal Scope: Court proceedings represent a specific time period that may not reflect evolving legal language
- Class Imbalance: Real-world legal document distributions may differ from the balanced test set used for evaluation
- Annotation Quality: LLM-assisted labeling with heuristic validation may introduce systematic errors compared to expert human annotation
- Model Generalization: Fine-tuned models may overfit to specific legal document formats from the training court
- Commercial LLM Bias: Results show systematic bias toward civil law classification that may affect other legal domains
FAQ
What is LegalBench-BR and why was it created?
LegalBench-BR is the first public benchmark for evaluating large language models on Brazilian legal text classification, created to address the lack of Portuguese legal AI evaluation tools.
How many legal documents does LegalBench-BR contain?
LegalBench-BR contains 3,105 appellate proceedings from the Santa Catarina State Court collected through the official DataJud API.
Which legal areas does LegalBench-BR cover?
The benchmark covers five legal areas: administrative law, civil law, criminal law, tax law, and labor law classification tasks.
How does BERTimbau-LoRA perform compared to commercial LLMs?
BERTimbau-LoRA achieves 87.6% accuracy and 0.87 macro-F1, outperforming Claude 3.5 Haiku by 22 percentage points and GPT-4o mini by 28 percentage points.
Why do commercial LLMs perform poorly on administrative law classification?
GPT-4o mini scores F1 = 0.00 and Claude 3.5 Haiku scores F1 = 0.08 on administrative law due to systematic bias toward civil law categories.
What is LoRA fine-tuning and how efficient is it?
LoRA (Low-Rank Adaptation) fine-tuning updates only 0.3% of model parameters while achieving significant performance improvements on domain-specific tasks.
Can general-purpose LLMs replace specialized legal models for Brazilian law?
No, the benchmark demonstrates that general-purpose LLMs cannot substitute for domain-adapted models in Brazilian legal classification tasks.
How was the LegalBench-BR dataset annotated and validated?
Legal texts were labeled using LLM-assisted annotation combined with heuristic validation to ensure data quality across five legal categories.
Is LegalBench-BR available for public research use?
Yes, the full dataset, trained model, and evaluation pipeline are released publicly to enable reproducible research in Portuguese legal NLP.
What makes LegalBench-BR different from existing legal benchmarks?
LegalBench-BR is the first benchmark specifically designed for Portuguese Brazilian legal text classification, unlike existing English-focused legal evaluation datasets.
How does the benchmark handle class imbalance in legal document types?
LegalBench-BR uses a class-balanced test set to ensure fair evaluation metrics across all five legal area categories.
What are the implications for legal AI development in Brazil?
The benchmark shows that Brazilian legal AI applications require domain-specific fine-tuning rather than relying on general commercial LLM solutions.
Glossary
- BERTimbau
- Portuguese language version of BERT (Bidirectional Encoder Representations from Transformers) trained specifically for Brazilian Portuguese text processing
- DataJud API
- Official application programming interface provided by Brazil’s National Council of Justice (CNJ) for accessing court proceeding data
- LoRA (Low-Rank Adaptation)
- Parameter-efficient fine-tuning technique that updates only a small percentage of model parameters while maintaining performance
- Macro-F1
- Evaluation metric that calculates F1 score for each class separately then averages them, giving equal weight to all classes regardless of frequency
- TJSC
- Santa Catarina State Court (Tribunal de Justiça de Santa Catarina), the state-level judicial court system in Santa Catarina, Brazil
- Appellate Proceedings
- Legal cases that have been appealed from lower courts to higher courts for review of legal decisions
- Heuristic Validation
- Rule-based checking method used to verify the accuracy of automated annotations using predefined logical rules
- Domain Adaptation
- Process of modifying a general-purpose model to perform better on specific domain tasks through specialized training
Sources
- LexGenius: An Expert-Level Benchmark for Large Language Models in Legal General Intelligence — https://arxiv.org/html/2512.04578
- Benchmarking Vietnamese Legal Knowledge of Large Language Models — https://arxiv.org/html/2512.14554v5
- PoliLegalLM: A Technical Report on a Large Language Model for Political and Legal Affairs — https://arxiv.org/html/2604.17543
- LLM Leaderboard 2026 — Compare Top AI Models – Vellum AI — https://www.vellum.ai/llm-leaderboard
- From Legal Text to Executable Decision Models: Evaluating Structured Representations for Legal Decision Model Generation — https://arxiv.org/html/2604.17153
- Professional Reasoning Benchmark – Legal – Scale Labs — https://labs.scale.com/leaderboard/prbench-legal
- [2512.04578] LexGenius: An Expert-Level Benchmark for Large Language Models in Legal General Intelligence — https://arxiv.org/abs/2512.04578
- Large language model – Wikipedia — https://en.wikipedia.org/wiki/Large_language_model