LegalBench-BR is the first public benchmark for evaluating large language models on Brazilian legal text classification, featuring 3,105 appellate proceedings from Santa Catarina State Court across five legal areas with specialized fine-tuned models significantly outperforming general-purpose LLMs.
| Released by | Not yet disclosed |
|---|---|
| Release date | |
| What it is | First public benchmark for Brazilian legal text classification |
| Who it is for | Legal AI researchers and developers |
| Where to get it | arXiv paper with dataset release |
| Price | Free |
- LegalBench-BR contains 3,105 appellate proceedings from Santa Catarina State Court collected via DataJud API
- BERTimbau-LoRA achieves 87.6% accuracy while updating only 0.3% of model parameters
- GPT-4o mini and Claude 3.5 Haiku show systematic bias toward civil law classification
- Fine-tuned models outperform commercial LLMs by 22-28 percentage points on macro-F1 scores
- Dataset covers five legal areas with LLM-assisted labeling and heuristic validation
- Domain-adapted fine-tuning eliminates systematic classification biases present in general-purpose LLMs
- LoRA fine-tuning achieves expert-level performance while updating minimal model parameters
- Commercial LLMs cannot substitute for specialized models in Brazilian legal classification tasks
- Administrative law classification proves most challenging for general-purpose models
- Consumer GPU fine-tuning closes performance gaps at zero marginal inference cost
What is LegalBench-BR
LegalBench-BR is the first public benchmark specifically designed for evaluating large language models on Brazilian legal text classification tasks. The benchmark comprises 3,105 appellate proceedings from the Santa Catarina State Court (TJSC) collected through the DataJud API provided by the National Council of Justice (CNJ). The dataset covers five distinct legal areas and uses LLM-assisted labeling combined with heuristic validation to ensure annotation quality.
The benchmark addresses a critical gap in Portuguese legal natural language processing research. Unlike existing legal benchmarks that focus primarily on English or other languages, LegalBench-BR provides researchers with a standardized evaluation framework for Brazilian legal AI applications. The dataset includes proceedings annotated across civil, criminal, administrative, labor, and tax law categories.
What is new vs previous benchmarks
LegalBench-BR introduces several novel features compared to existing legal AI benchmarks:
| Feature | LegalBench-BR | Previous Benchmarks |
|---|---|---|
| Language focus | Brazilian Portuguese | Primarily English |
| Data source | Real court proceedings via DataJud API | Various legal texts |
| Annotation method | LLM-assisted with heuristic validation | Manual or rule-based |
| Legal areas | 5 specific Brazilian law categories | General legal domains |
| Model evaluation | Includes LoRA fine-tuning analysis | Standard fine-tuning approaches |
How does LegalBench-BR work
LegalBench-BR operates through a systematic evaluation framework for Brazilian legal text classification:
- Data collection: Researchers gather 3,105 appellate proceedings from TJSC using the official DataJud API
- Annotation process: Legal texts receive labels across five categories using LLM-assisted annotation with heuristic validation
- Class balancing: The test set maintains balanced representation across all five legal areas
- Model evaluation: Benchmark measures accuracy and macro-F1 scores on classification tasks
- Performance analysis: Results compare domain-adapted models against general-purpose LLMs
Benchmarks and evidence
LegalBench-BR demonstrates significant performance gaps between specialized and general-purpose models:
| Model | Accuracy | Macro-F1 | Parameters Updated |
|---|---|---|---|
| BERTimbau-LoRA | 87.6% | 0.87 | 0.3% |
| Claude 3.5 Haiku | Not yet disclosed | 0.65 | N/A |
| GPT-4o mini | Not yet disclosed | 0.59 | N/A |
The performance gap proves most striking in administrative law classification. GPT-4o mini achieves F1 = 0.00 on administrative law cases, while Claude 3.5 Haiku scores F1 = 0.08. The fine-tuned BERTimbau model reaches F1 = 0.91 on the same category, demonstrating the critical importance of domain adaptation for Brazilian legal texts.
Who should care
Builders
AI developers building legal technology for Brazilian markets need domain-specific benchmarks to validate their models. LegalBench-BR provides the first standardized evaluation framework for Portuguese legal NLP applications, enabling developers to measure performance against established baselines.
Enterprise
Law firms and legal service providers can use LegalBench-BR results to inform technology adoption decisions. The benchmark demonstrates that general-purpose LLMs cannot substitute for specialized models in Brazilian legal classification, guiding investment in domain-adapted solutions.
End users
Legal professionals working with Brazilian court systems benefit from improved AI tools validated against real court proceedings. The benchmark ensures legal AI applications meet professional standards for accuracy and reliability in Portuguese legal contexts.
Investors
Venture capital and legal tech investors can use LegalBench-BR performance metrics to evaluate startup claims about Brazilian legal AI capabilities. The benchmark provides objective performance standards for due diligence processes.
How to use LegalBench-BR today
Researchers and developers can access LegalBench-BR through the following steps:
- Download the dataset: Access the full dataset from the arXiv paper release
- Install dependencies: Set up the provided pipeline with required Python libraries
- Load the model: Use the released BERTimbau-LoRA model for baseline comparisons
- Run evaluation: Execute the benchmark pipeline on your models using the class-balanced test set
- Compare results: Measure accuracy and macro-F1 scores against established baselines
LegalBench-BR vs competitors
LegalBench-BR compares against other legal AI benchmarks in the following areas:
| Benchmark | Language | Task Type | Dataset Size | Legal Domain |
|---|---|---|---|---|
| LegalBench-BR | Portuguese | Classification | 3,105 cases | Brazilian law |
| LexGLUE | English | Multi-task | Various | General legal |
| LegalBench | English | Reasoning | Various | US law |
Risks, limits, and myths
- Geographic limitation: Dataset focuses solely on Santa Catarina State Court proceedings, potentially limiting generalizability
- Language specificity: Results apply specifically to Brazilian Portuguese legal texts, not other Portuguese variants
- Classification scope: Benchmark covers only five legal areas, excluding specialized domains like intellectual property
- Temporal constraints: Court proceedings reflect specific time periods and may not capture evolving legal language
- Annotation bias: LLM-assisted labeling may introduce systematic biases despite heuristic validation
- Model dependency: Performance results depend on specific model architectures and training procedures
FAQ
What is LegalBench-BR and why was it created?
LegalBench-BR is the first public benchmark for evaluating large language models on Brazilian legal text classification, created to address the lack of Portuguese legal AI evaluation frameworks.
How many legal cases does LegalBench-BR contain?
LegalBench-BR contains 3,105 appellate proceedings from the Santa Catarina State Court collected through the DataJud API.
Which legal areas does LegalBench-BR cover?
The benchmark covers five legal areas: civil law, criminal law, administrative law, labor law, and tax law.
How does BERTimbau-LoRA perform compared to commercial LLMs?
BERTimbau-LoRA achieves 87.6% accuracy and 0.87 macro-F1, outperforming Claude 3.5 Haiku by 22 percentage points and GPT-4o mini by 28 percentage points.
Why do commercial LLMs perform poorly on administrative law cases?
GPT-4o mini scores F1 = 0.00 and Claude 3.5 Haiku scores F1 = 0.08 on administrative law due to systematic bias toward civil law classification.
What percentage of model parameters does LoRA fine-tuning update?
LoRA fine-tuning updates only 0.3% of model parameters while achieving superior performance on Brazilian legal classification tasks.
Can general-purpose LLMs substitute for domain-adapted models in legal tasks?
No, the benchmark demonstrates that general-purpose LLMs cannot substitute for domain-adapted models in Brazilian legal classification, even for simple 5-class problems.
How was the LegalBench-BR dataset annotated?
The dataset uses LLM-assisted labeling combined with heuristic validation to ensure annotation quality across the five legal categories.
What is the cost of using fine-tuned models versus commercial LLMs?
LoRA fine-tuning on consumer GPUs closes the performance gap at zero marginal inference cost compared to commercial LLM APIs.
Where can researchers access the LegalBench-BR dataset?
Researchers can access the full dataset, model, and pipeline through the arXiv paper release to enable reproducible research.
Glossary
- BERTimbau
- Portuguese language model based on BERT architecture, specifically trained on Brazilian Portuguese texts
- DataJud API
- Official API provided by Brazil’s National Council of Justice for accessing court proceeding data
- LoRA (Low-Rank Adaptation)
- Parameter-efficient fine-tuning technique that updates only a small percentage of model parameters
- Macro-F1
- Evaluation metric that calculates F1 score for each class separately then averages them, giving equal weight to all classes
- TJSC
- Santa Catarina State Court (Tribunal de Justiça de Santa Catarina) in Brazil
- Heuristic validation
- Rule-based verification process used to check the quality of automated annotations
- Appellate proceedings
- Legal cases that have been appealed to a higher court for review of lower court decisions
Sources
- LexGenius: An Expert-Level Benchmark for Large Language Models in Legal General Intelligence — https://arxiv.org/html/2512.04578
- Benchmarking Vietnamese Legal Knowledge of Large Language Models — https://arxiv.org/html/2512.14554v5
- PoliLegalLM: A Technical Report on a Large Language Model for Political and Legal Affairs — https://arxiv.org/html/2604.17543
- From Legal Text to Executable Decision Models: Evaluating Structured Representations for Legal Decision Model Generation — https://arxiv.org/html/2604.17153
- Professional Reasoning Benchmark – Legal – Scale Labs — https://labs.scale.com/leaderboard/prbench-legal
- LLM Leaderboard 2026 — Compare Top AI Models – Vellum AI — https://www.vellum.ai/llm-leaderboard
- [2512.04578] LexGenius: An Expert-Level Benchmark for Large Language Models in Legal General Intelligence — https://arxiv.org/abs/2512.04578
- The Best Open-Source LLMs in 2026 — https://www.bentoml.com/blog/navigating-the-world-of-open-source-large-language-models