LegalBench-BR is the first public benchmark for evaluating large language models on Brazilian legal text classification, comprising 3,105 appellate proceedings from Santa Catarina State Court across five legal areas.
| Released by | Not yet disclosed |
|---|---|
| Release date | |
| What it is | First public benchmark for evaluating LLMs on Brazilian legal text classification |
| Who it’s for | AI researchers and legal technology developers |
| Where to get it | Full dataset and model released publicly |
| Price | Free |
- LegalBench-BR contains 3,105 Brazilian court proceedings classified across five legal areas
- BERTimbau-LoRA achieves 87.6% accuracy, outperforming GPT-4o mini by 28 percentage points
- Commercial LLMs show systematic bias toward civil law classification
- Fine-tuned models eliminate classification failures that plague general-purpose LLMs
- Dataset enables reproducible research in Portuguese legal natural language processing
- Domain-adapted fine-tuning significantly outperforms general-purpose LLMs on Brazilian legal classification
- Commercial models exhibit systematic classification bias that fine-tuning eliminates
- LoRA fine-tuning achieves superior performance while updating only 0.3% of model parameters
- Administrative law classification proves particularly challenging for general-purpose models
- The benchmark enables reproducible Portuguese legal NLP research
What is LegalBench-BR
LegalBench-BR is a benchmark dataset for evaluating large language models on Brazilian legal text classification tasks. The dataset comprises 3,105 appellate proceedings from the Santa Catarina State Court collected via the DataJud API. Legal documents are annotated across five legal areas through LLM-assisted labeling with heuristic validation.
The benchmark addresses the gap in Portuguese legal natural language processing evaluation tools. Legal proceedings span five classification categories covering major areas of Brazilian law. The dataset enables systematic evaluation of model performance on domain-specific legal text understanding.
What is new vs previous benchmarks
LegalBench-BR introduces the first public benchmark specifically designed for Brazilian legal text classification.
| Feature | LegalBench-BR | Previous Legal Benchmarks |
|---|---|---|
| Language focus | Brazilian Portuguese | Primarily English |
| Legal system | Brazilian civil law | Common law systems |
| Data source | Santa Catarina State Court | Various international courts |
| Classification areas | 5 Brazilian legal domains | General legal categories |
| Annotation method | LLM-assisted with heuristic validation | Manual annotation |
How does LegalBench-BR work
LegalBench-BR evaluates models through a structured classification pipeline across five legal domains.
- Legal proceedings are collected from Santa Catarina State Court via DataJud API
- Documents undergo LLM-assisted annotation with heuristic validation for quality control
- Text is classified across five legal areas: administrative, civil, criminal, tax, and labor law
- Models are evaluated on a class-balanced test set using accuracy and macro-F1 metrics
- Performance comparison reveals domain adaptation effectiveness versus general-purpose models
Benchmarks and evidence
BERTimbau-LoRA demonstrates superior performance compared to commercial large language models on Brazilian legal classification.
| Model | Accuracy | Macro-F1 | Parameters Updated | Source |
|---|---|---|---|---|
| BERTimbau-LoRA | 87.6% | 0.87 | 0.3% | LegalBench-BR paper |
| Claude 3.5 Haiku | Not disclosed | 0.65 | N/A | LegalBench-BR paper |
| GPT-4o mini | Not disclosed | 0.59 | N/A | LegalBench-BR paper |
Administrative law classification reveals the largest performance gap between models. GPT-4o mini achieves F1 score of 0.00 on administrative law while BERTimbau-LoRA reaches F1 score of 0.91. Commercial models exhibit systematic bias toward civil law classification, absorbing ambiguous classes rather than discriminating them.
Who should care
Builders
AI developers building legal technology for Brazilian markets need domain-adapted models for accurate classification. LegalBench-BR provides evaluation framework for Portuguese legal NLP applications. The benchmark enables systematic comparison of model architectures on Brazilian legal text.
Enterprise
Law firms and legal technology companies require reliable classification systems for Brazilian legal documents. The benchmark demonstrates that general-purpose LLMs cannot substitute domain-adapted models for legal classification tasks. Fine-tuning approaches offer superior performance at zero marginal inference cost.
End users
Legal professionals working with Brazilian court documents benefit from improved automated classification systems. The benchmark enables development of more accurate legal document processing tools. Users gain access to better legal technology through domain-specific model evaluation.
Investors
Legal technology investment decisions require understanding of model performance on domain-specific tasks. LegalBench-BR provides evidence that specialized fine-tuning outperforms general-purpose models significantly. The benchmark supports investment thesis for domain-adapted legal AI solutions.
How to use LegalBench-BR today
Researchers and developers can access LegalBench-BR through the publicly released dataset and model pipeline.
- Download the full dataset from the public repository containing 3,105 annotated legal proceedings
- Access the BERTimbau-LoRA model weights and training configuration files
- Install the evaluation pipeline using the provided Python scripts and dependencies
- Run benchmark evaluation on your models using the class-balanced test set
- Compare results against baseline performance metrics for accuracy and macro-F1 scores
LegalBench-BR vs competitors
LegalBench-BR addresses Portuguese legal text classification while existing benchmarks focus on English legal tasks.
| Benchmark | Language | Legal System | Task Type | Dataset Size |
|---|---|---|---|---|
| LegalBench-BR | Portuguese | Brazilian civil law | Classification | 3,105 proceedings |
| LegalBench | English | US common law | Multi-task | Various sizes |
| LexGLUE | English | EU/US law | Multi-task | Various sizes |
Risks, limits, and myths
- Dataset limited to Santa Catarina State Court proceedings may not represent all Brazilian legal domains
- Five-class classification simplifies complex legal categorization that occurs in practice
- LLM-assisted annotation with heuristic validation may introduce systematic labeling biases
- Performance metrics focus on classification accuracy rather than legal reasoning quality
- Benchmark does not evaluate model performance on legal document generation or analysis
- Results may not generalize to other Portuguese-speaking legal systems outside Brazil
FAQ
What is LegalBench-BR and how does it work?
LegalBench-BR is the first public benchmark for evaluating large language models on Brazilian legal text classification, containing 3,105 court proceedings classified across five legal areas.
How does BERTimbau-LoRA compare to GPT-4o mini on Brazilian legal tasks?
BERTimbau-LoRA achieves 87.6% accuracy and 0.87 macro-F1, outperforming GPT-4o mini by 28 percentage points on Brazilian legal classification tasks.
Why do commercial LLMs perform poorly on Brazilian legal classification?
Commercial LLMs exhibit systematic bias toward civil law classification and fail to discriminate between legal categories, particularly struggling with administrative law classification.
What legal areas does LegalBench-BR cover?
LegalBench-BR covers five legal areas: administrative law, civil law, criminal law, tax law, and labor law from Brazilian court proceedings.
How much data does LegalBench-BR contain?
LegalBench-BR contains 3,105 appellate proceedings from the Santa Catarina State Court collected via the DataJud API with LLM-assisted annotation.
Can I access LegalBench-BR dataset and models?
Yes, the full dataset, BERTimbau-LoRA model, and evaluation pipeline are released publicly to enable reproducible research in Portuguese legal NLP.
What makes LoRA fine-tuning effective for legal classification?
LoRA fine-tuning updates only 0.3% of model parameters while achieving superior performance and eliminating classification failures at zero marginal inference cost.
How does LegalBench-BR compare to other legal AI benchmarks?
LegalBench-BR is the first benchmark specifically designed for Brazilian Portuguese legal text, while existing benchmarks like LegalBench and LexGLUE focus on English legal tasks.
What evaluation metrics does LegalBench-BR use?
LegalBench-BR evaluates models using accuracy and macro-F1 scores on a class-balanced test set across five Brazilian legal classification categories.
Who should use LegalBench-BR for AI development?
AI researchers, legal technology developers, and companies building Portuguese legal NLP applications should use LegalBench-BR for systematic model evaluation and comparison.
Glossary
- BERTimbau-LoRA
- Portuguese BERT model fine-tuned using Low-Rank Adaptation technique for Brazilian legal text classification
- DataJud API
- Brazilian National Council of Justice API for accessing court proceeding data from state courts
- Macro-F1
- Evaluation metric calculating F1 score for each class separately then averaging, giving equal weight to all classes
- LoRA (Low-Rank Adaptation)
- Parameter-efficient fine-tuning technique that updates only a small percentage of model parameters while maintaining performance
- Santa Catarina State Court (TJSC)
- Brazilian state court system serving Santa Catarina state, source of legal proceedings in LegalBench-BR dataset
- Heuristic validation
- Rule-based quality control method used to verify accuracy of LLM-assisted annotation in dataset creation
Sources
- LegalBench-BR: A Benchmark for Evaluating Large Language Models on Brazilian Legal Decision Classification. arXiv:2604.18878v1. .