LegalBench-BR is the first public benchmark for evaluating language models on Brazilian legal text classification, featuring 3,105 appellate proceedings from Santa Catarina State Court across five legal areas with BERTimbau-LoRA achieving 87.6% accuracy.
| Released by | Not yet disclosed |
|---|---|
| Release date | |
| What it is | First public benchmark for Brazilian legal text classification |
| Who it is for | Legal AI researchers and Portuguese NLP developers |
| Where to get it | Full dataset and model released publicly |
| Price | Free |
- LegalBench-BR contains 3,105 appellate proceedings from Santa Catarina State Court collected via DataJud API
- BERTimbau-LoRA achieves 87.6% accuracy with only 0.3% parameter updates, outperforming commercial LLMs by 22-28 percentage points
- GPT-4o mini and Claude 3.5 Haiku show systematic bias toward civil law classification
- Fine-tuned models eliminate commercial LLM failure modes in administrative law classification
- Complete dataset, model, and pipeline released for reproducible Portuguese legal NLP research
- Domain-adapted fine-tuning significantly outperforms general-purpose LLMs on Brazilian legal classification tasks
- Commercial LLMs exhibit systematic classification bias that fine-tuning eliminates
- LoRA fine-tuning provides efficient parameter updates with zero marginal inference cost
- Administrative law classification proves particularly challenging for general-purpose models
What is LegalBench-BR
LegalBench-BR is the first public benchmark specifically designed for evaluating language models on Brazilian legal text classification tasks. The benchmark comprises 3,105 appellate proceedings from the Santa Catarina State Court (TJSC) collected through the DataJud API from Brazil’s National Council of Justice (CNJ). The dataset covers five distinct legal areas and uses LLM-assisted labeling with heuristic validation for annotation quality.
What is new vs previous benchmarks
LegalBench-BR introduces the first Portuguese-language legal classification benchmark, addressing a gap in existing legal AI evaluation tools.
| Feature | LegalBench-BR | Previous Legal Benchmarks |
|---|---|---|
| Language | Portuguese (Brazilian) | Primarily English |
| Legal System | Brazilian civil law | Common law systems |
| Data Source | Santa Catarina State Court | Various international courts |
| Classification Areas | 5 Brazilian legal domains | General legal categories |
| Annotation Method | LLM-assisted with heuristic validation | Manual expert annotation |
How does LegalBench-BR work
LegalBench-BR operates through a systematic evaluation framework for Brazilian legal text classification.
- Data collection from Santa Catarina State Court via DataJud API provides 3,105 appellate proceedings
- LLM-assisted labeling with heuristic validation annotates texts across five legal areas
- Class-balanced test set ensures fair evaluation across all legal domains
- BERTimbau-LoRA fine-tuning updates only 0.3% of model parameters for efficient adaptation
- Evaluation metrics include accuracy and macro-F1 scores for comprehensive performance assessment
Benchmarks and evidence
Performance evaluations demonstrate significant advantages of domain-adapted models over general-purpose LLMs on Brazilian legal classification.
| Model | Accuracy | Macro-F1 | Administrative Law F1 | Source |
|---|---|---|---|---|
| BERTimbau-LoRA | 87.6% | 0.87 | 0.91 | LegalBench-BR paper |
| Claude 3.5 Haiku | Not disclosed | 0.65 | 0.08 | LegalBench-BR paper |
| GPT-4o mini | Not disclosed | 0.59 | 0.00 | LegalBench-BR paper |
Who should care
Builders
Legal AI developers building Portuguese-language applications need domain-specific benchmarks for accurate model evaluation. LegalBench-BR provides the first standardized evaluation framework for Brazilian legal text classification, enabling developers to measure model performance against established baselines.
Enterprise
Law firms and legal technology companies operating in Brazil require specialized AI models for document classification and case management. The benchmark demonstrates that general-purpose LLMs cannot substitute for domain-adapted models in Brazilian legal contexts, informing technology investment decisions.
End users
Legal professionals and researchers working with Brazilian court documents benefit from improved AI classification accuracy. The benchmark’s findings show that fine-tuned models eliminate systematic biases present in commercial LLMs, leading to more reliable legal document processing.
Investors
Venture capital and legal tech investors can assess the competitive landscape for Portuguese legal AI solutions. The benchmark reveals significant performance gaps between general-purpose and specialized models, indicating market opportunities for domain-specific legal AI development.
How to use LegalBench-BR today
Researchers and developers can access LegalBench-BR through the publicly released dataset and evaluation pipeline.
- Download the complete dataset from the public repository containing 3,105 annotated legal proceedings
- Install the provided evaluation pipeline with preprocessing and classification scripts
- Load the BERTimbau-LoRA model weights for baseline comparison testing
- Run evaluation scripts on the class-balanced test set using accuracy and macro-F1 metrics
- Compare custom model performance against established BERTimbau-LoRA, Claude 3.5 Haiku, and GPT-4o mini baselines
LegalBench-BR vs competitors
LegalBench-BR addresses Portuguese legal classification while existing benchmarks focus on English-language legal tasks.
| Benchmark | Language | Task Type | Legal System | Dataset Size |
|---|---|---|---|---|
| LegalBench-BR | Portuguese | Classification | Brazilian civil law | 3,105 proceedings |
| LegalBench | English | Multi-task | US common law | Not disclosed |
| LexGLUE | English | Multi-task | EU/US law | Not disclosed |
| CaseHOLD | English | Classification | US common law | Not disclosed |
Risks, limits, and myths
- Dataset limited to Santa Catarina State Court proceedings may not generalize to other Brazilian jurisdictions
- Five-class classification scope excludes more granular legal subcategories and specialized domains
- LLM-assisted annotation with heuristic validation may introduce systematic labeling biases
- Performance metrics focus on classification accuracy without evaluating legal reasoning quality
- Benchmark does not address multilingual legal documents or cross-jurisdictional legal analysis
- Fine-tuning requirements may limit accessibility for researchers without computational resources
FAQ
What makes LegalBench-BR different from other legal AI benchmarks?
LegalBench-BR is the first public benchmark specifically designed for Brazilian legal text classification, using Portuguese-language appellate proceedings from Santa Catarina State Court across five legal areas.
How accurate is BERTimbau-LoRA compared to GPT-4o mini on Brazilian legal texts?
BERTimbau-LoRA achieves 87.6% accuracy and 0.87 macro-F1, outperforming GPT-4o mini by 28 percentage points while updating only 0.3% of model parameters.
Why do commercial LLMs perform poorly on administrative law classification?
GPT-4o mini scores F1 = 0.00 and Claude 3.5 Haiku scores F1 = 0.08 on administrative law due to systematic bias toward civil law classification, absorbing ambiguous classes rather than discriminating them.
What data source does LegalBench-BR use for legal proceedings?
LegalBench-BR uses 3,105 appellate proceedings from Santa Catarina State Court (TJSC) collected via the DataJud API from Brazil’s National Council of Justice (CNJ).
How does LoRA fine-tuning compare to full model training for legal classification?
LoRA fine-tuning updates only 0.3% of BERTimbau parameters while achieving 87.6% accuracy, providing efficient domain adaptation with zero marginal inference cost compared to full model retraining.
Can I use LegalBench-BR for legal systems outside Brazil?
LegalBench-BR is specifically designed for Brazilian civil law system and Portuguese language, limiting direct applicability to other legal systems without adaptation.
What annotation method does LegalBench-BR use for labeling legal texts?
LegalBench-BR employs LLM-assisted labeling with heuristic validation to annotate 3,105 legal proceedings across five legal areas for classification tasks.
Is the LegalBench-BR dataset available for commercial use?
The researchers release the full dataset, model, and pipeline publicly to enable reproducible research in Portuguese legal NLP, though specific licensing terms are not yet disclosed.
How many legal areas does LegalBench-BR cover for classification?
LegalBench-BR covers five distinct legal areas for classification tasks, including administrative law, civil law, and three additional legal domains from Brazilian jurisprudence.
What evaluation metrics does LegalBench-BR use for model comparison?
LegalBench-BR uses accuracy and macro-F1 scores on a class-balanced test set to evaluate model performance across five legal classification categories.
Glossary
- BERTimbau
- Portuguese-language BERT model specifically trained on Brazilian Portuguese texts for natural language processing tasks
- DataJud API
- Application programming interface provided by Brazil’s National Council of Justice (CNJ) for accessing court proceedings and legal documents
- LoRA (Low-Rank Adaptation)
- Parameter-efficient fine-tuning method that updates only a small percentage of model parameters while maintaining performance
- Macro-F1
- Evaluation metric that calculates F1 score for each class separately then averages them, giving equal weight to all classes regardless of frequency
- Santa Catarina State Court (TJSC)
- State-level judicial court in Santa Catarina, Brazil, providing appellate proceedings for the LegalBench-BR dataset
- Systematic bias
- Consistent tendency of a model to favor certain classifications over others, leading to predictable errors in specific categories
Sources
- LegalBench-BR: A Benchmark for Evaluating Large Language Models on Brazilian Legal Decision Classification. arXiv:2604.18878v1. .
- LexGenius: An Expert-Level Benchmark for Large Language Models in Legal General Intelligence. arXiv:2512.04578.
- Benchmarking Vietnamese Legal Knowledge of Large Language Models. arXiv:2512.14554v5.
- PoliLegalLM: A Technical Report on a Large Language Model for Political and Legal Affairs. arXiv:2604.17543.
- Professional Reasoning Benchmark – Legal. Scale Labs.