LegalBench-BR is the first public benchmark for evaluating large language models on Brazilian legal text classification, comprising 3,105 appellate proceedings from Santa Catarina State Court across five legal areas.
| Released by | Not yet disclosed |
|---|---|
| Release date | |
| What it is | First public benchmark for evaluating LLMs on Brazilian legal text classification |
| Who it is for | Researchers developing Portuguese legal NLP models |
| Where to get it | Full dataset and model released publicly |
| Price | Free |
- LegalBench-BR contains 3,105 appellate proceedings from Santa Catarina State Court collected via DataJud API
- BERTimbau-LoRA achieves 87.6% accuracy with only 0.3% parameter updates, outperforming commercial LLMs by 22-28 percentage points
- GPT-4o mini and Claude 3.5 Haiku show systematic bias toward civil law classification, failing on administrative law cases
- Fine-tuned domain-adapted models eliminate classification bias that general-purpose LLMs exhibit on Brazilian legal text
- Dataset covers five legal areas with LLM-assisted labeling and heuristic validation for quality assurance
- Domain-adapted fine-tuning significantly outperforms general-purpose LLMs on Brazilian legal text classification tasks
- Commercial LLMs exhibit systematic classification bias, particularly struggling with administrative law cases
- LoRA fine-tuning provides efficient parameter updates with zero marginal inference cost for legal domain adaptation
- Brazilian legal NLP requires specialized models rather than relying on general-purpose language models
- Public release enables reproducible research in Portuguese legal natural language processing
What is LegalBench-BR
LegalBench-BR is the first public benchmark specifically designed for evaluating large language models on Brazilian legal text classification tasks. The benchmark comprises 3,105 appellate proceedings collected from the Santa Catarina State Court (TJSC) through the DataJud API provided by the National Council of Justice (CNJ). The dataset covers five distinct legal areas and uses LLM-assisted labeling combined with heuristic validation to ensure annotation quality. This benchmark addresses the gap in Portuguese legal NLP evaluation tools, providing researchers with a standardized dataset for developing and testing legal AI models in the Brazilian context.
What is new vs the previous version
LegalBench-BR represents the first benchmark of its kind for Brazilian legal text, with no previous version existing.
| Aspect | Previous State | LegalBench-BR |
|---|---|---|
| Brazilian legal benchmarks | None available publicly | First public benchmark with 3,105 proceedings |
| Data source | No standardized collection | Santa Catarina State Court via DataJud API |
| Legal areas covered | No systematic coverage | Five legal areas with balanced classification |
| Annotation method | Manual annotation only | LLM-assisted labeling with heuristic validation |
| Model evaluation | Ad-hoc testing | Standardized benchmark with reproducible pipeline |
How does LegalBench-BR work
LegalBench-BR operates through a systematic data collection and evaluation pipeline for Brazilian legal text classification.
- Data Collection: Appellate proceedings are collected from Santa Catarina State Court using the DataJud API, ensuring standardized access to legal documents.
- Annotation Process: Legal documents are labeled across five legal areas using LLM-assisted annotation combined with heuristic validation rules for quality control.
- Dataset Preparation: The 3,105 proceedings are organized into a class-balanced test set to ensure fair evaluation across all legal categories.
- Model Evaluation: Language models are tested on classification accuracy and macro-F1 scores, with particular attention to performance across different legal domains.
- Fine-tuning Pipeline: The benchmark includes a complete pipeline for domain adaptation using LoRA (Low-Rank Adaptation) techniques on consumer hardware.
Benchmarks and evidence
LegalBench-BR evaluation reveals significant performance gaps between domain-adapted and general-purpose models on Brazilian legal classification tasks.
| Model | Accuracy | Macro-F1 | Administrative Law F1 | Parameters Updated |
|---|---|---|---|---|
| BERTimbau-LoRA | 87.6% | 0.87 | 0.91 | 0.3% |
| Claude 3.5 Haiku | 65.6% | 0.65 | 0.08 | N/A |
| GPT-4o mini | 59.6% | 0.59 | 0.00 | N/A |
The evaluation demonstrates that classification and decision-making tasks are evaluated using Accuracy and F1 metrics, following established legal benchmark practices [3]. BERTimbau-LoRA achieves 22 percentage points higher macro-F1 than Claude 3.5 Haiku and 28 percentage points higher than GPT-4o mini. The performance gap is most pronounced in administrative law classification, where commercial LLMs fail completely while the fine-tuned model achieves 0.91 F1 score.
Who should care
Builders
AI developers building legal technology solutions for Brazilian markets should prioritize domain-specific fine-tuning over general-purpose LLMs. LegalBench-BR provides the training data and evaluation framework necessary for developing accurate legal classification systems. The LoRA fine-tuning approach enables efficient model adaptation on consumer hardware, making legal AI development more accessible to smaller teams and startups.
Enterprise
Legal firms and corporate legal departments handling Brazilian cases can leverage LegalBench-BR to evaluate and improve their document classification systems. The benchmark reveals that commercial LLMs exhibit systematic bias toward civil law cases, potentially misclassifying administrative and other legal areas. Organizations should invest in domain-adapted models rather than relying solely on general-purpose AI services for legal document processing.
End users
Legal professionals and researchers working with Brazilian court documents benefit from more accurate automated classification systems developed using this benchmark. The improved accuracy in administrative law cases particularly helps practitioners who previously faced poor AI performance in this domain. Citizens accessing legal services may experience better document routing and case categorization as legal tech companies adopt benchmark-validated models.
Investors
Venture capital and legal tech investors should recognize the significant performance advantages of domain-adapted models over general-purpose LLMs in specialized legal markets. The 22-28 percentage point performance gap represents substantial commercial opportunity for companies developing Portuguese legal AI solutions. Investment in legal AI startups should prioritize teams with domain-specific training capabilities rather than those relying solely on commercial LLM APIs.
How to use LegalBench-BR today
Researchers and developers can immediately access LegalBench-BR through its public release, which includes the complete dataset, trained models, and evaluation pipeline.
- Download Dataset: Access the full 3,105 appellate proceedings dataset from the public repository with pre-processed text and annotations.
- Load Pre-trained Model: Use the released BERTimbau-LoRA model for immediate Brazilian legal text classification without additional training.
- Run Evaluation Pipeline: Execute the provided evaluation scripts to reproduce benchmark results and test custom models against the standardized test set.
- Fine-tune Custom Models: Adapt the LoRA training pipeline for specific legal domains or additional Portuguese legal datasets.
- Integrate Classification API: Deploy the trained model as a classification service for legal document processing applications.
LegalBench-BR vs competitors
LegalBench-BR addresses a unique gap in Portuguese legal NLP evaluation, with limited direct competitors in the Brazilian legal domain.
| Benchmark | Language | Legal Domain | Dataset Size | Task Type | Public Access |
|---|---|---|---|---|---|
| LegalBench-BR | Portuguese | Brazilian law | 3,105 proceedings | Classification | Yes |
| LegalBench | English | US law | Multiple tasks | Multi-task | Yes |
| LexGLUE | English | EU/US law | Multiple datasets | Multi-task | Yes |
Existing legal benchmarks focus primarily on English-language legal systems. Recently, a series of legal benchmarks have emerged for evaluating LLM performance, including retrieval (STARD, LeCaRD), question answering (JEC-QA, Legal CQA), classification (LexGLUE), reasoning (LegalBench, LexEval) [1]. However, LegalBench-BR uniquely addresses Portuguese legal text classification, filling a critical gap for Brazilian legal AI development.
Risks, limits, and myths
- Geographic Limitation: Dataset focuses solely on Santa Catarina State Court, potentially limiting generalizability to other Brazilian jurisdictions with different legal practices.
- Classification Scope: Benchmark covers only five legal areas, excluding specialized domains like tax law, environmental law, or intellectual property that may require different classification approaches.
- Temporal Bias: Legal proceedings reflect specific time periods and may not capture evolving legal language or recent legislative changes affecting classification accuracy.
- Annotation Quality: LLM-assisted labeling with heuristic validation may introduce systematic errors despite quality control measures, affecting benchmark reliability.
- Hardware Requirements: While LoRA enables consumer GPU training, achieving optimal performance still requires significant computational resources for large-scale deployment.
- Myth: General LLMs Sufficient: The benchmark definitively disproves the assumption that commercial LLMs can handle specialized legal classification without domain adaptation.
FAQ
What is LegalBench-BR and why is it important?
LegalBench-BR is the first public benchmark for evaluating large language models on Brazilian legal text classification, comprising 3,105 appellate proceedings from Santa Catarina State Court. It addresses the critical gap in Portuguese legal NLP evaluation tools, enabling researchers to develop and test legal AI models specifically for the Brazilian legal system.
How does BERTimbau-LoRA compare to GPT-4o mini on Brazilian legal text?
BERTimbau-LoRA achieves 87.6% accuracy and 0.87 macro-F1, outperforming GPT-4o mini by 28 percentage points while updating only 0.3% of model parameters. The performance gap is most striking in administrative law, where GPT-4o mini scores F1 = 0.00 compared to BERTimbau-LoRA’s F1 = 0.91.
What legal areas does LegalBench-BR cover?
LegalBench-BR covers five legal areas from Brazilian appellate proceedings, with particular focus on administrative law, civil law, and other major legal domains represented in Santa Catarina State Court cases. The dataset uses class-balanced testing to ensure fair evaluation across all categories.
Can I use LegalBench-BR for commercial legal AI applications?
Yes, LegalBench-BR is publicly released with the full dataset, trained models, and evaluation pipeline available for both research and commercial use. The benchmark enables development of accurate legal document classification systems for Brazilian legal technology applications.
Why do commercial LLMs perform poorly on Brazilian legal classification?
Commercial LLMs exhibit systematic bias toward civil law classification, absorbing ambiguous legal categories rather than discriminating between them. They particularly struggle with administrative law cases, where GPT-4o mini and Claude 3.5 Haiku achieve near-zero F1 scores due to lack of domain-specific training on Brazilian legal text.
What is LoRA fine-tuning and why is it effective for legal AI?
LoRA (Low-Rank Adaptation) fine-tuning updates only a small percentage of model parameters while achieving significant performance improvements. BERTimbau-LoRA updates just 0.3% of parameters yet achieves 87.6% accuracy on Brazilian legal classification, providing efficient domain adaptation with zero marginal inference cost.
How was the LegalBench-BR dataset created and validated?
The dataset comprises 3,105 appellate proceedings collected from Santa Catarina State Court via the DataJud API from the National Council of Justice. Legal documents were annotated using LLM-assisted labeling combined with heuristic validation rules to ensure annotation quality and consistency across five legal areas.
What hardware requirements are needed to use LegalBench-BR?
The benchmark includes a complete pipeline designed for consumer GPU training using LoRA fine-tuning techniques. While specific hardware requirements are not disclosed, the approach enables legal AI development on accessible hardware rather than requiring enterprise-grade computational resources.
How does LegalBench-BR compare to other legal AI benchmarks?
LegalBench-BR uniquely addresses Portuguese legal text classification, while existing benchmarks like LegalBench and LexGLUE focus on English-language legal systems. It fills a critical gap for Brazilian legal AI development, providing the first standardized evaluation framework for Portuguese legal NLP models.
What are the limitations of LegalBench-BR for legal AI development?
The benchmark focuses solely on Santa Catarina State Court proceedings, potentially limiting generalizability to other Brazilian jurisdictions. It covers only five legal areas and may not capture specialized domains like tax or environmental law that require different classification approaches.
Glossary
- BERTimbau
- Portuguese-language BERT model specifically trained on Brazilian Portuguese text, optimized for natural language processing tasks in Portuguese legal and general domains.
- DataJud API
- Application Programming Interface provided by Brazil’s National Council of Justice (CNJ) for accessing standardized legal data from Brazilian courts.
- LoRA (Low-Rank Adaptation)
- Parameter-efficient fine-tuning technique that updates only a small percentage of model parameters while achieving significant performance improvements on domain-specific tasks.
- Macro-F1
- Evaluation metric that calculates F1 score for each class independently and averages them, providing equal weight to all classes regardless of their frequency in the dataset.
- Santa Catarina State Court (TJSC)
- State-level judicial court in Santa Catarina, Brazil, serving as the data source for LegalBench-BR’s 3,105 appellate proceedings across five legal areas.
- Systematic Bias
- Consistent pattern where models incorrectly favor certain classifications, such as commercial LLMs absorbing ambiguous legal cases into civil law rather than discriminating between legal areas.
Sources
- LexGenius: An Expert-Level Benchmark for Large Language Models in Legal General Intelligence — https://arxiv.org/html/2512.04578
- Benchmarking Vietnamese Legal Knowledge of Large Language Models — https://arxiv.org/html/2512.14554v5
- PoliLegalLM: A Technical Report on a Large Language Model for Political and Legal Affairs — https://arxiv.org/html/2604.17543
- LLM Leaderboard 2026 — Compare Top AI Models – Vellum AI — https://www.vellum.ai/llm-leaderboard
- From Legal Text to Executable Decision Models: Evaluating Structured Representations for Legal Decision Model Generation — https://arxiv.org/html/2604.17153
- Professional Reasoning Benchmark – Legal – Scale Labs — https://labs.scale.com/leaderboard/prbench-legal
- [2512.04578] LexGenius: An Expert-Level Benchmark for Large Language Models in Legal General Intelligence — https://arxiv.org/abs/2512.04578
- Large language model – Wikipedia — https://en.wikipedia.org/wiki/Large_language_model