LegalBench-BR is the first public benchmark for evaluating large language models on Brazilian legal text classification, comprising 3,105 appellate proceedings from Santa Catarina State Court across five legal areas.
| Released by | Not yet disclosed |
| Release date | |
| What it is | First public benchmark for Brazilian legal text classification |
| Who it is for | Legal AI researchers and Portuguese NLP developers |
| Where to get it | Full dataset and model released publicly |
| Price | Free |
- LegalBench-BR contains 3,105 appellate proceedings from Santa Catarina State Court collected via DataJud API
- BERTimbau-LoRA achieves 87.6% accuracy by updating only 0.3% of model parameters
- Commercial LLMs like GPT-4o mini and Claude 3.5 Haiku perform poorly on administrative law classification
- Fine-tuned models eliminate systematic bias toward civil law that affects general-purpose LLMs
- Complete dataset, model, and pipeline are publicly released for reproducible research
- Domain-adapted fine-tuning significantly outperforms general-purpose LLMs on Brazilian legal classification tasks
- Commercial LLMs exhibit systematic bias toward civil law categories when classifying ambiguous legal texts
- LoRA fine-tuning provides substantial performance gains while updating minimal model parameters
- Administrative law classification represents the most challenging category for general-purpose models
- Brazilian legal NLP requires specialized datasets and models rather than relying on general-purpose solutions
What is LegalBench-BR
LegalBench-BR is the first public benchmark specifically designed for evaluating large language models on Brazilian legal text classification tasks. The dataset comprises 3,105 appellate proceedings from the Santa Catarina State Court (TJSC) collected through the DataJud API from Brazil’s National Council of Justice (CNJ).
The benchmark covers five distinct legal areas through LLM-assisted labeling with heuristic validation. Each proceeding is classified into one of five categories representing different areas of Brazilian law, creating a comprehensive evaluation framework for Portuguese legal natural language processing.
The dataset addresses a critical gap in legal AI evaluation for Portuguese-speaking jurisdictions. Unlike existing legal benchmarks that focus primarily on English or other languages, LegalBench-BR provides researchers with authentic Brazilian court documents for developing and testing legal AI systems.
What is new vs previous benchmarks
LegalBench-BR introduces the first Portuguese-language legal classification benchmark, filling a significant gap in multilingual legal AI evaluation.
| Feature | LegalBench-BR | Previous Legal Benchmarks |
|---|---|---|
| Language | Portuguese (Brazilian) | Primarily English |
| Data Source | Real court proceedings via DataJud API | Various legal texts |
| Task Focus | 5-class legal area classification | Multiple legal tasks |
| Dataset Size | 3,105 appellate proceedings | Varies by benchmark |
| Geographic Scope | Brazilian legal system | Various jurisdictions |
The benchmark specifically targets Brazilian legal classification, unlike general legal benchmarks such as LegalBench or LexGLUE that focus on broader legal intelligence tasks. LegalBench-BR provides domain-specific evaluation for Portuguese legal NLP applications.
How does LegalBench-BR work
LegalBench-BR operates through a systematic evaluation framework that tests model performance on Brazilian legal text classification across five categories.
- Data Collection: Appellate proceedings are gathered from Santa Catarina State Court through the official DataJud API maintained by Brazil’s National Council of Justice.
- Annotation Process: Legal texts undergo LLM-assisted labeling with heuristic validation to ensure accurate classification across five legal areas.
- Class Balancing: The test set maintains balanced representation across all five legal categories to prevent evaluation bias.
- Model Evaluation: Systems are tested using accuracy and macro-F1 scores on the class-balanced test set.
- Performance Analysis: Results reveal systematic biases and failure modes in different model types, particularly commercial LLMs versus fine-tuned models.
The evaluation methodology exposes critical differences between general-purpose and domain-adapted models. Commercial LLMs demonstrate systematic bias toward civil law categories, while fine-tuned models achieve more balanced classification across all legal areas.
Benchmarks and evidence
LegalBench-BR evaluation reveals significant performance gaps between domain-adapted and general-purpose language models on Brazilian legal classification tasks.
| Model | Accuracy | Macro-F1 | Administrative Law F1 | Parameters Updated |
|---|---|---|---|---|
| BERTimbau-LoRA | 87.6% | 0.87 | 0.91 | 0.3% |
| Claude 3.5 Haiku | Not yet disclosed | 0.65 | 0.08 | N/A |
| GPT-4o mini | Not yet disclosed | 0.59 | 0.00 | N/A |
BERTimbau-LoRA achieves 87.6% accuracy and 0.87 macro-F1 score while updating only 0.3% of model parameters. The fine-tuned model outperforms Claude 3.5 Haiku by 22 percentage points and GPT-4o mini by 28 percentage points in macro-F1 score.
Administrative law classification represents the most challenging category for commercial LLMs. GPT-4o mini scores F1 = 0.00 on administrative law cases, while Claude 3.5 Haiku achieves only F1 = 0.08, compared to BERTimbau-LoRA’s F1 = 0.91.
Who should care
Builders
AI developers building legal technology for Brazilian markets need LegalBench-BR to evaluate Portuguese legal NLP systems. The benchmark provides essential performance metrics for legal classification tasks using authentic court documents.
Machine learning engineers can use the dataset to fine-tune models for Brazilian legal applications. The LoRA fine-tuning approach demonstrates how to achieve high performance while updating minimal parameters on consumer hardware.
Enterprise
Legal technology companies serving Brazilian clients require domain-specific evaluation frameworks for their AI systems. LegalBench-BR enables rigorous testing of legal classification accuracy before deployment in production environments.
Law firms and legal service providers can benchmark AI tools against established performance standards. The evaluation reveals that general-purpose LLMs cannot substitute for domain-adapted models in Brazilian legal classification tasks.
End users
Legal professionals working with Brazilian court documents benefit from improved AI classification systems trained and evaluated on LegalBench-BR. The benchmark ensures AI tools understand Portuguese legal terminology and classification requirements.
Researchers studying Portuguese legal texts gain access to a standardized evaluation framework. The publicly released dataset enables reproducible research in Brazilian legal natural language processing.
Investors
Venture capital firms evaluating legal AI startups can use LegalBench-BR performance as a technical due diligence metric. The benchmark provides objective comparison standards for Brazilian legal technology solutions.
Investment decisions in Portuguese legal AI can be informed by standardized performance metrics. The significant performance gaps between model types highlight the importance of domain-specific approaches in legal AI markets.
How to use LegalBench-BR today
LegalBench-BR provides immediate access to dataset, model, and evaluation pipeline for Brazilian legal AI development.
- Download Dataset: Access the complete dataset of 3,105 appellate proceedings with annotations across five legal categories from the public release.
- Load Pre-trained Model: Use the released BERTimbau-LoRA model that achieves 87.6% accuracy on the benchmark classification task.
- Run Evaluation Pipeline: Execute the provided evaluation scripts to test custom models against the class-balanced test set using accuracy and macro-F1 metrics.
- Fine-tune Custom Models: Apply LoRA fine-tuning techniques to adapt models for Brazilian legal classification using the training data and methodology.
- Compare Performance: Benchmark custom solutions against established baselines including BERTimbau-LoRA, Claude 3.5 Haiku, and GPT-4o mini results.
The complete pipeline enables researchers to reproduce results and develop improved models for Portuguese legal text classification. All components are publicly available for immediate use in legal AI research and development.
LegalBench-BR vs competitors
LegalBench-BR addresses Portuguese legal classification while existing benchmarks focus on different languages and broader legal tasks.
| Benchmark | Language | Task Type | Dataset Size | Geographic Focus |
|---|---|---|---|---|
| LegalBench-BR | Portuguese | 5-class legal classification | 3,105 proceedings | Brazilian courts |
| LegalBench | English | Multi-task legal reasoning | Various tasks | US legal system |
| LexGLUE | English | Multiple legal NLP tasks | Various datasets | Multiple jurisdictions |
| LexGenius | English | Expert-level legal intelligence | Not yet disclosed | General legal domains |
LegalBench-BR fills a critical gap in Portuguese legal AI evaluation that existing English-focused benchmarks cannot address. The benchmark provides authentic Brazilian court data rather than synthetic or translated legal texts.
Unlike multi-task benchmarks such as LegalBench and LexGLUE, LegalBench-BR focuses specifically on classification performance within Brazilian legal categories. This targeted approach enables precise evaluation of domain-specific legal AI systems.
Risks, limits, and myths
- Geographic Limitation: Dataset contains only Santa Catarina State Court proceedings, potentially limiting generalization to other Brazilian states or federal courts.
- Classification Scope: Benchmark covers only five legal areas, which may not represent the full complexity of Brazilian legal classification needs.
- Temporal Bias: Court proceedings reflect specific time periods and may not capture evolving legal language or recent legislative changes.
- Annotation Quality: LLM-assisted labeling with heuristic validation may introduce systematic errors or inconsistencies in ground truth labels.
- Model Generalization: Performance on LegalBench-BR may not predict success on other Portuguese legal tasks or different court systems.
- Commercial LLM Myth: Results debunk the assumption that general-purpose commercial LLMs can effectively handle specialized legal classification without domain adaptation.
- Parameter Efficiency Myth: LoRA fine-tuning achieves high performance with minimal parameter updates, challenging beliefs about requiring full model retraining.
- Language Transfer Myth: English legal benchmarks cannot substitute for Portuguese-specific evaluation, despite apparent linguistic similarities in legal domains.
FAQ
- What is LegalBench-BR and why was it created?
- LegalBench-BR is the first public benchmark for evaluating large language models on Brazilian legal text classification, created to address the lack of Portuguese legal AI evaluation frameworks.
- How many legal documents does LegalBench-BR contain?
- LegalBench-BR contains 3,105 appellate proceedings from the Santa Catarina State Court collected through the DataJud API from Brazil’s National Council of Justice.
- Which model performs best on LegalBench-BR classification tasks?
- BERTimbau-LoRA achieves the highest performance with 87.6% accuracy and 0.87 macro-F1 score while updating only 0.3% of model parameters through fine-tuning.
- How do commercial LLMs perform on Brazilian legal classification?
- Commercial LLMs perform poorly, with Claude 3.5 Haiku achieving 0.65 macro-F1 and GPT-4o mini achieving 0.59 macro-F1, significantly below the fine-tuned model’s 0.87.
- What legal areas does LegalBench-BR cover for classification?
- LegalBench-BR covers five legal areas for classification, though the specific categories are not detailed in the available sources beyond administrative and civil law.
- Why do general-purpose LLMs struggle with administrative law classification?
- General-purpose LLMs exhibit systematic bias toward civil law categories, with GPT-4o mini scoring F1 = 0.00 and Claude 3.5 Haiku scoring F1 = 0.08 on administrative law.
- Is LegalBench-BR dataset publicly available for research?
- Yes, the complete LegalBench-BR dataset, model, and evaluation pipeline are publicly released to enable reproducible research in Portuguese legal natural language processing.
- Can LegalBench-BR results generalize to other Portuguese legal systems?
- Generalization may be limited since the dataset contains only Santa Catarina State Court proceedings, potentially not representing other Brazilian states or federal courts.
- What evaluation metrics does LegalBench-BR use for model performance?
- LegalBench-BR evaluates models using accuracy and macro-F1 scores on a class-balanced test set to ensure fair comparison across all five legal categories.
- How does LoRA fine-tuning achieve high performance with minimal parameter updates?
- LoRA fine-tuning updates only 0.3% of BERTimbau model parameters while achieving 87.6% accuracy, demonstrating parameter-efficient adaptation for domain-specific legal tasks.
- What systematic bias do commercial LLMs show in legal classification?
- Commercial LLMs exhibit systematic bias toward civil law categories, absorbing ambiguous classes rather than discriminating them, which domain-adapted fine-tuning eliminates.
- Why can’t general-purpose LLMs substitute for domain-adapted models in legal tasks?
- Results demonstrate that general-purpose LLMs cannot substitute for domain-adapted models in Brazilian legal classification, even for simple 5-class problems, due to systematic biases and poor performance.
Glossary
- BERTimbau
- Portuguese-language BERT model specifically trained on Brazilian Portuguese texts for natural language processing tasks
- DataJud API
- Official application programming interface from Brazil’s National Council of Justice for accessing court proceeding data
- LoRA (Low-Rank Adaptation)
- Parameter-efficient fine-tuning technique that updates only a small percentage of model parameters while maintaining performance
- Macro-F1
- Evaluation metric that calculates F1 score for each class separately then averages them, giving equal weight to all classes regardless of frequency
- TJSC
- Santa Catarina State Court (Tribunal de Justiça de Santa Catarina), the source of legal proceedings in the LegalBench-BR dataset
- CNJ
- National Council of Justice (Conselho Nacional de Justiça), Brazil’s judicial oversight body that maintains the DataJud API
- Appellate Proceedings
- Legal cases that have been appealed to a higher court for review of a lower court’s decision
- Heuristic Validation
- Rule-based checking method used to verify the accuracy of LLM-assisted annotations in the dataset creation process
Sources
- LexGenius: An Expert-Level Benchmark for Large Language Models in Legal General Intelligence — https://arxiv.org/html/2512.04578
- Benchmarking Vietnamese Legal Knowledge of Large Language Models — https://arxiv.org/html/2512.14554v5
- PoliLegalLM: A Technical Report on a Large Language Model for Political and Legal Affairs — https://arxiv.org/html/2604.17543
- From Legal Text to Executable Decision Models: Evaluating Structured Representations for Legal Decision Model Generation — https://arxiv.org/html/2604.17153
- Professional Reasoning Benchmark – Legal – Scale Labs — https://labs.scale.com/leaderboard/prbench-legal
- LLM Leaderboard 2026 — Compare Top AI Models – Vellum AI — https://www.vellum.ai/llm-leaderboard
- [2512.04578] LexGenius: An Expert-Level Benchmark for Large Language Models in Legal General Intelligence — https://arxiv.org/abs/2512.04578
- The Best Open-Source LLMs in 2026 — https://www.bentoml.com/blog/navigating-the-world-of-open-source-large-language-models