LegalBench-BR: First Brazilian Legal LLM Benchmark Released

LegalBench-BR is the first public benchmark for evaluating large language models on Brazilian legal text classification, featuring 3,105 appellate proceedings from Santa Catarina State Court across five legal areas with specialized fine-tuned models significantly outperforming commercial LLMs.

Released by	Not yet disclosed
Release date	April 22, 2026
What it is	First public benchmark for evaluating LLMs on Brazilian legal text classification
Who it is for	Legal AI researchers and Portuguese NLP developers
Where to get it	Full dataset and model released publicly
Price	Free

LegalBench-BR contains 3,105 appellate proceedings from Santa Catarina State Court collected via DataJud API
BERTimbau-LoRA achieves 87.6% accuracy by updating only 0.3% of model parameters
Commercial LLMs like GPT-4o mini and Claude 3.5 Haiku perform poorly on administrative law classification
Fine-tuned models eliminate systematic bias toward civil law that affects general-purpose LLMs
Dataset covers five legal areas with LLM-assisted labeling and heuristic validation

What is LegalBench-BR
What is new vs previous benchmarks
How does LegalBench-BR work
Benchmarks and evidence
Who should care
How to use LegalBench-BR today
LegalBench-BR vs competitors
Risks, limits, and myths

Domain-adapted fine-tuning significantly outperforms general-purpose LLMs on Brazilian legal classification tasks
Commercial LLMs exhibit systematic bias toward civil law categories in legal text classification
LoRA fine-tuning provides efficient domain adaptation with minimal parameter updates
Brazilian legal NLP requires specialized models rather than relying on general commercial solutions
Public legal datasets enable reproducible research in Portuguese legal natural language processing

What is LegalBench-BR

LegalBench-BR is the first public benchmark specifically designed for evaluating large language models on Brazilian legal text classification tasks. The dataset comprises 3,105 appellate proceedings from the Santa Catarina State Court (TJSC) collected through the DataJud API provided by the National Council of Justice (CNJ). The benchmark covers five distinct legal areas and uses LLM-assisted labeling with heuristic validation to ensure data quality.

The benchmark addresses a critical gap in Portuguese legal natural language processing by providing standardized evaluation metrics for legal AI systems. Unlike existing legal benchmarks that focus primarily on English or other languages, LegalBench-BR specifically targets Brazilian legal terminology and procedural contexts.

What is new vs previous benchmarks

LegalBench-BR introduces the first Portuguese-language legal classification benchmark, filling a significant gap in multilingual legal AI evaluation.

Feature	LegalBench-BR	Previous Legal Benchmarks
Language	Portuguese (Brazilian)	Primarily English
Legal System	Brazilian civil law	Common law systems
Data Source	Official court API (DataJud)	Various sources
Task Focus	5-class legal area classification	Multiple legal tasks
Validation Method	LLM-assisted with heuristics	Human annotation primarily

How does LegalBench-BR work

LegalBench-BR operates through a systematic data collection and evaluation pipeline designed for Brazilian legal text classification.

Data Collection: Appellate proceedings are collected from Santa Catarina State Court via the official DataJud API
Annotation Process: Legal texts are labeled across five categories using LLM-assisted annotation with heuristic validation
Class Balancing: Test sets are balanced across legal categories to ensure fair evaluation metrics
Model Evaluation: Performance is measured using accuracy and macro-F1 scores on classification tasks
Benchmark Testing: Models are evaluated on their ability to distinguish between administrative, civil, criminal, tax, and labor law cases

Benchmarks and evidence

LegalBench-BR evaluation results demonstrate significant performance gaps between domain-adapted and general-purpose models on Brazilian legal classification.

Model	Accuracy	Macro-F1	Administrative Law F1	Source
BERTimbau-LoRA	87.6%	0.87	0.91	LegalBench-BR paper
Claude 3.5 Haiku	Not disclosed	0.65	0.08	LegalBench-BR paper
GPT-4o mini	Not disclosed	0.59	0.00	LegalBench-BR paper

The fine-tuned BERTimbau-LoRA model achieves 22 percentage points higher macro-F1 than Claude 3.5 Haiku and 28 percentage points higher than GPT-4o mini. The performance gap is most pronounced in administrative law classification, where commercial LLMs fail completely while the specialized model excels.

Who should care

Builders

AI developers working on legal technology applications for Brazilian markets need domain-specific benchmarks to evaluate model performance. LegalBench-BR provides standardized metrics for Portuguese legal NLP systems and demonstrates the necessity of fine-tuning for legal classification tasks.

Enterprise

Law firms and legal technology companies operating in Brazil require accurate automated document classification systems. The benchmark reveals that commercial LLMs cannot substitute for domain-adapted models in Brazilian legal contexts, informing technology procurement decisions.

End users

Legal professionals and researchers working with Brazilian court documents benefit from improved classification accuracy that specialized models provide. The benchmark enables better tooling for legal research and case management systems.

Investors

Venture capital and private equity firms evaluating legal AI startups can use LegalBench-BR results to assess technical capabilities and market positioning in Portuguese-speaking legal markets.

How to use LegalBench-BR today

LegalBench-BR is available as a complete research package including dataset, trained models, and evaluation pipeline.

Download Dataset: Access the full 3,105 appellate proceedings dataset from the public release
Load Pre-trained Model: Use the released BERTimbau-LoRA model for immediate Brazilian legal text classification
Run Evaluation Pipeline: Execute the provided evaluation scripts to benchmark new models against established baselines
Fine-tune Custom Models: Apply LoRA fine-tuning techniques to adapt models for specific legal domains
Validate Results: Use the class-balanced test set to measure accuracy and macro-F1 performance

LegalBench-BR vs competitors

LegalBench-BR addresses Portuguese legal classification while existing benchmarks focus on English legal tasks and broader evaluation scopes.

Benchmark	Language	Task Type	Legal System	Dataset Size
LegalBench-BR	Portuguese	5-class classification	Brazilian civil law	3,105 cases
LegalBench	English	Multi-task evaluation	US common law	Not disclosed
LexGLUE	English	Multiple NLP tasks	EU/US law	Not disclosed

According to research sources, classification and decision-making tasks are evaluated using Accuracy and F1, while imbalanced or precision-oriented tasks adopt F0.5 in legal benchmarks [3]. LegalBench-BR follows this standard with accuracy and macro-F1 metrics for its classification task.

Risks, limits, and myths

Geographic Limitation: Dataset only covers Santa Catarina State Court, potentially limiting generalizability to other Brazilian jurisdictions
Temporal Scope: Court proceedings represent a specific time period that may not reflect evolving legal language
Class Imbalance: Real-world legal document distributions may differ from the balanced test set used for evaluation
Annotation Quality: LLM-assisted labeling with heuristic validation may introduce systematic errors compared to expert human annotation
Model Generalization: Fine-tuned models may overfit to specific legal document formats from the training court
Commercial LLM Bias: Results show systematic bias toward civil law classification that may affect other legal domains

FAQ

What is LegalBench-BR and why was it created?

LegalBench-BR is the first public benchmark for evaluating large language models on Brazilian legal text classification, created to address the lack of Portuguese legal AI evaluation tools.

How many legal documents does LegalBench-BR contain?

LegalBench-BR contains 3,105 appellate proceedings from the Santa Catarina State Court collected through the official DataJud API.

Which legal areas does LegalBench-BR cover?

The benchmark covers five legal areas: administrative law, civil law, criminal law, tax law, and labor law classification tasks.

How does BERTimbau-LoRA perform compared to commercial LLMs?

BERTimbau-LoRA achieves 87.6% accuracy and 0.87 macro-F1, outperforming Claude 3.5 Haiku by 22 percentage points and GPT-4o mini by 28 percentage points.

Why do commercial LLMs perform poorly on administrative law classification?

GPT-4o mini scores F1 = 0.00 and Claude 3.5 Haiku scores F1 = 0.08 on administrative law due to systematic bias toward civil law categories.

What is LoRA fine-tuning and how efficient is it?

LoRA (Low-Rank Adaptation) fine-tuning updates only 0.3% of model parameters while achieving significant performance improvements on domain-specific tasks.

Can general-purpose LLMs replace specialized legal models for Brazilian law?

No, the benchmark demonstrates that general-purpose LLMs cannot substitute for domain-adapted models in Brazilian legal classification tasks.

How was the LegalBench-BR dataset annotated and validated?

Legal texts were labeled using LLM-assisted annotation combined with heuristic validation to ensure data quality across five legal categories.

Is LegalBench-BR available for public research use?

Yes, the full dataset, trained model, and evaluation pipeline are released publicly to enable reproducible research in Portuguese legal NLP.

What makes LegalBench-BR different from existing legal benchmarks?

LegalBench-BR is the first benchmark specifically designed for Portuguese Brazilian legal text classification, unlike existing English-focused legal evaluation datasets.

How does the benchmark handle class imbalance in legal document types?

LegalBench-BR uses a class-balanced test set to ensure fair evaluation metrics across all five legal area categories.

What are the implications for legal AI development in Brazil?

The benchmark shows that Brazilian legal AI applications require domain-specific fine-tuning rather than relying on general commercial LLM solutions.

Glossary

BERTimbau: Portuguese language version of BERT (Bidirectional Encoder Representations from Transformers) trained specifically for Brazilian Portuguese text processing
DataJud API: Official application programming interface provided by Brazil’s National Council of Justice (CNJ) for accessing court proceeding data
LoRA (Low-Rank Adaptation): Parameter-efficient fine-tuning technique that updates only a small percentage of model parameters while maintaining performance
Macro-F1: Evaluation metric that calculates F1 score for each class separately then averages them, giving equal weight to all classes regardless of frequency
TJSC: Santa Catarina State Court (Tribunal de Justiça de Santa Catarina), the state-level judicial court system in Santa Catarina, Brazil
Appellate Proceedings: Legal cases that have been appealed from lower courts to higher courts for review of legal decisions
Heuristic Validation: Rule-based checking method used to verify the accuracy of automated annotations using predefined logical rules
Domain Adaptation: Process of modifying a general-purpose model to perform better on specific domain tasks through specialized training

Download the LegalBench-BR dataset and pre-trained BERTimbau-LoRA model to begin evaluating your legal AI systems on Brazilian Portuguese legal text classification tasks.

Sources

LexGenius: An Expert-Level Benchmark for Large Language Models in Legal General Intelligence — https://arxiv.org/html/2512.04578
Benchmarking Vietnamese Legal Knowledge of Large Language Models — https://arxiv.org/html/2512.14554v5
PoliLegalLM: A Technical Report on a Large Language Model for Political and Legal Affairs — https://arxiv.org/html/2604.17543
LLM Leaderboard 2026 — Compare Top AI Models – Vellum AI — https://www.vellum.ai/llm-leaderboard
From Legal Text to Executable Decision Models: Evaluating Structured Representations for Legal Decision Model Generation — https://arxiv.org/html/2604.17153
Professional Reasoning Benchmark – Legal – Scale Labs — https://labs.scale.com/leaderboard/prbench-legal
[2512.04578] LexGenius: An Expert-Level Benchmark for Large Language Models in Legal General Intelligence — https://arxiv.org/abs/2512.04578
Large language model – Wikipedia — https://en.wikipedia.org/wiki/Large_language_model

Author

Siegfried Kamgo

Founder and editorial lead at FrontierWisdom. Engineer turned operator-analyst writing about AI systems, automation infrastructure, decentralised stacks, and the practical economics of frontier technology. Focus: turning fast-moving releases into durable, implementation-ready playbooks.