LegalBench-BR: First Brazilian Legal LLM Benchmark Released

LegalBench-BR is the first public benchmark for evaluating large language models on Brazilian legal text classification, featuring 3,105 appellate proceedings from Santa Catarina State Court across five legal areas with specialized fine-tuned models significantly outperforming general-purpose LLMs.

Released by	Not yet disclosed
Release date	April 22, 2024
What it is	First public benchmark for Brazilian legal text classification
Who it is for	Legal AI researchers and developers
Where to get it	arXiv paper with dataset release
Price	Free

LegalBench-BR contains 3,105 appellate proceedings from Santa Catarina State Court collected via DataJud API
BERTimbau-LoRA achieves 87.6% accuracy while updating only 0.3% of model parameters
GPT-4o mini and Claude 3.5 Haiku show systematic bias toward civil law classification
Fine-tuned models outperform commercial LLMs by 22-28 percentage points on macro-F1 scores
Dataset covers five legal areas with LLM-assisted labeling and heuristic validation

What is LegalBench-BR
What is new vs previous benchmarks
How does LegalBench-BR work
Benchmarks and evidence
Who should care
How to use LegalBench-BR today
LegalBench-BR vs competitors
Risks, limits, and myths

Domain-adapted fine-tuning eliminates systematic classification biases present in general-purpose LLMs
LoRA fine-tuning achieves expert-level performance while updating minimal model parameters
Commercial LLMs cannot substitute for specialized models in Brazilian legal classification tasks
Administrative law classification proves most challenging for general-purpose models
Consumer GPU fine-tuning closes performance gaps at zero marginal inference cost

What is LegalBench-BR

LegalBench-BR is the first public benchmark specifically designed for evaluating large language models on Brazilian legal text classification tasks. The benchmark comprises 3,105 appellate proceedings from the Santa Catarina State Court (TJSC) collected through the DataJud API provided by the National Council of Justice (CNJ). The dataset covers five distinct legal areas and uses LLM-assisted labeling combined with heuristic validation to ensure annotation quality.

The benchmark addresses a critical gap in Portuguese legal natural language processing research. Unlike existing legal benchmarks that focus primarily on English or other languages, LegalBench-BR provides researchers with a standardized evaluation framework for Brazilian legal AI applications. The dataset includes proceedings annotated across civil, criminal, administrative, labor, and tax law categories.

What is new vs previous benchmarks

LegalBench-BR introduces several novel features compared to existing legal AI benchmarks:

Feature	LegalBench-BR	Previous Benchmarks
Language focus	Brazilian Portuguese	Primarily English
Data source	Real court proceedings via DataJud API	Various legal texts
Annotation method	LLM-assisted with heuristic validation	Manual or rule-based
Legal areas	5 specific Brazilian law categories	General legal domains
Model evaluation	Includes LoRA fine-tuning analysis	Standard fine-tuning approaches

How does LegalBench-BR work

LegalBench-BR operates through a systematic evaluation framework for Brazilian legal text classification:

Data collection: Researchers gather 3,105 appellate proceedings from TJSC using the official DataJud API
Annotation process: Legal texts receive labels across five categories using LLM-assisted annotation with heuristic validation
Class balancing: The test set maintains balanced representation across all five legal areas
Model evaluation: Benchmark measures accuracy and macro-F1 scores on classification tasks
Performance analysis: Results compare domain-adapted models against general-purpose LLMs

Benchmarks and evidence

LegalBench-BR demonstrates significant performance gaps between specialized and general-purpose models:

Model	Accuracy	Macro-F1	Parameters Updated
BERTimbau-LoRA	87.6%	0.87	0.3%
Claude 3.5 Haiku	Not yet disclosed	0.65	N/A
GPT-4o mini	Not yet disclosed	0.59	N/A

The performance gap proves most striking in administrative law classification. GPT-4o mini achieves F1 = 0.00 on administrative law cases, while Claude 3.5 Haiku scores F1 = 0.08. The fine-tuned BERTimbau model reaches F1 = 0.91 on the same category, demonstrating the critical importance of domain adaptation for Brazilian legal texts.

Who should care

Builders

AI developers building legal technology for Brazilian markets need domain-specific benchmarks to validate their models. LegalBench-BR provides the first standardized evaluation framework for Portuguese legal NLP applications, enabling developers to measure performance against established baselines.

Enterprise

Law firms and legal service providers can use LegalBench-BR results to inform technology adoption decisions. The benchmark demonstrates that general-purpose LLMs cannot substitute for specialized models in Brazilian legal classification, guiding investment in domain-adapted solutions.

End users

Legal professionals working with Brazilian court systems benefit from improved AI tools validated against real court proceedings. The benchmark ensures legal AI applications meet professional standards for accuracy and reliability in Portuguese legal contexts.

Investors

Venture capital and legal tech investors can use LegalBench-BR performance metrics to evaluate startup claims about Brazilian legal AI capabilities. The benchmark provides objective performance standards for due diligence processes.

How to use LegalBench-BR today

Researchers and developers can access LegalBench-BR through the following steps:

Download the dataset: Access the full dataset from the arXiv paper release
Install dependencies: Set up the provided pipeline with required Python libraries
Load the model: Use the released BERTimbau-LoRA model for baseline comparisons
Run evaluation: Execute the benchmark pipeline on your models using the class-balanced test set
Compare results: Measure accuracy and macro-F1 scores against established baselines

LegalBench-BR vs competitors

LegalBench-BR compares against other legal AI benchmarks in the following areas:

Benchmark	Language	Task Type	Dataset Size	Legal Domain
LegalBench-BR	Portuguese	Classification	3,105 cases	Brazilian law
LexGLUE	English	Multi-task	Various	General legal
LegalBench	English	Reasoning	Various	US law

Risks, limits, and myths

Geographic limitation: Dataset focuses solely on Santa Catarina State Court proceedings, potentially limiting generalizability
Language specificity: Results apply specifically to Brazilian Portuguese legal texts, not other Portuguese variants
Classification scope: Benchmark covers only five legal areas, excluding specialized domains like intellectual property
Temporal constraints: Court proceedings reflect specific time periods and may not capture evolving legal language
Annotation bias: LLM-assisted labeling may introduce systematic biases despite heuristic validation
Model dependency: Performance results depend on specific model architectures and training procedures

FAQ

What is LegalBench-BR and why was it created?

LegalBench-BR is the first public benchmark for evaluating large language models on Brazilian legal text classification, created to address the lack of Portuguese legal AI evaluation frameworks.

How many legal cases does LegalBench-BR contain?

LegalBench-BR contains 3,105 appellate proceedings from the Santa Catarina State Court collected through the DataJud API.

Which legal areas does LegalBench-BR cover?

The benchmark covers five legal areas: civil law, criminal law, administrative law, labor law, and tax law.

How does BERTimbau-LoRA perform compared to commercial LLMs?

BERTimbau-LoRA achieves 87.6% accuracy and 0.87 macro-F1, outperforming Claude 3.5 Haiku by 22 percentage points and GPT-4o mini by 28 percentage points.

Why do commercial LLMs perform poorly on administrative law cases?

GPT-4o mini scores F1 = 0.00 and Claude 3.5 Haiku scores F1 = 0.08 on administrative law due to systematic bias toward civil law classification.

What percentage of model parameters does LoRA fine-tuning update?

LoRA fine-tuning updates only 0.3% of model parameters while achieving superior performance on Brazilian legal classification tasks.

Can general-purpose LLMs substitute for domain-adapted models in legal tasks?

No, the benchmark demonstrates that general-purpose LLMs cannot substitute for domain-adapted models in Brazilian legal classification, even for simple 5-class problems.

How was the LegalBench-BR dataset annotated?

The dataset uses LLM-assisted labeling combined with heuristic validation to ensure annotation quality across the five legal categories.

What is the cost of using fine-tuned models versus commercial LLMs?

LoRA fine-tuning on consumer GPUs closes the performance gap at zero marginal inference cost compared to commercial LLM APIs.

Where can researchers access the LegalBench-BR dataset?

Researchers can access the full dataset, model, and pipeline through the arXiv paper release to enable reproducible research.

Glossary

BERTimbau: Portuguese language model based on BERT architecture, specifically trained on Brazilian Portuguese texts
DataJud API: Official API provided by Brazil’s National Council of Justice for accessing court proceeding data
LoRA (Low-Rank Adaptation): Parameter-efficient fine-tuning technique that updates only a small percentage of model parameters
Macro-F1: Evaluation metric that calculates F1 score for each class separately then averages them, giving equal weight to all classes
TJSC: Santa Catarina State Court (Tribunal de Justiça de Santa Catarina) in Brazil
Heuristic validation: Rule-based verification process used to check the quality of automated annotations
Appellate proceedings: Legal cases that have been appealed to a higher court for review of lower court decisions

Download the LegalBench-BR dataset from arXiv and run the evaluation pipeline on your Brazilian legal AI models to benchmark performance against established baselines.

Sources

LexGenius: An Expert-Level Benchmark for Large Language Models in Legal General Intelligence — https://arxiv.org/html/2512.04578
Benchmarking Vietnamese Legal Knowledge of Large Language Models — https://arxiv.org/html/2512.14554v5
PoliLegalLM: A Technical Report on a Large Language Model for Political and Legal Affairs — https://arxiv.org/html/2604.17543
From Legal Text to Executable Decision Models: Evaluating Structured Representations for Legal Decision Model Generation — https://arxiv.org/html/2604.17153
Professional Reasoning Benchmark – Legal – Scale Labs — https://labs.scale.com/leaderboard/prbench-legal
LLM Leaderboard 2026 — Compare Top AI Models – Vellum AI — https://www.vellum.ai/llm-leaderboard
[2512.04578] LexGenius: An Expert-Level Benchmark for Large Language Models in Legal General Intelligence — https://arxiv.org/abs/2512.04578
The Best Open-Source LLMs in 2026 — https://www.bentoml.com/blog/navigating-the-world-of-open-source-large-language-models

Author

Siegfried Kamgo

Founder and editorial lead at FrontierWisdom. Engineer turned operator-analyst writing about AI systems, automation infrastructure, decentralised stacks, and the practical economics of frontier technology. Focus: turning fast-moving releases into durable, implementation-ready playbooks.