LegalBench-BR: First Brazilian Legal LLM Classification

LegalBench-BR is the first public benchmark for evaluating large language models on Brazilian legal text classification, featuring 3,105 appellate court proceedings across five legal areas with domain-adapted models significantly outperforming general-purpose LLMs.

Released by	Not yet disclosed
Release date	April 22, 2026
What it is	First public benchmark for Brazilian legal text classification
Who it is for	Legal AI researchers and Portuguese NLP developers
Where to get it	Not yet disclosed
Price	Free (open dataset)

LegalBench-BR contains 3,105 appellate proceedings from Santa Catarina State Court across five legal areas
BERTimbau-LoRA achieves 87.6% accuracy, outperforming GPT-4o mini by 28 percentage points in macro-F1
Commercial LLMs show systematic bias toward civil law classification and fail on administrative law
Fine-tuned models eliminate classification bias that general-purpose LLMs exhibit on ambiguous cases
Dataset enables reproducible research in Portuguese legal natural language processing

What is LegalBench-BR
What is new vs the previous version
How does LegalBench-BR work
Benchmarks and evidence
Who should care
How to use LegalBench-BR today
LegalBench-BR vs competitors
Risks, limits, and myths

Domain-adapted models significantly outperform general-purpose LLMs on Brazilian legal classification tasks
Commercial LLMs exhibit systematic classification bias that fine-tuning eliminates
LoRA fine-tuning updates only 0.3% of model parameters while achieving superior performance
Administrative law classification proves particularly challenging for general-purpose models
The benchmark enables reproducible Portuguese legal NLP research

What is LegalBench-BR

LegalBench-BR is the first public benchmark specifically designed for evaluating large language models on Brazilian legal text classification tasks. The dataset comprises 3,105 appellate proceedings from the Santa Catarina State Court (TJSC), collected via the DataJud API from Brazil’s National Council of Justice (CNJ). These proceedings are annotated across five distinct legal areas through LLM-assisted labeling with heuristic validation.

The benchmark addresses a critical gap in Portuguese legal natural language processing evaluation. Unlike existing legal benchmarks that focus primarily on English or other languages, LegalBench-BR provides researchers with a standardized dataset for assessing model performance on Brazilian legal documents. The dataset covers five key legal areas: civil law (civel), administrative law (administrativo), criminal law (criminal), tax law (tributario), and constitutional law (constitucional).

What is new vs the previous version

LegalBench-BR represents the first benchmark of its kind for Brazilian legal text classification, making direct version comparisons impossible.

Aspect	Previous State	LegalBench-BR
Brazilian legal benchmarks	None available	First public benchmark with 3,105 proceedings
Portuguese legal NLP	Limited evaluation resources	Standardized classification across 5 legal areas
Court data access	Fragmented, non-standardized	Systematic collection via DataJud API
Model evaluation	Ad-hoc testing methods	Class-balanced test set with macro-F1 metrics
Reproducibility	Closed datasets	Full dataset, model, and pipeline released

How does LegalBench-BR work

LegalBench-BR operates through a systematic data collection and evaluation framework designed for Brazilian legal text classification.

Data Collection: Researchers collect appellate proceedings from Santa Catarina State Court using the DataJud API, ensuring systematic access to official court documents.
Annotation Process: Legal texts undergo LLM-assisted labeling across five legal areas, with heuristic validation ensuring annotation quality and consistency.
Class Balancing: The test set maintains balanced representation across all five legal categories to prevent evaluation bias toward dominant classes.
Model Training: Fine-tuning approaches like LoRA update minimal model parameters (0.3% for BERTimbau) while achieving domain adaptation.
Evaluation Metrics: Performance assessment uses accuracy and macro-F1 scores on the class-balanced test set, enabling fair comparison across models.
Bias Detection: The benchmark identifies systematic classification biases, particularly commercial LLMs’ tendency toward civil law categorization.

Benchmarks and evidence

LegalBench-BR evaluation reveals significant performance gaps between domain-adapted and general-purpose models on Brazilian legal classification.

Model	Accuracy	Macro-F1	Administrative Law F1	Parameter Updates
BERTimbau-LoRA	87.6%	0.87	0.91	0.3% of total parameters
Claude 3.5 Haiku	Not disclosed	0.65 (+22pp gap)	0.08	Zero-shot inference
GPT-4o mini	Not disclosed	0.59 (+28pp gap)	0.00	Zero-shot inference

The benchmark demonstrates that commercial LLMs exhibit systematic bias toward civil law classification, absorbing ambiguous cases rather than discriminating between legal categories. This bias proves most problematic for administrative law, where GPT-4o mini achieves zero F1 score and Claude 3.5 Haiku reaches only 0.08 F1, while the fine-tuned BERTimbau model achieves 0.91 F1 on the same category.

Who should care

Builders

Legal AI developers building Portuguese language applications need LegalBench-BR for systematic model evaluation and comparison. The benchmark provides standardized metrics for assessing classification performance across Brazilian legal domains. Developers can use the dataset to fine-tune models for specific legal applications, achieving superior performance with minimal parameter updates through LoRA techniques.

Enterprise

Law firms and legal technology companies operating in Brazil require accurate document classification for case management and legal research systems. LegalBench-BR demonstrates that general-purpose LLMs cannot substitute for domain-adapted models in Brazilian legal contexts. Enterprise users should prioritize fine-tuned models over commercial APIs for critical legal classification tasks.

End users

Legal professionals and researchers working with Brazilian court documents benefit from improved classification accuracy that domain-adapted models provide. The benchmark reveals significant limitations in commercial LLM performance on Portuguese legal texts. End users should expect better results from specialized legal AI tools rather than general-purpose language models.

Investors

Venture capital and legal technology investors should recognize the performance gap between general-purpose and domain-adapted models in legal AI. LegalBench-BR evidence suggests that specialized legal AI companies may have competitive advantages over general-purpose LLM providers in Brazilian markets. Investment decisions should consider the necessity of domain adaptation for legal applications.

How to use LegalBench-BR today

Researchers and developers can access LegalBench-BR through the released dataset, model, and evaluation pipeline.

Download Dataset: Access the complete dataset of 3,105 annotated appellate proceedings from the official release repository.
Install Dependencies: Set up the evaluation environment with required Python libraries for legal text processing and model training.
Load Pretrained Model: Download the BERTimbau-LoRA checkpoint that achieves 87.6% accuracy on the benchmark test set.
Run Evaluation: Execute the provided evaluation script to reproduce benchmark results and compare new models against established baselines.
Fine-tune Models: Use the training pipeline to adapt new language models on the Brazilian legal classification task.
Submit Results: Contribute new model evaluations to the benchmark leaderboard for community comparison and validation.

LegalBench-BR vs competitors

LegalBench-BR addresses Brazilian legal classification while existing benchmarks focus on other languages and legal systems.

Benchmark	Language	Legal System	Task Type	Dataset Size
LegalBench-BR	Portuguese	Brazilian	5-class classification	3,105 proceedings
LexGLUE	English	US/European	Multi-task evaluation	Various sizes
LegalBench	English	US Common Law	Multi-task reasoning	Various sizes
CaseHOLD	English	US Federal	Statute classification	Not disclosed

According to research sources, existing legal benchmarks like LexGLUE and LegalBench primarily evaluate English-language legal tasks, while LegalBench-BR specifically targets Portuguese legal text classification in the Brazilian civil law system. The benchmark fills a critical gap for Portuguese legal NLP evaluation that previous benchmarks did not address.

Risks, limits, and myths

Geographic Limitation: Dataset contains only Santa Catarina State Court proceedings, potentially limiting generalization to other Brazilian jurisdictions
Temporal Bias: Court proceedings reflect specific time periods that may not represent evolving legal language and precedents
Classification Scope: Five-class taxonomy may oversimplify complex legal categorizations that practitioners use in real-world scenarios
Annotation Quality: LLM-assisted labeling with heuristic validation may introduce systematic errors despite quality control measures
Model Generalization: Fine-tuned performance may not transfer to other Portuguese legal domains outside Brazilian appellate proceedings
Commercial Bias: Evaluation focuses on zero-shot commercial LLM performance without exploring few-shot or prompt engineering approaches
Resource Requirements: LoRA fine-tuning requires technical expertise and computational resources despite efficiency claims

FAQ

What makes LegalBench-BR different from other legal AI benchmarks?: LegalBench-BR is the first public benchmark specifically designed for Brazilian legal text classification, featuring Portuguese language court proceedings from Santa Catarina State Court across five legal areas.
How accurate are commercial LLMs on Brazilian legal classification?: Commercial LLMs perform significantly worse than domain-adapted models, with GPT-4o mini and Claude 3.5 Haiku achieving 28 and 22 percentage points lower macro-F1 scores respectively compared to BERTimbau-LoRA.
Why do general-purpose LLMs fail on administrative law classification?: Commercial LLMs exhibit systematic bias toward civil law classification, with GPT-4o mini achieving 0.00 F1 score and Claude 3.5 Haiku reaching only 0.08 F1 on administrative law cases.
What is LoRA fine-tuning and why is it effective for legal AI?: LoRA (Low-Rank Adaptation) updates only 0.3% of model parameters while achieving domain adaptation, enabling efficient fine-tuning on consumer GPUs with zero marginal inference cost.
How many legal documents are included in LegalBench-BR?: LegalBench-BR contains 3,105 appellate proceedings from the Santa Catarina State Court, collected via the DataJud API and annotated across five legal areas.
Can I use LegalBench-BR for commercial legal AI applications?: The benchmark is released as an open dataset to enable reproducible research, though specific licensing terms for commercial use are not yet disclosed.
What legal areas does LegalBench-BR cover?: The benchmark covers five legal areas: civil law (civel), administrative law (administrativo), criminal law (criminal), tax law (tributario), and constitutional law (constitucional).
How does LegalBench-BR ensure annotation quality?: The dataset uses LLM-assisted labeling with heuristic validation to ensure consistent and accurate annotation across the 3,105 legal proceedings.
What evaluation metrics does LegalBench-BR use?: The benchmark uses accuracy and macro-F1 scores on a class-balanced test set to ensure fair comparison across models and prevent bias toward dominant legal categories.
Is LegalBench-BR suitable for other Portuguese-speaking countries?: The benchmark focuses specifically on Brazilian legal system and terminology, which may limit direct applicability to other Portuguese-speaking jurisdictions with different legal frameworks.

Glossary

BERTimbau: Portuguese language version of BERT (Bidirectional Encoder Representations from Transformers) specifically trained on Brazilian Portuguese text
DataJud API: Application Programming Interface provided by Brazil’s National Council of Justice (CNJ) for accessing court proceeding data
LoRA (Low-Rank Adaptation): Parameter-efficient fine-tuning technique that updates only a small percentage of model parameters while maintaining performance
Macro-F1: Evaluation metric that calculates F1 score for each class separately then averages them, giving equal weight to all classes regardless of frequency
TJSC: Santa Catarina State Court (Tribunal de Justiça de Santa Catarina), one of Brazil’s state-level judicial courts
Zero-shot inference: Model evaluation approach where the language model performs classification without any task-specific training examples
Class-balanced test set: Evaluation dataset where each legal category has equal representation to prevent bias toward more frequent classes
Heuristic validation: Quality control process using rule-based methods to verify the accuracy of automated annotations

Download the LegalBench-BR dataset and evaluation pipeline to benchmark your legal AI models on Brazilian Portuguese legal text classification.

Sources

LexGenius: An Expert-Level Benchmark for Large Language Models in Legal General Intelligence — https://arxiv.org/html/2512.04578
Benchmarking Vietnamese Legal Knowledge of Large Language Models — https://arxiv.org/html/2512.14554v5
PoliLegalLM: A Technical Report on a Large Language Model for Political and Legal Affairs — https://arxiv.org/html/2604.17543
From Legal Text to Executable Decision Models: Evaluating Structured Representations for Legal Decision Model Generation — https://arxiv.org/html/2604.17153
Professional Reasoning Benchmark – Legal – Scale Labs — https://labs.scale.com/leaderboard/prbench-legal
LLM Leaderboard 2026 — Compare Top AI Models – Vellum AI — https://www.vellum.ai/llm-leaderboard
[2512.04578] LexGenius: An Expert-Level Benchmark for Large Language Models in Legal General Intelligence — https://arxiv.org/abs/2512.04578
The Best Open-Source LLMs in 2026 — https://www.bentoml.com/blog/navigating-the-world-of-open-source-large-language-models

Author

Siegfried Kamgo

Founder and editorial lead at FrontierWisdom. Engineer turned operator-analyst writing about AI systems, automation infrastructure, decentralised stacks, and the practical economics of frontier technology. Focus: turning fast-moving releases into durable, implementation-ready playbooks.

LegalBench-BR: First Brazilian Legal LLM Classification Benchmark

What is LegalBench-BR

What is new vs the previous version

How does LegalBench-BR work

Benchmarks and evidence