LegalBench-BR: First Brazilian Legal AI Benchmark Released

Released by	Not yet disclosed
Release date	April 22, 2026
What it is	First public benchmark for evaluating LLMs on Brazilian legal text classification
Who it is for	Researchers developing Portuguese legal NLP models
Where to get it	Full dataset and model released publicly
Price	Free

LegalBench-BR contains 3,105 appellate proceedings from Santa Catarina State Court collected via DataJud API
BERTimbau-LoRA achieves 87.6% accuracy with only 0.3% parameter updates, outperforming commercial LLMs by 22-28 percentage points
GPT-4o mini and Claude 3.5 Haiku show systematic bias toward civil law classification, failing on administrative law cases
Fine-tuned domain-adapted models eliminate classification bias that general-purpose LLMs exhibit on Brazilian legal text
Dataset covers five legal areas with LLM-assisted labeling and heuristic validation for quality assurance

What is LegalBench-BR
What is new vs the previous version
How does LegalBench-BR work
Benchmarks and evidence
Who should care
How to use LegalBench-BR today
LegalBench-BR vs competitors
Risks, limits, and myths

Domain-adapted fine-tuning significantly outperforms general-purpose LLMs on Brazilian legal text classification tasks
Commercial LLMs exhibit systematic classification bias, particularly struggling with administrative law cases
LoRA fine-tuning provides efficient parameter updates with zero marginal inference cost for legal domain adaptation
Brazilian legal NLP requires specialized models rather than relying on general-purpose language models
Public release enables reproducible research in Portuguese legal natural language processing

What is LegalBench-BR

LegalBench-BR is the first public benchmark specifically designed for evaluating large language models on Brazilian legal text classification tasks. The benchmark comprises 3,105 appellate proceedings collected from the Santa Catarina State Court (TJSC) through the DataJud API provided by the National Council of Justice (CNJ). The dataset covers five distinct legal areas and uses LLM-assisted labeling combined with heuristic validation to ensure annotation quality. This benchmark addresses the gap in Portuguese legal NLP evaluation tools, providing researchers with a standardized dataset for developing and testing legal AI models in the Brazilian context.

What is new vs the previous version

LegalBench-BR represents the first benchmark of its kind for Brazilian legal text, with no previous version existing.

Aspect	Previous State	LegalBench-BR
Brazilian legal benchmarks	None available publicly	First public benchmark with 3,105 proceedings
Data source	No standardized collection	Santa Catarina State Court via DataJud API
Legal areas covered	No systematic coverage	Five legal areas with balanced classification
Annotation method	Manual annotation only	LLM-assisted labeling with heuristic validation
Model evaluation	Ad-hoc testing	Standardized benchmark with reproducible pipeline

How does LegalBench-BR work

LegalBench-BR operates through a systematic data collection and evaluation pipeline for Brazilian legal text classification.

Data Collection: Appellate proceedings are collected from Santa Catarina State Court using the DataJud API, ensuring standardized access to legal documents.
Annotation Process: Legal documents are labeled across five legal areas using LLM-assisted annotation combined with heuristic validation rules for quality control.
Dataset Preparation: The 3,105 proceedings are organized into a class-balanced test set to ensure fair evaluation across all legal categories.
Model Evaluation: Language models are tested on classification accuracy and macro-F1 scores, with particular attention to performance across different legal domains.
Fine-tuning Pipeline: The benchmark includes a complete pipeline for domain adaptation using LoRA (Low-Rank Adaptation) techniques on consumer hardware.

Benchmarks and evidence

LegalBench-BR evaluation reveals significant performance gaps between domain-adapted and general-purpose models on Brazilian legal classification tasks.

Model	Accuracy	Macro-F1	Administrative Law F1	Parameters Updated
BERTimbau-LoRA	87.6%	0.87	0.91	0.3%
Claude 3.5 Haiku	65.6%	0.65	0.08	N/A
GPT-4o mini	59.6%	0.59	0.00	N/A

The evaluation demonstrates that classification and decision-making tasks are evaluated using Accuracy and F1 metrics, following established legal benchmark practices [3]. BERTimbau-LoRA achieves 22 percentage points higher macro-F1 than Claude 3.5 Haiku and 28 percentage points higher than GPT-4o mini. The performance gap is most pronounced in administrative law classification, where commercial LLMs fail completely while the fine-tuned model achieves 0.91 F1 score.

Who should care

Builders

AI developers building legal technology solutions for Brazilian markets should prioritize domain-specific fine-tuning over general-purpose LLMs. LegalBench-BR provides the training data and evaluation framework necessary for developing accurate legal classification systems. The LoRA fine-tuning approach enables efficient model adaptation on consumer hardware, making legal AI development more accessible to smaller teams and startups.

Enterprise

Legal firms and corporate legal departments handling Brazilian cases can leverage LegalBench-BR to evaluate and improve their document classification systems. The benchmark reveals that commercial LLMs exhibit systematic bias toward civil law cases, potentially misclassifying administrative and other legal areas. Organizations should invest in domain-adapted models rather than relying solely on general-purpose AI services for legal document processing.

End users

Legal professionals and researchers working with Brazilian court documents benefit from more accurate automated classification systems developed using this benchmark. The improved accuracy in administrative law cases particularly helps practitioners who previously faced poor AI performance in this domain. Citizens accessing legal services may experience better document routing and case categorization as legal tech companies adopt benchmark-validated models.

Investors

Venture capital and legal tech investors should recognize the significant performance advantages of domain-adapted models over general-purpose LLMs in specialized legal markets. The 22-28 percentage point performance gap represents substantial commercial opportunity for companies developing Portuguese legal AI solutions. Investment in legal AI startups should prioritize teams with domain-specific training capabilities rather than those relying solely on commercial LLM APIs.

How to use LegalBench-BR today

Researchers and developers can immediately access LegalBench-BR through its public release, which includes the complete dataset, trained models, and evaluation pipeline.

Download Dataset: Access the full 3,105 appellate proceedings dataset from the public repository with pre-processed text and annotations.
Load Pre-trained Model: Use the released BERTimbau-LoRA model for immediate Brazilian legal text classification without additional training.
Run Evaluation Pipeline: Execute the provided evaluation scripts to reproduce benchmark results and test custom models against the standardized test set.
Fine-tune Custom Models: Adapt the LoRA training pipeline for specific legal domains or additional Portuguese legal datasets.
Integrate Classification API: Deploy the trained model as a classification service for legal document processing applications.

LegalBench-BR vs competitors

LegalBench-BR addresses a unique gap in Portuguese legal NLP evaluation, with limited direct competitors in the Brazilian legal domain.

Benchmark	Language	Legal Domain	Dataset Size	Task Type	Public Access
LegalBench-BR	Portuguese	Brazilian law	3,105 proceedings	Classification	Yes
LegalBench	English	US law	Multiple tasks	Multi-task	Yes
LexGLUE	English	EU/US law	Multiple datasets	Multi-task	Yes

Existing legal benchmarks focus primarily on English-language legal systems. Recently, a series of legal benchmarks have emerged for evaluating LLM performance, including retrieval (STARD, LeCaRD), question answering (JEC-QA, Legal CQA), classification (LexGLUE), reasoning (LegalBench, LexEval) [1]. However, LegalBench-BR uniquely addresses Portuguese legal text classification, filling a critical gap for Brazilian legal AI development.

Risks, limits, and myths

Geographic Limitation: Dataset focuses solely on Santa Catarina State Court, potentially limiting generalizability to other Brazilian jurisdictions with different legal practices.
Classification Scope: Benchmark covers only five legal areas, excluding specialized domains like tax law, environmental law, or intellectual property that may require different classification approaches.
Temporal Bias: Legal proceedings reflect specific time periods and may not capture evolving legal language or recent legislative changes affecting classification accuracy.
Annotation Quality: LLM-assisted labeling with heuristic validation may introduce systematic errors despite quality control measures, affecting benchmark reliability.
Hardware Requirements: While LoRA enables consumer GPU training, achieving optimal performance still requires significant computational resources for large-scale deployment.
Myth: General LLMs Sufficient: The benchmark definitively disproves the assumption that commercial LLMs can handle specialized legal classification without domain adaptation.

FAQ

What is LegalBench-BR and why is it important?

LegalBench-BR is the first public benchmark for evaluating large language models on Brazilian legal text classification, comprising 3,105 appellate proceedings from Santa Catarina State Court. It addresses the critical gap in Portuguese legal NLP evaluation tools, enabling researchers to develop and test legal AI models specifically for the Brazilian legal system.

How does BERTimbau-LoRA compare to GPT-4o mini on Brazilian legal text?

BERTimbau-LoRA achieves 87.6% accuracy and 0.87 macro-F1, outperforming GPT-4o mini by 28 percentage points while updating only 0.3% of model parameters. The performance gap is most striking in administrative law, where GPT-4o mini scores F1 = 0.00 compared to BERTimbau-LoRA’s F1 = 0.91.

What legal areas does LegalBench-BR cover?

LegalBench-BR covers five legal areas from Brazilian appellate proceedings, with particular focus on administrative law, civil law, and other major legal domains represented in Santa Catarina State Court cases. The dataset uses class-balanced testing to ensure fair evaluation across all categories.

Can I use LegalBench-BR for commercial legal AI applications?

Yes, LegalBench-BR is publicly released with the full dataset, trained models, and evaluation pipeline available for both research and commercial use. The benchmark enables development of accurate legal document classification systems for Brazilian legal technology applications.

Why do commercial LLMs perform poorly on Brazilian legal classification?

Commercial LLMs exhibit systematic bias toward civil law classification, absorbing ambiguous legal categories rather than discriminating between them. They particularly struggle with administrative law cases, where GPT-4o mini and Claude 3.5 Haiku achieve near-zero F1 scores due to lack of domain-specific training on Brazilian legal text.

What is LoRA fine-tuning and why is it effective for legal AI?

LoRA (Low-Rank Adaptation) fine-tuning updates only a small percentage of model parameters while achieving significant performance improvements. BERTimbau-LoRA updates just 0.3% of parameters yet achieves 87.6% accuracy on Brazilian legal classification, providing efficient domain adaptation with zero marginal inference cost.

How was the LegalBench-BR dataset created and validated?

The dataset comprises 3,105 appellate proceedings collected from Santa Catarina State Court via the DataJud API from the National Council of Justice. Legal documents were annotated using LLM-assisted labeling combined with heuristic validation rules to ensure annotation quality and consistency across five legal areas.

What hardware requirements are needed to use LegalBench-BR?

The benchmark includes a complete pipeline designed for consumer GPU training using LoRA fine-tuning techniques. While specific hardware requirements are not disclosed, the approach enables legal AI development on accessible hardware rather than requiring enterprise-grade computational resources.

How does LegalBench-BR compare to other legal AI benchmarks?

LegalBench-BR uniquely addresses Portuguese legal text classification, while existing benchmarks like LegalBench and LexGLUE focus on English-language legal systems. It fills a critical gap for Brazilian legal AI development, providing the first standardized evaluation framework for Portuguese legal NLP models.

What are the limitations of LegalBench-BR for legal AI development?

The benchmark focuses solely on Santa Catarina State Court proceedings, potentially limiting generalizability to other Brazilian jurisdictions. It covers only five legal areas and may not capture specialized domains like tax or environmental law that require different classification approaches.

Glossary

BERTimbau: Portuguese-language BERT model specifically trained on Brazilian Portuguese text, optimized for natural language processing tasks in Portuguese legal and general domains.
DataJud API: Application Programming Interface provided by Brazil’s National Council of Justice (CNJ) for accessing standardized legal data from Brazilian courts.
LoRA (Low-Rank Adaptation): Parameter-efficient fine-tuning technique that updates only a small percentage of model parameters while achieving significant performance improvements on domain-specific tasks.
Macro-F1: Evaluation metric that calculates F1 score for each class independently and averages them, providing equal weight to all classes regardless of their frequency in the dataset.
Santa Catarina State Court (TJSC): State-level judicial court in Santa Catarina, Brazil, serving as the data source for LegalBench-BR’s 3,105 appellate proceedings across five legal areas.
Systematic Bias: Consistent pattern where models incorrectly favor certain classifications, such as commercial LLMs absorbing ambiguous legal cases into civil law rather than discriminating between legal areas.

Download the LegalBench-BR dataset and trained models from the public repository to begin evaluating or developing Brazilian legal AI classification systems.

Sources

LexGenius: An Expert-Level Benchmark for Large Language Models in Legal General Intelligence — https://arxiv.org/html/2512.04578
Benchmarking Vietnamese Legal Knowledge of Large Language Models — https://arxiv.org/html/2512.14554v5
PoliLegalLM: A Technical Report on a Large Language Model for Political and Legal Affairs — https://arxiv.org/html/2604.17543
LLM Leaderboard 2026 — Compare Top AI Models – Vellum AI — https://www.vellum.ai/llm-leaderboard
From Legal Text to Executable Decision Models: Evaluating Structured Representations for Legal Decision Model Generation — https://arxiv.org/html/2604.17153
Professional Reasoning Benchmark – Legal – Scale Labs — https://labs.scale.com/leaderboard/prbench-legal
[2512.04578] LexGenius: An Expert-Level Benchmark for Large Language Models in Legal General Intelligence — https://arxiv.org/abs/2512.04578
Large language model – Wikipedia — https://en.wikipedia.org/wiki/Large_language_model

Author

Siegfried Kamgo

Founder and editorial lead at FrontierWisdom. Engineer turned operator-analyst writing about AI systems, automation infrastructure, decentralised stacks, and the practical economics of frontier technology. Focus: turning fast-moving releases into durable, implementation-ready playbooks.