LegalBench-BR: First Brazilian Legal AI Benchmark Released

LegalBench-BR is the first public benchmark for evaluating large language models on Brazilian legal text classification, comprising 3,105 appellate proceedings from Santa Catarina State Court across five legal areas.

Released by	Not yet disclosed
Release date	April 22, 2026
What it is	First public benchmark for Brazilian legal text classification
Who it is for	Legal AI researchers and Portuguese NLP developers
Where to get it	Full dataset and model released publicly
Price	Free

LegalBench-BR contains 3,105 appellate proceedings from Santa Catarina State Court collected via DataJud API
BERTimbau-LoRA achieves 87.6% accuracy by updating only 0.3% of model parameters
Commercial LLMs like GPT-4o mini and Claude 3.5 Haiku perform poorly on administrative law classification
Fine-tuned models eliminate systematic bias toward civil law that affects general-purpose LLMs
Complete dataset, model, and pipeline are publicly released for reproducible research

What is LegalBench-BR
What is new vs previous benchmarks
How does LegalBench-BR work
Benchmarks and evidence
Who should care
How to use LegalBench-BR today
LegalBench-BR vs competitors
Risks, limits, and myths

Domain-adapted fine-tuning significantly outperforms general-purpose LLMs on Brazilian legal classification tasks
Commercial LLMs exhibit systematic bias toward civil law categories when classifying ambiguous legal texts
LoRA fine-tuning provides substantial performance gains while updating minimal model parameters
Administrative law classification represents the most challenging category for general-purpose models
Brazilian legal NLP requires specialized datasets and models rather than relying on general-purpose solutions

What is LegalBench-BR

LegalBench-BR is the first public benchmark specifically designed for evaluating large language models on Brazilian legal text classification tasks. The dataset comprises 3,105 appellate proceedings from the Santa Catarina State Court (TJSC) collected through the DataJud API from Brazil’s National Council of Justice (CNJ).

The benchmark covers five distinct legal areas through LLM-assisted labeling with heuristic validation. Each proceeding is classified into one of five categories representing different areas of Brazilian law, creating a comprehensive evaluation framework for Portuguese legal natural language processing.

The dataset addresses a critical gap in legal AI evaluation for Portuguese-speaking jurisdictions. Unlike existing legal benchmarks that focus primarily on English or other languages, LegalBench-BR provides researchers with authentic Brazilian court documents for developing and testing legal AI systems.

What is new vs previous benchmarks

LegalBench-BR introduces the first Portuguese-language legal classification benchmark, filling a significant gap in multilingual legal AI evaluation.

Feature	LegalBench-BR	Previous Legal Benchmarks
Language	Portuguese (Brazilian)	Primarily English
Data Source	Real court proceedings via DataJud API	Various legal texts
Task Focus	5-class legal area classification	Multiple legal tasks
Dataset Size	3,105 appellate proceedings	Varies by benchmark
Geographic Scope	Brazilian legal system	Various jurisdictions

The benchmark specifically targets Brazilian legal classification, unlike general legal benchmarks such as LegalBench or LexGLUE that focus on broader legal intelligence tasks. LegalBench-BR provides domain-specific evaluation for Portuguese legal NLP applications.

How does LegalBench-BR work

LegalBench-BR operates through a systematic evaluation framework that tests model performance on Brazilian legal text classification across five categories.

Data Collection: Appellate proceedings are gathered from Santa Catarina State Court through the official DataJud API maintained by Brazil’s National Council of Justice.
Annotation Process: Legal texts undergo LLM-assisted labeling with heuristic validation to ensure accurate classification across five legal areas.
Class Balancing: The test set maintains balanced representation across all five legal categories to prevent evaluation bias.
Model Evaluation: Systems are tested using accuracy and macro-F1 scores on the class-balanced test set.
Performance Analysis: Results reveal systematic biases and failure modes in different model types, particularly commercial LLMs versus fine-tuned models.

The evaluation methodology exposes critical differences between general-purpose and domain-adapted models. Commercial LLMs demonstrate systematic bias toward civil law categories, while fine-tuned models achieve more balanced classification across all legal areas.

Benchmarks and evidence

LegalBench-BR evaluation reveals significant performance gaps between domain-adapted and general-purpose language models on Brazilian legal classification tasks.

Model	Accuracy	Macro-F1	Administrative Law F1	Parameters Updated
BERTimbau-LoRA	87.6%	0.87	0.91	0.3%
Claude 3.5 Haiku	Not yet disclosed	0.65	0.08	N/A
GPT-4o mini	Not yet disclosed	0.59	0.00	N/A

BERTimbau-LoRA achieves 87.6% accuracy and 0.87 macro-F1 score while updating only 0.3% of model parameters. The fine-tuned model outperforms Claude 3.5 Haiku by 22 percentage points and GPT-4o mini by 28 percentage points in macro-F1 score.

Administrative law classification represents the most challenging category for commercial LLMs. GPT-4o mini scores F1 = 0.00 on administrative law cases, while Claude 3.5 Haiku achieves only F1 = 0.08, compared to BERTimbau-LoRA’s F1 = 0.91.

Who should care

Builders

AI developers building legal technology for Brazilian markets need LegalBench-BR to evaluate Portuguese legal NLP systems. The benchmark provides essential performance metrics for legal classification tasks using authentic court documents.

Machine learning engineers can use the dataset to fine-tune models for Brazilian legal applications. The LoRA fine-tuning approach demonstrates how to achieve high performance while updating minimal parameters on consumer hardware.

Enterprise

Legal technology companies serving Brazilian clients require domain-specific evaluation frameworks for their AI systems. LegalBench-BR enables rigorous testing of legal classification accuracy before deployment in production environments.

Law firms and legal service providers can benchmark AI tools against established performance standards. The evaluation reveals that general-purpose LLMs cannot substitute for domain-adapted models in Brazilian legal classification tasks.

End users

Legal professionals working with Brazilian court documents benefit from improved AI classification systems trained and evaluated on LegalBench-BR. The benchmark ensures AI tools understand Portuguese legal terminology and classification requirements.

Researchers studying Portuguese legal texts gain access to a standardized evaluation framework. The publicly released dataset enables reproducible research in Brazilian legal natural language processing.

Investors

Venture capital firms evaluating legal AI startups can use LegalBench-BR performance as a technical due diligence metric. The benchmark provides objective comparison standards for Brazilian legal technology solutions.

Investment decisions in Portuguese legal AI can be informed by standardized performance metrics. The significant performance gaps between model types highlight the importance of domain-specific approaches in legal AI markets.

How to use LegalBench-BR today

LegalBench-BR provides immediate access to dataset, model, and evaluation pipeline for Brazilian legal AI development.

Download Dataset: Access the complete dataset of 3,105 appellate proceedings with annotations across five legal categories from the public release.
Load Pre-trained Model: Use the released BERTimbau-LoRA model that achieves 87.6% accuracy on the benchmark classification task.
Run Evaluation Pipeline: Execute the provided evaluation scripts to test custom models against the class-balanced test set using accuracy and macro-F1 metrics.
Fine-tune Custom Models: Apply LoRA fine-tuning techniques to adapt models for Brazilian legal classification using the training data and methodology.
Compare Performance: Benchmark custom solutions against established baselines including BERTimbau-LoRA, Claude 3.5 Haiku, and GPT-4o mini results.

The complete pipeline enables researchers to reproduce results and develop improved models for Portuguese legal text classification. All components are publicly available for immediate use in legal AI research and development.

LegalBench-BR vs competitors

LegalBench-BR addresses Portuguese legal classification while existing benchmarks focus on different languages and broader legal tasks.

Benchmark	Language	Task Type	Dataset Size	Geographic Focus
LegalBench-BR	Portuguese	5-class legal classification	3,105 proceedings	Brazilian courts
LegalBench	English	Multi-task legal reasoning	Various tasks	US legal system
LexGLUE	English	Multiple legal NLP tasks	Various datasets	Multiple jurisdictions
LexGenius	English	Expert-level legal intelligence	Not yet disclosed	General legal domains

LegalBench-BR fills a critical gap in Portuguese legal AI evaluation that existing English-focused benchmarks cannot address. The benchmark provides authentic Brazilian court data rather than synthetic or translated legal texts.

Unlike multi-task benchmarks such as LegalBench and LexGLUE, LegalBench-BR focuses specifically on classification performance within Brazilian legal categories. This targeted approach enables precise evaluation of domain-specific legal AI systems.

Risks, limits, and myths

Geographic Limitation: Dataset contains only Santa Catarina State Court proceedings, potentially limiting generalization to other Brazilian states or federal courts.
Classification Scope: Benchmark covers only five legal areas, which may not represent the full complexity of Brazilian legal classification needs.
Temporal Bias: Court proceedings reflect specific time periods and may not capture evolving legal language or recent legislative changes.
Annotation Quality: LLM-assisted labeling with heuristic validation may introduce systematic errors or inconsistencies in ground truth labels.
Model Generalization: Performance on LegalBench-BR may not predict success on other Portuguese legal tasks or different court systems.
Commercial LLM Myth: Results debunk the assumption that general-purpose commercial LLMs can effectively handle specialized legal classification without domain adaptation.
Parameter Efficiency Myth: LoRA fine-tuning achieves high performance with minimal parameter updates, challenging beliefs about requiring full model retraining.
Language Transfer Myth: English legal benchmarks cannot substitute for Portuguese-specific evaluation, despite apparent linguistic similarities in legal domains.

FAQ

What is LegalBench-BR and why was it created?: LegalBench-BR is the first public benchmark for evaluating large language models on Brazilian legal text classification, created to address the lack of Portuguese legal AI evaluation frameworks.
How many legal documents does LegalBench-BR contain?: LegalBench-BR contains 3,105 appellate proceedings from the Santa Catarina State Court collected through the DataJud API from Brazil’s National Council of Justice.
Which model performs best on LegalBench-BR classification tasks?: BERTimbau-LoRA achieves the highest performance with 87.6% accuracy and 0.87 macro-F1 score while updating only 0.3% of model parameters through fine-tuning.
How do commercial LLMs perform on Brazilian legal classification?: Commercial LLMs perform poorly, with Claude 3.5 Haiku achieving 0.65 macro-F1 and GPT-4o mini achieving 0.59 macro-F1, significantly below the fine-tuned model’s 0.87.
What legal areas does LegalBench-BR cover for classification?: LegalBench-BR covers five legal areas for classification, though the specific categories are not detailed in the available sources beyond administrative and civil law.
Why do general-purpose LLMs struggle with administrative law classification?: General-purpose LLMs exhibit systematic bias toward civil law categories, with GPT-4o mini scoring F1 = 0.00 and Claude 3.5 Haiku scoring F1 = 0.08 on administrative law.
Is LegalBench-BR dataset publicly available for research?: Yes, the complete LegalBench-BR dataset, model, and evaluation pipeline are publicly released to enable reproducible research in Portuguese legal natural language processing.
Can LegalBench-BR results generalize to other Portuguese legal systems?: Generalization may be limited since the dataset contains only Santa Catarina State Court proceedings, potentially not representing other Brazilian states or federal courts.
What evaluation metrics does LegalBench-BR use for model performance?: LegalBench-BR evaluates models using accuracy and macro-F1 scores on a class-balanced test set to ensure fair comparison across all five legal categories.
How does LoRA fine-tuning achieve high performance with minimal parameter updates?: LoRA fine-tuning updates only 0.3% of BERTimbau model parameters while achieving 87.6% accuracy, demonstrating parameter-efficient adaptation for domain-specific legal tasks.
What systematic bias do commercial LLMs show in legal classification?: Commercial LLMs exhibit systematic bias toward civil law categories, absorbing ambiguous classes rather than discriminating them, which domain-adapted fine-tuning eliminates.
Why can’t general-purpose LLMs substitute for domain-adapted models in legal tasks?: Results demonstrate that general-purpose LLMs cannot substitute for domain-adapted models in Brazilian legal classification, even for simple 5-class problems, due to systematic biases and poor performance.

Glossary

BERTimbau: Portuguese-language BERT model specifically trained on Brazilian Portuguese texts for natural language processing tasks
DataJud API: Official application programming interface from Brazil’s National Council of Justice for accessing court proceeding data
LoRA (Low-Rank Adaptation): Parameter-efficient fine-tuning technique that updates only a small percentage of model parameters while maintaining performance
Macro-F1: Evaluation metric that calculates F1 score for each class separately then averages them, giving equal weight to all classes regardless of frequency
TJSC: Santa Catarina State Court (Tribunal de Justiça de Santa Catarina), the source of legal proceedings in the LegalBench-BR dataset
CNJ: National Council of Justice (Conselho Nacional de Justiça), Brazil’s judicial oversight body that maintains the DataJud API
Appellate Proceedings: Legal cases that have been appealed to a higher court for review of a lower court’s decision
Heuristic Validation: Rule-based checking method used to verify the accuracy of LLM-assisted annotations in the dataset creation process

Download the LegalBench-BR dataset and evaluation pipeline to benchmark your Portuguese legal AI models against established performance standards.

Sources

LexGenius: An Expert-Level Benchmark for Large Language Models in Legal General Intelligence — https://arxiv.org/html/2512.04578
Benchmarking Vietnamese Legal Knowledge of Large Language Models — https://arxiv.org/html/2512.14554v5
PoliLegalLM: A Technical Report on a Large Language Model for Political and Legal Affairs — https://arxiv.org/html/2604.17543
From Legal Text to Executable Decision Models: Evaluating Structured Representations for Legal Decision Model Generation — https://arxiv.org/html/2604.17153
Professional Reasoning Benchmark – Legal – Scale Labs — https://labs.scale.com/leaderboard/prbench-legal
LLM Leaderboard 2026 — Compare Top AI Models – Vellum AI — https://www.vellum.ai/llm-leaderboard
[2512.04578] LexGenius: An Expert-Level Benchmark for Large Language Models in Legal General Intelligence — https://arxiv.org/abs/2512.04578
The Best Open-Source LLMs in 2026 — https://www.bentoml.com/blog/navigating-the-world-of-open-source-large-language-models

Author

Siegfried Kamgo

Founder and editorial lead at FrontierWisdom. Engineer turned operator-analyst writing about AI systems, automation infrastructure, decentralised stacks, and the practical economics of frontier technology. Focus: turning fast-moving releases into durable, implementation-ready playbooks.

LegalBench-BR: First Brazilian Legal AI Benchmark Released

Turn this article into a repeatable weekly edge.

What is LegalBench-BR

What is new vs previous benchmarks

How does LegalBench-BR work

Benchmarks and evidence

Who should care

Builders

Enterprise

End users

Investors

How to use LegalBench-BR today

LegalBench-BR vs competitors

Risks, limits, and myths

FAQ

Glossary

Sources

Author

Siegfried Kamgo

Get the next blueprint before it becomes common advice.

Related Articles

AI Chatbots Leak Real Phone Numbers, Raising Privacy Concerns

GitHub Copilot App Enters Technical Preview for Agentic Development

Together AI Releases Violin: Open-Source Video Translation Tool

Leave a Reply Cancel reply