IndiaFinBench: New LLM Benchmark for Indian Financial

IndiaFinBench is the first publicly available evaluation benchmark for assessing large language model performance on Indian financial regulatory text. The benchmark contains 406 expert-annotated question-answer pairs from SEBI and RBI documents across four task types.

Released by	Not yet disclosed
Release date	April 22, 2026
What it is	Evaluation benchmark for LLM performance on Indian financial regulatory text
Who it’s for	AI researchers and financial technology developers
Where to get it	https://github.com/rajveerpall/IndiaFinBench
Price	Free (open source)

IndiaFinBench addresses gaps in existing financial NLP benchmarks that focus exclusively on Western financial corpora
The benchmark includes 406 question-answer pairs from 192 SEBI and RBI documents across four task types
Twelve models were evaluated with accuracy ranging from 70.4% to 89.7% under zero-shot conditions
Numerical reasoning proved most discriminative with a 35.9 percentage-point performance spread across models
All evaluated models substantially outperformed a non-specialist human baseline of 60.0%

What is IndiaFinBench
What is new vs previous benchmarks
How does IndiaFinBench work
Benchmarks and evidence
Who should care
How to use IndiaFinBench today
IndiaFinBench vs competitors
Risks, limits, and myths

IndiaFinBench fills a critical gap by providing the first evaluation benchmark specifically for Indian financial regulatory text
The benchmark demonstrates significant performance variation across models, with numerical reasoning being the most challenging task
High annotation quality is validated through model-based secondary passes and human inter-annotator agreement evaluation
Bootstrap significance testing reveals three statistically distinct performance tiers among evaluated models

What is IndiaFinBench

IndiaFinBench is an evaluation benchmark designed to assess large language model performance on Indian financial regulatory text. The benchmark addresses a significant gap in existing financial NLP evaluation tools that draw exclusively from Western financial corpora such as SEC filings and US earnings reports.

The benchmark contains 406 expert-annotated question-answer pairs sourced from 192 documents from the Securities and Exchange Board of India (SEBI) and the Reserve Bank of India (RBI). These documents represent the primary regulatory frameworks governing India’s financial sector.

What is new vs previous benchmarks

IndiaFinBench introduces the first publicly available benchmark specifically for non-Western financial regulatory frameworks.

Feature	Previous Financial Benchmarks	IndiaFinBench
Geographic focus	Western financial corpora only	Indian regulatory framework
Document sources	SEC filings, US earnings reports	SEBI and RBI regulatory documents
Task diversity	Limited task types	Four distinct task types
Annotation validation	Varies by benchmark	Model-based secondary pass + human agreement
Statistical rigor	Basic accuracy reporting	Bootstrap significance testing with 10,000 resamples

How does IndiaFinBench work

IndiaFinBench evaluates models across four distinct task types that capture different aspects of financial regulatory comprehension.

Regulatory interpretation: 174 items testing understanding of regulatory requirements and compliance frameworks
Numerical reasoning: 92 items requiring mathematical calculations and quantitative analysis of financial data
Contradiction detection: 62 items identifying inconsistencies within regulatory text
Temporal reasoning: 78 items involving time-based regulatory requirements and deadlines

Annotation quality is validated through a model-based secondary pass achieving kappa=0.918 on contradiction detection tasks. A 60-item human inter-annotator agreement evaluation achieved kappa=0.611 with 76.7% overall agreement.

Benchmarks and evidence

Twelve models were evaluated under zero-shot conditions, revealing significant performance variations across different model architectures.

Model	Overall Accuracy	Performance Tier	Source
Gemini 2.5 Flash	89.7%	Tier 1	[1]
Gemma 4 E4B	70.4%	Tier 3	[1]
Human baseline (non-specialist)	60.0%	Below all models	[1]
Numerical reasoning spread	35.9 percentage points	Most discriminative task	[1]

Bootstrap significance testing with 10,000 resamples reveals three statistically distinct performance tiers among the evaluated models. Numerical reasoning emerged as the most discriminative task type with the largest performance spread.

Who should care

Builders

AI developers building financial applications for Indian markets need IndiaFinBench to evaluate model performance on regulatory compliance tasks. The benchmark provides standardized evaluation metrics for models processing SEBI and RBI documentation.

Enterprise

Financial institutions operating in India require models that accurately interpret regulatory requirements for compliance automation. IndiaFinBench enables systematic evaluation of LLM capabilities for regulatory text processing and risk assessment.

End users

Financial advisors and compliance professionals can use IndiaFinBench results to understand model limitations when processing Indian regulatory documents. The benchmark helps identify which models perform best for specific regulatory interpretation tasks.

Investors

Venture capital and fintech investors can use IndiaFinBench performance data to evaluate AI startups targeting Indian financial markets. The benchmark provides objective metrics for assessing regulatory AI capabilities.

How to use IndiaFinBench today

IndiaFinBench is immediately available through its open-source repository with complete evaluation code and model outputs.

Clone the repository: git clone https://github.com/rajveerpall/IndiaFinBench
Install dependencies from the requirements file
Load your model using the provided evaluation framework
Run evaluation across the four task types: regulatory interpretation, numerical reasoning, contradiction detection, and temporal reasoning
Compare results against the baseline performance metrics for twelve evaluated models
Use bootstrap significance testing code to determine statistical significance of performance differences

IndiaFinBench vs competitors

IndiaFinBench addresses geographic and regulatory gaps not covered by existing financial benchmarks.

Benchmark	Geographic Focus	Document Types	Task Variety	Validation Method
IndiaFinBench	Indian regulatory framework	SEBI and RBI documents	4 task types	Model + human validation
FinanceBench	US financial markets	SEC filings, earnings reports	Question answering	Not yet disclosed
LawBench	General legal text	Legal documents	Multiple legal tasks	Expert annotation

Risks, limits, and myths

Limited scope: Benchmark focuses only on SEBI and RBI documents, not covering other Indian financial regulators
Zero-shot evaluation: Models were not fine-tuned on Indian regulatory text, potentially underestimating specialized model performance
Language limitation: Evaluation conducted in English only, not covering regional language regulatory documents
Temporal coverage: Benchmark reflects regulatory frameworks as of the dataset creation date, not future regulatory changes
Human baseline: Non-specialist human performance may not reflect expert regulatory professional capabilities
Model selection: Twelve evaluated models may not represent the full spectrum of available LLM architectures

FAQ

What makes IndiaFinBench different from other financial AI benchmarks?

IndiaFinBench is the first publicly available benchmark specifically designed for Indian financial regulatory text, addressing gaps in existing benchmarks that focus exclusively on Western financial corpora.

How many question-answer pairs does IndiaFinBench contain?

IndiaFinBench contains 406 expert-annotated question-answer pairs sourced from 192 documents from SEBI and RBI.

What are the four task types in IndiaFinBench evaluation?

The four task types are regulatory interpretation (174 items), numerical reasoning (92 items), contradiction detection (62 items), and temporal reasoning (78 items).

Which model performed best on IndiaFinBench?

Gemini 2.5 Flash achieved the highest accuracy at 89.7% under zero-shot evaluation conditions.

How was annotation quality validated in IndiaFinBench?

Annotation quality was validated through a model-based secondary pass (kappa=0.918 on contradiction detection) and a 60-item human inter-annotator agreement evaluation (kappa=0.611).

What was the human baseline performance on IndiaFinBench?

The non-specialist human baseline achieved 60.0% accuracy, which all twelve evaluated models substantially outperformed.

Which task type showed the largest performance variation across models?

Numerical reasoning was the most discriminative task, showing a 35.9 percentage-point spread across different models.

Is IndiaFinBench available for commercial use?

IndiaFinBench is available as open source through GitHub, though specific licensing terms are not yet disclosed in the provided sources.

How many statistical performance tiers were identified in the evaluation?

Bootstrap significance testing with 10,000 resamples revealed three statistically distinct performance tiers among the evaluated models.

What regulatory bodies are covered in IndiaFinBench documents?

IndiaFinBench draws from documents issued by the Securities and Exchange Board of India (SEBI) and the Reserve Bank of India (RBI).

Glossary

SEBI: Securities and Exchange Board of India, the regulatory authority for securities markets in India
RBI: Reserve Bank of India, the central banking institution and monetary authority of India
Zero-shot evaluation: Testing model performance on tasks without prior training or fine-tuning on similar examples
Bootstrap significance testing: Statistical method using resampling to determine if performance differences between models are statistically significant
Kappa coefficient: Statistical measure of inter-rater agreement that accounts for agreement occurring by chance
Contradiction detection: Task type requiring identification of inconsistent or conflicting statements within regulatory text
Temporal reasoning: Cognitive task involving understanding and processing time-based relationships and sequences

Visit the IndiaFinBench GitHub repository at https://github.com/rajveerpall/IndiaFinBench to access the dataset, evaluation code, and model outputs for your own LLM evaluation projects.

Sources

IndiaFinBench: An Evaluation Benchmark for Large Language Model Performance on Indian Financial Regulatory Text. arXiv:2604.19298v1. April 22, 2026.

Author

siego237

Writes for FrontierWisdom on AI systems, automation, decentralized identity, and frontier infrastructure, with a focus on turning emerging technology into practical playbooks, implementation roadmaps, and monetization strategies for operators, builders, and consultants.

IndiaFinBench: New LLM Benchmark for Indian Financial Regulations

What is IndiaFinBench

What is new vs previous benchmarks

How does IndiaFinBench work

Benchmarks and evidence