IndiaFinBench: First LLM Benchmark for Indian Financial Text

IndiaFinBench is the first publicly available evaluation benchmark for assessing large language model performance on Indian financial regulatory text. The benchmark contains 406 expert-annotated question-answer pairs from SEBI and RBI documents across four task types.

Released by	Not yet disclosed
Release date	April 22, 2026
What it is	First evaluation benchmark for LLM performance on Indian financial regulatory text
Who it is for	AI researchers and financial technology developers
Where to get it	https://github.com/rajveerpall/IndiaFinBench
Price	Free

IndiaFinBench addresses the gap in non-Western financial NLP benchmarks with 406 expert-annotated questions
The benchmark draws from 192 documents from Securities and Exchange Board of India and Reserve Bank of India
Four task types include regulatory interpretation, numerical reasoning, contradiction detection, and temporal reasoning
Twelve models were evaluated with accuracy ranging from 70.4% to 89.7% under zero-shot conditions
All models substantially outperformed a non-specialist human baseline of 60.0% accuracy

What is IndiaFinBench
What is new vs the previous version
How does IndiaFinBench work
Benchmarks and evidence
Who should care
How to use IndiaFinBench today
IndiaFinBench vs competitors
Risks, limits, and myths

IndiaFinBench fills a critical gap by providing the first benchmark for Indian financial regulatory text evaluation
The benchmark demonstrates strong annotation quality with kappa scores of 0.918 and 0.611 across different validation methods
Numerical reasoning emerges as the most discriminative task with a 35.9 percentage-point performance spread
Statistical testing reveals three distinct performance tiers among the twelve evaluated models
The complete dataset, evaluation code, and model outputs are publicly available for research use

What is IndiaFinBench

IndiaFinBench is the first publicly available evaluation benchmark specifically designed for assessing large language model performance on Indian financial regulatory text. Benchmark evaluations for LLMs attempt to measure model reasoning, factual accuracy, alignment, and safety [1].

The benchmark addresses a significant coverage gap in existing financial NLP benchmarks, which draw exclusively from Western financial corpora including SEC filings, US earnings reports, and English-language financial news. IndiaFinBench provides a comprehensive evaluation framework tailored to the unique characteristics of Indian financial regulatory frameworks.

The benchmark contains 406 expert-annotated question-answer pairs sourced from 192 official documents from two primary Indian financial regulatory bodies: the Securities and Exchange Board of India (SEBI) and the Reserve Bank of India (RBI).

What is new vs the previous version

IndiaFinBench represents the first benchmark of its kind, with no previous versions existing for Indian financial regulatory text evaluation.

Aspect	Previous State	IndiaFinBench
Geographic Coverage	Western financial corpora only	Indian regulatory frameworks
Document Sources	SEC filings, US earnings reports	SEBI and RBI official documents
Task Types	General financial reasoning	Four specialized tasks: regulatory interpretation, numerical reasoning, contradiction detection, temporal reasoning
Annotation Quality	Variable validation methods	Dual validation: model-based (kappa=0.918) and human inter-annotator (kappa=0.611)
Public Availability	Limited benchmark access	Complete dataset, code, and outputs publicly available

How does IndiaFinBench work

IndiaFinBench operates through a structured evaluation framework that tests large language models across four distinct financial regulatory tasks.

Document Collection: Researchers gathered 192 official documents from SEBI and RBI regulatory publications
Question Generation: Expert annotators created 406 question-answer pairs distributed across four task categories
Task Distribution: Regulatory interpretation (174 items), numerical reasoning (92 items), contradiction detection (62 items), and temporal reasoning (78 items)
Quality Validation: Annotation quality underwent dual validation through model-based secondary pass and human inter-annotator agreement evaluation
Zero-Shot Evaluation: Twelve models were tested under zero-shot conditions without task-specific training
Statistical Analysis: Bootstrap significance testing with 10,000 resamples identified statistically distinct performance tiers

Benchmarks and evidence

IndiaFinBench evaluation results demonstrate clear performance differentiation across twelve tested large language models.

Model	Overall Accuracy	Performance Tier	Source
Gemini 2.5 Flash	89.7%	Tier 1	IndiaFinBench paper
Gemma 4 E4B	70.4%	Tier 3	IndiaFinBench paper
Non-specialist Human	60.0%	Baseline	IndiaFinBench paper
Numerical Reasoning Spread	35.9 percentage points	Most discriminative task	IndiaFinBench paper

Researchers and practitioners look at qualities like accuracy, efficiency, safety, fairness and robustness to determine how well a model performs [4]. The benchmark validation achieved strong inter-annotator agreement with kappa=0.611 and 76.7% overall agreement across 60 evaluated items.

Bootstrap significance testing with 10,000 resamples revealed three statistically distinct performance tiers among the evaluated models, providing robust statistical evidence for performance differentiation.

Who should care

Builders

AI researchers developing financial NLP systems need IndiaFinBench to evaluate model performance on non-Western regulatory frameworks. The benchmark provides standardized evaluation metrics for Indian financial text understanding capabilities.

Enterprise

Financial technology companies operating in Indian markets require IndiaFinBench to assess LLM suitability for regulatory compliance applications. Banks and fintech firms can use the benchmark to validate AI systems processing SEBI and RBI documentation.

End users

Financial professionals working with Indian regulatory documents benefit from IndiaFinBench-validated AI tools that demonstrate proven performance on regulatory interpretation and numerical reasoning tasks.

Investors

Venture capital and private equity investors evaluating AI companies serving Indian financial markets can reference IndiaFinBench results to assess technical capabilities and market readiness.

How to use IndiaFinBench today

IndiaFinBench is immediately accessible through its public GitHub repository with complete evaluation resources.

Access the Repository: Visit https://github.com/rajveerpall/IndiaFinBench to download the complete dataset
Download Components: Obtain the 406 question-answer pairs, evaluation code, and all model outputs
Set Up Evaluation: Install the provided evaluation framework following repository documentation
Run Baseline Tests: Execute zero-shot evaluation on your target LLM using the standardized protocol
Compare Results: Benchmark your model performance against the twelve reference models and human baseline
Analyze Task Performance: Examine results across the four task types to identify model strengths and weaknesses

IndiaFinBench vs competitors

IndiaFinBench stands alone as the first benchmark specifically designed for Indian financial regulatory text evaluation.

Benchmark	Geographic Focus	Document Sources	Task Types	Question Count
IndiaFinBench	Indian regulatory frameworks	SEBI, RBI documents	4 specialized tasks	406 questions
FinanceBench	Western markets	SEC filings, US earnings	General financial QA	Not yet disclosed
LawBench	General legal text	Legal documents	Legal reasoning	Not yet disclosed

We evaluate PoliLegalLM on three representative benchmarks, including LawBench, LexEval, and a real-world dataset, PoliLegal [6], demonstrating the broader landscape of specialized domain benchmarks.

Risks, limits, and myths

Limited Scope: IndiaFinBench focuses exclusively on SEBI and RBI documents, potentially missing other Indian financial regulatory bodies
Zero-Shot Only: Current evaluation uses zero-shot conditions, which may not reflect fine-tuned model performance
Language Limitation: The benchmark evaluates English-language regulatory text, excluding regional language financial documents
Temporal Coverage: Document sources represent a specific time period, potentially missing recent regulatory changes
Human Baseline: The 60.0% non-specialist human baseline may not represent expert-level human performance
Task Balance: Uneven distribution across task types (174 regulatory interpretation vs 62 contradiction detection items)
Model Selection: Evaluation limited to twelve models, potentially missing other relevant LLM architectures

FAQ

What is IndiaFinBench and why was it created?

IndiaFinBench is the first publicly available evaluation benchmark for assessing large language model performance on Indian financial regulatory text, created to address the gap in non-Western financial NLP benchmarks.

How many questions does IndiaFinBench contain?

IndiaFinBench contains 406 expert-annotated question-answer pairs drawn from 192 documents sourced from SEBI and RBI regulatory publications.

What are the four task types in IndiaFinBench?

The four task types are regulatory interpretation (174 items), numerical reasoning (92 items), contradiction detection (62 items), and temporal reasoning (78 items).

Which model performed best on IndiaFinBench?

Gemini 2.5 Flash achieved the highest accuracy at 89.7% under zero-shot evaluation conditions, while Gemma 4 E4B scored lowest at 70.4%.

How does IndiaFinBench validate annotation quality?

IndiaFinBench validates annotation quality through dual methods: model-based secondary pass (kappa=0.918 on contradiction detection) and human inter-annotator agreement evaluation (kappa=0.611).

What is the most challenging task type in IndiaFinBench?

Numerical reasoning is the most discriminative task, showing a 35.9 percentage-point performance spread across the twelve evaluated models.

How does model performance compare to human baseline?

All twelve evaluated models substantially outperformed the non-specialist human baseline of 60.0% accuracy, with the top model achieving 89.7%.

Where can I access IndiaFinBench dataset and code?

The complete IndiaFinBench dataset, evaluation code, and all model outputs are publicly available at https://github.com/rajveerpall/IndiaFinBench.

What statistical methods validate IndiaFinBench results?

Bootstrap significance testing with 10,000 resamples reveals three statistically distinct performance tiers among the evaluated models.

Which regulatory bodies provide source documents for IndiaFinBench?

IndiaFinBench draws from official documents from two primary Indian financial regulatory bodies: Securities and Exchange Board of India (SEBI) and Reserve Bank of India (RBI).

How does IndiaFinBench differ from existing financial benchmarks?

Unlike existing financial NLP benchmarks that draw exclusively from Western financial corpora, IndiaFinBench focuses specifically on Indian regulatory frameworks and documents.

What evaluation conditions were used for model testing?

All twelve models were evaluated under zero-shot conditions without task-specific training or fine-tuning on Indian financial regulatory text.

Glossary

Bootstrap Significance Testing: Statistical method using repeated resampling to determine if performance differences between models are statistically significant
Contradiction Detection: Task type requiring models to identify conflicting information within regulatory documents
Inter-annotator Agreement: Measure of consistency between different human annotators when labeling the same data, expressed as kappa coefficient
Kappa Coefficient: Statistical measure of inter-rater reliability that accounts for agreement occurring by chance
Numerical Reasoning: Task type requiring models to perform mathematical calculations and interpret numerical information in financial contexts
Regulatory Interpretation: Task type requiring models to understand and explain financial regulatory rules and requirements
SEBI: Securities and Exchange Board of India, the regulatory body for securities markets in India
Temporal Reasoning: Task type requiring models to understand time-based relationships and sequences in regulatory contexts
Zero-shot Evaluation: Testing method where models perform tasks without prior training or examples specific to that task

Visit https://github.com/rajveerpall/IndiaFinBench to download the complete dataset and begin evaluating your LLM on Indian financial regulatory text.

Sources

Large language model – Wikipedia. https://en.wikipedia.org/wiki/Large_language_model
IFEval | DeepEval by Confident AI – The LLM Evaluation Framework. https://deepeval.com/docs/benchmarks-ifeval
FinanceBench: A New Benchmark for Financial Question Answering. https://wandb.ai/byyoung3/ml-news/reports/FinanceBench-A-New-Benchmark-for-Financial-Question-Answering–VmlldzoxMDE1OTM0Mw
What Are Large Language Models (LLMs)? | IBM. https://www.ibm.com/think/topics/large-language-models
Comparative Evaluation of Rule-Based and Large Language Models for Financial Transaction Extraction in Chatbots. https://jurnal.polgan.ac.id/index.php/sinkron/article/view/16020
PoliLegalLM: A Technical Report on a Large Language Model for Political and Legal Affairs. https://arxiv.org/html/2604.17543
Large Language Models for Cybersecurity Intelligence: A Systematic Review. https://www.sciencedirect.com/org/science/article/pii/S1546221826003565
A Functionality-Grounded Benchmark for Evaluating Web Agents in E-commerce Domains. https://arxiv.org/html/2508.15832

Author

siego237

Writes for FrontierWisdom on AI systems, automation, decentralized identity, and frontier infrastructure, with a focus on turning emerging technology into practical playbooks, implementation roadmaps, and monetization strategies for operators, builders, and consultants.

IndiaFinBench: First LLM Benchmark for Indian Financial Text

What is IndiaFinBench

What is new vs the previous version

How does IndiaFinBench work

Benchmarks and evidence