IndiaFinBench: First LLM Benchmark for Indian Financial

IndiaFinBench is the first publicly available evaluation benchmark for assessing large language model performance on Indian financial regulatory text, featuring 406 expert-annotated question-answer pairs from SEBI and RBI documents across four task types.

Released by	Not yet disclosed
Release date	April 22, 2026
What it is	LLM evaluation benchmark for Indian financial regulatory text
Who it is for	AI researchers and financial technology developers
Where to get it	https://github.com/rajveerpall/IndiaFinBench
Price	Free

IndiaFinBench contains 406 expert-annotated question-answer pairs from 192 SEBI and RBI documents
Four task types include regulatory interpretation, numerical reasoning, contradiction detection, and temporal reasoning
Twelve models tested showed accuracy ranging from 70.4% to 89.7% under zero-shot conditions
Numerical reasoning proved most discriminative with 35.9 percentage-point spread across models
Dataset addresses gap in non-Western financial regulatory framework evaluation

What is IndiaFinBench
What is new vs the previous version
How does IndiaFinBench work
Benchmarks and evidence
Who should care
How to use IndiaFinBench today
IndiaFinBench vs competitors
Risks, limits, and myths

IndiaFinBench fills critical gap in LLM evaluation for non-Western financial regulatory frameworks
Expert annotation quality validated through model-based secondary pass with kappa=0.918 on contradiction detection
All twelve tested models substantially outperformed non-specialist human baseline of 60.0% accuracy
Bootstrap significance testing revealed three statistically distinct performance tiers across models
Complete dataset, evaluation code, and model outputs available for reproducible research

What is IndiaFinBench

IndiaFinBench is an evaluation benchmark specifically designed to assess large language model performance on Indian financial regulatory text. The benchmark addresses a significant gap in existing financial NLP evaluation tools, which draw exclusively from Western financial corpora like SEC filings and US earnings reports.

The benchmark contains 406 expert-annotated question-answer pairs sourced from 192 documents from the Securities and Exchange Board of India (SEBI) and the Reserve Bank of India (RBI). These documents represent the core regulatory framework governing India’s financial sector.

IndiaFinBench evaluates models across four distinct task types: regulatory interpretation (174 items), numerical reasoning (92 items), contradiction detection (62 items), and temporal reasoning (78 items). Each task type tests different aspects of language model comprehension and reasoning capabilities within the Indian financial regulatory context.

What is new vs the previous version

IndiaFinBench represents the first benchmark of its kind rather than an update to existing tools.

Aspect	Previous Financial Benchmarks	IndiaFinBench
Geographic Focus	Western financial markets only	Indian regulatory framework
Document Sources	SEC filings, US earnings reports	SEBI and RBI regulatory documents
Language Context	English-language financial news	Indian financial regulatory text
Task Diversity	Limited task types	Four specialized task categories
Annotation Quality	Varies by benchmark	Model-validated with kappa=0.918

How does IndiaFinBench work

IndiaFinBench operates through a structured evaluation framework that tests language models across four specialized financial regulatory tasks.

Document Collection: Researchers gathered 192 regulatory documents from SEBI and RBI covering various aspects of Indian financial regulation
Question Generation: Expert annotators created 406 question-answer pairs distributed across four task categories based on document content
Quality Validation: Annotation quality underwent validation through model-based secondary pass and human inter-annotator agreement evaluation
Model Testing: Twelve language models received evaluation under zero-shot conditions without task-specific training
Statistical Analysis: Bootstrap significance testing with 10,000 resamples determined statistically distinct performance tiers

Benchmarks and evidence

IndiaFinBench evaluation results demonstrate significant performance variation across different language models and task types.

Model Performance Metric	Result	Source
Highest accuracy achieved	89.7% (Gemini 2.5 Flash)	IndiaFinBench evaluation
Lowest accuracy achieved	70.4% (Gemma 4 E4B)	IndiaFinBench evaluation
Human baseline accuracy	60.0% (non-specialist)	IndiaFinBench evaluation
Most discriminative task spread	35.9 percentage points (numerical reasoning)	IndiaFinBench evaluation
Inter-annotator agreement kappa	0.611 (76.7% overall agreement)	60-item human evaluation
Contradiction detection validation kappa	0.918	Model-based secondary pass

Who should care

Builders

AI developers building financial applications for Indian markets need IndiaFinBench to evaluate model performance on regulatory compliance tasks. The benchmark provides standardized testing for models handling SEBI and RBI documentation, ensuring applications meet regulatory interpretation requirements.

Enterprise

Financial institutions operating in India require accurate LLM evaluation for regulatory compliance systems. IndiaFinBench enables enterprises to assess model capabilities for processing Indian financial regulations, supporting automated compliance monitoring and regulatory document analysis.

End users

Financial technology users benefit from applications tested against IndiaFinBench standards, ensuring more accurate regulatory guidance and compliance assistance. The benchmark validates that AI-powered financial tools understand Indian regulatory nuances correctly.

Investors

Investment firms focusing on Indian fintech companies can use IndiaFinBench results to evaluate the technical capabilities of AI-powered financial services. The benchmark provides objective performance metrics for assessing regulatory compliance technology investments.

How to use IndiaFinBench today

IndiaFinBench provides immediate access through its GitHub repository for researchers and developers.

Access the repository: Visit https://github.com/rajveerpall/IndiaFinBench to download the complete dataset
Install evaluation framework: Clone the repository and install required dependencies listed in requirements.txt
Load your model: Configure your language model to work with the provided evaluation scripts
Run evaluation: Execute the benchmark using the provided evaluation code across all four task types
Analyze results: Compare your model’s performance against the published baseline results and statistical significance tests

IndiaFinBench vs competitors

IndiaFinBench stands alone as the first benchmark specifically designed for Indian financial regulatory text evaluation.

Benchmark	Geographic Focus	Document Sources	Task Types	Question Count
IndiaFinBench	Indian regulations	SEBI, RBI documents	4 specialized tasks	406 questions
FinanceBench	US markets	SEC filings, earnings	General financial QA	Not yet disclosed
LawBench	General legal	Various legal texts	Legal reasoning	Not yet disclosed
LexEval	Legal domains	Legal documents	Legal evaluation	Not yet disclosed

Risks, limits, and myths

Limited scope: IndiaFinBench focuses exclusively on SEBI and RBI documents, potentially missing other Indian financial regulatory bodies
Zero-shot evaluation only: Current testing excludes few-shot or fine-tuned model performance assessment
Language limitation: Benchmark covers English-language regulatory text, excluding regional language financial documents
Temporal constraints: Regulatory documents have specific time periods, potentially limiting applicability to future regulatory changes
Expert annotation bias: Human annotators may introduce subjective interpretations despite validation measures
Model selection bias: Twelve tested models may not represent complete landscape of available language models

FAQ

What makes IndiaFinBench different from other financial AI benchmarks?: IndiaFinBench specifically evaluates language models on Indian financial regulatory text from SEBI and RBI, addressing the gap left by Western-focused financial benchmarks that use SEC filings and US earnings reports.
How many questions does IndiaFinBench contain for model evaluation?: IndiaFinBench contains 406 expert-annotated question-answer pairs distributed across four task types: regulatory interpretation (174), numerical reasoning (92), contradiction detection (62), and temporal reasoning (78).
Which language models performed best on IndiaFinBench testing?: Gemini 2.5 Flash achieved the highest accuracy at 89.7%, while Gemma 4 E4B scored lowest at 70.4%. All twelve tested models outperformed the 60.0% human baseline.
Where can researchers access IndiaFinBench dataset and evaluation tools?: The complete IndiaFinBench dataset, evaluation code, and all model outputs are freely available at https://github.com/rajveerpall/IndiaFinBench for reproducible research.
What validation methods ensure IndiaFinBench annotation quality?: Annotation quality underwent validation through model-based secondary pass achieving kappa=0.918 on contradiction detection and 60-item human inter-annotator agreement evaluation with kappa=0.611.
Which task type shows the largest performance differences between models?: Numerical reasoning proved most discriminative with a 35.9 percentage-point spread across models, indicating significant variation in mathematical reasoning capabilities within financial contexts.
How does IndiaFinBench handle statistical significance in model comparisons?: Bootstrap significance testing with 10,000 resamples revealed three statistically distinct performance tiers, ensuring reliable model performance comparisons beyond simple accuracy scores.
What regulatory documents form the foundation of IndiaFinBench questions?: IndiaFinBench draws from 192 regulatory documents sourced from the Securities and Exchange Board of India (SEBI) and the Reserve Bank of India (RBI).
Can IndiaFinBench evaluate models trained specifically on financial data?: Current IndiaFinBench evaluation focuses on zero-shot conditions without task-specific training, though the framework could potentially accommodate fine-tuned model assessment.
What languages does IndiaFinBench support for regulatory text evaluation?: IndiaFinBench currently evaluates English-language Indian financial regulatory text, with no disclosed plans for regional language document inclusion.

Glossary

Bootstrap significance testing: Statistical method using repeated random sampling to determine if performance differences between models are statistically meaningful rather than due to chance
Contradiction detection: Task type requiring models to identify conflicting information within regulatory documents or between different regulatory statements
Inter-annotator agreement: Measure of consistency between different human annotators when labeling the same data, typically expressed as kappa coefficient
Kappa coefficient: Statistical measure of inter-rater reliability accounting for agreement occurring by chance, with values closer to 1.0 indicating higher agreement
Numerical reasoning: Task type requiring models to perform mathematical calculations and quantitative analysis within financial regulatory contexts
RBI: Reserve Bank of India, the central banking institution responsible for monetary policy and banking regulation in India
Regulatory interpretation: Task type requiring models to understand and explain the meaning and implications of specific regulatory text passages
SEBI: Securities and Exchange Board of India, the regulatory authority for securities and commodity markets in India
Temporal reasoning: Task type requiring models to understand time-based relationships and chronological sequences within regulatory frameworks
Zero-shot evaluation: Testing methodology where models perform tasks without prior training or examples specific to those tasks

Download IndiaFinBench from https://github.com/rajveerpall/IndiaFinBench to evaluate your language model’s performance on Indian financial regulatory text.

Sources

Large language model – Wikipedia. https://en.wikipedia.org/wiki/Large_language_model
What Are Large Language Models (LLMs)? | IBM. https://www.ibm.com/think/topics/large-language-models
Comparative Evaluation of Rule-Based and Large Language Models for Financial Transaction Extraction in Chatbots | Sinkron. https://jurnal.polgan.ac.id/index.php/sinkron/article/view/16020
PoliLegalLM: A Technical Report on a Large Language Model for Political and Legal Affairs. https://arxiv.org/html/2604.17543
A Functionality-Grounded Benchmark for Evaluating Web Agents in E-commerce Domains. https://arxiv.org/html/2508.15832

Author

Siegfried Kamgo

Founder and editorial lead at FrontierWisdom. Engineer turned operator-analyst writing about AI systems, automation infrastructure, decentralised stacks, and the practical economics of frontier technology. Focus: turning fast-moving releases into durable, implementation-ready playbooks.

IndiaFinBench: First LLM Benchmark for Indian Financial Rules

What is IndiaFinBench

What is new vs the previous version

How does IndiaFinBench work

Benchmarks and evidence