IndiaFinBench: First LLM Benchmark for Indian Financial

IndiaFinBench is the first publicly available benchmark for evaluating large language model performance on Indian financial regulatory text, featuring 406 expert-annotated question-answer pairs from SEBI and RBI documents across four task types.

Released by	Not yet disclosed
Release date	April 22, 2026
What it is	First public benchmark for evaluating LLM performance on Indian financial regulatory text
Who it is for	AI researchers and financial technology developers
Where to get it	https://github.com/rajveerpall/IndiaFinBench
Price	Free

IndiaFinBench contains 406 expert-annotated question-answer pairs from 192 SEBI and RBI documents
The benchmark covers four task types: regulatory interpretation, numerical reasoning, contradiction detection, and temporal reasoning
Twelve models were evaluated with accuracy ranging from 70.4% to 89.7% under zero-shot conditions
Numerical reasoning proved most discriminative with a 35.9 percentage-point performance spread across models
All models substantially outperformed a non-specialist human baseline of 60.0% accuracy

What is IndiaFinBench
What is new vs previous benchmarks
How does IndiaFinBench work
Benchmarks and evidence
Who should care
How to use IndiaFinBench today
IndiaFinBench vs competitors
Risks, limits, and myths

IndiaFinBench addresses a significant gap in LLM evaluation by focusing on non-Western financial regulatory frameworks
The benchmark demonstrates three statistically distinct performance tiers among evaluated models through bootstrap significance testing
Annotation quality validation achieved kappa=0.918 on contradiction detection and kappa=0.611 on human inter-annotator agreement
Numerical reasoning tasks show the largest performance variation, making them most useful for model discrimination
The complete dataset, evaluation code, and model outputs are publicly available for research use

What is IndiaFinBench

IndiaFinBench is an evaluation benchmark specifically designed to assess large language model performance on Indian financial regulatory text. Benchmark evaluations for LLMs attempt to measure model reasoning, factual accuracy, alignment, and safety [1]. The benchmark contains 406 expert-annotated question-answer pairs sourced from 192 documents from the Securities and Exchange Board of India (SEBI) and the Reserve Bank of India (RBI).

The benchmark addresses a critical gap in existing financial NLP evaluation tools, which draw exclusively from Western financial corpora including SEC filings, US earnings reports, and English-language financial news. IndiaFinBench provides the first comprehensive evaluation framework for models working with Indian regulatory frameworks.

What is new vs previous benchmarks

IndiaFinBench introduces several novel elements compared to existing financial benchmarks:

Feature	IndiaFinBench	Existing Financial Benchmarks
Geographic focus	Indian regulatory framework (SEBI/RBI)	Western markets (SEC, US earnings)
Document sources	192 Indian regulatory documents	US financial filings and news
Task diversity	4 specialized task types	General financial QA
Annotation validation	Model-based + human inter-annotator (kappa=0.611)	Varies by benchmark
Performance tiers	3 statistically distinct tiers via bootstrap testing	Not systematically established

How does IndiaFinBench work

IndiaFinBench operates through a structured four-task evaluation framework:

Regulatory interpretation tasks assess model understanding of Indian financial regulations with 174 question-answer pairs
Numerical reasoning tasks evaluate quantitative analysis capabilities using 92 items focused on financial calculations
Contradiction detection tasks test logical consistency identification across 62 regulatory statement pairs
Temporal reasoning tasks measure understanding of time-dependent regulatory changes through 78 scenarios

The evaluation uses zero-shot conditions where models receive no task-specific training examples. Researchers and practitioners look at qualities like accuracy, efficiency, safety, fairness and robustness to determine how well a model performs [4]. Bootstrap significance testing with 10,000 resamples validates statistical differences between model performance levels.

Benchmarks and evidence

IndiaFinBench evaluation results demonstrate clear performance hierarchies among twelve tested models:

Model	Overall Accuracy	Performance Tier	Source
Gemini 2.5 Flash	89.7%	Tier 1	IndiaFinBench paper
Mid-range models	75-85%	Tier 2	IndiaFinBench paper
Gemma 4 E4B	70.4%	Tier 3	IndiaFinBench paper
Non-specialist human	60.0%	Baseline	IndiaFinBench paper
Numerical reasoning spread	35.9 percentage points	Most discriminative	IndiaFinBench paper

The annotation quality validation achieved kappa=0.918 on contradiction detection tasks and maintained 76.7% overall agreement in human inter-annotator evaluation across 60 items.

Who should care

Builders

AI developers building financial applications for Indian markets need IndiaFinBench to validate model performance on local regulatory requirements. The benchmark provides standardized evaluation metrics for SEBI and RBI compliance capabilities.

Enterprise

Financial institutions operating in India require models that understand local regulatory frameworks for compliance automation and risk assessment. IndiaFinBench enables objective comparison of model capabilities for regulatory interpretation tasks.

End users

Financial advisors and compliance professionals can use IndiaFinBench results to select appropriate AI tools for Indian regulatory analysis. The benchmark results indicate which models perform best on specific task types.

Investors

Investment firms focusing on Indian fintech can use IndiaFinBench performance data to evaluate AI-powered regulatory technology solutions. The benchmark provides objective metrics for due diligence on financial AI products.

How to use IndiaFinBench today

IndiaFinBench is immediately available for research and development use:

Access the repository: Visit https://github.com/rajveerpall/IndiaFinBench to download the complete dataset
Install evaluation framework: Clone the repository and install required dependencies for model evaluation
Run baseline evaluation: Execute the provided evaluation scripts against your target language model
Compare results: Use the included benchmark results to position your model performance relative to established baselines
Analyze task-specific performance: Review detailed results across the four task types to identify model strengths and weaknesses

The repository includes the complete dataset, evaluation code, and all model outputs from the twelve evaluated systems for comprehensive analysis.

IndiaFinBench vs competitors

IndiaFinBench occupies a unique position among financial AI evaluation benchmarks:

Benchmark	Geographic Focus	Document Count	Task Types	Validation Method
IndiaFinBench	Indian regulatory (SEBI/RBI)	192 documents	4 specialized tasks	Model + human validation
FinanceBench	US markets	Not yet disclosed	General financial QA	Not yet disclosed
LawBench	General legal	Not yet disclosed	Legal reasoning	Not yet disclosed

We evaluate PoliLegalLM on three representative benchmarks, including LawBench, LexEval, and a real-world dataset, PoliLegal [6], demonstrating the broader landscape of specialized evaluation frameworks for regulatory and legal domains.

Risks, limits, and myths

Limited scope: IndiaFinBench focuses exclusively on SEBI and RBI documents, not covering other Indian financial regulators
Zero-shot evaluation only: The benchmark does not assess few-shot or fine-tuned model performance capabilities
Static dataset: Regulatory frameworks evolve continuously, potentially dating benchmark content over time
Language limitation: The benchmark uses English-language regulatory documents, not covering regional Indian languages
Expert annotation dependency: Benchmark quality relies on the expertise and consistency of human annotators
Task type imbalance: Regulatory interpretation tasks (174 items) significantly outnumber contradiction detection tasks (62 items)

FAQ

What makes IndiaFinBench different from other financial AI benchmarks?: IndiaFinBench is the first benchmark specifically designed for Indian financial regulatory text, using SEBI and RBI documents instead of Western financial corpora.
How many question-answer pairs does IndiaFinBench contain?: IndiaFinBench contains 406 expert-annotated question-answer pairs sourced from 192 Indian regulatory documents.
Which AI models perform best on IndiaFinBench?: Gemini 2.5 Flash achieved the highest accuracy at 89.7%, while Gemma 4 E4B scored lowest at 70.4% among the twelve evaluated models.
What task types are included in IndiaFinBench evaluation?: IndiaFinBench includes four task types: regulatory interpretation (174 items), numerical reasoning (92 items), contradiction detection (62 items), and temporal reasoning (78 items).
How reliable is IndiaFinBench annotation quality?: Annotation quality achieved kappa=0.918 on contradiction detection through model-based validation and kappa=0.611 with 76.7% agreement in human inter-annotator evaluation.
Can I access IndiaFinBench dataset for free?: Yes, IndiaFinBench dataset, evaluation code, and all model outputs are freely available at https://github.com/rajveerpall/IndiaFinBench.
Which task type shows the biggest performance differences between models?: Numerical reasoning tasks show the largest performance variation with a 35.9 percentage-point spread across evaluated models, making them most discriminative.
How does human performance compare to AI models on IndiaFinBench?: All evaluated AI models substantially outperformed the non-specialist human baseline of 60.0% accuracy, with the best model achieving 89.7%.
What statistical methods validate IndiaFinBench results?: Bootstrap significance testing with 10,000 resamples reveals three statistically distinct performance tiers among the evaluated models.
Does IndiaFinBench cover all Indian financial regulators?: No, IndiaFinBench focuses specifically on SEBI and RBI documents, not covering other Indian financial regulatory bodies.

Glossary

SEBI: Securities and Exchange Board of India, the primary regulator of Indian capital markets and securities trading
RBI: Reserve Bank of India, the central banking institution responsible for monetary policy and banking regulation in India
Zero-shot evaluation: Testing AI models on tasks without providing any task-specific training examples or demonstrations
Bootstrap significance testing: Statistical method using repeated random sampling to determine if performance differences between models are statistically meaningful
Inter-annotator agreement: Measure of consistency between different human experts when labeling the same data, typically expressed as kappa coefficient
Contradiction detection: AI task requiring identification of logical inconsistencies or conflicting statements within regulatory text
Temporal reasoning: AI capability to understand and process time-dependent relationships and sequences in regulatory changes
Numerical reasoning: AI ability to perform mathematical calculations and quantitative analysis on financial data and regulations

Download IndiaFinBench from https://github.com/rajveerpall/IndiaFinBench to evaluate your language model’s performance on Indian financial regulatory text.

Sources

Large language model – Wikipedia. https://en.wikipedia.org/wiki/Large_language_model
IFEval | DeepEval by Confident AI – The LLM Evaluation Framework. https://deepeval.com/docs/benchmarks-ifeval
FinanceBench: A New Benchmark for Financial Question Answering. https://wandb.ai/byyoung3/ml-news/reports/FinanceBench-A-New-Benchmark-for-Financial-Question-Answering–VmlldzoxMDE1OTM0Mw
What Are Large Language Models (LLMs)? | IBM. https://www.ibm.com/think/topics/large-language-models
Comparative Evaluation of Rule-Based and Large Language Models for Financial Transaction Extraction in Chatbots. https://jurnal.polgan.ac.id/index.php/sinkron/article/view/16020
PoliLegalLM: A Technical Report on a Large Language Model for Political and Legal Affairs. https://arxiv.org/html/2604.17543
Large Language Models for Cybersecurity Intelligence: A Systematic Review. https://www.sciencedirect.com/org/science/article/pii/S1546221826003565
A Functionality-Grounded Benchmark for Evaluating Web Agents in E-commerce Domains. https://arxiv.org/html/2508.15832

Author

Siegfried Kamgo

Founder and editorial lead at FrontierWisdom. Engineer turned operator-analyst writing about AI systems, automation infrastructure, decentralised stacks, and the practical economics of frontier technology. Focus: turning fast-moving releases into durable, implementation-ready playbooks.

IndiaFinBench: First LLM Benchmark for Indian Financial Regulation

What is IndiaFinBench

What is new vs previous benchmarks

How does IndiaFinBench work

Benchmarks and evidence