IndiaFinBench: First LLM Benchmark for Indian Financial

IndiaFinBench is the first publicly available evaluation benchmark designed to assess large language model performance on Indian financial regulatory text, featuring 406 expert-annotated question-answer pairs from SEBI and RBI documents across four specialized task types.

Released by	Not yet disclosed
Release date	April 22, 2024
What it is	Evaluation benchmark for LLM performance on Indian financial regulatory text
Who it is for	AI researchers and financial technology developers
Where to get it	GitHub repository at rajveerpall/IndiaFinBench
Price	Free

IndiaFinBench addresses the gap in non-Western financial NLP benchmarks with 406 expert-annotated questions from Indian regulatory documents
The benchmark covers four task types: regulatory interpretation, numerical reasoning, contradiction detection, and temporal reasoning
Twelve models were evaluated with accuracy ranging from 70.4% to 89.7% under zero-shot conditions
Numerical reasoning proved most discriminative with a 35.9 percentage-point performance spread across models
All tested models substantially outperformed a non-specialist human baseline of 60.0% accuracy

What is IndiaFinBench
What is new vs previous benchmarks
How does IndiaFinBench work
Benchmarks and evidence
Who should care
How to use IndiaFinBench today
IndiaFinBench vs competitors
Risks, limits, and myths

IndiaFinBench fills a critical gap in financial NLP evaluation by focusing on Indian regulatory frameworks rather than Western financial corpora
The benchmark demonstrates significant performance variation across models, particularly in numerical reasoning tasks
High annotation quality is validated through model-based secondary passes and human inter-annotator agreement evaluation
Bootstrap significance testing reveals three statistically distinct performance tiers among evaluated models

What is IndiaFinBench

IndiaFinBench is an evaluation benchmark specifically designed to assess large language model performance on Indian financial regulatory documents. Benchmark evaluations for LLMs attempt to measure model reasoning, factual accuracy, alignment, and safety [1]. The benchmark contains 406 expert-annotated question-answer pairs sourced from 192 documents from the Securities and Exchange Board of India (SEBI) and the Reserve Bank of India (RBI).

The benchmark addresses a significant gap in existing financial NLP evaluation tools, which draw exclusively from Western financial corpora including SEC filings, US earnings reports, and English-language financial news. IndiaFinBench provides the first comprehensive evaluation framework for non-Western regulatory frameworks in the financial domain.

What is new vs previous benchmarks

IndiaFinBench introduces several novel elements compared to existing financial evaluation benchmarks:

Feature	IndiaFinBench	Previous Financial Benchmarks
Geographic focus	Indian regulatory framework	Western financial corpora exclusively
Document sources	SEBI and RBI regulatory documents	SEC filings, US earnings reports
Task diversity	Four specialized task types	General financial question answering
Annotation validation	Model-based secondary pass plus human agreement	Standard human annotation only
Statistical analysis	Bootstrap significance testing with 10,000 resamples	Basic accuracy reporting

How does IndiaFinBench work

IndiaFinBench operates through a structured evaluation process across four distinct task types:

Regulatory interpretation tasks: 174 items testing model understanding of Indian financial regulations and compliance requirements
Numerical reasoning tasks: 92 items evaluating mathematical computation and quantitative analysis capabilities
Contradiction detection tasks: 62 items assessing ability to identify conflicting information within regulatory documents
Temporal reasoning tasks: 78 items testing understanding of time-dependent regulatory changes and sequences

The evaluation methodology employs zero-shot conditions where models receive no task-specific training examples. Annotation quality is validated through a model-based secondary pass achieving kappa=0.918 on contradiction detection tasks and a 60-item human inter-annotator agreement evaluation with kappa=0.611 and 76.7% overall agreement.

Benchmarks and evidence

Researchers and practitioners look at qualities like accuracy, efficiency, safety, fairness and robustness to determine how well a model performs [4]. IndiaFinBench evaluation results demonstrate significant performance variation across twelve tested models:

Model	Overall Accuracy	Performance Tier	Source
Gemini 2.5 Flash	89.7%	Tier 1	IndiaFinBench evaluation
Mid-range models	75-85%	Tier 2	IndiaFinBench evaluation
Gemma 4 E4B	70.4%	Tier 3	IndiaFinBench evaluation
Non-specialist human	60.0%	Baseline	IndiaFinBench evaluation

Numerical reasoning tasks showed the highest discriminative power with a 35.9 percentage-point spread across models. Bootstrap significance testing with 10,000 resamples confirmed three statistically distinct performance tiers among evaluated models.

Who should care

Builders

AI researchers developing financial NLP systems need IndiaFinBench to evaluate model performance on non-Western regulatory frameworks. The benchmark provides standardized evaluation metrics for Indian financial document processing capabilities.

Enterprise

Financial institutions operating in India require accurate AI systems for regulatory compliance and document analysis. IndiaFinBench enables assessment of LLM capabilities for Indian regulatory interpretation and numerical reasoning tasks.

End users

Financial professionals working with Indian regulatory documents benefit from AI tools validated against IndiaFinBench standards. The benchmark ensures reliable performance on SEBI and RBI document analysis.

Investors

Investment firms focusing on Indian markets need AI systems capable of processing local regulatory requirements. IndiaFinBench provides validation metrics for financial AI tools in the Indian regulatory context.

How to use IndiaFinBench today

IndiaFinBench is available for immediate use through its GitHub repository:

Access the repository: Visit https://github.com/rajveerpall/IndiaFinBench to download the complete dataset
Review evaluation code: The repository includes standardized evaluation scripts for consistent model assessment
Load the dataset: Import the 406 question-answer pairs organized by task type for systematic evaluation
Run zero-shot evaluation: Test your model without task-specific training examples following the established protocol
Compare results: Use the provided baseline scores and statistical analysis framework for performance comparison

The dataset includes all model outputs from the original evaluation, enabling direct comparison with established performance benchmarks.

IndiaFinBench vs competitors

Benchmark	Geographic Focus	Document Sources	Task Types	Validation Method
IndiaFinBench	Indian regulatory framework	SEBI, RBI documents	4 specialized tasks	Model-based + human validation
FinanceBench	Western markets	SEC filings, earnings reports	General financial QA	Human annotation
LawBench	General legal domain	Legal documents	Legal reasoning	Expert annotation

Risks, limits, and myths

Limited scope: The benchmark focuses exclusively on Indian regulatory documents, limiting generalizability to other financial markets
Language constraints: Evaluation is conducted in English, potentially missing nuances in regional Indian financial terminology
Temporal limitations: Regulatory documents have specific time periods, requiring regular updates to maintain relevance
Task type bias: Numerical reasoning shows highest discriminative power, potentially overweighting quantitative capabilities
Human baseline limitations: The 60.0% non-specialist human baseline may not represent expert-level performance expectations
Model selection bias: Twelve evaluated models may not represent the full spectrum of available LLM capabilities

FAQ

What makes IndiaFinBench different from other financial AI benchmarks?

IndiaFinBench is the first evaluation benchmark specifically designed for Indian financial regulatory documents, addressing the gap in non-Western financial NLP evaluation tools that previously focused exclusively on Western financial corpora.

How many questions are included in the IndiaFinBench dataset?

IndiaFinBench contains 406 expert-annotated question-answer pairs sourced from 192 documents from SEBI and RBI, distributed across four task types: regulatory interpretation, numerical reasoning, contradiction detection, and temporal reasoning.

Which AI models perform best on IndiaFinBench tasks?

Gemini 2.5 Flash achieved the highest accuracy at 89.7%, while Gemma 4 E4B scored lowest at 70.4%. All twelve evaluated models substantially outperformed the non-specialist human baseline of 60.0%.

What is the most challenging task type in IndiaFinBench?

Numerical reasoning tasks proved most discriminative with a 35.9 percentage-point spread across models, indicating significant variation in quantitative analysis capabilities among different LLMs.

How is annotation quality validated in IndiaFinBench?

Annotation quality is validated through a model-based secondary pass achieving kappa=0.918 on contradiction detection and a 60-item human inter-annotator agreement evaluation with kappa=0.611 and 76.7% overall agreement.

Can I access IndiaFinBench for free?

Yes, IndiaFinBench is freely available through its GitHub repository at rajveerpall/IndiaFinBench, including the complete dataset, evaluation code, and all model outputs from the original study.

What regulatory bodies are covered in IndiaFinBench documents?

IndiaFinBench draws from documents issued by two primary Indian financial regulatory authorities: the Securities and Exchange Board of India (SEBI) and the Reserve Bank of India (RBI).

How does IndiaFinBench ensure statistical significance in results?

The benchmark employs bootstrap significance testing with 10,000 resamples to establish three statistically distinct performance tiers among evaluated models, ensuring robust statistical validation of results.

What evaluation methodology does IndiaFinBench use?

IndiaFinBench uses zero-shot evaluation conditions where models receive no task-specific training examples, providing a standardized assessment of inherent model capabilities on Indian financial regulatory text.

Who should use IndiaFinBench for model evaluation?

AI researchers, financial technology developers, regulatory compliance teams, and financial institutions operating in India should use IndiaFinBench to evaluate LLM performance on Indian regulatory document processing tasks.

Glossary

SEBI: Securities and Exchange Board of India, the regulatory authority for securities markets in India
RBI: Reserve Bank of India, the central banking institution and monetary authority of India
Zero-shot evaluation: Testing model performance without providing task-specific training examples or fine-tuning
Kappa coefficient: Statistical measure of inter-rater agreement that accounts for agreement occurring by chance
Bootstrap significance testing: Statistical method using resampling to determine confidence intervals and significance of results
Contradiction detection: Task type requiring identification of conflicting information within regulatory documents
Temporal reasoning: Cognitive ability to understand time-dependent relationships and sequences in regulatory changes

Download IndiaFinBench from GitHub at rajveerpall/IndiaFinBench to begin evaluating your LLM’s performance on Indian financial regulatory text processing tasks.

Sources

Large language model – Wikipedia. https://en.wikipedia.org/wiki/Large_language_model
IFEval | DeepEval by Confident AI – The LLM Evaluation Framework. https://deepeval.com/docs/benchmarks-ifeval
FinanceBench: A New Benchmark for Financial Question Answering. https://wandb.ai/byyoung3/ml-news/reports/FinanceBench-A-New-Benchmark-for-Financial-Question-Answering–VmlldzoxMDE1OTM0Mw
What Are Large Language Models (LLMs)? | IBM. https://www.ibm.com/think/topics/large-language-models
Comparative Evaluation of Rule-Based and Large Language Models for Financial Transaction Extraction in Chatbots. https://jurnal.polgan.ac.id/index.php/sinkron/article/view/16020
PoliLegalLM: A Technical Report on a Large Language Model for Political and Legal Affairs. https://arxiv.org/html/2604.17543
Large Language Models for Cybersecurity Intelligence: A Systematic Review. https://www.sciencedirect.com/org/science/article/pii/S1546221826003565
A Functionality-Grounded Benchmark for Evaluating Web Agents in E-commerce Domains. https://arxiv.org/html/2508.15832

Author

Siegfried Kamgo

Founder and editorial lead at FrontierWisdom. Engineer turned operator-analyst writing about AI systems, automation infrastructure, decentralised stacks, and the practical economics of frontier technology. Focus: turning fast-moving releases into durable, implementation-ready playbooks.

IndiaFinBench: First LLM Benchmark for Indian Financial Regulation

What is IndiaFinBench

What is new vs previous benchmarks

How does IndiaFinBench work

Benchmarks and evidence