IndiaFinBench: New LLM Benchmark for Indian Financial

IndiaFinBench is the first publicly available evaluation benchmark for assessing large language model performance on Indian financial regulatory text. The benchmark contains 406 expert-annotated question-answer pairs from SEBI and RBI documents across four task types.

Released by	Not yet disclosed
Release date	April 22, 2026
What it is	Evaluation benchmark for LLM performance on Indian financial regulatory text
Who it’s for	AI researchers and financial technology developers
Where to get it	https://github.com/rajveerpall/IndiaFinBench
Price	Free

IndiaFinBench addresses the gap in non-Western financial regulatory evaluation benchmarks
The dataset includes 406 questions from 192 SEBI and RBI documents
Four task types cover regulatory interpretation, numerical reasoning, contradiction detection, and temporal reasoning
Twelve models were evaluated with accuracy ranging from 70.4% to 89.7%
All models significantly outperformed human baseline performance of 60.0%

What is IndiaFinBench
What is new vs the previous version
How does IndiaFinBench work
Benchmarks and evidence
Who should care
How to use IndiaFinBench today
IndiaFinBench vs competitors
Risks, limits, and myths

IndiaFinBench fills a critical gap in financial NLP evaluation by focusing on Indian regulatory frameworks
The benchmark demonstrates significant performance variation across models, with numerical reasoning being most discriminative
Expert annotation quality is validated through both model-based and human inter-annotator agreement measures
Bootstrap significance testing reveals three statistically distinct performance tiers among evaluated models

What is IndiaFinBench

IndiaFinBench is an evaluation benchmark specifically designed to assess large language model performance on Indian financial regulatory text. Existing financial NLP benchmarks draw exclusively from Western financial corpora, leaving a significant gap in coverage of non-Western regulatory frameworks [1]. The benchmark addresses this limitation by providing the first publicly available dataset focused on Indian financial regulations.

The dataset comprises 406 expert-annotated question-answer pairs sourced from 192 official documents. These documents originate from two primary Indian financial regulatory bodies: the Securities and Exchange Board of India (SEBI) and the Reserve Bank of India (RBI). The benchmark evaluates models across four distinct task types that reflect real-world financial regulatory analysis requirements.

What is new vs the previous version

IndiaFinBench represents the first benchmark of its kind, with no previous versions existing for Indian financial regulatory text evaluation.

Aspect	Previous State	IndiaFinBench Innovation
Geographic Coverage	Western financial corpora only	First Indian regulatory framework focus
Document Sources	SEC filings, US earnings reports	SEBI and RBI official documents
Task Diversity	Limited to basic comprehension	Four specialized task types
Annotation Quality	Variable validation methods	Dual validation: model-based and human

How does IndiaFinBench work

IndiaFinBench operates through a structured evaluation framework that tests large language models across four specialized financial regulatory tasks.

Document Collection: Researchers gathered 192 documents from SEBI and RBI official publications
Question Generation: Expert annotators created 406 question-answer pairs distributed across four task categories
Quality Validation: A two-stage validation process ensures annotation accuracy through model-based secondary passes and human inter-annotator agreement
Model Evaluation: Twelve models undergo zero-shot evaluation across all task types
Statistical Analysis: Bootstrap significance testing with 10,000 resamples determines performance tier distinctions

Benchmarks and evidence

IndiaFinBench evaluation results demonstrate significant performance variation across models and task types.

Model Performance Metric	Result	Source
Highest accuracy (Gemini 2.5 Flash)	89.7%	IndiaFinBench evaluation
Lowest accuracy (Gemma 4 E4B)	70.4%	IndiaFinBench evaluation
Human baseline performance	60.0%	IndiaFinBench evaluation
Numerical reasoning performance spread	35.9 percentage points	IndiaFinBench evaluation
Inter-annotator agreement (kappa)	0.611	60-item human evaluation
Model-based validation (kappa)	0.918	Contradiction detection task

Who should care

Builders

AI researchers developing financial NLP systems need IndiaFinBench to evaluate model performance on non-Western regulatory frameworks. Benchmark evaluations for LLMs attempt to measure model reasoning, factual accuracy, alignment, and safety [1]. The dataset provides essential validation for models intended to process Indian financial regulations.

Enterprise

Financial institutions operating in India require accurate AI systems for regulatory compliance and document analysis. IndiaFinBench enables organizations to assess whether their chosen models can reliably interpret SEBI and RBI requirements. The benchmark’s four task types directly align with common regulatory analysis workflows.

End users

Financial professionals and compliance officers benefit from understanding AI model capabilities on Indian regulatory text. The benchmark results help users set appropriate expectations for AI-assisted regulatory analysis and identify tasks requiring human oversight.

Investors

Venture capital and institutional investors evaluating fintech companies can use IndiaFinBench results to assess technical capabilities. The benchmark provides objective performance metrics for AI systems targeting the Indian financial market.

How to use IndiaFinBench today

IndiaFinBench is immediately accessible through its GitHub repository for researchers and developers.

Access the repository: Visit https://github.com/rajveerpall/IndiaFinBench to download the dataset
Review the documentation: Examine the provided evaluation code and model output examples
Load your model: Implement the evaluation framework with your chosen large language model
Run zero-shot evaluation: Execute the benchmark across all four task types without fine-tuning
Analyze results: Compare your model’s performance against the published baseline results
Submit findings: Consider contributing your results to the research community

IndiaFinBench vs competitors

IndiaFinBench occupies a unique position in the financial NLP evaluation landscape by focusing specifically on Indian regulatory frameworks.

Benchmark	Geographic Focus	Document Sources	Task Types	Question Count
IndiaFinBench	India	SEBI, RBI documents	4 specialized tasks	406
FinanceBench	United States	SEC filings, earnings reports	General comprehension	Not yet disclosed
LawBench	Multiple jurisdictions	Legal documents	Legal reasoning	Not yet disclosed

Risks, limits, and myths

Limited scope: The benchmark focuses only on SEBI and RBI documents, excluding other Indian financial regulators
Language constraint: All documents are in English, potentially missing regional language regulatory content
Temporal coverage: The dataset represents a specific time period and may not reflect evolving regulations
Task specificity: Four task types may not capture all real-world regulatory analysis requirements
Model selection bias: Evaluation limited to twelve models may not represent full market performance range
Zero-shot limitation: Results may not reflect performance after domain-specific fine-tuning

FAQ

What makes IndiaFinBench different from other financial AI benchmarks?

IndiaFinBench is the first evaluation benchmark specifically designed for Indian financial regulatory text, addressing the gap left by Western-focused financial NLP datasets.

How many questions does IndiaFinBench contain?

IndiaFinBench contains 406 expert-annotated question-answer pairs distributed across four task types: regulatory interpretation, numerical reasoning, contradiction detection, and temporal reasoning.

Which Indian regulatory bodies are covered in IndiaFinBench?

The benchmark draws from documents issued by the Securities and Exchange Board of India (SEBI) and the Reserve Bank of India (RBI).

What was the best performing model on IndiaFinBench?

Gemini 2.5 Flash achieved the highest accuracy of 89.7% across all tasks in the zero-shot evaluation.

How does human performance compare to AI models on IndiaFinBench?

All evaluated AI models substantially outperformed the non-specialist human baseline of 60.0%, with the lowest model achieving 70.4% accuracy.

Which task type is most challenging for AI models?

Numerical reasoning proved most discriminative, showing a 35.9 percentage-point performance spread across the twelve evaluated models.

Is IndiaFinBench available for commercial use?

The dataset, evaluation code, and model outputs are freely available through the GitHub repository at https://github.com/rajveerpall/IndiaFinBench.

How was annotation quality validated in IndiaFinBench?

Quality validation used a dual approach: model-based secondary passes achieving kappa=0.918 on contradiction detection, and human inter-annotator agreement evaluation with kappa=0.611.

Can I evaluate my own model using IndiaFinBench?

Yes, the benchmark provides evaluation code and documentation enabling researchers to test their own large language models against the dataset.

What statistical methods validate IndiaFinBench results?

Bootstrap significance testing with 10,000 resamples was used to establish three statistically distinct performance tiers among the evaluated models.

Glossary

SEBI: Securities and Exchange Board of India, the primary regulator of securities markets in India
RBI: Reserve Bank of India, the central banking institution and monetary authority of India
Zero-shot evaluation: Testing AI models on tasks without prior training or fine-tuning on similar examples
Kappa coefficient: Statistical measure of inter-rater agreement that accounts for agreement occurring by chance
Bootstrap significance testing: Statistical method using repeated random sampling to determine the reliability of results
Numerical reasoning: AI task involving mathematical calculations and quantitative analysis within text
Contradiction detection: AI task identifying conflicting information within or across documents
Temporal reasoning: AI task involving understanding and processing time-related information and sequences

Visit the IndiaFinBench GitHub repository at https://github.com/rajveerpall/IndiaFinBench to download the dataset and begin evaluating your large language model on Indian financial regulatory text.

Sources

Large language model – Wikipedia. https://en.wikipedia.org/wiki/Large_language_model
IFEval | DeepEval by Confident AI – The LLM Evaluation Framework. https://deepeval.com/docs/benchmarks-ifeval
FinanceBench: A New Benchmark for Financial Question Answering. https://wandb.ai/byyoung3/ml-news/reports/FinanceBench-A-New-Benchmark-for-Financial-Question-Answering–VmlldzoxMDE1OTM0Mw
What Are Large Language Models (LLMs)? | IBM. https://www.ibm.com/think/topics/large-language-models
Comparative Evaluation of Rule-Based and Large Language Models for Financial Transaction Extraction in Chatbots. https://jurnal.polgan.ac.id/index.php/sinkron/article/view/16020
PoliLegalLM: A Technical Report on a Large Language Model for Political and Legal Affairs. https://arxiv.org/html/2604.17543
Large Language Models for Cybersecurity Intelligence: A Systematic Review. https://www.sciencedirect.com/org/science/article/pii/S1546221826003565
A Functionality-Grounded Benchmark for Evaluating Web Agents in E-commerce Domains. https://arxiv.org/html/2508.15832

Author

siego237

Writes for FrontierWisdom on AI systems, automation, decentralized identity, and frontier infrastructure, with a focus on turning emerging technology into practical playbooks, implementation roadmaps, and monetization strategies for operators, builders, and consultants.

IndiaFinBench: New LLM Benchmark for Indian Financial Regulation

What is IndiaFinBench

What is new vs the previous version

How does IndiaFinBench work

Benchmarks and evidence