IndiaFinBench is the first publicly available evaluation benchmark designed to assess large language model performance on Indian financial regulatory text, featuring 406 expert-annotated question-answer pairs from SEBI and RBI documents across four specialized task types.
| Released by | Not yet disclosed |
|---|---|
| Release date | |
| What it is | Evaluation benchmark for LLM performance on Indian financial regulatory text |
| Who it is for | AI researchers and financial technology developers |
| Where to get it | GitHub repository at rajveerpall/IndiaFinBench |
| Price | Free |
- IndiaFinBench addresses the gap in non-Western financial NLP benchmarks with 406 expert-annotated questions from Indian regulatory documents
- The benchmark covers four task types: regulatory interpretation, numerical reasoning, contradiction detection, and temporal reasoning
- Twelve models were evaluated with accuracy ranging from 70.4% to 89.7% under zero-shot conditions
- Numerical reasoning proved most discriminative with a 35.9 percentage-point performance spread across models
- All tested models substantially outperformed a non-specialist human baseline of 60.0% accuracy
- IndiaFinBench fills a critical gap in financial NLP evaluation by focusing on Indian regulatory frameworks rather than Western financial corpora
- The benchmark demonstrates significant performance variation across models, particularly in numerical reasoning tasks
- High annotation quality is validated through model-based secondary passes and human inter-annotator agreement evaluation
- Bootstrap significance testing reveals three statistically distinct performance tiers among evaluated models
What is IndiaFinBench
IndiaFinBench is an evaluation benchmark specifically designed to assess large language model performance on Indian financial regulatory documents. Benchmark evaluations for LLMs attempt to measure model reasoning, factual accuracy, alignment, and safety [1]. The benchmark contains 406 expert-annotated question-answer pairs sourced from 192 documents from the Securities and Exchange Board of India (SEBI) and the Reserve Bank of India (RBI).
The benchmark addresses a significant gap in existing financial NLP evaluation tools, which draw exclusively from Western financial corpora including SEC filings, US earnings reports, and English-language financial news. IndiaFinBench provides the first comprehensive evaluation framework for non-Western regulatory frameworks in the financial domain.
What is new vs previous benchmarks
IndiaFinBench introduces several novel elements compared to existing financial evaluation benchmarks:
| Feature | IndiaFinBench | Previous Financial Benchmarks |
|---|---|---|
| Geographic focus | Indian regulatory framework | Western financial corpora exclusively |
| Document sources | SEBI and RBI regulatory documents | SEC filings, US earnings reports |
| Task diversity | Four specialized task types | General financial question answering |
| Annotation validation | Model-based secondary pass plus human agreement | Standard human annotation only |
| Statistical analysis | Bootstrap significance testing with 10,000 resamples | Basic accuracy reporting |
How does IndiaFinBench work
IndiaFinBench operates through a structured evaluation process across four distinct task types:
- Regulatory interpretation tasks: 174 items testing model understanding of Indian financial regulations and compliance requirements
- Numerical reasoning tasks: 92 items evaluating mathematical computation and quantitative analysis capabilities
- Contradiction detection tasks: 62 items assessing ability to identify conflicting information within regulatory documents
- Temporal reasoning tasks: 78 items testing understanding of time-dependent regulatory changes and sequences
The evaluation methodology employs zero-shot conditions where models receive no task-specific training examples. Annotation quality is validated through a model-based secondary pass achieving kappa=0.918 on contradiction detection tasks and a 60-item human inter-annotator agreement evaluation with kappa=0.611 and 76.7% overall agreement.
Benchmarks and evidence
Researchers and practitioners look at qualities like accuracy, efficiency, safety, fairness and robustness to determine how well a model performs [4]. IndiaFinBench evaluation results demonstrate significant performance variation across twelve tested models:
| Model | Overall Accuracy | Performance Tier | Source |
|---|---|---|---|
| Gemini 2.5 Flash | 89.7% | Tier 1 | IndiaFinBench evaluation |
| Mid-range models | 75-85% | Tier 2 | IndiaFinBench evaluation |
| Gemma 4 E4B | 70.4% | Tier 3 | IndiaFinBench evaluation |
| Non-specialist human | 60.0% | Baseline | IndiaFinBench evaluation |
Numerical reasoning tasks showed the highest discriminative power with a 35.9 percentage-point spread across models. Bootstrap significance testing with 10,000 resamples confirmed three statistically distinct performance tiers among evaluated models.
Who should care
Builders
AI researchers developing financial NLP systems need IndiaFinBench to evaluate model performance on non-Western regulatory frameworks. The benchmark provides standardized evaluation metrics for Indian financial document processing capabilities.
Enterprise
Financial institutions operating in India require accurate AI systems for regulatory compliance and document analysis. IndiaFinBench enables assessment of LLM capabilities for Indian regulatory interpretation and numerical reasoning tasks.
End users
Financial professionals working with Indian regulatory documents benefit from AI tools validated against IndiaFinBench standards. The benchmark ensures reliable performance on SEBI and RBI document analysis.
Investors
Investment firms focusing on Indian markets need AI systems capable of processing local regulatory requirements. IndiaFinBench provides validation metrics for financial AI tools in the Indian regulatory context.
How to use IndiaFinBench today
IndiaFinBench is available for immediate use through its GitHub repository:
- Access the repository: Visit https://github.com/rajveerpall/IndiaFinBench to download the complete dataset
- Review evaluation code: The repository includes standardized evaluation scripts for consistent model assessment
- Load the dataset: Import the 406 question-answer pairs organized by task type for systematic evaluation
- Run zero-shot evaluation: Test your model without task-specific training examples following the established protocol
- Compare results: Use the provided baseline scores and statistical analysis framework for performance comparison
The dataset includes all model outputs from the original evaluation, enabling direct comparison with established performance benchmarks.
IndiaFinBench vs competitors
| Benchmark | Geographic Focus | Document Sources | Task Types | Validation Method |
|---|---|---|---|---|
| IndiaFinBench | Indian regulatory framework | SEBI, RBI documents | 4 specialized tasks | Model-based + human validation |
| FinanceBench | Western markets | SEC filings, earnings reports | General financial QA | Human annotation |
| LawBench | General legal domain | Legal documents | Legal reasoning | Expert annotation |
Risks, limits, and myths
- Limited scope: The benchmark focuses exclusively on Indian regulatory documents, limiting generalizability to other financial markets
- Language constraints: Evaluation is conducted in English, potentially missing nuances in regional Indian financial terminology
- Temporal limitations: Regulatory documents have specific time periods, requiring regular updates to maintain relevance
- Task type bias: Numerical reasoning shows highest discriminative power, potentially overweighting quantitative capabilities
- Human baseline limitations: The 60.0% non-specialist human baseline may not represent expert-level performance expectations
- Model selection bias: Twelve evaluated models may not represent the full spectrum of available LLM capabilities
FAQ
What makes IndiaFinBench different from other financial AI benchmarks?
IndiaFinBench is the first evaluation benchmark specifically designed for Indian financial regulatory documents, addressing the gap in non-Western financial NLP evaluation tools that previously focused exclusively on Western financial corpora.
How many questions are included in the IndiaFinBench dataset?
IndiaFinBench contains 406 expert-annotated question-answer pairs sourced from 192 documents from SEBI and RBI, distributed across four task types: regulatory interpretation, numerical reasoning, contradiction detection, and temporal reasoning.
Which AI models perform best on IndiaFinBench tasks?
Gemini 2.5 Flash achieved the highest accuracy at 89.7%, while Gemma 4 E4B scored lowest at 70.4%. All twelve evaluated models substantially outperformed the non-specialist human baseline of 60.0%.
What is the most challenging task type in IndiaFinBench?
Numerical reasoning tasks proved most discriminative with a 35.9 percentage-point spread across models, indicating significant variation in quantitative analysis capabilities among different LLMs.
How is annotation quality validated in IndiaFinBench?
Annotation quality is validated through a model-based secondary pass achieving kappa=0.918 on contradiction detection and a 60-item human inter-annotator agreement evaluation with kappa=0.611 and 76.7% overall agreement.
Can I access IndiaFinBench for free?
Yes, IndiaFinBench is freely available through its GitHub repository at rajveerpall/IndiaFinBench, including the complete dataset, evaluation code, and all model outputs from the original study.
What regulatory bodies are covered in IndiaFinBench documents?
IndiaFinBench draws from documents issued by two primary Indian financial regulatory authorities: the Securities and Exchange Board of India (SEBI) and the Reserve Bank of India (RBI).
How does IndiaFinBench ensure statistical significance in results?
The benchmark employs bootstrap significance testing with 10,000 resamples to establish three statistically distinct performance tiers among evaluated models, ensuring robust statistical validation of results.
What evaluation methodology does IndiaFinBench use?
IndiaFinBench uses zero-shot evaluation conditions where models receive no task-specific training examples, providing a standardized assessment of inherent model capabilities on Indian financial regulatory text.
Who should use IndiaFinBench for model evaluation?
AI researchers, financial technology developers, regulatory compliance teams, and financial institutions operating in India should use IndiaFinBench to evaluate LLM performance on Indian regulatory document processing tasks.
Glossary
- SEBI
- Securities and Exchange Board of India, the regulatory authority for securities markets in India
- RBI
- Reserve Bank of India, the central banking institution and monetary authority of India
- Zero-shot evaluation
- Testing model performance without providing task-specific training examples or fine-tuning
- Kappa coefficient
- Statistical measure of inter-rater agreement that accounts for agreement occurring by chance
- Bootstrap significance testing
- Statistical method using resampling to determine confidence intervals and significance of results
- Contradiction detection
- Task type requiring identification of conflicting information within regulatory documents
- Temporal reasoning
- Cognitive ability to understand time-dependent relationships and sequences in regulatory changes
Sources
- Large language model – Wikipedia. https://en.wikipedia.org/wiki/Large_language_model
- IFEval | DeepEval by Confident AI – The LLM Evaluation Framework. https://deepeval.com/docs/benchmarks-ifeval
- FinanceBench: A New Benchmark for Financial Question Answering. https://wandb.ai/byyoung3/ml-news/reports/FinanceBench-A-New-Benchmark-for-Financial-Question-Answering–VmlldzoxMDE1OTM0Mw
- What Are Large Language Models (LLMs)? | IBM. https://www.ibm.com/think/topics/large-language-models
- Comparative Evaluation of Rule-Based and Large Language Models for Financial Transaction Extraction in Chatbots. https://jurnal.polgan.ac.id/index.php/sinkron/article/view/16020
- PoliLegalLM: A Technical Report on a Large Language Model for Political and Legal Affairs. https://arxiv.org/html/2604.17543
- Large Language Models for Cybersecurity Intelligence: A Systematic Review. https://www.sciencedirect.com/org/science/article/pii/S1546221826003565
- A Functionality-Grounded Benchmark for Evaluating Web Agents in E-commerce Domains. https://arxiv.org/html/2508.15832