IndiaFinBench is the first publicly available evaluation benchmark for assessing large language model performance on Indian financial regulatory text. The benchmark contains 406 expert-annotated question-answer pairs from SEBI and RBI documents across four task types.
| Released by | Not yet disclosed |
|---|---|
| Release date | |
| What it is | Evaluation benchmark for LLM performance on Indian financial regulatory text |
| Who it’s for | AI researchers and financial technology developers |
| Where to get it | https://github.com/rajveerpall/IndiaFinBench |
| Price | Free |
- IndiaFinBench addresses the gap in non-Western financial regulatory evaluation benchmarks
- The dataset includes 406 questions from 192 SEBI and RBI documents
- Four task types cover regulatory interpretation, numerical reasoning, contradiction detection, and temporal reasoning
- Twelve models were evaluated with accuracy ranging from 70.4% to 89.7%
- All models significantly outperformed human baseline performance of 60.0%
- IndiaFinBench fills a critical gap in financial NLP evaluation by focusing on Indian regulatory frameworks
- The benchmark demonstrates significant performance variation across models, with numerical reasoning being most discriminative
- Expert annotation quality is validated through both model-based and human inter-annotator agreement measures
- Bootstrap significance testing reveals three statistically distinct performance tiers among evaluated models
What is IndiaFinBench
IndiaFinBench is an evaluation benchmark specifically designed to assess large language model performance on Indian financial regulatory text. Existing financial NLP benchmarks draw exclusively from Western financial corpora, leaving a significant gap in coverage of non-Western regulatory frameworks [1]. The benchmark addresses this limitation by providing the first publicly available dataset focused on Indian financial regulations.
The dataset comprises 406 expert-annotated question-answer pairs sourced from 192 official documents. These documents originate from two primary Indian financial regulatory bodies: the Securities and Exchange Board of India (SEBI) and the Reserve Bank of India (RBI). The benchmark evaluates models across four distinct task types that reflect real-world financial regulatory analysis requirements.
What is new vs the previous version
IndiaFinBench represents the first benchmark of its kind, with no previous versions existing for Indian financial regulatory text evaluation.
| Aspect | Previous State | IndiaFinBench Innovation |
|---|---|---|
| Geographic Coverage | Western financial corpora only | First Indian regulatory framework focus |
| Document Sources | SEC filings, US earnings reports | SEBI and RBI official documents |
| Task Diversity | Limited to basic comprehension | Four specialized task types |
| Annotation Quality | Variable validation methods | Dual validation: model-based and human |
How does IndiaFinBench work
IndiaFinBench operates through a structured evaluation framework that tests large language models across four specialized financial regulatory tasks.
- Document Collection: Researchers gathered 192 documents from SEBI and RBI official publications
- Question Generation: Expert annotators created 406 question-answer pairs distributed across four task categories
- Quality Validation: A two-stage validation process ensures annotation accuracy through model-based secondary passes and human inter-annotator agreement
- Model Evaluation: Twelve models undergo zero-shot evaluation across all task types
- Statistical Analysis: Bootstrap significance testing with 10,000 resamples determines performance tier distinctions
Benchmarks and evidence
IndiaFinBench evaluation results demonstrate significant performance variation across models and task types.
| Model Performance Metric | Result | Source |
|---|---|---|
| Highest accuracy (Gemini 2.5 Flash) | 89.7% | IndiaFinBench evaluation |
| Lowest accuracy (Gemma 4 E4B) | 70.4% | IndiaFinBench evaluation |
| Human baseline performance | 60.0% | IndiaFinBench evaluation |
| Numerical reasoning performance spread | 35.9 percentage points | IndiaFinBench evaluation |
| Inter-annotator agreement (kappa) | 0.611 | 60-item human evaluation |
| Model-based validation (kappa) | 0.918 | Contradiction detection task |
Who should care
Builders
AI researchers developing financial NLP systems need IndiaFinBench to evaluate model performance on non-Western regulatory frameworks. Benchmark evaluations for LLMs attempt to measure model reasoning, factual accuracy, alignment, and safety [1]. The dataset provides essential validation for models intended to process Indian financial regulations.
Enterprise
Financial institutions operating in India require accurate AI systems for regulatory compliance and document analysis. IndiaFinBench enables organizations to assess whether their chosen models can reliably interpret SEBI and RBI requirements. The benchmark’s four task types directly align with common regulatory analysis workflows.
End users
Financial professionals and compliance officers benefit from understanding AI model capabilities on Indian regulatory text. The benchmark results help users set appropriate expectations for AI-assisted regulatory analysis and identify tasks requiring human oversight.
Investors
Venture capital and institutional investors evaluating fintech companies can use IndiaFinBench results to assess technical capabilities. The benchmark provides objective performance metrics for AI systems targeting the Indian financial market.
How to use IndiaFinBench today
IndiaFinBench is immediately accessible through its GitHub repository for researchers and developers.
- Access the repository: Visit https://github.com/rajveerpall/IndiaFinBench to download the dataset
- Review the documentation: Examine the provided evaluation code and model output examples
- Load your model: Implement the evaluation framework with your chosen large language model
- Run zero-shot evaluation: Execute the benchmark across all four task types without fine-tuning
- Analyze results: Compare your model’s performance against the published baseline results
- Submit findings: Consider contributing your results to the research community
IndiaFinBench vs competitors
IndiaFinBench occupies a unique position in the financial NLP evaluation landscape by focusing specifically on Indian regulatory frameworks.
| Benchmark | Geographic Focus | Document Sources | Task Types | Question Count |
|---|---|---|---|---|
| IndiaFinBench | India | SEBI, RBI documents | 4 specialized tasks | 406 |
| FinanceBench | United States | SEC filings, earnings reports | General comprehension | Not yet disclosed |
| LawBench | Multiple jurisdictions | Legal documents | Legal reasoning | Not yet disclosed |
Risks, limits, and myths
- Limited scope: The benchmark focuses only on SEBI and RBI documents, excluding other Indian financial regulators
- Language constraint: All documents are in English, potentially missing regional language regulatory content
- Temporal coverage: The dataset represents a specific time period and may not reflect evolving regulations
- Task specificity: Four task types may not capture all real-world regulatory analysis requirements
- Model selection bias: Evaluation limited to twelve models may not represent full market performance range
- Zero-shot limitation: Results may not reflect performance after domain-specific fine-tuning
FAQ
What makes IndiaFinBench different from other financial AI benchmarks?
IndiaFinBench is the first evaluation benchmark specifically designed for Indian financial regulatory text, addressing the gap left by Western-focused financial NLP datasets.
How many questions does IndiaFinBench contain?
IndiaFinBench contains 406 expert-annotated question-answer pairs distributed across four task types: regulatory interpretation, numerical reasoning, contradiction detection, and temporal reasoning.
Which Indian regulatory bodies are covered in IndiaFinBench?
The benchmark draws from documents issued by the Securities and Exchange Board of India (SEBI) and the Reserve Bank of India (RBI).
What was the best performing model on IndiaFinBench?
Gemini 2.5 Flash achieved the highest accuracy of 89.7% across all tasks in the zero-shot evaluation.
How does human performance compare to AI models on IndiaFinBench?
All evaluated AI models substantially outperformed the non-specialist human baseline of 60.0%, with the lowest model achieving 70.4% accuracy.
Which task type is most challenging for AI models?
Numerical reasoning proved most discriminative, showing a 35.9 percentage-point performance spread across the twelve evaluated models.
Is IndiaFinBench available for commercial use?
The dataset, evaluation code, and model outputs are freely available through the GitHub repository at https://github.com/rajveerpall/IndiaFinBench.
How was annotation quality validated in IndiaFinBench?
Quality validation used a dual approach: model-based secondary passes achieving kappa=0.918 on contradiction detection, and human inter-annotator agreement evaluation with kappa=0.611.
Can I evaluate my own model using IndiaFinBench?
Yes, the benchmark provides evaluation code and documentation enabling researchers to test their own large language models against the dataset.
What statistical methods validate IndiaFinBench results?
Bootstrap significance testing with 10,000 resamples was used to establish three statistically distinct performance tiers among the evaluated models.
Glossary
- SEBI
- Securities and Exchange Board of India, the primary regulator of securities markets in India
- RBI
- Reserve Bank of India, the central banking institution and monetary authority of India
- Zero-shot evaluation
- Testing AI models on tasks without prior training or fine-tuning on similar examples
- Kappa coefficient
- Statistical measure of inter-rater agreement that accounts for agreement occurring by chance
- Bootstrap significance testing
- Statistical method using repeated random sampling to determine the reliability of results
- Numerical reasoning
- AI task involving mathematical calculations and quantitative analysis within text
- Contradiction detection
- AI task identifying conflicting information within or across documents
- Temporal reasoning
- AI task involving understanding and processing time-related information and sequences
Sources
- Large language model – Wikipedia. https://en.wikipedia.org/wiki/Large_language_model
- IFEval | DeepEval by Confident AI – The LLM Evaluation Framework. https://deepeval.com/docs/benchmarks-ifeval
- FinanceBench: A New Benchmark for Financial Question Answering. https://wandb.ai/byyoung3/ml-news/reports/FinanceBench-A-New-Benchmark-for-Financial-Question-Answering–VmlldzoxMDE1OTM0Mw
- What Are Large Language Models (LLMs)? | IBM. https://www.ibm.com/think/topics/large-language-models
- Comparative Evaluation of Rule-Based and Large Language Models for Financial Transaction Extraction in Chatbots. https://jurnal.polgan.ac.id/index.php/sinkron/article/view/16020
- PoliLegalLM: A Technical Report on a Large Language Model for Political and Legal Affairs. https://arxiv.org/html/2604.17543
- Large Language Models for Cybersecurity Intelligence: A Systematic Review. https://www.sciencedirect.com/org/science/article/pii/S1546221826003565
- A Functionality-Grounded Benchmark for Evaluating Web Agents in E-commerce Domains. https://arxiv.org/html/2508.15832