IndiaFinBench is the first publicly available evaluation benchmark for assessing large language model performance on Indian financial regulatory text, featuring 406 expert-annotated question-answer pairs from SEBI and RBI documents across four task types.
| Released by | Not yet disclosed |
| Release date | |
| What it is | LLM evaluation benchmark for Indian financial regulatory text |
| Who it is for | AI researchers and financial technology developers |
| Where to get it | https://github.com/rajveerpall/IndiaFinBench |
| Price | Free |
- IndiaFinBench contains 406 expert-annotated question-answer pairs from 192 SEBI and RBI documents
- Four task types include regulatory interpretation, numerical reasoning, contradiction detection, and temporal reasoning
- Twelve models tested showed accuracy ranging from 70.4% to 89.7% under zero-shot conditions
- Numerical reasoning proved most discriminative with 35.9 percentage-point spread across models
- Dataset addresses gap in non-Western financial regulatory framework evaluation
- IndiaFinBench fills critical gap in LLM evaluation for non-Western financial regulatory frameworks
- Expert annotation quality validated through model-based secondary pass with kappa=0.918 on contradiction detection
- All twelve tested models substantially outperformed non-specialist human baseline of 60.0% accuracy
- Bootstrap significance testing revealed three statistically distinct performance tiers across models
- Complete dataset, evaluation code, and model outputs available for reproducible research
What is IndiaFinBench
IndiaFinBench is an evaluation benchmark specifically designed to assess large language model performance on Indian financial regulatory text. The benchmark addresses a significant gap in existing financial NLP evaluation tools, which draw exclusively from Western financial corpora like SEC filings and US earnings reports.
The benchmark contains 406 expert-annotated question-answer pairs sourced from 192 documents from the Securities and Exchange Board of India (SEBI) and the Reserve Bank of India (RBI). These documents represent the core regulatory framework governing India’s financial sector.
IndiaFinBench evaluates models across four distinct task types: regulatory interpretation (174 items), numerical reasoning (92 items), contradiction detection (62 items), and temporal reasoning (78 items). Each task type tests different aspects of language model comprehension and reasoning capabilities within the Indian financial regulatory context.
What is new vs the previous version
IndiaFinBench represents the first benchmark of its kind rather than an update to existing tools.
| Aspect | Previous Financial Benchmarks | IndiaFinBench |
|---|---|---|
| Geographic Focus | Western financial markets only | Indian regulatory framework |
| Document Sources | SEC filings, US earnings reports | SEBI and RBI regulatory documents |
| Language Context | English-language financial news | Indian financial regulatory text |
| Task Diversity | Limited task types | Four specialized task categories |
| Annotation Quality | Varies by benchmark | Model-validated with kappa=0.918 |
How does IndiaFinBench work
IndiaFinBench operates through a structured evaluation framework that tests language models across four specialized financial regulatory tasks.
- Document Collection: Researchers gathered 192 regulatory documents from SEBI and RBI covering various aspects of Indian financial regulation
- Question Generation: Expert annotators created 406 question-answer pairs distributed across four task categories based on document content
- Quality Validation: Annotation quality underwent validation through model-based secondary pass and human inter-annotator agreement evaluation
- Model Testing: Twelve language models received evaluation under zero-shot conditions without task-specific training
- Statistical Analysis: Bootstrap significance testing with 10,000 resamples determined statistically distinct performance tiers
Benchmarks and evidence
IndiaFinBench evaluation results demonstrate significant performance variation across different language models and task types.
| Model Performance Metric | Result | Source |
|---|---|---|
| Highest accuracy achieved | 89.7% (Gemini 2.5 Flash) | IndiaFinBench evaluation |
| Lowest accuracy achieved | 70.4% (Gemma 4 E4B) | IndiaFinBench evaluation |
| Human baseline accuracy | 60.0% (non-specialist) | IndiaFinBench evaluation |
| Most discriminative task spread | 35.9 percentage points (numerical reasoning) | IndiaFinBench evaluation |
| Inter-annotator agreement kappa | 0.611 (76.7% overall agreement) | 60-item human evaluation |
| Contradiction detection validation kappa | 0.918 | Model-based secondary pass |
Who should care
Builders
AI developers building financial applications for Indian markets need IndiaFinBench to evaluate model performance on regulatory compliance tasks. The benchmark provides standardized testing for models handling SEBI and RBI documentation, ensuring applications meet regulatory interpretation requirements.
Enterprise
Financial institutions operating in India require accurate LLM evaluation for regulatory compliance systems. IndiaFinBench enables enterprises to assess model capabilities for processing Indian financial regulations, supporting automated compliance monitoring and regulatory document analysis.
End users
Financial technology users benefit from applications tested against IndiaFinBench standards, ensuring more accurate regulatory guidance and compliance assistance. The benchmark validates that AI-powered financial tools understand Indian regulatory nuances correctly.
Investors
Investment firms focusing on Indian fintech companies can use IndiaFinBench results to evaluate the technical capabilities of AI-powered financial services. The benchmark provides objective performance metrics for assessing regulatory compliance technology investments.
How to use IndiaFinBench today
IndiaFinBench provides immediate access through its GitHub repository for researchers and developers.
- Access the repository: Visit https://github.com/rajveerpall/IndiaFinBench to download the complete dataset
- Install evaluation framework: Clone the repository and install required dependencies listed in requirements.txt
- Load your model: Configure your language model to work with the provided evaluation scripts
- Run evaluation: Execute the benchmark using the provided evaluation code across all four task types
- Analyze results: Compare your model’s performance against the published baseline results and statistical significance tests
IndiaFinBench vs competitors
IndiaFinBench stands alone as the first benchmark specifically designed for Indian financial regulatory text evaluation.
| Benchmark | Geographic Focus | Document Sources | Task Types | Question Count |
|---|---|---|---|---|
| IndiaFinBench | Indian regulations | SEBI, RBI documents | 4 specialized tasks | 406 questions |
| FinanceBench | US markets | SEC filings, earnings | General financial QA | Not yet disclosed |
| LawBench | General legal | Various legal texts | Legal reasoning | Not yet disclosed |
| LexEval | Legal domains | Legal documents | Legal evaluation | Not yet disclosed |
Risks, limits, and myths
- Limited scope: IndiaFinBench focuses exclusively on SEBI and RBI documents, potentially missing other Indian financial regulatory bodies
- Zero-shot evaluation only: Current testing excludes few-shot or fine-tuned model performance assessment
- Language limitation: Benchmark covers English-language regulatory text, excluding regional language financial documents
- Temporal constraints: Regulatory documents have specific time periods, potentially limiting applicability to future regulatory changes
- Expert annotation bias: Human annotators may introduce subjective interpretations despite validation measures
- Model selection bias: Twelve tested models may not represent complete landscape of available language models
FAQ
- What makes IndiaFinBench different from other financial AI benchmarks?
- IndiaFinBench specifically evaluates language models on Indian financial regulatory text from SEBI and RBI, addressing the gap left by Western-focused financial benchmarks that use SEC filings and US earnings reports.
- How many questions does IndiaFinBench contain for model evaluation?
- IndiaFinBench contains 406 expert-annotated question-answer pairs distributed across four task types: regulatory interpretation (174), numerical reasoning (92), contradiction detection (62), and temporal reasoning (78).
- Which language models performed best on IndiaFinBench testing?
- Gemini 2.5 Flash achieved the highest accuracy at 89.7%, while Gemma 4 E4B scored lowest at 70.4%. All twelve tested models outperformed the 60.0% human baseline.
- Where can researchers access IndiaFinBench dataset and evaluation tools?
- The complete IndiaFinBench dataset, evaluation code, and all model outputs are freely available at https://github.com/rajveerpall/IndiaFinBench for reproducible research.
- What validation methods ensure IndiaFinBench annotation quality?
- Annotation quality underwent validation through model-based secondary pass achieving kappa=0.918 on contradiction detection and 60-item human inter-annotator agreement evaluation with kappa=0.611.
- Which task type shows the largest performance differences between models?
- Numerical reasoning proved most discriminative with a 35.9 percentage-point spread across models, indicating significant variation in mathematical reasoning capabilities within financial contexts.
- How does IndiaFinBench handle statistical significance in model comparisons?
- Bootstrap significance testing with 10,000 resamples revealed three statistically distinct performance tiers, ensuring reliable model performance comparisons beyond simple accuracy scores.
- What regulatory documents form the foundation of IndiaFinBench questions?
- IndiaFinBench draws from 192 regulatory documents sourced from the Securities and Exchange Board of India (SEBI) and the Reserve Bank of India (RBI).
- Can IndiaFinBench evaluate models trained specifically on financial data?
- Current IndiaFinBench evaluation focuses on zero-shot conditions without task-specific training, though the framework could potentially accommodate fine-tuned model assessment.
- What languages does IndiaFinBench support for regulatory text evaluation?
- IndiaFinBench currently evaluates English-language Indian financial regulatory text, with no disclosed plans for regional language document inclusion.
Glossary
- Bootstrap significance testing
- Statistical method using repeated random sampling to determine if performance differences between models are statistically meaningful rather than due to chance
- Contradiction detection
- Task type requiring models to identify conflicting information within regulatory documents or between different regulatory statements
- Inter-annotator agreement
- Measure of consistency between different human annotators when labeling the same data, typically expressed as kappa coefficient
- Kappa coefficient
- Statistical measure of inter-rater reliability accounting for agreement occurring by chance, with values closer to 1.0 indicating higher agreement
- Numerical reasoning
- Task type requiring models to perform mathematical calculations and quantitative analysis within financial regulatory contexts
- RBI
- Reserve Bank of India, the central banking institution responsible for monetary policy and banking regulation in India
- Regulatory interpretation
- Task type requiring models to understand and explain the meaning and implications of specific regulatory text passages
- SEBI
- Securities and Exchange Board of India, the regulatory authority for securities and commodity markets in India
- Temporal reasoning
- Task type requiring models to understand time-based relationships and chronological sequences within regulatory frameworks
- Zero-shot evaluation
- Testing methodology where models perform tasks without prior training or examples specific to those tasks
Sources
- Large language model – Wikipedia. https://en.wikipedia.org/wiki/Large_language_model
- What Are Large Language Models (LLMs)? | IBM. https://www.ibm.com/think/topics/large-language-models
- Comparative Evaluation of Rule-Based and Large Language Models for Financial Transaction Extraction in Chatbots | Sinkron. https://jurnal.polgan.ac.id/index.php/sinkron/article/view/16020
- PoliLegalLM: A Technical Report on a Large Language Model for Political and Legal Affairs. https://arxiv.org/html/2604.17543
- A Functionality-Grounded Benchmark for Evaluating Web Agents in E-commerce Domains. https://arxiv.org/html/2508.15832