IndiaFinBench is the first publicly available evaluation benchmark for assessing large language model performance on Indian financial regulatory text. The benchmark contains 406 expert-annotated question-answer pairs from SEBI and RBI documents across four task types.
| Released by | Not yet disclosed |
|---|---|
| Release date | |
| What it is | Evaluation benchmark for LLM performance on Indian financial regulatory text |
| Who it’s for | AI researchers and financial technology developers |
| Where to get it | https://github.com/rajveerpall/IndiaFinBench |
| Price | Free (open source) |
- IndiaFinBench addresses gaps in existing financial NLP benchmarks that focus exclusively on Western financial corpora
- The benchmark includes 406 question-answer pairs from 192 SEBI and RBI documents across four task types
- Twelve models were evaluated with accuracy ranging from 70.4% to 89.7% under zero-shot conditions
- Numerical reasoning proved most discriminative with a 35.9 percentage-point performance spread across models
- All evaluated models substantially outperformed a non-specialist human baseline of 60.0%
- IndiaFinBench fills a critical gap by providing the first evaluation benchmark specifically for Indian financial regulatory text
- The benchmark demonstrates significant performance variation across models, with numerical reasoning being the most challenging task
- High annotation quality is validated through model-based secondary passes and human inter-annotator agreement evaluation
- Bootstrap significance testing reveals three statistically distinct performance tiers among evaluated models
What is IndiaFinBench
IndiaFinBench is an evaluation benchmark designed to assess large language model performance on Indian financial regulatory text. The benchmark addresses a significant gap in existing financial NLP evaluation tools that draw exclusively from Western financial corpora such as SEC filings and US earnings reports.
The benchmark contains 406 expert-annotated question-answer pairs sourced from 192 documents from the Securities and Exchange Board of India (SEBI) and the Reserve Bank of India (RBI). These documents represent the primary regulatory frameworks governing India’s financial sector.
What is new vs previous benchmarks
IndiaFinBench introduces the first publicly available benchmark specifically for non-Western financial regulatory frameworks.
| Feature | Previous Financial Benchmarks | IndiaFinBench |
|---|---|---|
| Geographic focus | Western financial corpora only | Indian regulatory framework |
| Document sources | SEC filings, US earnings reports | SEBI and RBI regulatory documents |
| Task diversity | Limited task types | Four distinct task types |
| Annotation validation | Varies by benchmark | Model-based secondary pass + human agreement |
| Statistical rigor | Basic accuracy reporting | Bootstrap significance testing with 10,000 resamples |
How does IndiaFinBench work
IndiaFinBench evaluates models across four distinct task types that capture different aspects of financial regulatory comprehension.
- Regulatory interpretation: 174 items testing understanding of regulatory requirements and compliance frameworks
- Numerical reasoning: 92 items requiring mathematical calculations and quantitative analysis of financial data
- Contradiction detection: 62 items identifying inconsistencies within regulatory text
- Temporal reasoning: 78 items involving time-based regulatory requirements and deadlines
Annotation quality is validated through a model-based secondary pass achieving kappa=0.918 on contradiction detection tasks. A 60-item human inter-annotator agreement evaluation achieved kappa=0.611 with 76.7% overall agreement.
Benchmarks and evidence
Twelve models were evaluated under zero-shot conditions, revealing significant performance variations across different model architectures.
| Model | Overall Accuracy | Performance Tier | Source |
|---|---|---|---|
| Gemini 2.5 Flash | 89.7% | Tier 1 | [1] |
| Gemma 4 E4B | 70.4% | Tier 3 | [1] |
| Human baseline (non-specialist) | 60.0% | Below all models | [1] |
| Numerical reasoning spread | 35.9 percentage points | Most discriminative task | [1] |
Bootstrap significance testing with 10,000 resamples reveals three statistically distinct performance tiers among the evaluated models. Numerical reasoning emerged as the most discriminative task type with the largest performance spread.
Who should care
Builders
AI developers building financial applications for Indian markets need IndiaFinBench to evaluate model performance on regulatory compliance tasks. The benchmark provides standardized evaluation metrics for models processing SEBI and RBI documentation.
Enterprise
Financial institutions operating in India require models that accurately interpret regulatory requirements for compliance automation. IndiaFinBench enables systematic evaluation of LLM capabilities for regulatory text processing and risk assessment.
End users
Financial advisors and compliance professionals can use IndiaFinBench results to understand model limitations when processing Indian regulatory documents. The benchmark helps identify which models perform best for specific regulatory interpretation tasks.
Investors
Venture capital and fintech investors can use IndiaFinBench performance data to evaluate AI startups targeting Indian financial markets. The benchmark provides objective metrics for assessing regulatory AI capabilities.
How to use IndiaFinBench today
IndiaFinBench is immediately available through its open-source repository with complete evaluation code and model outputs.
- Clone the repository:
git clone https://github.com/rajveerpall/IndiaFinBench - Install dependencies from the requirements file
- Load your model using the provided evaluation framework
- Run evaluation across the four task types: regulatory interpretation, numerical reasoning, contradiction detection, and temporal reasoning
- Compare results against the baseline performance metrics for twelve evaluated models
- Use bootstrap significance testing code to determine statistical significance of performance differences
IndiaFinBench vs competitors
IndiaFinBench addresses geographic and regulatory gaps not covered by existing financial benchmarks.
| Benchmark | Geographic Focus | Document Types | Task Variety | Validation Method |
|---|---|---|---|---|
| IndiaFinBench | Indian regulatory framework | SEBI and RBI documents | 4 task types | Model + human validation |
| FinanceBench | US financial markets | SEC filings, earnings reports | Question answering | Not yet disclosed |
| LawBench | General legal text | Legal documents | Multiple legal tasks | Expert annotation |
Risks, limits, and myths
- Limited scope: Benchmark focuses only on SEBI and RBI documents, not covering other Indian financial regulators
- Zero-shot evaluation: Models were not fine-tuned on Indian regulatory text, potentially underestimating specialized model performance
- Language limitation: Evaluation conducted in English only, not covering regional language regulatory documents
- Temporal coverage: Benchmark reflects regulatory frameworks as of the dataset creation date, not future regulatory changes
- Human baseline: Non-specialist human performance may not reflect expert regulatory professional capabilities
- Model selection: Twelve evaluated models may not represent the full spectrum of available LLM architectures
FAQ
What makes IndiaFinBench different from other financial AI benchmarks?
IndiaFinBench is the first publicly available benchmark specifically designed for Indian financial regulatory text, addressing gaps in existing benchmarks that focus exclusively on Western financial corpora.
How many question-answer pairs does IndiaFinBench contain?
IndiaFinBench contains 406 expert-annotated question-answer pairs sourced from 192 documents from SEBI and RBI.
What are the four task types in IndiaFinBench evaluation?
The four task types are regulatory interpretation (174 items), numerical reasoning (92 items), contradiction detection (62 items), and temporal reasoning (78 items).
Which model performed best on IndiaFinBench?
Gemini 2.5 Flash achieved the highest accuracy at 89.7% under zero-shot evaluation conditions.
How was annotation quality validated in IndiaFinBench?
Annotation quality was validated through a model-based secondary pass (kappa=0.918 on contradiction detection) and a 60-item human inter-annotator agreement evaluation (kappa=0.611).
What was the human baseline performance on IndiaFinBench?
The non-specialist human baseline achieved 60.0% accuracy, which all twelve evaluated models substantially outperformed.
Which task type showed the largest performance variation across models?
Numerical reasoning was the most discriminative task, showing a 35.9 percentage-point spread across different models.
Is IndiaFinBench available for commercial use?
IndiaFinBench is available as open source through GitHub, though specific licensing terms are not yet disclosed in the provided sources.
How many statistical performance tiers were identified in the evaluation?
Bootstrap significance testing with 10,000 resamples revealed three statistically distinct performance tiers among the evaluated models.
What regulatory bodies are covered in IndiaFinBench documents?
IndiaFinBench draws from documents issued by the Securities and Exchange Board of India (SEBI) and the Reserve Bank of India (RBI).
Glossary
- SEBI
- Securities and Exchange Board of India, the regulatory authority for securities markets in India
- RBI
- Reserve Bank of India, the central banking institution and monetary authority of India
- Zero-shot evaluation
- Testing model performance on tasks without prior training or fine-tuning on similar examples
- Bootstrap significance testing
- Statistical method using resampling to determine if performance differences between models are statistically significant
- Kappa coefficient
- Statistical measure of inter-rater agreement that accounts for agreement occurring by chance
- Contradiction detection
- Task type requiring identification of inconsistent or conflicting statements within regulatory text
- Temporal reasoning
- Cognitive task involving understanding and processing time-based relationships and sequences
Sources
- IndiaFinBench: An Evaluation Benchmark for Large Language Model Performance on Indian Financial Regulatory Text. arXiv:2604.19298v1. .