IndiaFinBench represents the first publicly available evaluation benchmark designed to assess large language model performance on Indian financial regulatory text, featuring 406 expert-annotated question-answer pairs from SEBI and RBI documents across four specialized task types.
| Released by | Not yet disclosed |
|---|---|
| Release date | |
| What it is | First LLM evaluation benchmark for Indian financial regulatory text |
| Who it is for | AI researchers and financial technology developers |
| Where to get it | GitHub repository at rajveerpall/IndiaFinBench |
| Price | Free |
- IndiaFinBench contains 406 expert-annotated question-answer pairs from 192 SEBI and RBI documents
- The benchmark covers four task types: regulatory interpretation, numerical reasoning, contradiction detection, and temporal reasoning
- Twelve models were evaluated with accuracy ranging from 70.4% to 89.7% under zero-shot conditions
- Numerical reasoning proved most discriminative with a 35.9 percentage-point performance spread across models
- All evaluated models substantially outperformed a non-specialist human baseline of 60.0%
- IndiaFinBench addresses a significant gap in financial NLP benchmarks by focusing on non-Western regulatory frameworks
- The benchmark demonstrates high annotation quality with kappa scores of 0.918 for contradiction detection and 0.611 for human inter-annotator agreement
- Gemini 2.5 Flash achieved the highest accuracy at 89.7%, while Gemma 4 E4B scored lowest at 70.4%
- Bootstrap significance testing revealed three statistically distinct performance tiers among the twelve evaluated models
- The complete dataset, evaluation code, and model outputs are publicly available for research use
What is IndiaFinBench
IndiaFinBench is the first publicly available evaluation benchmark specifically designed to assess large language model performance on Indian financial regulatory text. The benchmark addresses a critical gap in existing financial NLP evaluation tools, which have historically drawn exclusively from Western financial corpora including SEC filings, US earnings reports, and English-language financial news.
The benchmark contains 406 expert-annotated question-answer pairs sourced from 192 official documents from the Securities and Exchange Board of India (SEBI) and the Reserve Bank of India (RBI). These regulatory documents represent the primary financial oversight bodies in India and cover diverse aspects of financial regulation, compliance requirements, and policy frameworks.
What is new vs the previous version
IndiaFinBench represents an entirely new benchmark category, as no previous evaluation framework has focused on Indian financial regulatory text.
| Aspect | Previous Financial Benchmarks | IndiaFinBench |
|---|---|---|
| Geographic Focus | Exclusively Western financial corpora | Indian regulatory frameworks |
| Source Documents | SEC filings, US earnings reports | SEBI and RBI regulatory documents |
| Task Coverage | General financial understanding | Four specialized regulatory tasks |
| Annotation Quality | Variable validation methods | Model-based secondary pass with kappa=0.918 |
| Public Availability | Limited open access | Complete dataset and code on GitHub |
How does IndiaFinBench work
IndiaFinBench operates through a structured four-task evaluation framework designed to assess different aspects of regulatory text comprehension.
- Regulatory Interpretation (174 items): Models must demonstrate understanding of complex regulatory language and policy implications from SEBI and RBI documents.
- Numerical Reasoning (92 items): Tasks require mathematical computation and quantitative analysis of financial regulations and compliance requirements.
- Contradiction Detection (62 items): Models identify inconsistencies or conflicting statements within regulatory text passages.
- Temporal Reasoning (78 items): Evaluation focuses on understanding time-dependent regulatory changes and chronological policy relationships.
The benchmark employs zero-shot evaluation conditions, meaning models receive no task-specific training examples before assessment. Annotation quality validation includes both model-based secondary passes and human inter-annotator agreement evaluation across 60 items.
Benchmarks and evidence
IndiaFinBench evaluation results demonstrate significant performance variation across twelve tested models under zero-shot conditions.
| Model | Overall Accuracy | Performance Tier | Source |
|---|---|---|---|
| Gemini 2.5 Flash | 89.7% | Tier 1 | IndiaFinBench paper |
| Gemma 4 E4B | 70.4% | Tier 3 | IndiaFinBench paper |
| Non-specialist Human | 60.0% | Baseline | IndiaFinBench paper |
| Numerical Reasoning Spread | 35.9 percentage points | Most discriminative task | IndiaFinBench paper |
Bootstrap significance testing with 10,000 resamples revealed three statistically distinct performance tiers among evaluated models. The numerical reasoning task category showed the highest discrimination between model capabilities, with performance gaps exceeding 35 percentage points between top and bottom performers.
Who should care
Builders
AI developers working on financial applications can use IndiaFinBench to evaluate model performance on regulatory compliance tasks. The benchmark provides standardized metrics for assessing LLM capabilities in non-Western financial contexts, enabling more robust model selection and fine-tuning strategies.
Enterprise
Financial institutions operating in India can leverage IndiaFinBench to assess AI systems for regulatory compliance automation. Banks, investment firms, and fintech companies can evaluate whether their LLM implementations meet accuracy requirements for processing SEBI and RBI documentation.
End users
Researchers and academics studying financial NLP can access a comprehensive evaluation framework for Indian regulatory text. The benchmark enables comparative analysis of model performance across different regulatory interpretation tasks.
Investors
Investment firms can use IndiaFinBench results to evaluate AI-powered compliance and regulatory analysis tools. The benchmark provides objective performance metrics for assessing fintech solutions targeting Indian financial markets.
How to use IndiaFinBench today
IndiaFinBench is immediately accessible through its GitHub repository for research and evaluation purposes.
- Access the repository: Navigate to https://github.com/rajveerpall/IndiaFinBench to download the complete dataset and evaluation code.
- Load the benchmark data: The repository contains 406 question-answer pairs organized by task type with corresponding SEBI and RBI source documents.
- Implement evaluation protocol: Use the provided evaluation code to assess your model under zero-shot conditions across all four task categories.
- Compare results: Benchmark your model performance against the twelve baseline models with accuracy scores ranging from 70.4% to 89.7%.
- Analyze task-specific performance: Focus on numerical reasoning tasks for the most discriminative evaluation of model capabilities.
IndiaFinBench vs competitors
IndiaFinBench occupies a unique position in the financial NLP benchmark landscape by focusing specifically on Indian regulatory frameworks.
| Benchmark | Geographic Focus | Task Types | Document Sources | Question Count |
|---|---|---|---|---|
| IndiaFinBench | Indian regulatory frameworks | 4 specialized tasks | SEBI and RBI documents | 406 questions |
| FinanceBench | Western financial markets | General financial QA | SEC filings, earnings reports | Not yet disclosed |
| LawBench | General legal domains | Legal reasoning tasks | Various legal documents | Not yet disclosed |
Risks, limits, and myths
- Geographic limitation: The benchmark focuses exclusively on Indian regulatory frameworks and may not generalize to other financial jurisdictions.
- Language constraint: All evaluation materials are in English, potentially missing regional language regulatory documents used in Indian financial contexts.
- Temporal scope: The benchmark reflects regulatory frameworks as of the document collection date and may not capture recent policy changes.
- Task coverage: Four task types may not encompass all aspects of regulatory text comprehension required in real-world financial applications.
- Human baseline limitation: The 60.0% non-specialist human baseline may not represent expert-level human performance on these tasks.
- Model selection bias: Evaluation limited to twelve models may not represent the full spectrum of available LLM capabilities.
FAQ
What makes IndiaFinBench different from other financial AI benchmarks?
IndiaFinBench is the first evaluation benchmark specifically designed for Indian financial regulatory text, addressing a gap in existing benchmarks that focus exclusively on Western financial corpora like SEC filings and US earnings reports.
How many questions does IndiaFinBench contain?
IndiaFinBench contains 406 expert-annotated question-answer pairs sourced from 192 documents from the Securities and Exchange Board of India (SEBI) and the Reserve Bank of India (RBI).
What types of tasks does IndiaFinBench evaluate?
The benchmark evaluates four task types: regulatory interpretation (174 items), numerical reasoning (92 items), contradiction detection (62 items), and temporal reasoning (78 items).
Which AI model performed best on IndiaFinBench?
Gemini 2.5 Flash achieved the highest accuracy at 89.7%, while Gemma 4 E4B scored lowest at 70.4% among the twelve evaluated models under zero-shot conditions.
How reliable is the annotation quality in IndiaFinBench?
Annotation quality is validated through model-based secondary passes achieving kappa=0.918 on contradiction detection and human inter-annotator agreement evaluation with kappa=0.611 and 76.7% overall agreement.
Can I access IndiaFinBench for free?
Yes, the complete dataset, evaluation code, and all model outputs are freely available at the GitHub repository https://github.com/rajveerpall/IndiaFinBench.
What is the most challenging task type in IndiaFinBench?
Numerical reasoning proved most discriminative with a 35.9 percentage-point spread across models, making it the most challenging task category for differentiating model capabilities.
How does human performance compare to AI models on IndiaFinBench?
All twelve evaluated AI models substantially outperformed the non-specialist human baseline of 60.0%, with the lowest-performing model (Gemma 4 E4B) achieving 70.4% accuracy.
What statistical methods validate IndiaFinBench results?
Bootstrap significance testing with 10,000 resamples revealed three statistically distinct performance tiers among the evaluated models, ensuring robust statistical validation of results.
Which regulatory bodies provide source documents for IndiaFinBench?
IndiaFinBench draws from documents issued by the Securities and Exchange Board of India (SEBI) and the Reserve Bank of India (RBI), representing the primary financial regulatory authorities in India.
Glossary
- SEBI
- Securities and Exchange Board of India, the primary regulatory authority for securities markets in India
- RBI
- Reserve Bank of India, the central banking institution and monetary authority of India
- Zero-shot evaluation
- Testing AI models on tasks without providing task-specific training examples or fine-tuning
- Kappa score
- Statistical measure of inter-rater agreement that accounts for agreement occurring by chance
- Bootstrap significance testing
- Statistical method using resampling to determine if observed differences between groups are statistically significant
- Numerical reasoning
- AI task requiring mathematical computation and quantitative analysis of text-based problems
- Contradiction detection
- NLP task involving identification of inconsistent or conflicting statements within text passages
- Temporal reasoning
- AI capability to understand time-dependent relationships and chronological sequences in text
Sources
- Large language model – Wikipedia. Benchmark evaluations for LLMs attempt to measure model reasoning, factual accuracy, alignment, and safety. https://en.wikipedia.org/wiki/Large_language_model
- What Are Large Language Models (LLMs)? | IBM. Researchers and practitioners look at qualities like accuracy, efficiency, safety, fairness and robustness to determine how well a model performs. https://www.ibm.com/think/topics/large-language-models
- Comparative Evaluation of Rule-Based and Large Language Models for Financial Transaction Extraction in Chatbots. Open-source large language models achieved the best performance across the natural language processing evaluation, with strong intent classification, high entity extraction quality, and complete output validity. https://jurnal.polgan.ac.id/index.php/sinkron/article/view/16020
- PoliLegalLM: A Technical Report on a Large Language Model for Political and Legal Affairs. We evaluate PoliLegalLM on three representative benchmarks, including LawBench, LexEval, and a real-world dataset, PoliLegal. https://arxiv.org/html/2604.17543
- A Functionality-Grounded Benchmark for Evaluating Web Agents in E-commerce Domains. Advances in large language models (LLMs) have led to strong performance in reasoning and planning. https://arxiv.org/html/2508.15832