Skip to main content
Frontier Signal

IndiaFinBench: First LLM Benchmark for Indian Financial Rules

IndiaFinBench evaluates large language models on Indian financial regulatory text with 406 expert-annotated questions from SEBI and RBI documents.

Operator Briefing

Turn this article into a repeatable weekly edge.

Get implementation-minded writeups on frontier tools, systems, and income opportunities built for professionals.

No fluff. No generic AI listicles. Unsubscribe anytime.

IndiaFinBench is the first publicly available evaluation benchmark for assessing large language model performance on Indian financial regulatory text, featuring 406 expert-annotated question-answer pairs from SEBI and RBI documents across four task types.

Released by Not yet disclosed
Release date
What it is LLM evaluation benchmark for Indian financial regulatory text
Who it is for AI researchers and financial technology developers
Where to get it https://github.com/rajveerpall/IndiaFinBench
Price Free
  • IndiaFinBench contains 406 expert-annotated question-answer pairs from 192 SEBI and RBI documents
  • Four task types include regulatory interpretation, numerical reasoning, contradiction detection, and temporal reasoning
  • Twelve models tested showed accuracy ranging from 70.4% to 89.7% under zero-shot conditions
  • Numerical reasoning proved most discriminative with 35.9 percentage-point spread across models
  • Dataset addresses gap in non-Western financial regulatory framework evaluation
  • IndiaFinBench fills critical gap in LLM evaluation for non-Western financial regulatory frameworks
  • Expert annotation quality validated through model-based secondary pass with kappa=0.918 on contradiction detection
  • All twelve tested models substantially outperformed non-specialist human baseline of 60.0% accuracy
  • Bootstrap significance testing revealed three statistically distinct performance tiers across models
  • Complete dataset, evaluation code, and model outputs available for reproducible research

What is IndiaFinBench

IndiaFinBench is an evaluation benchmark specifically designed to assess large language model performance on Indian financial regulatory text. The benchmark addresses a significant gap in existing financial NLP evaluation tools, which draw exclusively from Western financial corpora like SEC filings and US earnings reports.

The benchmark contains 406 expert-annotated question-answer pairs sourced from 192 documents from the Securities and Exchange Board of India (SEBI) and the Reserve Bank of India (RBI). These documents represent the core regulatory framework governing India’s financial sector.

IndiaFinBench evaluates models across four distinct task types: regulatory interpretation (174 items), numerical reasoning (92 items), contradiction detection (62 items), and temporal reasoning (78 items). Each task type tests different aspects of language model comprehension and reasoning capabilities within the Indian financial regulatory context.

What is new vs the previous version

IndiaFinBench represents the first benchmark of its kind rather than an update to existing tools.

Aspect Previous Financial Benchmarks IndiaFinBench
Geographic Focus Western financial markets only Indian regulatory framework
Document Sources SEC filings, US earnings reports SEBI and RBI regulatory documents
Language Context English-language financial news Indian financial regulatory text
Task Diversity Limited task types Four specialized task categories
Annotation Quality Varies by benchmark Model-validated with kappa=0.918

How does IndiaFinBench work

IndiaFinBench operates through a structured evaluation framework that tests language models across four specialized financial regulatory tasks.

  1. Document Collection: Researchers gathered 192 regulatory documents from SEBI and RBI covering various aspects of Indian financial regulation
  2. Question Generation: Expert annotators created 406 question-answer pairs distributed across four task categories based on document content
  3. Quality Validation: Annotation quality underwent validation through model-based secondary pass and human inter-annotator agreement evaluation
  4. Model Testing: Twelve language models received evaluation under zero-shot conditions without task-specific training
  5. Statistical Analysis: Bootstrap significance testing with 10,000 resamples determined statistically distinct performance tiers

Benchmarks and evidence

IndiaFinBench evaluation results demonstrate significant performance variation across different language models and task types.

Model Performance Metric Result Source
Highest accuracy achieved 89.7% (Gemini 2.5 Flash) IndiaFinBench evaluation
Lowest accuracy achieved 70.4% (Gemma 4 E4B) IndiaFinBench evaluation
Human baseline accuracy 60.0% (non-specialist) IndiaFinBench evaluation
Most discriminative task spread 35.9 percentage points (numerical reasoning) IndiaFinBench evaluation
Inter-annotator agreement kappa 0.611 (76.7% overall agreement) 60-item human evaluation
Contradiction detection validation kappa 0.918 Model-based secondary pass

Who should care

Builders

AI developers building financial applications for Indian markets need IndiaFinBench to evaluate model performance on regulatory compliance tasks. The benchmark provides standardized testing for models handling SEBI and RBI documentation, ensuring applications meet regulatory interpretation requirements.

Enterprise

Financial institutions operating in India require accurate LLM evaluation for regulatory compliance systems. IndiaFinBench enables enterprises to assess model capabilities for processing Indian financial regulations, supporting automated compliance monitoring and regulatory document analysis.

End users

Financial technology users benefit from applications tested against IndiaFinBench standards, ensuring more accurate regulatory guidance and compliance assistance. The benchmark validates that AI-powered financial tools understand Indian regulatory nuances correctly.

Investors

Investment firms focusing on Indian fintech companies can use IndiaFinBench results to evaluate the technical capabilities of AI-powered financial services. The benchmark provides objective performance metrics for assessing regulatory compliance technology investments.

How to use IndiaFinBench today

IndiaFinBench provides immediate access through its GitHub repository for researchers and developers.

  1. Access the repository: Visit https://github.com/rajveerpall/IndiaFinBench to download the complete dataset
  2. Install evaluation framework: Clone the repository and install required dependencies listed in requirements.txt
  3. Load your model: Configure your language model to work with the provided evaluation scripts
  4. Run evaluation: Execute the benchmark using the provided evaluation code across all four task types
  5. Analyze results: Compare your model’s performance against the published baseline results and statistical significance tests

IndiaFinBench vs competitors

IndiaFinBench stands alone as the first benchmark specifically designed for Indian financial regulatory text evaluation.

Benchmark Geographic Focus Document Sources Task Types Question Count
IndiaFinBench Indian regulations SEBI, RBI documents 4 specialized tasks 406 questions
FinanceBench US markets SEC filings, earnings General financial QA Not yet disclosed
LawBench General legal Various legal texts Legal reasoning Not yet disclosed
LexEval Legal domains Legal documents Legal evaluation Not yet disclosed

Risks, limits, and myths

  • Limited scope: IndiaFinBench focuses exclusively on SEBI and RBI documents, potentially missing other Indian financial regulatory bodies
  • Zero-shot evaluation only: Current testing excludes few-shot or fine-tuned model performance assessment
  • Language limitation: Benchmark covers English-language regulatory text, excluding regional language financial documents
  • Temporal constraints: Regulatory documents have specific time periods, potentially limiting applicability to future regulatory changes
  • Expert annotation bias: Human annotators may introduce subjective interpretations despite validation measures
  • Model selection bias: Twelve tested models may not represent complete landscape of available language models

FAQ

What makes IndiaFinBench different from other financial AI benchmarks?
IndiaFinBench specifically evaluates language models on Indian financial regulatory text from SEBI and RBI, addressing the gap left by Western-focused financial benchmarks that use SEC filings and US earnings reports.
How many questions does IndiaFinBench contain for model evaluation?
IndiaFinBench contains 406 expert-annotated question-answer pairs distributed across four task types: regulatory interpretation (174), numerical reasoning (92), contradiction detection (62), and temporal reasoning (78).
Which language models performed best on IndiaFinBench testing?
Gemini 2.5 Flash achieved the highest accuracy at 89.7%, while Gemma 4 E4B scored lowest at 70.4%. All twelve tested models outperformed the 60.0% human baseline.
Where can researchers access IndiaFinBench dataset and evaluation tools?
The complete IndiaFinBench dataset, evaluation code, and all model outputs are freely available at https://github.com/rajveerpall/IndiaFinBench for reproducible research.
What validation methods ensure IndiaFinBench annotation quality?
Annotation quality underwent validation through model-based secondary pass achieving kappa=0.918 on contradiction detection and 60-item human inter-annotator agreement evaluation with kappa=0.611.
Which task type shows the largest performance differences between models?
Numerical reasoning proved most discriminative with a 35.9 percentage-point spread across models, indicating significant variation in mathematical reasoning capabilities within financial contexts.
How does IndiaFinBench handle statistical significance in model comparisons?
Bootstrap significance testing with 10,000 resamples revealed three statistically distinct performance tiers, ensuring reliable model performance comparisons beyond simple accuracy scores.
What regulatory documents form the foundation of IndiaFinBench questions?
IndiaFinBench draws from 192 regulatory documents sourced from the Securities and Exchange Board of India (SEBI) and the Reserve Bank of India (RBI).
Can IndiaFinBench evaluate models trained specifically on financial data?
Current IndiaFinBench evaluation focuses on zero-shot conditions without task-specific training, though the framework could potentially accommodate fine-tuned model assessment.
What languages does IndiaFinBench support for regulatory text evaluation?
IndiaFinBench currently evaluates English-language Indian financial regulatory text, with no disclosed plans for regional language document inclusion.

Glossary

Bootstrap significance testing
Statistical method using repeated random sampling to determine if performance differences between models are statistically meaningful rather than due to chance
Contradiction detection
Task type requiring models to identify conflicting information within regulatory documents or between different regulatory statements
Inter-annotator agreement
Measure of consistency between different human annotators when labeling the same data, typically expressed as kappa coefficient
Kappa coefficient
Statistical measure of inter-rater reliability accounting for agreement occurring by chance, with values closer to 1.0 indicating higher agreement
Numerical reasoning
Task type requiring models to perform mathematical calculations and quantitative analysis within financial regulatory contexts
RBI
Reserve Bank of India, the central banking institution responsible for monetary policy and banking regulation in India
Regulatory interpretation
Task type requiring models to understand and explain the meaning and implications of specific regulatory text passages
SEBI
Securities and Exchange Board of India, the regulatory authority for securities and commodity markets in India
Temporal reasoning
Task type requiring models to understand time-based relationships and chronological sequences within regulatory frameworks
Zero-shot evaluation
Testing methodology where models perform tasks without prior training or examples specific to those tasks

Download IndiaFinBench from https://github.com/rajveerpall/IndiaFinBench to evaluate your language model’s performance on Indian financial regulatory text.

Sources

  1. Large language model – Wikipedia. https://en.wikipedia.org/wiki/Large_language_model
  2. What Are Large Language Models (LLMs)? | IBM. https://www.ibm.com/think/topics/large-language-models
  3. Comparative Evaluation of Rule-Based and Large Language Models for Financial Transaction Extraction in Chatbots | Sinkron. https://jurnal.polgan.ac.id/index.php/sinkron/article/view/16020
  4. PoliLegalLM: A Technical Report on a Large Language Model for Political and Legal Affairs. https://arxiv.org/html/2604.17543
  5. A Functionality-Grounded Benchmark for Evaluating Web Agents in E-commerce Domains. https://arxiv.org/html/2508.15832

Author

  • siego237

    Writes for FrontierWisdom on AI systems, automation, decentralized identity, and frontier infrastructure, with a focus on turning emerging technology into practical playbooks, implementation roadmaps, and monetization strategies for operators, builders, and consultants.

Keep Compounding Signal

Get the next blueprint before it becomes common advice.

Join the newsletter for future-economy playbooks, tactical prompts, and high-margin tool recommendations.

  • Actionable execution blueprints
  • High-signal tool and infrastructure breakdowns
  • New monetization angles before they saturate

No fluff. No generic AI listicles. Unsubscribe anytime.

Leave a Reply

Your email address will not be published. Required fields are marked *