Skip to main content
Frontier Signal

IndiaFinBench: First LLM Benchmark for Indian Financial Regulation

IndiaFinBench introduces 406 expert-annotated question-answer pairs from SEBI and RBI documents to evaluate large language model performance on Indian financial regulatory text.

Operator Briefing

Turn this article into a repeatable weekly edge.

Get implementation-minded writeups on frontier tools, systems, and income opportunities built for professionals.

No fluff. No generic AI listicles. Unsubscribe anytime.

IndiaFinBench is the first publicly available benchmark for evaluating large language model performance on Indian financial regulatory text, featuring 406 expert-annotated question-answer pairs from SEBI and RBI documents across four task types.

Released by Not yet disclosed
Release date
What it is First public benchmark for evaluating LLM performance on Indian financial regulatory text
Who it is for AI researchers and financial technology developers
Where to get it https://github.com/rajveerpall/IndiaFinBench
Price Free
  • IndiaFinBench contains 406 expert-annotated question-answer pairs from 192 SEBI and RBI documents
  • The benchmark covers four task types: regulatory interpretation, numerical reasoning, contradiction detection, and temporal reasoning
  • Twelve models were evaluated with accuracy ranging from 70.4% to 89.7% under zero-shot conditions
  • Numerical reasoning proved most discriminative with a 35.9 percentage-point performance spread across models
  • All models substantially outperformed a non-specialist human baseline of 60.0% accuracy
  • IndiaFinBench addresses a significant gap in LLM evaluation by focusing on non-Western financial regulatory frameworks
  • The benchmark demonstrates three statistically distinct performance tiers among evaluated models through bootstrap significance testing
  • Annotation quality validation achieved kappa=0.918 on contradiction detection and kappa=0.611 on human inter-annotator agreement
  • Numerical reasoning tasks show the largest performance variation, making them most useful for model discrimination
  • The complete dataset, evaluation code, and model outputs are publicly available for research use

What is IndiaFinBench

IndiaFinBench is an evaluation benchmark specifically designed to assess large language model performance on Indian financial regulatory text. Benchmark evaluations for LLMs attempt to measure model reasoning, factual accuracy, alignment, and safety [1]. The benchmark contains 406 expert-annotated question-answer pairs sourced from 192 documents from the Securities and Exchange Board of India (SEBI) and the Reserve Bank of India (RBI).

The benchmark addresses a critical gap in existing financial NLP evaluation tools, which draw exclusively from Western financial corpora including SEC filings, US earnings reports, and English-language financial news. IndiaFinBench provides the first comprehensive evaluation framework for models working with Indian regulatory frameworks.

What is new vs previous benchmarks

IndiaFinBench introduces several novel elements compared to existing financial benchmarks:

Feature IndiaFinBench Existing Financial Benchmarks
Geographic focus Indian regulatory framework (SEBI/RBI) Western markets (SEC, US earnings)
Document sources 192 Indian regulatory documents US financial filings and news
Task diversity 4 specialized task types General financial QA
Annotation validation Model-based + human inter-annotator (kappa=0.611) Varies by benchmark
Performance tiers 3 statistically distinct tiers via bootstrap testing Not systematically established

How does IndiaFinBench work

IndiaFinBench operates through a structured four-task evaluation framework:

  1. Regulatory interpretation tasks assess model understanding of Indian financial regulations with 174 question-answer pairs
  2. Numerical reasoning tasks evaluate quantitative analysis capabilities using 92 items focused on financial calculations
  3. Contradiction detection tasks test logical consistency identification across 62 regulatory statement pairs
  4. Temporal reasoning tasks measure understanding of time-dependent regulatory changes through 78 scenarios

The evaluation uses zero-shot conditions where models receive no task-specific training examples. Researchers and practitioners look at qualities like accuracy, efficiency, safety, fairness and robustness to determine how well a model performs [4]. Bootstrap significance testing with 10,000 resamples validates statistical differences between model performance levels.

Benchmarks and evidence

IndiaFinBench evaluation results demonstrate clear performance hierarchies among twelve tested models:

Model Overall Accuracy Performance Tier Source
Gemini 2.5 Flash 89.7% Tier 1 IndiaFinBench paper
Mid-range models 75-85% Tier 2 IndiaFinBench paper
Gemma 4 E4B 70.4% Tier 3 IndiaFinBench paper
Non-specialist human 60.0% Baseline IndiaFinBench paper
Numerical reasoning spread 35.9 percentage points Most discriminative IndiaFinBench paper

The annotation quality validation achieved kappa=0.918 on contradiction detection tasks and maintained 76.7% overall agreement in human inter-annotator evaluation across 60 items.

Who should care

Builders

AI developers building financial applications for Indian markets need IndiaFinBench to validate model performance on local regulatory requirements. The benchmark provides standardized evaluation metrics for SEBI and RBI compliance capabilities.

Enterprise

Financial institutions operating in India require models that understand local regulatory frameworks for compliance automation and risk assessment. IndiaFinBench enables objective comparison of model capabilities for regulatory interpretation tasks.

End users

Financial advisors and compliance professionals can use IndiaFinBench results to select appropriate AI tools for Indian regulatory analysis. The benchmark results indicate which models perform best on specific task types.

Investors

Investment firms focusing on Indian fintech can use IndiaFinBench performance data to evaluate AI-powered regulatory technology solutions. The benchmark provides objective metrics for due diligence on financial AI products.

How to use IndiaFinBench today

IndiaFinBench is immediately available for research and development use:

  1. Access the repository: Visit https://github.com/rajveerpall/IndiaFinBench to download the complete dataset
  2. Install evaluation framework: Clone the repository and install required dependencies for model evaluation
  3. Run baseline evaluation: Execute the provided evaluation scripts against your target language model
  4. Compare results: Use the included benchmark results to position your model performance relative to established baselines
  5. Analyze task-specific performance: Review detailed results across the four task types to identify model strengths and weaknesses

The repository includes the complete dataset, evaluation code, and all model outputs from the twelve evaluated systems for comprehensive analysis.

IndiaFinBench vs competitors

IndiaFinBench occupies a unique position among financial AI evaluation benchmarks:

Benchmark Geographic Focus Document Count Task Types Validation Method
IndiaFinBench Indian regulatory (SEBI/RBI) 192 documents 4 specialized tasks Model + human validation
FinanceBench US markets Not yet disclosed General financial QA Not yet disclosed
LawBench General legal Not yet disclosed Legal reasoning Not yet disclosed

We evaluate PoliLegalLM on three representative benchmarks, including LawBench, LexEval, and a real-world dataset, PoliLegal [6], demonstrating the broader landscape of specialized evaluation frameworks for regulatory and legal domains.

Risks, limits, and myths

  • Limited scope: IndiaFinBench focuses exclusively on SEBI and RBI documents, not covering other Indian financial regulators
  • Zero-shot evaluation only: The benchmark does not assess few-shot or fine-tuned model performance capabilities
  • Static dataset: Regulatory frameworks evolve continuously, potentially dating benchmark content over time
  • Language limitation: The benchmark uses English-language regulatory documents, not covering regional Indian languages
  • Expert annotation dependency: Benchmark quality relies on the expertise and consistency of human annotators
  • Task type imbalance: Regulatory interpretation tasks (174 items) significantly outnumber contradiction detection tasks (62 items)

FAQ

What makes IndiaFinBench different from other financial AI benchmarks?
IndiaFinBench is the first benchmark specifically designed for Indian financial regulatory text, using SEBI and RBI documents instead of Western financial corpora.
How many question-answer pairs does IndiaFinBench contain?
IndiaFinBench contains 406 expert-annotated question-answer pairs sourced from 192 Indian regulatory documents.
Which AI models perform best on IndiaFinBench?
Gemini 2.5 Flash achieved the highest accuracy at 89.7%, while Gemma 4 E4B scored lowest at 70.4% among the twelve evaluated models.
What task types are included in IndiaFinBench evaluation?
IndiaFinBench includes four task types: regulatory interpretation (174 items), numerical reasoning (92 items), contradiction detection (62 items), and temporal reasoning (78 items).
How reliable is IndiaFinBench annotation quality?
Annotation quality achieved kappa=0.918 on contradiction detection through model-based validation and kappa=0.611 with 76.7% agreement in human inter-annotator evaluation.
Can I access IndiaFinBench dataset for free?
Yes, IndiaFinBench dataset, evaluation code, and all model outputs are freely available at https://github.com/rajveerpall/IndiaFinBench.
Which task type shows the biggest performance differences between models?
Numerical reasoning tasks show the largest performance variation with a 35.9 percentage-point spread across evaluated models, making them most discriminative.
How does human performance compare to AI models on IndiaFinBench?
All evaluated AI models substantially outperformed the non-specialist human baseline of 60.0% accuracy, with the best model achieving 89.7%.
What statistical methods validate IndiaFinBench results?
Bootstrap significance testing with 10,000 resamples reveals three statistically distinct performance tiers among the evaluated models.
Does IndiaFinBench cover all Indian financial regulators?
No, IndiaFinBench focuses specifically on SEBI and RBI documents, not covering other Indian financial regulatory bodies.

Glossary

SEBI
Securities and Exchange Board of India, the primary regulator of Indian capital markets and securities trading
RBI
Reserve Bank of India, the central banking institution responsible for monetary policy and banking regulation in India
Zero-shot evaluation
Testing AI models on tasks without providing any task-specific training examples or demonstrations
Bootstrap significance testing
Statistical method using repeated random sampling to determine if performance differences between models are statistically meaningful
Inter-annotator agreement
Measure of consistency between different human experts when labeling the same data, typically expressed as kappa coefficient
Contradiction detection
AI task requiring identification of logical inconsistencies or conflicting statements within regulatory text
Temporal reasoning
AI capability to understand and process time-dependent relationships and sequences in regulatory changes
Numerical reasoning
AI ability to perform mathematical calculations and quantitative analysis on financial data and regulations

Download IndiaFinBench from https://github.com/rajveerpall/IndiaFinBench to evaluate your language model’s performance on Indian financial regulatory text.

Sources

  1. Large language model – Wikipedia. https://en.wikipedia.org/wiki/Large_language_model
  2. IFEval | DeepEval by Confident AI – The LLM Evaluation Framework. https://deepeval.com/docs/benchmarks-ifeval
  3. FinanceBench: A New Benchmark for Financial Question Answering. https://wandb.ai/byyoung3/ml-news/reports/FinanceBench-A-New-Benchmark-for-Financial-Question-Answering–VmlldzoxMDE1OTM0Mw
  4. What Are Large Language Models (LLMs)? | IBM. https://www.ibm.com/think/topics/large-language-models
  5. Comparative Evaluation of Rule-Based and Large Language Models for Financial Transaction Extraction in Chatbots. https://jurnal.polgan.ac.id/index.php/sinkron/article/view/16020
  6. PoliLegalLM: A Technical Report on a Large Language Model for Political and Legal Affairs. https://arxiv.org/html/2604.17543
  7. Large Language Models for Cybersecurity Intelligence: A Systematic Review. https://www.sciencedirect.com/org/science/article/pii/S1546221826003565
  8. A Functionality-Grounded Benchmark for Evaluating Web Agents in E-commerce Domains. https://arxiv.org/html/2508.15832

Author

  • siego237

    Writes for FrontierWisdom on AI systems, automation, decentralized identity, and frontier infrastructure, with a focus on turning emerging technology into practical playbooks, implementation roadmaps, and monetization strategies for operators, builders, and consultants.

Keep Compounding Signal

Get the next blueprint before it becomes common advice.

Join the newsletter for future-economy playbooks, tactical prompts, and high-margin tool recommendations.

  • Actionable execution blueprints
  • High-signal tool and infrastructure breakdowns
  • New monetization angles before they saturate

No fluff. No generic AI listicles. Unsubscribe anytime.

Leave a Reply

Your email address will not be published. Required fields are marked *