Skip to main content
Frontier Signal

IndiaFinBench: First LLM Benchmark for Indian Financial Text

IndiaFinBench introduces the first evaluation benchmark for large language models on Indian financial regulatory text, featuring 406 expert-annotated questions from SEBI and RBI documents.

Operator Briefing

Turn this article into a repeatable weekly edge.

Get implementation-minded writeups on frontier tools, systems, and income opportunities built for professionals.

No fluff. No generic AI listicles. Unsubscribe anytime.

IndiaFinBench is the first publicly available evaluation benchmark for assessing large language model performance on Indian financial regulatory text. The benchmark contains 406 expert-annotated question-answer pairs from SEBI and RBI documents across four task types.

Released by Not yet disclosed
Release date
What it is First evaluation benchmark for LLM performance on Indian financial regulatory text
Who it is for AI researchers and financial technology developers
Where to get it https://github.com/rajveerpall/IndiaFinBench
Price Free
  • IndiaFinBench addresses the gap in non-Western financial NLP benchmarks with 406 expert-annotated questions
  • The benchmark draws from 192 documents from Securities and Exchange Board of India and Reserve Bank of India
  • Four task types include regulatory interpretation, numerical reasoning, contradiction detection, and temporal reasoning
  • Twelve models were evaluated with accuracy ranging from 70.4% to 89.7% under zero-shot conditions
  • All models substantially outperformed a non-specialist human baseline of 60.0% accuracy
  • IndiaFinBench fills a critical gap by providing the first benchmark for Indian financial regulatory text evaluation
  • The benchmark demonstrates strong annotation quality with kappa scores of 0.918 and 0.611 across different validation methods
  • Numerical reasoning emerges as the most discriminative task with a 35.9 percentage-point performance spread
  • Statistical testing reveals three distinct performance tiers among the twelve evaluated models
  • The complete dataset, evaluation code, and model outputs are publicly available for research use

What is IndiaFinBench

IndiaFinBench is the first publicly available evaluation benchmark specifically designed for assessing large language model performance on Indian financial regulatory text. Benchmark evaluations for LLMs attempt to measure model reasoning, factual accuracy, alignment, and safety [1].

The benchmark addresses a significant coverage gap in existing financial NLP benchmarks, which draw exclusively from Western financial corpora including SEC filings, US earnings reports, and English-language financial news. IndiaFinBench provides a comprehensive evaluation framework tailored to the unique characteristics of Indian financial regulatory frameworks.

The benchmark contains 406 expert-annotated question-answer pairs sourced from 192 official documents from two primary Indian financial regulatory bodies: the Securities and Exchange Board of India (SEBI) and the Reserve Bank of India (RBI).

What is new vs the previous version

IndiaFinBench represents the first benchmark of its kind, with no previous versions existing for Indian financial regulatory text evaluation.

Aspect Previous State IndiaFinBench
Geographic Coverage Western financial corpora only Indian regulatory frameworks
Document Sources SEC filings, US earnings reports SEBI and RBI official documents
Task Types General financial reasoning Four specialized tasks: regulatory interpretation, numerical reasoning, contradiction detection, temporal reasoning
Annotation Quality Variable validation methods Dual validation: model-based (kappa=0.918) and human inter-annotator (kappa=0.611)
Public Availability Limited benchmark access Complete dataset, code, and outputs publicly available

How does IndiaFinBench work

IndiaFinBench operates through a structured evaluation framework that tests large language models across four distinct financial regulatory tasks.

  1. Document Collection: Researchers gathered 192 official documents from SEBI and RBI regulatory publications
  2. Question Generation: Expert annotators created 406 question-answer pairs distributed across four task categories
  3. Task Distribution: Regulatory interpretation (174 items), numerical reasoning (92 items), contradiction detection (62 items), and temporal reasoning (78 items)
  4. Quality Validation: Annotation quality underwent dual validation through model-based secondary pass and human inter-annotator agreement evaluation
  5. Zero-Shot Evaluation: Twelve models were tested under zero-shot conditions without task-specific training
  6. Statistical Analysis: Bootstrap significance testing with 10,000 resamples identified statistically distinct performance tiers

Benchmarks and evidence

IndiaFinBench evaluation results demonstrate clear performance differentiation across twelve tested large language models.

Model Overall Accuracy Performance Tier Source
Gemini 2.5 Flash 89.7% Tier 1 IndiaFinBench paper
Gemma 4 E4B 70.4% Tier 3 IndiaFinBench paper
Non-specialist Human 60.0% Baseline IndiaFinBench paper
Numerical Reasoning Spread 35.9 percentage points Most discriminative task IndiaFinBench paper

Researchers and practitioners look at qualities like accuracy, efficiency, safety, fairness and robustness to determine how well a model performs [4]. The benchmark validation achieved strong inter-annotator agreement with kappa=0.611 and 76.7% overall agreement across 60 evaluated items.

Bootstrap significance testing with 10,000 resamples revealed three statistically distinct performance tiers among the evaluated models, providing robust statistical evidence for performance differentiation.

Who should care

Builders

AI researchers developing financial NLP systems need IndiaFinBench to evaluate model performance on non-Western regulatory frameworks. The benchmark provides standardized evaluation metrics for Indian financial text understanding capabilities.

Enterprise

Financial technology companies operating in Indian markets require IndiaFinBench to assess LLM suitability for regulatory compliance applications. Banks and fintech firms can use the benchmark to validate AI systems processing SEBI and RBI documentation.

End users

Financial professionals working with Indian regulatory documents benefit from IndiaFinBench-validated AI tools that demonstrate proven performance on regulatory interpretation and numerical reasoning tasks.

Investors

Venture capital and private equity investors evaluating AI companies serving Indian financial markets can reference IndiaFinBench results to assess technical capabilities and market readiness.

How to use IndiaFinBench today

IndiaFinBench is immediately accessible through its public GitHub repository with complete evaluation resources.

  1. Access the Repository: Visit https://github.com/rajveerpall/IndiaFinBench to download the complete dataset
  2. Download Components: Obtain the 406 question-answer pairs, evaluation code, and all model outputs
  3. Set Up Evaluation: Install the provided evaluation framework following repository documentation
  4. Run Baseline Tests: Execute zero-shot evaluation on your target LLM using the standardized protocol
  5. Compare Results: Benchmark your model performance against the twelve reference models and human baseline
  6. Analyze Task Performance: Examine results across the four task types to identify model strengths and weaknesses

IndiaFinBench vs competitors

IndiaFinBench stands alone as the first benchmark specifically designed for Indian financial regulatory text evaluation.

Benchmark Geographic Focus Document Sources Task Types Question Count
IndiaFinBench Indian regulatory frameworks SEBI, RBI documents 4 specialized tasks 406 questions
FinanceBench Western markets SEC filings, US earnings General financial QA Not yet disclosed
LawBench General legal text Legal documents Legal reasoning Not yet disclosed

We evaluate PoliLegalLM on three representative benchmarks, including LawBench, LexEval, and a real-world dataset, PoliLegal [6], demonstrating the broader landscape of specialized domain benchmarks.

Risks, limits, and myths

  • Limited Scope: IndiaFinBench focuses exclusively on SEBI and RBI documents, potentially missing other Indian financial regulatory bodies
  • Zero-Shot Only: Current evaluation uses zero-shot conditions, which may not reflect fine-tuned model performance
  • Language Limitation: The benchmark evaluates English-language regulatory text, excluding regional language financial documents
  • Temporal Coverage: Document sources represent a specific time period, potentially missing recent regulatory changes
  • Human Baseline: The 60.0% non-specialist human baseline may not represent expert-level human performance
  • Task Balance: Uneven distribution across task types (174 regulatory interpretation vs 62 contradiction detection items)
  • Model Selection: Evaluation limited to twelve models, potentially missing other relevant LLM architectures

FAQ

What is IndiaFinBench and why was it created?

IndiaFinBench is the first publicly available evaluation benchmark for assessing large language model performance on Indian financial regulatory text, created to address the gap in non-Western financial NLP benchmarks.

How many questions does IndiaFinBench contain?

IndiaFinBench contains 406 expert-annotated question-answer pairs drawn from 192 documents sourced from SEBI and RBI regulatory publications.

What are the four task types in IndiaFinBench?

The four task types are regulatory interpretation (174 items), numerical reasoning (92 items), contradiction detection (62 items), and temporal reasoning (78 items).

Which model performed best on IndiaFinBench?

Gemini 2.5 Flash achieved the highest accuracy at 89.7% under zero-shot evaluation conditions, while Gemma 4 E4B scored lowest at 70.4%.

How does IndiaFinBench validate annotation quality?

IndiaFinBench validates annotation quality through dual methods: model-based secondary pass (kappa=0.918 on contradiction detection) and human inter-annotator agreement evaluation (kappa=0.611).

What is the most challenging task type in IndiaFinBench?

Numerical reasoning is the most discriminative task, showing a 35.9 percentage-point performance spread across the twelve evaluated models.

How does model performance compare to human baseline?

All twelve evaluated models substantially outperformed the non-specialist human baseline of 60.0% accuracy, with the top model achieving 89.7%.

Where can I access IndiaFinBench dataset and code?

The complete IndiaFinBench dataset, evaluation code, and all model outputs are publicly available at https://github.com/rajveerpall/IndiaFinBench.

What statistical methods validate IndiaFinBench results?

Bootstrap significance testing with 10,000 resamples reveals three statistically distinct performance tiers among the evaluated models.

Which regulatory bodies provide source documents for IndiaFinBench?

IndiaFinBench draws from official documents from two primary Indian financial regulatory bodies: Securities and Exchange Board of India (SEBI) and Reserve Bank of India (RBI).

How does IndiaFinBench differ from existing financial benchmarks?

Unlike existing financial NLP benchmarks that draw exclusively from Western financial corpora, IndiaFinBench focuses specifically on Indian regulatory frameworks and documents.

What evaluation conditions were used for model testing?

All twelve models were evaluated under zero-shot conditions without task-specific training or fine-tuning on Indian financial regulatory text.

Glossary

Bootstrap Significance Testing
Statistical method using repeated resampling to determine if performance differences between models are statistically significant
Contradiction Detection
Task type requiring models to identify conflicting information within regulatory documents
Inter-annotator Agreement
Measure of consistency between different human annotators when labeling the same data, expressed as kappa coefficient
Kappa Coefficient
Statistical measure of inter-rater reliability that accounts for agreement occurring by chance
Numerical Reasoning
Task type requiring models to perform mathematical calculations and interpret numerical information in financial contexts
Regulatory Interpretation
Task type requiring models to understand and explain financial regulatory rules and requirements
SEBI
Securities and Exchange Board of India, the regulatory body for securities markets in India
Temporal Reasoning
Task type requiring models to understand time-based relationships and sequences in regulatory contexts
Zero-shot Evaluation
Testing method where models perform tasks without prior training or examples specific to that task

Visit https://github.com/rajveerpall/IndiaFinBench to download the complete dataset and begin evaluating your LLM on Indian financial regulatory text.

Sources

  1. Large language model – Wikipedia. https://en.wikipedia.org/wiki/Large_language_model
  2. IFEval | DeepEval by Confident AI – The LLM Evaluation Framework. https://deepeval.com/docs/benchmarks-ifeval
  3. FinanceBench: A New Benchmark for Financial Question Answering. https://wandb.ai/byyoung3/ml-news/reports/FinanceBench-A-New-Benchmark-for-Financial-Question-Answering–VmlldzoxMDE1OTM0Mw
  4. What Are Large Language Models (LLMs)? | IBM. https://www.ibm.com/think/topics/large-language-models
  5. Comparative Evaluation of Rule-Based and Large Language Models for Financial Transaction Extraction in Chatbots. https://jurnal.polgan.ac.id/index.php/sinkron/article/view/16020
  6. PoliLegalLM: A Technical Report on a Large Language Model for Political and Legal Affairs. https://arxiv.org/html/2604.17543
  7. Large Language Models for Cybersecurity Intelligence: A Systematic Review. https://www.sciencedirect.com/org/science/article/pii/S1546221826003565
  8. A Functionality-Grounded Benchmark for Evaluating Web Agents in E-commerce Domains. https://arxiv.org/html/2508.15832

Author

  • siego237

    Writes for FrontierWisdom on AI systems, automation, decentralized identity, and frontier infrastructure, with a focus on turning emerging technology into practical playbooks, implementation roadmaps, and monetization strategies for operators, builders, and consultants.

Keep Compounding Signal

Get the next blueprint before it becomes common advice.

Join the newsletter for future-economy playbooks, tactical prompts, and high-margin tool recommendations.

  • Actionable execution blueprints
  • High-signal tool and infrastructure breakdowns
  • New monetization angles before they saturate

No fluff. No generic AI listicles. Unsubscribe anytime.

Leave a Reply

Your email address will not be published. Required fields are marked *