Skip to main content
Frontier Signal

IndiaFinBench: New LLM Benchmark for Indian Financial Regulation

IndiaFinBench introduces the first evaluation benchmark for large language models on Indian financial regulatory text, featuring 406 expert-annotated questions from SEBI and RBI documents.

Operator Briefing

Turn this article into a repeatable weekly edge.

Get implementation-minded writeups on frontier tools, systems, and income opportunities built for professionals.

No fluff. No generic AI listicles. Unsubscribe anytime.

IndiaFinBench is the first publicly available evaluation benchmark for assessing large language model performance on Indian financial regulatory text. The benchmark contains 406 expert-annotated question-answer pairs from SEBI and RBI documents across four task types.

Released by Not yet disclosed
Release date
What it is Evaluation benchmark for LLM performance on Indian financial regulatory text
Who it’s for AI researchers and financial technology developers
Where to get it https://github.com/rajveerpall/IndiaFinBench
Price Free
  • IndiaFinBench addresses the gap in non-Western financial regulatory evaluation benchmarks
  • The dataset includes 406 questions from 192 SEBI and RBI documents
  • Four task types cover regulatory interpretation, numerical reasoning, contradiction detection, and temporal reasoning
  • Twelve models were evaluated with accuracy ranging from 70.4% to 89.7%
  • All models significantly outperformed human baseline performance of 60.0%
  • IndiaFinBench fills a critical gap in financial NLP evaluation by focusing on Indian regulatory frameworks
  • The benchmark demonstrates significant performance variation across models, with numerical reasoning being most discriminative
  • Expert annotation quality is validated through both model-based and human inter-annotator agreement measures
  • Bootstrap significance testing reveals three statistically distinct performance tiers among evaluated models

What is IndiaFinBench

IndiaFinBench is an evaluation benchmark specifically designed to assess large language model performance on Indian financial regulatory text. Existing financial NLP benchmarks draw exclusively from Western financial corpora, leaving a significant gap in coverage of non-Western regulatory frameworks [1]. The benchmark addresses this limitation by providing the first publicly available dataset focused on Indian financial regulations.

The dataset comprises 406 expert-annotated question-answer pairs sourced from 192 official documents. These documents originate from two primary Indian financial regulatory bodies: the Securities and Exchange Board of India (SEBI) and the Reserve Bank of India (RBI). The benchmark evaluates models across four distinct task types that reflect real-world financial regulatory analysis requirements.

What is new vs the previous version

IndiaFinBench represents the first benchmark of its kind, with no previous versions existing for Indian financial regulatory text evaluation.

Aspect Previous State IndiaFinBench Innovation
Geographic Coverage Western financial corpora only First Indian regulatory framework focus
Document Sources SEC filings, US earnings reports SEBI and RBI official documents
Task Diversity Limited to basic comprehension Four specialized task types
Annotation Quality Variable validation methods Dual validation: model-based and human

How does IndiaFinBench work

IndiaFinBench operates through a structured evaluation framework that tests large language models across four specialized financial regulatory tasks.

  1. Document Collection: Researchers gathered 192 documents from SEBI and RBI official publications
  2. Question Generation: Expert annotators created 406 question-answer pairs distributed across four task categories
  3. Quality Validation: A two-stage validation process ensures annotation accuracy through model-based secondary passes and human inter-annotator agreement
  4. Model Evaluation: Twelve models undergo zero-shot evaluation across all task types
  5. Statistical Analysis: Bootstrap significance testing with 10,000 resamples determines performance tier distinctions

Benchmarks and evidence

IndiaFinBench evaluation results demonstrate significant performance variation across models and task types.

Model Performance Metric Result Source
Highest accuracy (Gemini 2.5 Flash) 89.7% IndiaFinBench evaluation
Lowest accuracy (Gemma 4 E4B) 70.4% IndiaFinBench evaluation
Human baseline performance 60.0% IndiaFinBench evaluation
Numerical reasoning performance spread 35.9 percentage points IndiaFinBench evaluation
Inter-annotator agreement (kappa) 0.611 60-item human evaluation
Model-based validation (kappa) 0.918 Contradiction detection task

Who should care

Builders

AI researchers developing financial NLP systems need IndiaFinBench to evaluate model performance on non-Western regulatory frameworks. Benchmark evaluations for LLMs attempt to measure model reasoning, factual accuracy, alignment, and safety [1]. The dataset provides essential validation for models intended to process Indian financial regulations.

Enterprise

Financial institutions operating in India require accurate AI systems for regulatory compliance and document analysis. IndiaFinBench enables organizations to assess whether their chosen models can reliably interpret SEBI and RBI requirements. The benchmark’s four task types directly align with common regulatory analysis workflows.

End users

Financial professionals and compliance officers benefit from understanding AI model capabilities on Indian regulatory text. The benchmark results help users set appropriate expectations for AI-assisted regulatory analysis and identify tasks requiring human oversight.

Investors

Venture capital and institutional investors evaluating fintech companies can use IndiaFinBench results to assess technical capabilities. The benchmark provides objective performance metrics for AI systems targeting the Indian financial market.

How to use IndiaFinBench today

IndiaFinBench is immediately accessible through its GitHub repository for researchers and developers.

  1. Access the repository: Visit https://github.com/rajveerpall/IndiaFinBench to download the dataset
  2. Review the documentation: Examine the provided evaluation code and model output examples
  3. Load your model: Implement the evaluation framework with your chosen large language model
  4. Run zero-shot evaluation: Execute the benchmark across all four task types without fine-tuning
  5. Analyze results: Compare your model’s performance against the published baseline results
  6. Submit findings: Consider contributing your results to the research community

IndiaFinBench vs competitors

IndiaFinBench occupies a unique position in the financial NLP evaluation landscape by focusing specifically on Indian regulatory frameworks.

Benchmark Geographic Focus Document Sources Task Types Question Count
IndiaFinBench India SEBI, RBI documents 4 specialized tasks 406
FinanceBench United States SEC filings, earnings reports General comprehension Not yet disclosed
LawBench Multiple jurisdictions Legal documents Legal reasoning Not yet disclosed

Risks, limits, and myths

  • Limited scope: The benchmark focuses only on SEBI and RBI documents, excluding other Indian financial regulators
  • Language constraint: All documents are in English, potentially missing regional language regulatory content
  • Temporal coverage: The dataset represents a specific time period and may not reflect evolving regulations
  • Task specificity: Four task types may not capture all real-world regulatory analysis requirements
  • Model selection bias: Evaluation limited to twelve models may not represent full market performance range
  • Zero-shot limitation: Results may not reflect performance after domain-specific fine-tuning

FAQ

What makes IndiaFinBench different from other financial AI benchmarks?

IndiaFinBench is the first evaluation benchmark specifically designed for Indian financial regulatory text, addressing the gap left by Western-focused financial NLP datasets.

How many questions does IndiaFinBench contain?

IndiaFinBench contains 406 expert-annotated question-answer pairs distributed across four task types: regulatory interpretation, numerical reasoning, contradiction detection, and temporal reasoning.

Which Indian regulatory bodies are covered in IndiaFinBench?

The benchmark draws from documents issued by the Securities and Exchange Board of India (SEBI) and the Reserve Bank of India (RBI).

What was the best performing model on IndiaFinBench?

Gemini 2.5 Flash achieved the highest accuracy of 89.7% across all tasks in the zero-shot evaluation.

How does human performance compare to AI models on IndiaFinBench?

All evaluated AI models substantially outperformed the non-specialist human baseline of 60.0%, with the lowest model achieving 70.4% accuracy.

Which task type is most challenging for AI models?

Numerical reasoning proved most discriminative, showing a 35.9 percentage-point performance spread across the twelve evaluated models.

Is IndiaFinBench available for commercial use?

The dataset, evaluation code, and model outputs are freely available through the GitHub repository at https://github.com/rajveerpall/IndiaFinBench.

How was annotation quality validated in IndiaFinBench?

Quality validation used a dual approach: model-based secondary passes achieving kappa=0.918 on contradiction detection, and human inter-annotator agreement evaluation with kappa=0.611.

Can I evaluate my own model using IndiaFinBench?

Yes, the benchmark provides evaluation code and documentation enabling researchers to test their own large language models against the dataset.

What statistical methods validate IndiaFinBench results?

Bootstrap significance testing with 10,000 resamples was used to establish three statistically distinct performance tiers among the evaluated models.

Glossary

SEBI
Securities and Exchange Board of India, the primary regulator of securities markets in India
RBI
Reserve Bank of India, the central banking institution and monetary authority of India
Zero-shot evaluation
Testing AI models on tasks without prior training or fine-tuning on similar examples
Kappa coefficient
Statistical measure of inter-rater agreement that accounts for agreement occurring by chance
Bootstrap significance testing
Statistical method using repeated random sampling to determine the reliability of results
Numerical reasoning
AI task involving mathematical calculations and quantitative analysis within text
Contradiction detection
AI task identifying conflicting information within or across documents
Temporal reasoning
AI task involving understanding and processing time-related information and sequences

Visit the IndiaFinBench GitHub repository at https://github.com/rajveerpall/IndiaFinBench to download the dataset and begin evaluating your large language model on Indian financial regulatory text.

Sources

  1. Large language model – Wikipedia. https://en.wikipedia.org/wiki/Large_language_model
  2. IFEval | DeepEval by Confident AI – The LLM Evaluation Framework. https://deepeval.com/docs/benchmarks-ifeval
  3. FinanceBench: A New Benchmark for Financial Question Answering. https://wandb.ai/byyoung3/ml-news/reports/FinanceBench-A-New-Benchmark-for-Financial-Question-Answering–VmlldzoxMDE1OTM0Mw
  4. What Are Large Language Models (LLMs)? | IBM. https://www.ibm.com/think/topics/large-language-models
  5. Comparative Evaluation of Rule-Based and Large Language Models for Financial Transaction Extraction in Chatbots. https://jurnal.polgan.ac.id/index.php/sinkron/article/view/16020
  6. PoliLegalLM: A Technical Report on a Large Language Model for Political and Legal Affairs. https://arxiv.org/html/2604.17543
  7. Large Language Models for Cybersecurity Intelligence: A Systematic Review. https://www.sciencedirect.com/org/science/article/pii/S1546221826003565
  8. A Functionality-Grounded Benchmark for Evaluating Web Agents in E-commerce Domains. https://arxiv.org/html/2508.15832

Author

  • siego237

    Writes for FrontierWisdom on AI systems, automation, decentralized identity, and frontier infrastructure, with a focus on turning emerging technology into practical playbooks, implementation roadmaps, and monetization strategies for operators, builders, and consultants.

Keep Compounding Signal

Get the next blueprint before it becomes common advice.

Join the newsletter for future-economy playbooks, tactical prompts, and high-margin tool recommendations.

  • Actionable execution blueprints
  • High-signal tool and infrastructure breakdowns
  • New monetization angles before they saturate

No fluff. No generic AI listicles. Unsubscribe anytime.

Leave a Reply

Your email address will not be published. Required fields are marked *