Skip to main content
Frontier Signal

IndiaFinBench: First LLM Benchmark for Indian Financial Regulation

IndiaFinBench introduces the first evaluation benchmark for large language models on Indian financial regulatory text, featuring 406 expert-annotated questions from SEBI and RBI documents.

Operator Briefing

Turn this article into a repeatable weekly edge.

Get implementation-minded writeups on frontier tools, systems, and income opportunities built for professionals.

No fluff. No generic AI listicles. Unsubscribe anytime.

IndiaFinBench represents the first publicly available evaluation benchmark designed to assess large language model performance on Indian financial regulatory text, featuring 406 expert-annotated question-answer pairs from SEBI and RBI documents across four specialized task types.

Released by Not yet disclosed
Release date
What it is First LLM evaluation benchmark for Indian financial regulatory text
Who it is for AI researchers and financial technology developers
Where to get it GitHub repository at rajveerpall/IndiaFinBench
Price Free
  • IndiaFinBench contains 406 expert-annotated question-answer pairs from 192 SEBI and RBI documents
  • The benchmark covers four task types: regulatory interpretation, numerical reasoning, contradiction detection, and temporal reasoning
  • Twelve models were evaluated with accuracy ranging from 70.4% to 89.7% under zero-shot conditions
  • Numerical reasoning proved most discriminative with a 35.9 percentage-point performance spread across models
  • All evaluated models substantially outperformed a non-specialist human baseline of 60.0%
  • IndiaFinBench addresses a significant gap in financial NLP benchmarks by focusing on non-Western regulatory frameworks
  • The benchmark demonstrates high annotation quality with kappa scores of 0.918 for contradiction detection and 0.611 for human inter-annotator agreement
  • Gemini 2.5 Flash achieved the highest accuracy at 89.7%, while Gemma 4 E4B scored lowest at 70.4%
  • Bootstrap significance testing revealed three statistically distinct performance tiers among the twelve evaluated models
  • The complete dataset, evaluation code, and model outputs are publicly available for research use

What is IndiaFinBench

IndiaFinBench is the first publicly available evaluation benchmark specifically designed to assess large language model performance on Indian financial regulatory text. The benchmark addresses a critical gap in existing financial NLP evaluation tools, which have historically drawn exclusively from Western financial corpora including SEC filings, US earnings reports, and English-language financial news.

The benchmark contains 406 expert-annotated question-answer pairs sourced from 192 official documents from the Securities and Exchange Board of India (SEBI) and the Reserve Bank of India (RBI). These regulatory documents represent the primary financial oversight bodies in India and cover diverse aspects of financial regulation, compliance requirements, and policy frameworks.

What is new vs the previous version

IndiaFinBench represents an entirely new benchmark category, as no previous evaluation framework has focused on Indian financial regulatory text.

Aspect Previous Financial Benchmarks IndiaFinBench
Geographic Focus Exclusively Western financial corpora Indian regulatory frameworks
Source Documents SEC filings, US earnings reports SEBI and RBI regulatory documents
Task Coverage General financial understanding Four specialized regulatory tasks
Annotation Quality Variable validation methods Model-based secondary pass with kappa=0.918
Public Availability Limited open access Complete dataset and code on GitHub

How does IndiaFinBench work

IndiaFinBench operates through a structured four-task evaluation framework designed to assess different aspects of regulatory text comprehension.

  1. Regulatory Interpretation (174 items): Models must demonstrate understanding of complex regulatory language and policy implications from SEBI and RBI documents.
  2. Numerical Reasoning (92 items): Tasks require mathematical computation and quantitative analysis of financial regulations and compliance requirements.
  3. Contradiction Detection (62 items): Models identify inconsistencies or conflicting statements within regulatory text passages.
  4. Temporal Reasoning (78 items): Evaluation focuses on understanding time-dependent regulatory changes and chronological policy relationships.

The benchmark employs zero-shot evaluation conditions, meaning models receive no task-specific training examples before assessment. Annotation quality validation includes both model-based secondary passes and human inter-annotator agreement evaluation across 60 items.

Benchmarks and evidence

IndiaFinBench evaluation results demonstrate significant performance variation across twelve tested models under zero-shot conditions.

Model Overall Accuracy Performance Tier Source
Gemini 2.5 Flash 89.7% Tier 1 IndiaFinBench paper
Gemma 4 E4B 70.4% Tier 3 IndiaFinBench paper
Non-specialist Human 60.0% Baseline IndiaFinBench paper
Numerical Reasoning Spread 35.9 percentage points Most discriminative task IndiaFinBench paper

Bootstrap significance testing with 10,000 resamples revealed three statistically distinct performance tiers among evaluated models. The numerical reasoning task category showed the highest discrimination between model capabilities, with performance gaps exceeding 35 percentage points between top and bottom performers.

Who should care

Builders

AI developers working on financial applications can use IndiaFinBench to evaluate model performance on regulatory compliance tasks. The benchmark provides standardized metrics for assessing LLM capabilities in non-Western financial contexts, enabling more robust model selection and fine-tuning strategies.

Enterprise

Financial institutions operating in India can leverage IndiaFinBench to assess AI systems for regulatory compliance automation. Banks, investment firms, and fintech companies can evaluate whether their LLM implementations meet accuracy requirements for processing SEBI and RBI documentation.

End users

Researchers and academics studying financial NLP can access a comprehensive evaluation framework for Indian regulatory text. The benchmark enables comparative analysis of model performance across different regulatory interpretation tasks.

Investors

Investment firms can use IndiaFinBench results to evaluate AI-powered compliance and regulatory analysis tools. The benchmark provides objective performance metrics for assessing fintech solutions targeting Indian financial markets.

How to use IndiaFinBench today

IndiaFinBench is immediately accessible through its GitHub repository for research and evaluation purposes.

  1. Access the repository: Navigate to https://github.com/rajveerpall/IndiaFinBench to download the complete dataset and evaluation code.
  2. Load the benchmark data: The repository contains 406 question-answer pairs organized by task type with corresponding SEBI and RBI source documents.
  3. Implement evaluation protocol: Use the provided evaluation code to assess your model under zero-shot conditions across all four task categories.
  4. Compare results: Benchmark your model performance against the twelve baseline models with accuracy scores ranging from 70.4% to 89.7%.
  5. Analyze task-specific performance: Focus on numerical reasoning tasks for the most discriminative evaluation of model capabilities.

IndiaFinBench vs competitors

IndiaFinBench occupies a unique position in the financial NLP benchmark landscape by focusing specifically on Indian regulatory frameworks.

Benchmark Geographic Focus Task Types Document Sources Question Count
IndiaFinBench Indian regulatory frameworks 4 specialized tasks SEBI and RBI documents 406 questions
FinanceBench Western financial markets General financial QA SEC filings, earnings reports Not yet disclosed
LawBench General legal domains Legal reasoning tasks Various legal documents Not yet disclosed

Risks, limits, and myths

  • Geographic limitation: The benchmark focuses exclusively on Indian regulatory frameworks and may not generalize to other financial jurisdictions.
  • Language constraint: All evaluation materials are in English, potentially missing regional language regulatory documents used in Indian financial contexts.
  • Temporal scope: The benchmark reflects regulatory frameworks as of the document collection date and may not capture recent policy changes.
  • Task coverage: Four task types may not encompass all aspects of regulatory text comprehension required in real-world financial applications.
  • Human baseline limitation: The 60.0% non-specialist human baseline may not represent expert-level human performance on these tasks.
  • Model selection bias: Evaluation limited to twelve models may not represent the full spectrum of available LLM capabilities.

FAQ

What makes IndiaFinBench different from other financial AI benchmarks?

IndiaFinBench is the first evaluation benchmark specifically designed for Indian financial regulatory text, addressing a gap in existing benchmarks that focus exclusively on Western financial corpora like SEC filings and US earnings reports.

How many questions does IndiaFinBench contain?

IndiaFinBench contains 406 expert-annotated question-answer pairs sourced from 192 documents from the Securities and Exchange Board of India (SEBI) and the Reserve Bank of India (RBI).

What types of tasks does IndiaFinBench evaluate?

The benchmark evaluates four task types: regulatory interpretation (174 items), numerical reasoning (92 items), contradiction detection (62 items), and temporal reasoning (78 items).

Which AI model performed best on IndiaFinBench?

Gemini 2.5 Flash achieved the highest accuracy at 89.7%, while Gemma 4 E4B scored lowest at 70.4% among the twelve evaluated models under zero-shot conditions.

How reliable is the annotation quality in IndiaFinBench?

Annotation quality is validated through model-based secondary passes achieving kappa=0.918 on contradiction detection and human inter-annotator agreement evaluation with kappa=0.611 and 76.7% overall agreement.

Can I access IndiaFinBench for free?

Yes, the complete dataset, evaluation code, and all model outputs are freely available at the GitHub repository https://github.com/rajveerpall/IndiaFinBench.

What is the most challenging task type in IndiaFinBench?

Numerical reasoning proved most discriminative with a 35.9 percentage-point spread across models, making it the most challenging task category for differentiating model capabilities.

How does human performance compare to AI models on IndiaFinBench?

All twelve evaluated AI models substantially outperformed the non-specialist human baseline of 60.0%, with the lowest-performing model (Gemma 4 E4B) achieving 70.4% accuracy.

What statistical methods validate IndiaFinBench results?

Bootstrap significance testing with 10,000 resamples revealed three statistically distinct performance tiers among the evaluated models, ensuring robust statistical validation of results.

Which regulatory bodies provide source documents for IndiaFinBench?

IndiaFinBench draws from documents issued by the Securities and Exchange Board of India (SEBI) and the Reserve Bank of India (RBI), representing the primary financial regulatory authorities in India.

Glossary

SEBI
Securities and Exchange Board of India, the primary regulatory authority for securities markets in India
RBI
Reserve Bank of India, the central banking institution and monetary authority of India
Zero-shot evaluation
Testing AI models on tasks without providing task-specific training examples or fine-tuning
Kappa score
Statistical measure of inter-rater agreement that accounts for agreement occurring by chance
Bootstrap significance testing
Statistical method using resampling to determine if observed differences between groups are statistically significant
Numerical reasoning
AI task requiring mathematical computation and quantitative analysis of text-based problems
Contradiction detection
NLP task involving identification of inconsistent or conflicting statements within text passages
Temporal reasoning
AI capability to understand time-dependent relationships and chronological sequences in text

Download the IndiaFinBench dataset from GitHub at https://github.com/rajveerpall/IndiaFinBench to evaluate your LLM’s performance on Indian financial regulatory text.

Sources

  1. Large language model – Wikipedia. Benchmark evaluations for LLMs attempt to measure model reasoning, factual accuracy, alignment, and safety. https://en.wikipedia.org/wiki/Large_language_model
  2. What Are Large Language Models (LLMs)? | IBM. Researchers and practitioners look at qualities like accuracy, efficiency, safety, fairness and robustness to determine how well a model performs. https://www.ibm.com/think/topics/large-language-models
  3. Comparative Evaluation of Rule-Based and Large Language Models for Financial Transaction Extraction in Chatbots. Open-source large language models achieved the best performance across the natural language processing evaluation, with strong intent classification, high entity extraction quality, and complete output validity. https://jurnal.polgan.ac.id/index.php/sinkron/article/view/16020
  4. PoliLegalLM: A Technical Report on a Large Language Model for Political and Legal Affairs. We evaluate PoliLegalLM on three representative benchmarks, including LawBench, LexEval, and a real-world dataset, PoliLegal. https://arxiv.org/html/2604.17543
  5. A Functionality-Grounded Benchmark for Evaluating Web Agents in E-commerce Domains. Advances in large language models (LLMs) have led to strong performance in reasoning and planning. https://arxiv.org/html/2508.15832

Author

  • siego237

    Writes for FrontierWisdom on AI systems, automation, decentralized identity, and frontier infrastructure, with a focus on turning emerging technology into practical playbooks, implementation roadmaps, and monetization strategies for operators, builders, and consultants.

Keep Compounding Signal

Get the next blueprint before it becomes common advice.

Join the newsletter for future-economy playbooks, tactical prompts, and high-margin tool recommendations.

  • Actionable execution blueprints
  • High-signal tool and infrastructure breakdowns
  • New monetization angles before they saturate

No fluff. No generic AI listicles. Unsubscribe anytime.

Leave a Reply

Your email address will not be published. Required fields are marked *