Skip to main content
Frontier Signal

IndiaFinBench: New LLM Benchmark for Indian Financial Regulations

IndiaFinBench introduces 406 expert-annotated question-answer pairs from SEBI and RBI documents to evaluate large language model performance on Indian financial regulatory text.

Operator Briefing

Turn this article into a repeatable weekly edge.

Get implementation-minded writeups on frontier tools, systems, and income opportunities built for professionals.

No fluff. No generic AI listicles. Unsubscribe anytime.

IndiaFinBench is the first publicly available evaluation benchmark for assessing large language model performance on Indian financial regulatory text. The benchmark contains 406 expert-annotated question-answer pairs from SEBI and RBI documents across four task types.

Released by Not yet disclosed
Release date
What it is Evaluation benchmark for LLM performance on Indian financial regulatory text
Who it’s for AI researchers and financial technology developers
Where to get it https://github.com/rajveerpall/IndiaFinBench
Price Free (open source)
  • IndiaFinBench addresses gaps in existing financial NLP benchmarks that focus exclusively on Western financial corpora
  • The benchmark includes 406 question-answer pairs from 192 SEBI and RBI documents across four task types
  • Twelve models were evaluated with accuracy ranging from 70.4% to 89.7% under zero-shot conditions
  • Numerical reasoning proved most discriminative with a 35.9 percentage-point performance spread across models
  • All evaluated models substantially outperformed a non-specialist human baseline of 60.0%
  • IndiaFinBench fills a critical gap by providing the first evaluation benchmark specifically for Indian financial regulatory text
  • The benchmark demonstrates significant performance variation across models, with numerical reasoning being the most challenging task
  • High annotation quality is validated through model-based secondary passes and human inter-annotator agreement evaluation
  • Bootstrap significance testing reveals three statistically distinct performance tiers among evaluated models

What is IndiaFinBench

IndiaFinBench is an evaluation benchmark designed to assess large language model performance on Indian financial regulatory text. The benchmark addresses a significant gap in existing financial NLP evaluation tools that draw exclusively from Western financial corpora such as SEC filings and US earnings reports.

The benchmark contains 406 expert-annotated question-answer pairs sourced from 192 documents from the Securities and Exchange Board of India (SEBI) and the Reserve Bank of India (RBI). These documents represent the primary regulatory frameworks governing India’s financial sector.

What is new vs previous benchmarks

IndiaFinBench introduces the first publicly available benchmark specifically for non-Western financial regulatory frameworks.

Feature Previous Financial Benchmarks IndiaFinBench
Geographic focus Western financial corpora only Indian regulatory framework
Document sources SEC filings, US earnings reports SEBI and RBI regulatory documents
Task diversity Limited task types Four distinct task types
Annotation validation Varies by benchmark Model-based secondary pass + human agreement
Statistical rigor Basic accuracy reporting Bootstrap significance testing with 10,000 resamples

How does IndiaFinBench work

IndiaFinBench evaluates models across four distinct task types that capture different aspects of financial regulatory comprehension.

  1. Regulatory interpretation: 174 items testing understanding of regulatory requirements and compliance frameworks
  2. Numerical reasoning: 92 items requiring mathematical calculations and quantitative analysis of financial data
  3. Contradiction detection: 62 items identifying inconsistencies within regulatory text
  4. Temporal reasoning: 78 items involving time-based regulatory requirements and deadlines

Annotation quality is validated through a model-based secondary pass achieving kappa=0.918 on contradiction detection tasks. A 60-item human inter-annotator agreement evaluation achieved kappa=0.611 with 76.7% overall agreement.

Benchmarks and evidence

Twelve models were evaluated under zero-shot conditions, revealing significant performance variations across different model architectures.

Model Overall Accuracy Performance Tier Source
Gemini 2.5 Flash 89.7% Tier 1 [1]
Gemma 4 E4B 70.4% Tier 3 [1]
Human baseline (non-specialist) 60.0% Below all models [1]
Numerical reasoning spread 35.9 percentage points Most discriminative task [1]

Bootstrap significance testing with 10,000 resamples reveals three statistically distinct performance tiers among the evaluated models. Numerical reasoning emerged as the most discriminative task type with the largest performance spread.

Who should care

Builders

AI developers building financial applications for Indian markets need IndiaFinBench to evaluate model performance on regulatory compliance tasks. The benchmark provides standardized evaluation metrics for models processing SEBI and RBI documentation.

Enterprise

Financial institutions operating in India require models that accurately interpret regulatory requirements for compliance automation. IndiaFinBench enables systematic evaluation of LLM capabilities for regulatory text processing and risk assessment.

End users

Financial advisors and compliance professionals can use IndiaFinBench results to understand model limitations when processing Indian regulatory documents. The benchmark helps identify which models perform best for specific regulatory interpretation tasks.

Investors

Venture capital and fintech investors can use IndiaFinBench performance data to evaluate AI startups targeting Indian financial markets. The benchmark provides objective metrics for assessing regulatory AI capabilities.

How to use IndiaFinBench today

IndiaFinBench is immediately available through its open-source repository with complete evaluation code and model outputs.

  1. Clone the repository: git clone https://github.com/rajveerpall/IndiaFinBench
  2. Install dependencies from the requirements file
  3. Load your model using the provided evaluation framework
  4. Run evaluation across the four task types: regulatory interpretation, numerical reasoning, contradiction detection, and temporal reasoning
  5. Compare results against the baseline performance metrics for twelve evaluated models
  6. Use bootstrap significance testing code to determine statistical significance of performance differences

IndiaFinBench vs competitors

IndiaFinBench addresses geographic and regulatory gaps not covered by existing financial benchmarks.

Benchmark Geographic Focus Document Types Task Variety Validation Method
IndiaFinBench Indian regulatory framework SEBI and RBI documents 4 task types Model + human validation
FinanceBench US financial markets SEC filings, earnings reports Question answering Not yet disclosed
LawBench General legal text Legal documents Multiple legal tasks Expert annotation

Risks, limits, and myths

  • Limited scope: Benchmark focuses only on SEBI and RBI documents, not covering other Indian financial regulators
  • Zero-shot evaluation: Models were not fine-tuned on Indian regulatory text, potentially underestimating specialized model performance
  • Language limitation: Evaluation conducted in English only, not covering regional language regulatory documents
  • Temporal coverage: Benchmark reflects regulatory frameworks as of the dataset creation date, not future regulatory changes
  • Human baseline: Non-specialist human performance may not reflect expert regulatory professional capabilities
  • Model selection: Twelve evaluated models may not represent the full spectrum of available LLM architectures

FAQ

What makes IndiaFinBench different from other financial AI benchmarks?

IndiaFinBench is the first publicly available benchmark specifically designed for Indian financial regulatory text, addressing gaps in existing benchmarks that focus exclusively on Western financial corpora.

How many question-answer pairs does IndiaFinBench contain?

IndiaFinBench contains 406 expert-annotated question-answer pairs sourced from 192 documents from SEBI and RBI.

What are the four task types in IndiaFinBench evaluation?

The four task types are regulatory interpretation (174 items), numerical reasoning (92 items), contradiction detection (62 items), and temporal reasoning (78 items).

Which model performed best on IndiaFinBench?

Gemini 2.5 Flash achieved the highest accuracy at 89.7% under zero-shot evaluation conditions.

How was annotation quality validated in IndiaFinBench?

Annotation quality was validated through a model-based secondary pass (kappa=0.918 on contradiction detection) and a 60-item human inter-annotator agreement evaluation (kappa=0.611).

What was the human baseline performance on IndiaFinBench?

The non-specialist human baseline achieved 60.0% accuracy, which all twelve evaluated models substantially outperformed.

Which task type showed the largest performance variation across models?

Numerical reasoning was the most discriminative task, showing a 35.9 percentage-point spread across different models.

Is IndiaFinBench available for commercial use?

IndiaFinBench is available as open source through GitHub, though specific licensing terms are not yet disclosed in the provided sources.

How many statistical performance tiers were identified in the evaluation?

Bootstrap significance testing with 10,000 resamples revealed three statistically distinct performance tiers among the evaluated models.

What regulatory bodies are covered in IndiaFinBench documents?

IndiaFinBench draws from documents issued by the Securities and Exchange Board of India (SEBI) and the Reserve Bank of India (RBI).

Glossary

SEBI
Securities and Exchange Board of India, the regulatory authority for securities markets in India
RBI
Reserve Bank of India, the central banking institution and monetary authority of India
Zero-shot evaluation
Testing model performance on tasks without prior training or fine-tuning on similar examples
Bootstrap significance testing
Statistical method using resampling to determine if performance differences between models are statistically significant
Kappa coefficient
Statistical measure of inter-rater agreement that accounts for agreement occurring by chance
Contradiction detection
Task type requiring identification of inconsistent or conflicting statements within regulatory text
Temporal reasoning
Cognitive task involving understanding and processing time-based relationships and sequences

Visit the IndiaFinBench GitHub repository at https://github.com/rajveerpall/IndiaFinBench to access the dataset, evaluation code, and model outputs for your own LLM evaluation projects.

Sources

  1. IndiaFinBench: An Evaluation Benchmark for Large Language Model Performance on Indian Financial Regulatory Text. arXiv:2604.19298v1. .

Author

  • siego237

    Writes for FrontierWisdom on AI systems, automation, decentralized identity, and frontier infrastructure, with a focus on turning emerging technology into practical playbooks, implementation roadmaps, and monetization strategies for operators, builders, and consultants.

Keep Compounding Signal

Get the next blueprint before it becomes common advice.

Join the newsletter for future-economy playbooks, tactical prompts, and high-margin tool recommendations.

  • Actionable execution blueprints
  • High-signal tool and infrastructure breakdowns
  • New monetization angles before they saturate

No fluff. No generic AI listicles. Unsubscribe anytime.

Leave a Reply

Your email address will not be published. Required fields are marked *