IndiaFinBench: First LLM Benchmark for Indian Financial

IndiaFinBench represents the first publicly available evaluation benchmark designed to assess large language model performance on Indian financial regulatory text, featuring 406 expert-annotated question-answer pairs from SEBI and RBI documents across four specialized task types.

Released by	Not yet disclosed
Release date	April 22, 2024
What it is	First LLM evaluation benchmark for Indian financial regulatory text
Who it is for	AI researchers and financial technology developers
Where to get it	GitHub repository at rajveerpall/IndiaFinBench
Price	Free

IndiaFinBench contains 406 expert-annotated question-answer pairs from 192 SEBI and RBI documents
The benchmark covers four task types: regulatory interpretation, numerical reasoning, contradiction detection, and temporal reasoning
Twelve models were evaluated with accuracy ranging from 70.4% to 89.7% under zero-shot conditions
Numerical reasoning proved most discriminative with a 35.9 percentage-point performance spread across models
All evaluated models substantially outperformed a non-specialist human baseline of 60.0%

What is IndiaFinBench
What is new vs the previous version
How does IndiaFinBench work
Benchmarks and evidence
Who should care
How to use IndiaFinBench today
IndiaFinBench vs competitors
Risks, limits, and myths

IndiaFinBench addresses a significant gap in financial NLP benchmarks by focusing on non-Western regulatory frameworks
The benchmark demonstrates high annotation quality with kappa scores of 0.918 for contradiction detection and 0.611 for human inter-annotator agreement
Gemini 2.5 Flash achieved the highest accuracy at 89.7%, while Gemma 4 E4B scored lowest at 70.4%
Bootstrap significance testing revealed three statistically distinct performance tiers among the twelve evaluated models
The complete dataset, evaluation code, and model outputs are publicly available for research use

What is IndiaFinBench

IndiaFinBench is the first publicly available evaluation benchmark specifically designed to assess large language model performance on Indian financial regulatory text. The benchmark addresses a critical gap in existing financial NLP evaluation tools, which have historically drawn exclusively from Western financial corpora including SEC filings, US earnings reports, and English-language financial news.

The benchmark contains 406 expert-annotated question-answer pairs sourced from 192 official documents from the Securities and Exchange Board of India (SEBI) and the Reserve Bank of India (RBI). These regulatory documents represent the primary financial oversight bodies in India and cover diverse aspects of financial regulation, compliance requirements, and policy frameworks.

What is new vs the previous version

IndiaFinBench represents an entirely new benchmark category, as no previous evaluation framework has focused on Indian financial regulatory text.

Aspect	Previous Financial Benchmarks	IndiaFinBench
Geographic Focus	Exclusively Western financial corpora	Indian regulatory frameworks
Source Documents	SEC filings, US earnings reports	SEBI and RBI regulatory documents
Task Coverage	General financial understanding	Four specialized regulatory tasks
Annotation Quality	Variable validation methods	Model-based secondary pass with kappa=0.918
Public Availability	Limited open access	Complete dataset and code on GitHub

How does IndiaFinBench work

IndiaFinBench operates through a structured four-task evaluation framework designed to assess different aspects of regulatory text comprehension.

Regulatory Interpretation (174 items): Models must demonstrate understanding of complex regulatory language and policy implications from SEBI and RBI documents.
Numerical Reasoning (92 items): Tasks require mathematical computation and quantitative analysis of financial regulations and compliance requirements.
Contradiction Detection (62 items): Models identify inconsistencies or conflicting statements within regulatory text passages.
Temporal Reasoning (78 items): Evaluation focuses on understanding time-dependent regulatory changes and chronological policy relationships.

The benchmark employs zero-shot evaluation conditions, meaning models receive no task-specific training examples before assessment. Annotation quality validation includes both model-based secondary passes and human inter-annotator agreement evaluation across 60 items.

Benchmarks and evidence

IndiaFinBench evaluation results demonstrate significant performance variation across twelve tested models under zero-shot conditions.

Model	Overall Accuracy	Performance Tier	Source
Gemini 2.5 Flash	89.7%	Tier 1	IndiaFinBench paper
Gemma 4 E4B	70.4%	Tier 3	IndiaFinBench paper
Non-specialist Human	60.0%	Baseline	IndiaFinBench paper
Numerical Reasoning Spread	35.9 percentage points	Most discriminative task	IndiaFinBench paper

Bootstrap significance testing with 10,000 resamples revealed three statistically distinct performance tiers among evaluated models. The numerical reasoning task category showed the highest discrimination between model capabilities, with performance gaps exceeding 35 percentage points between top and bottom performers.

Who should care

Builders

AI developers working on financial applications can use IndiaFinBench to evaluate model performance on regulatory compliance tasks. The benchmark provides standardized metrics for assessing LLM capabilities in non-Western financial contexts, enabling more robust model selection and fine-tuning strategies.

Enterprise

Financial institutions operating in India can leverage IndiaFinBench to assess AI systems for regulatory compliance automation. Banks, investment firms, and fintech companies can evaluate whether their LLM implementations meet accuracy requirements for processing SEBI and RBI documentation.

End users

Researchers and academics studying financial NLP can access a comprehensive evaluation framework for Indian regulatory text. The benchmark enables comparative analysis of model performance across different regulatory interpretation tasks.

Investors

Investment firms can use IndiaFinBench results to evaluate AI-powered compliance and regulatory analysis tools. The benchmark provides objective performance metrics for assessing fintech solutions targeting Indian financial markets.

How to use IndiaFinBench today

IndiaFinBench is immediately accessible through its GitHub repository for research and evaluation purposes.

Access the repository: Navigate to https://github.com/rajveerpall/IndiaFinBench to download the complete dataset and evaluation code.
Load the benchmark data: The repository contains 406 question-answer pairs organized by task type with corresponding SEBI and RBI source documents.
Implement evaluation protocol: Use the provided evaluation code to assess your model under zero-shot conditions across all four task categories.
Compare results: Benchmark your model performance against the twelve baseline models with accuracy scores ranging from 70.4% to 89.7%.
Analyze task-specific performance: Focus on numerical reasoning tasks for the most discriminative evaluation of model capabilities.

IndiaFinBench vs competitors

IndiaFinBench occupies a unique position in the financial NLP benchmark landscape by focusing specifically on Indian regulatory frameworks.

Benchmark	Geographic Focus	Task Types	Document Sources	Question Count
IndiaFinBench	Indian regulatory frameworks	4 specialized tasks	SEBI and RBI documents	406 questions
FinanceBench	Western financial markets	General financial QA	SEC filings, earnings reports	Not yet disclosed
LawBench	General legal domains	Legal reasoning tasks	Various legal documents	Not yet disclosed

Risks, limits, and myths

Geographic limitation: The benchmark focuses exclusively on Indian regulatory frameworks and may not generalize to other financial jurisdictions.
Language constraint: All evaluation materials are in English, potentially missing regional language regulatory documents used in Indian financial contexts.
Temporal scope: The benchmark reflects regulatory frameworks as of the document collection date and may not capture recent policy changes.
Task coverage: Four task types may not encompass all aspects of regulatory text comprehension required in real-world financial applications.
Human baseline limitation: The 60.0% non-specialist human baseline may not represent expert-level human performance on these tasks.
Model selection bias: Evaluation limited to twelve models may not represent the full spectrum of available LLM capabilities.

FAQ

What makes IndiaFinBench different from other financial AI benchmarks?

IndiaFinBench is the first evaluation benchmark specifically designed for Indian financial regulatory text, addressing a gap in existing benchmarks that focus exclusively on Western financial corpora like SEC filings and US earnings reports.

How many questions does IndiaFinBench contain?

IndiaFinBench contains 406 expert-annotated question-answer pairs sourced from 192 documents from the Securities and Exchange Board of India (SEBI) and the Reserve Bank of India (RBI).

What types of tasks does IndiaFinBench evaluate?

The benchmark evaluates four task types: regulatory interpretation (174 items), numerical reasoning (92 items), contradiction detection (62 items), and temporal reasoning (78 items).

Which AI model performed best on IndiaFinBench?

Gemini 2.5 Flash achieved the highest accuracy at 89.7%, while Gemma 4 E4B scored lowest at 70.4% among the twelve evaluated models under zero-shot conditions.

How reliable is the annotation quality in IndiaFinBench?

Annotation quality is validated through model-based secondary passes achieving kappa=0.918 on contradiction detection and human inter-annotator agreement evaluation with kappa=0.611 and 76.7% overall agreement.

Can I access IndiaFinBench for free?

Yes, the complete dataset, evaluation code, and all model outputs are freely available at the GitHub repository https://github.com/rajveerpall/IndiaFinBench.

What is the most challenging task type in IndiaFinBench?

Numerical reasoning proved most discriminative with a 35.9 percentage-point spread across models, making it the most challenging task category for differentiating model capabilities.

How does human performance compare to AI models on IndiaFinBench?

All twelve evaluated AI models substantially outperformed the non-specialist human baseline of 60.0%, with the lowest-performing model (Gemma 4 E4B) achieving 70.4% accuracy.

What statistical methods validate IndiaFinBench results?

Bootstrap significance testing with 10,000 resamples revealed three statistically distinct performance tiers among the evaluated models, ensuring robust statistical validation of results.

Which regulatory bodies provide source documents for IndiaFinBench?

IndiaFinBench draws from documents issued by the Securities and Exchange Board of India (SEBI) and the Reserve Bank of India (RBI), representing the primary financial regulatory authorities in India.

Glossary

SEBI: Securities and Exchange Board of India, the primary regulatory authority for securities markets in India
RBI: Reserve Bank of India, the central banking institution and monetary authority of India
Zero-shot evaluation: Testing AI models on tasks without providing task-specific training examples or fine-tuning
Kappa score: Statistical measure of inter-rater agreement that accounts for agreement occurring by chance
Bootstrap significance testing: Statistical method using resampling to determine if observed differences between groups are statistically significant
Numerical reasoning: AI task requiring mathematical computation and quantitative analysis of text-based problems
Contradiction detection: NLP task involving identification of inconsistent or conflicting statements within text passages
Temporal reasoning: AI capability to understand time-dependent relationships and chronological sequences in text

Download the IndiaFinBench dataset from GitHub at https://github.com/rajveerpall/IndiaFinBench to evaluate your LLM’s performance on Indian financial regulatory text.

Sources

Large language model – Wikipedia. Benchmark evaluations for LLMs attempt to measure model reasoning, factual accuracy, alignment, and safety. https://en.wikipedia.org/wiki/Large_language_model
What Are Large Language Models (LLMs)? | IBM. Researchers and practitioners look at qualities like accuracy, efficiency, safety, fairness and robustness to determine how well a model performs. https://www.ibm.com/think/topics/large-language-models
Comparative Evaluation of Rule-Based and Large Language Models for Financial Transaction Extraction in Chatbots. Open-source large language models achieved the best performance across the natural language processing evaluation, with strong intent classification, high entity extraction quality, and complete output validity. https://jurnal.polgan.ac.id/index.php/sinkron/article/view/16020
PoliLegalLM: A Technical Report on a Large Language Model for Political and Legal Affairs. We evaluate PoliLegalLM on three representative benchmarks, including LawBench, LexEval, and a real-world dataset, PoliLegal. https://arxiv.org/html/2604.17543
A Functionality-Grounded Benchmark for Evaluating Web Agents in E-commerce Domains. Advances in large language models (LLMs) have led to strong performance in reasoning and planning. https://arxiv.org/html/2508.15832

Author

siego237

Writes for FrontierWisdom on AI systems, automation, decentralized identity, and frontier infrastructure, with a focus on turning emerging technology into practical playbooks, implementation roadmaps, and monetization strategies for operators, builders, and consultants.

IndiaFinBench: First LLM Benchmark for Indian Financial Regulation

What is IndiaFinBench

What is new vs the previous version

How does IndiaFinBench work

Benchmarks and evidence