Skip to main content
Frontier Signal

Nsanku Benchmark: LLM Zero-Shot Translation for Ghanaian Languages

The Nsanku benchmark reveals current LLMs struggle with zero-shot translation for Ghanaian languages, with Gemini 2.5 Flash leading but lacking reliability for production use.

Operator Briefing

Turn this article into a repeatable weekly edge.

Get implementation-minded writeups on frontier tools, systems, and income opportunities built for professionals.

No fluff. No generic AI listicles. Unsubscribe anytime.

The Nsanku benchmark, published on , systematically evaluates the zero-shot machine translation capabilities of 19 large language models (LLMs) for 43 Ghanaian languages paired with English. While Gemini 2.5 Flash achieved the highest overall score, the study critically concludes that no LLM currently offers reliable, consistent translation performance for these low-resource languages, highlighting a significant gap for operators aiming to deploy such solutions at scale.

  • Nsanku is the most comprehensive benchmark to date for LLM zero-shot translation of Ghanaian languages, evaluating 19 models across 43 languages.
  • Gemini 2.5 Flash scored highest overall with an average of 26.88 (BLEU: 24.60, chrF: 29.16), followed by Claude Sonnet 4.5 and GPT-4.1.
  • Open-weight models lagged, with Kimi K2 Instruct 0905 leading its category at 20.87 average score.
  • A key finding is the lack of simultaneous high performance and consistency across models and languages, indicating unreliability for production use.
  • The benchmark introduces a publicly available, extensible evaluation infrastructure for African language NLP research.

What changed

The Nsanku study introduces a novel and comprehensive evaluation framework specifically designed for low-resource African languages, a domain where LLM performance has been “poorly understood and largely unevaluated” until now. Previous LLM evaluations often focused on well-resourced languages, where models typically demonstrate strong multilingual capabilities. Nsanku’s contribution is a systematic benchmark using 300 English-to-Ghanaian language sentence pairs for each of 43 Ghanaian languages, sourced from the YouVersion Bible platform. This dataset, combined with both BLEU and chrF metrics, along with accuracy and cross-language consistency dimensions, provides a granular view of LLM translation performance in a previously under-examined linguistic landscape. The study’s release on , marks a significant step in establishing a baseline for African language NLP research.

Benchmarks and evidence

The Nsanku benchmark evaluated 19 LLMs, both open-weight and proprietary, for their zero-shot translation performance from English to 43 Ghanaian languages. The evaluation used 300 sentence pairs per language, employing Bilingual Evaluation Understudy (BLEU) and Character n-gram F-Score (chrF) as primary metrics, alongside an average accuracy score and a cross-language consistency dimension.

The top-performing models were:

  • Gemini 2.5 Flash: Achieved the highest overall average score of 26.88 (BLEU: 24.60, chrF: 29.16).
  • Claude Sonnet 4.5: Ranked second with an average score of 24.87 (BLEU: 22.46, chrF: 27.28).
  • GPT-4.1: Placed third among proprietary models, scoring 23.20 (BLEU: 21.15, chrF: 25.24).

Among open-weight models, Kimi K2 Instruct 0905 led its category with an average score of 20.87.

The consistency analysis revealed a critical limitation: no model and no language simultaneously reached the “Leaders quadrant” of high performance and high consistency. This indicates that even the best-performing models do not reliably translate across all Ghanaian languages. Per-language performance varied significantly, with Siwu achieving the highest average score at 25.73, while Nkonya scored lowest at 11.65. These scores, particularly for low-resource languages, are generally considered low for production-grade translation systems.

Why it matters for operators

For operators building or deploying language technology solutions in Africa, the Nsanku benchmark delivers a stark, yet crucial, reality check: current LLMs are not a silver bullet for low-resource language translation. The headline numbers, while showing Gemini 2.5 Flash as a frontrunner, mask a deeper issue of inconsistency. An average score of 26.88, even for the best model, is far from the quality required for reliable production systems, especially when compared to benchmarks for well-resourced languages. The finding that “no model and no language reached the Leaders quadrant of high performance and high consistency simultaneously” means that even if a model performs adequately for one Ghanaian language, it’s likely to fail for another, or even for different sentence structures within the same language.

This implies that a simple API call to a leading LLM for zero-shot translation into Ghanaian languages is not a viable strategy for any operator requiring dependable output. Instead, operators must anticipate significant post-processing, human-in-the-loop validation, and potentially, substantial fine-tuning with domain-specific data. For founders eyeing African markets, this means allocating considerably more resources to data acquisition, linguistic expertise, and quality assurance than they might for English or European language applications. Furthermore, the variability between languages (Siwu at 25.73 vs. Nkonya at 11.65) suggests that a one-size-fits-all approach is doomed to fail. Operators should prioritize targeted investment in data collection and model adaptation for specific languages crucial to their target markets, rather than relying on generalized LLM capabilities. This benchmark underscores the need for localized, data-driven strategies over broad, off-the-shelf LLM deployments for these linguistic contexts.

Risks and open questions

  • Data Scarcity for Fine-tuning: While the benchmark highlights the need for fine-tuning, acquiring sufficient high-quality parallel data for 43 low-resource Ghanaian languages remains a significant challenge. Operators must consider the cost and feasibility of creating such datasets.
  • Generalization Beyond Religious Texts: The evaluation sentences were sourced from the YouVersion Bible platform. This raises questions about how well these models would perform on secular, technical, or conversational text, which often have different linguistic characteristics and vocabulary.
  • Ethical Implications of Low-Quality Translation: Deploying unreliable translation systems for critical applications (e.g., healthcare, legal, education) in low-resource language communities could lead to misinformation, misunderstanding, and exacerbate existing digital divides.
  • Cost of Deployment: Even if performance improves, the inference costs of large proprietary LLMs for extensive translation tasks across many languages could be prohibitive for organizations operating in regions with limited resources.
  • Community Engagement: The Nsanku benchmark is publicly available and extensible. An open question is how effectively the NLP research community, particularly within Africa, will adopt and contribute to this infrastructure to drive further improvements.

Author

  • Siegfried Kamgo

    Founder and editorial lead at FrontierWisdom. Engineer turned operator-analyst writing about AI systems, automation infrastructure, decentralised stacks, and the practical economics of frontier technology. Focus: turning fast-moving releases into durable, implementation-ready playbooks.

Keep Compounding Signal

Get the next blueprint before it becomes common advice.

Join the newsletter for future-economy playbooks, tactical prompts, and high-margin tool recommendations.

  • Actionable execution blueprints
  • High-signal tool and infrastructure breakdowns
  • New monetization angles before they saturate

No fluff. No generic AI listicles. Unsubscribe anytime.

Leave a Reply

Your email address will not be published. Required fields are marked *