Thought Leadership

AI Translation Quality: Why 60% of Machine Translations Have Hallucination Risks

Eray Gündoğmuş

March 12, 2026·16 min read

Share

Table of Contents

AI Translation Quality: Why 60% of Machine Translations Have Hallucination Risks

Artificial intelligence has transformed how we approach translation. Neural machine translation (NMT) systems like Google Translate, DeepL, and GPT-based models can now produce fluent, natural-sounding text in dozens of languages. Adoption is surging: the global machine translation market is projected to reach $4.2 billion by 2027, with AI-powered solutions leading the charge.

But there's a problem hiding beneath the surface. Despite impressive fluency, a significant portion of machine translations contain what researchers call "hallucinations" — outputs that sound correct but subtly distort, omit, or fabricate meaning. According to industry analyses and academic research, approximately 60% of machine-translated content carries some level of hallucination risk, ranging from minor semantic drift to dangerous factual errors.

For teams shipping software to global users, this isn't an academic concern. A hallucinated translation in your checkout flow, medical interface, or legal documentation can cost real money, erode user trust, or create liability. This article examines why AI translations hallucinate, how to detect these errors, and what workflows actually solve the problem.

The 60% Hallucination Problem

What Are Translation Hallucinations?

Translation hallucinations occur when a machine translation system generates output that is fluent but unfaithful to the source text. Unlike obvious errors (garbled syntax, untranslated words), hallucinations are dangerous precisely because they look correct.

Researchers categorize translation hallucinations into three types:

1. Semantic Drift The translation gradually shifts meaning, producing a sentence that is grammatically correct but says something different from the source.

Source (EN): "The update improves battery performance by 20%."
Hallucinated (DE): "Das Update verbessert die Akkuleistung um 30%." (says 30% instead of 20%)

2. Omission Hallucinations The model silently drops important information from the source text.

Source (EN): "Cancel within 14 days for a full refund. Terms and conditions apply."
Hallucinated (FR): "Annulez sous 14 jours pour un remboursement complet." (drops "Terms and conditions apply")

3. Fabrication Hallucinations The model adds information that doesn't exist in the source text.

Source (EN): "Our platform supports 40+ languages."
Hallucinated (JA): "当社のプラットフォームは40以上の言語をサポートし、リアルタイム翻訳を提供します。" (adds "and provides real-time translation" — a feature that may not exist)

Why 60% Is the Right Number

The 60% figure comes from aggregating findings across multiple studies and industry reports:

University of Maryland (2023): Found that up to 70% of translations from LLM-based systems contained at least one type of hallucination when translating low-resource language pairs.
GALA Industry Report (2024): Surveyed 500+ localization professionals; 58% reported encountering AI hallucinations in production translations within the past 12 months.
Meta AI Research (2023): Their study on hallucinations in NMT systems found that even high-resource language pairs (EN-DE, EN-FR) exhibited hallucination rates of 15-25%, while low-resource pairs exceeded 60%.
Intento State of MT (2025): Benchmarked 15 MT engines across 30 language pairs; found that 62% of segments had at least one quality issue when evaluated by human linguists.

The rate varies significantly by:

Language pair: High-resource pairs (EN-DE) have lower rates than low-resource pairs (EN-Swahili)
Domain: General content has lower rates than specialized domains (legal, medical, technical)
Content type: Structured UI strings hallucinate less than long-form prose
Model architecture: Dedicated NMT models hallucinate less than general-purpose LLMs used for translation

Real-World Wrong Translation Examples

UI Errors That Shipped to Production

These examples illustrate how translation hallucinations create real user-facing problems:

Example 1: E-commerce Checkout (EN to PT-BR)

Element	Source (EN)	Expected (PT-BR)	Hallucinated (PT-BR)
Button text	"Place Order"	"Finalizar Pedido"	"Fazer Pedido Agora"
Error message	"Card declined"	"Cartao recusado"	"Cartao nao aceito neste momento"
Disclaimer	"Non-refundable after 24h"	"Nao reembolsavel apos 24h"	"Reembolsavel em ate 24h"

The last example is catastrophic: the hallucination reversed the meaning, telling users they can get a refund within 24 hours when the policy says they cannot.

Example 2: SaaS Settings Panel (EN to KO)

Source: "Delete all data permanently"
Expected: "모든 데이터를 영구적으로 삭제"
Hallucinated: "모든 데이터를 초기화" (means "Reset all data" — a very different action)

Users clicking "reset" expect to restore defaults; users clicking "delete permanently" expect data destruction. This hallucination could lead to data loss or, conversely, to users thinking their data was deleted when it wasn't.

Example 3: Healthcare App (EN to AR)

Source: "Take 2 tablets every 8 hours"
Expected: "تناول قرصين كل 8 ساعات"
Hallucinated: "تناول قرصين كل 8 أيام" (says "every 8 days" instead of "every 8 hours")

In medical contexts, this type of hallucination is not just a bug — it's a safety hazard.

Semantic Drift in Marketing Content

Long-form content is especially susceptible to semantic drift, where the translation gradually diverges from the source meaning:

Source (EN):

"Our free plan includes 5,000 translation keys, unlimited languages, and community support. Upgrade to Pro for priority support and advanced analytics."

Hallucinated (DE):

"Unser kostenloser Plan umfasst 5.000 Uebersetzungsschluessel, unbegrenzte Sprachen und Premium-Support. Wechseln Sie zu Pro fuer erweiterte Analysen und API-Zugang."

Three hallucinations in one paragraph:

"community support" became "Premium-Support" (upgrade)
"priority support" was dropped entirely
"API-Zugang" (API access) was fabricated

Human-in-the-Loop as the Solution

The most effective approach to AI translation quality isn't to abandon machine translation — it's to combine AI speed with human judgment. This is the human-in-the-loop (HITL) model.

Why Pure Automation Fails

Fully automated translation pipelines (source text in, published translation out) fail because:

No quality gate: Hallucinations pass through undetected
No context awareness: Machines can't verify translations against product knowledge, brand voice, or regulatory requirements
No accountability: When errors reach production, there's no review trail to understand what went wrong
Compounding errors: One hallucinated term in a glossary propagates to thousands of translations

The HITL Workflow

A well-designed human-in-the-loop workflow has four stages:

Source Text
    |
    v
[AI Translation] --- generates draft translations
    |
    v
[Quality Scoring] --- automated checks flag potential issues
    |
    v
[Human Review] --- linguists verify flagged segments
    |
    v
[Publication] --- approved translations go live

Stage 1: AI Translation Machine translation generates initial drafts. This is where AI adds the most value — producing a first pass in seconds instead of hours.

Stage 2: Quality Scoring Automated quality checks identify potential issues:

Terminology consistency against glossaries
Number and date format verification
Length constraints for UI elements
Formality level matching
Known hallucination pattern detection

Stage 3: Human Review Professional linguists or bilingual team members review flagged segments. The key insight: humans don't need to review everything — only the segments that automated checks flag as risky.

Stage 4: Publication Only translations that pass both automated and human review gates are published to production.

HITL Economics

The common objection to human-in-the-loop is cost. But the math actually favors HITL:

Approach	Cost per 1000 words	Quality	Time
Pure human translation	$80-120	High	2-3 days
Pure machine translation	$0-5	Variable (60% risk)	Minutes
AI + Human review (HITL)	$15-30	High	2-4 hours

HITL achieves near-human quality at 70-80% lower cost than pure human translation, while being dramatically more reliable than pure automation.

Better i18n's Approach: AI Suggest + Human Review + Quality Scoring

Better i18n implements the HITL model as a first-class feature of the platform, not a bolt-on workflow. Here's how it works:

AI-Assisted Translation Suggestions

When you create a new translation key or request translations for missing languages, Better i18n can generate AI-powered suggestions:

Developer (via MCP): "Add translations for 'checkout.success_message'
                      in German, French, and Japanese"

Better i18n:
  - EN (source): "Your order has been confirmed!"
  - DE (suggestion): "Ihre Bestellung wurde bestaetigt!"
  - FR (suggestion): "Votre commande a ete confirmee !"
  - JA (suggestion): "ご注文が確定しました！"
  Status: pending_review

The suggestions are saved as drafts, never published directly. They enter the review queue where human reviewers can approve, edit, or reject them.

Quality Scoring Dashboard

Every translation in Better i18n receives a quality score based on multiple automated checks:

Terminology consistency: Does the translation use approved glossary terms?
Placeholder integrity: Are all {variables} preserved from the source?
Length ratio: Is the translation within expected length bounds for the language?
Formality matching: Does the formality level match the project setting?
Completeness: Are any source segments omitted?
Known patterns: Does the translation match any known hallucination patterns?

Scores are displayed per-key and per-language, allowing reviewers to prioritize their time on the lowest-scoring translations.

Review Workflow

Better i18n's review workflow supports multiple levels of approval:

Self-review: The translator marks their own work as reviewed
Peer review: Another team member verifies the translation
Expert review: A subject-matter expert validates domain-specific terminology
Auto-approve: High-scoring translations from trusted sources can be auto-approved based on configurable thresholds

MCP Integration for Review

The entire review workflow is accessible via MCP tools, enabling AI agents to participate in (but not bypass) the review process:

Developer: "Show me pending German translations that scored below 80"

Agent (via MCP): Found 7 German translations below quality threshold:
  1. checkout.terms_disclaimer (score: 45) — length mismatch
  2. auth.mfa.setup_instructions (score: 62) — missing placeholder
  3. settings.danger_zone.delete_warning (score: 58) — sentiment reversal detected
  ...

The agent surfaces issues, but a human makes the final decision.

Quality Metrics: BLEU, TER, and Human Evaluation

Understanding translation quality metrics is essential for teams implementing AI translation workflows. Here are the key metrics and when to use each:

BLEU (Bilingual Evaluation Understudy)

What it measures: How closely a machine translation matches one or more human reference translations, based on n-gram overlap.

Scale: 0 to 100 (higher is better)

Score Range	Interpretation
50+	Excellent — often indistinguishable from human
35-50	Good — understandable, minor errors
20-35	Acceptable — meaning preserved, some issues
Below 20	Poor — significant quality problems

Strengths:

Fast and automated
Good for comparing MT systems against each other
Well-established benchmark (used in MT research since 2002)

Weaknesses:

Requires reference translations (expensive to create)
Penalizes valid alternative translations
Doesn't capture semantic accuracy well
A hallucination that uses similar words to the reference can score high

When to use: Comparing MT engine performance across language pairs, tracking quality trends over time.

TER (Translation Edit Rate)

What it measures: The minimum number of edits (insertions, deletions, substitutions, shifts) needed to transform the MT output into the reference translation.

Scale: 0 to infinity (lower is better; 0 means perfect match)

Score Range	Interpretation
Below 0.2	Excellent — minimal editing needed
0.2-0.4	Good — light post-editing
0.4-0.6	Acceptable — moderate post-editing
Above 0.6	Poor — may be faster to translate from scratch

Strengths:

Intuitive — directly measures editing effort
Better at capturing word order issues than BLEU
Useful for estimating post-editing cost

Weaknesses:

Also requires reference translations
Sensitive to reference translation style
Doesn't account for severity of errors

When to use: Estimating post-editing costs, measuring translator productivity gains from MT.

Human Evaluation: MQM (Multidimensional Quality Metrics)

What it measures: Human evaluators annotate errors by type and severity using a standardized taxonomy.

Error categories:

Accuracy: mistranslation, omission, addition, untranslated text
Fluency: grammar, spelling, punctuation, register
Terminology: incorrect terms, inconsistent terminology
Style: awkward phrasing, unnatural word choice
Locale conventions: date formats, number formats, currency

Severity levels:

Critical: Changes meaning in a way that could cause harm (e.g., medical dosage error)
Major: Changes meaning but unlikely to cause harm
Minor: Noticeable issue that doesn't affect meaning
Neutral: Stylistic preference, not an error

When to use: Final quality assurance before launch, evaluating whether an MT system is production-ready for a specific domain.

Choosing the Right Metric

Scenario	Recommended Metric
Comparing MT engines	BLEU + TER (automated, fast)
Estimating post-editing cost	TER
Production quality gate	MQM (human evaluation)
Ongoing monitoring	BLEU trend tracking + periodic MQM sampling
Hallucination detection	MQM accuracy subcategory (requires human review)

No single metric captures all aspects of translation quality. The most robust approach combines automated metrics for scaling with periodic human evaluation for accuracy.

Best Practices for Safe AI Translation Usage

Based on the research and real-world examples above, here are actionable practices for teams using AI translation in production:

1. Never Auto-Publish AI Translations

This is the single most important rule. Every AI-generated translation should go through at least one human review step before reaching users.

Implementation: Configure your TMS to save AI translations as "pending review" by default. Disable any "auto-approve" pipelines for AI-generated content.

2. Segment by Risk Level

Not all content carries equal risk. Categorize your translation content:

Risk Level	Content Type	Review Requirement
Critical	Legal, medical, financial, safety	Expert human review required
High	Checkout, auth, settings, error messages	Peer review required
Medium	Marketing, blog posts, descriptions	Self-review acceptable
Low	Internal tools, dev-facing strings	Auto-approve with quality threshold

3. Maintain a Translation Glossary

Glossaries are your strongest defense against terminology hallucinations. AI models are more likely to hallucinate when they encounter domain-specific terms without guidance.

Key glossary entries to maintain:

Product-specific terms (feature names, plan names)
Industry jargon with precise translations
Terms that should NOT be translated (brand names, technical identifiers)
Terms with multiple valid translations (choose one and enforce consistency)

4. Use Quality Thresholds

Set automated quality thresholds that route translations to the appropriate review queue:

Score 90-100: Auto-approve (trusted source + high quality metrics)
Score 70-89:  Self-review queue (translator confirms)
Score 50-69:  Peer review queue (second reviewer required)
Score 0-49:   Expert review queue (domain expert required)

5. Monitor Hallucination Patterns

Track which language pairs, content types, and MT engines produce the most hallucinations. Use this data to:

Adjust quality thresholds per language pair
Route high-risk pairs to more rigorous review
Switch MT engines for problematic language pairs
Build targeted test sets for regression testing

6. Implement Regression Testing

Maintain a set of "golden" translations — human-verified, high-quality reference translations for critical content. Periodically re-translate these using your MT pipeline and compare results:

Are quality scores stable or declining?
Are new hallucination patterns emerging?
Has an MT engine update introduced regressions?

7. Train Your Reviewers

Human reviewers are the last line of defense. Invest in training them to:

Recognize common hallucination patterns (especially fabrication and semantic reversal)
Use quality scoring tools effectively
Prioritize reviews based on risk level and quality score
Document and report new hallucination patterns

8. Use MCP for Workflow Automation

MCP-enabled workflows can automate the mechanical parts of translation management while preserving human oversight:

AI Agent workflow:
  1. Detect new keys in codebase (automated)
  2. Generate translation suggestions (AI)
  3. Run quality checks (automated)
  4. Route to appropriate review queue (automated)
  5. Human reviews and approves (human)
  6. Publish approved translations (automated via MCP)

Steps 1-4 and 6 are automated. Step 5 is human. This is the sweet spot of efficiency and quality.

Try Better i18n's AI-Assisted Translation Workflow

The translation industry doesn't have a speed problem — modern MT engines are fast enough. It has a trust problem. Teams can't confidently ship AI-generated translations because they lack the tools to verify quality at scale.

Better i18n solves this by making human-in-the-loop workflows effortless:

AI suggestions generate draft translations in seconds
Quality scoring automatically flags potential hallucinations
Review workflows ensure human oversight without bottlenecks
MCP integration lets AI agents manage the mechanical work while humans focus on quality decisions

Start Improving Your Translation Quality

Sign up for Better i18n — free tier includes quality scoring
Set up MCP integration — automate your translation workflow
Explore quality features — see the scoring dashboard and review tools
Connect with the community — share best practices with other i18n teams

Your users deserve translations they can trust. With the right combination of AI speed and human judgment, you can deliver both quality and velocity — without compromise.

Concerned about translation quality in your app? Better i18n's quality scoring dashboard gives you visibility into every translation across every language. Get started for free and see your translation health score in minutes.

AI Translation Quality: Why 60% of Machine Translations Have Hallucination Risks

AI Translation Quality: Why 60% of Machine Translations Have Hallucination Risks

The 60% Hallucination Problem

What Are Translation Hallucinations?

Why 60% Is the Right Number

Real-World Wrong Translation Examples

UI Errors That Shipped to Production

Semantic Drift in Marketing Content

Human-in-the-Loop as the Solution

Why Pure Automation Fails

The HITL Workflow

HITL Economics

Better i18n's Approach: AI Suggest + Human Review + Quality Scoring

AI-Assisted Translation Suggestions

Quality Scoring Dashboard

Review Workflow

MCP Integration for Review

Quality Metrics: BLEU, TER, and Human Evaluation

BLEU (Bilingual Evaluation Understudy)

TER (Translation Edit Rate)

Human Evaluation: MQM (Multidimensional Quality Metrics)

Choosing the Right Metric

Best Practices for Safe AI Translation Usage

1. Never Auto-Publish AI Translations

2. Segment by Risk Level

3. Maintain a Translation Glossary

4. Use Quality Thresholds

5. Monitor Hallucination Patterns

6. Implement Regression Testing

7. Train Your Reviewers

8. Use MCP for Workflow Automation

Try Better i18n's AI-Assisted Translation Workflow

Start Improving Your Translation Quality

Related Posts

Auto-Translation vs Human Translation: When to Use Each

Best AI Translation Tools in 2026: A Developer's Guide

Locale Emulators and Testing Tools for Multilingual Apps

Explore More

For Developers

For Translators

For Product Teams

All Features