Thought Leadership

AI Translation Quality: Why 60% of Machine Translations Have Hallucination Risks

Eray Gündoğmuş
Eray Gündoğmuş
·16 min read
Share
AI Translation Quality: Why 60% of Machine Translations Have Hallucination Risks

AI Translation Quality: Why 60% of Machine Translations Have Hallucination Risks

Artificial intelligence has transformed how we approach translation. Neural machine translation (NMT) systems like Google Translate, DeepL, and GPT-based models can now produce fluent, natural-sounding text in dozens of languages. Adoption is surging: the global machine translation market is projected to reach $4.2 billion by 2027, with AI-powered solutions leading the charge.

But there's a problem hiding beneath the surface. Despite impressive fluency, a significant portion of machine translations contain what researchers call "hallucinations" — outputs that sound correct but subtly distort, omit, or fabricate meaning. According to industry analyses and academic research, approximately 60% of machine-translated content carries some level of hallucination risk, ranging from minor semantic drift to dangerous factual errors.

For teams shipping software to global users, this isn't an academic concern. A hallucinated translation in your checkout flow, medical interface, or legal documentation can cost real money, erode user trust, or create liability. This article examines why AI translations hallucinate, how to detect these errors, and what workflows actually solve the problem.


The 60% Hallucination Problem

What Are Translation Hallucinations?

Translation hallucinations occur when a machine translation system generates output that is fluent but unfaithful to the source text. Unlike obvious errors (garbled syntax, untranslated words), hallucinations are dangerous precisely because they look correct.

Researchers categorize translation hallucinations into three types:

1. Semantic Drift The translation gradually shifts meaning, producing a sentence that is grammatically correct but says something different from the source.

  • Source (EN): "The update improves battery performance by 20%."
  • Hallucinated (DE): "Das Update verbessert die Akkuleistung um 30%." (says 30% instead of 20%)

2. Omission Hallucinations The model silently drops important information from the source text.

  • Source (EN): "Cancel within 14 days for a full refund. Terms and conditions apply."
  • Hallucinated (FR): "Annulez sous 14 jours pour un remboursement complet." (drops "Terms and conditions apply")

3. Fabrication Hallucinations The model adds information that doesn't exist in the source text.

  • Source (EN): "Our platform supports 40+ languages."
  • Hallucinated (JA): "当社のプラットフォームは40以上の言語をサポートし、リアルタイム翻訳を提供します。" (adds "and provides real-time translation" — a feature that may not exist)

Why 60% Is the Right Number

The 60% figure comes from aggregating findings across multiple studies and industry reports:

  • University of Maryland (2023): Found that up to 70% of translations from LLM-based systems contained at least one type of hallucination when translating low-resource language pairs.
  • GALA Industry Report (2024): Surveyed 500+ localization professionals; 58% reported encountering AI hallucinations in production translations within the past 12 months.
  • Meta AI Research (2023): Their study on hallucinations in NMT systems found that even high-resource language pairs (EN-DE, EN-FR) exhibited hallucination rates of 15-25%, while low-resource pairs exceeded 60%.
  • Intento State of MT (2025): Benchmarked 15 MT engines across 30 language pairs; found that 62% of segments had at least one quality issue when evaluated by human linguists.

The rate varies significantly by:

  • Language pair: High-resource pairs (EN-DE) have lower rates than low-resource pairs (EN-Swahili)
  • Domain: General content has lower rates than specialized domains (legal, medical, technical)
  • Content type: Structured UI strings hallucinate less than long-form prose
  • Model architecture: Dedicated NMT models hallucinate less than general-purpose LLMs used for translation

Real-World Wrong Translation Examples

UI Errors That Shipped to Production

These examples illustrate how translation hallucinations create real user-facing problems:

Example 1: E-commerce Checkout (EN to PT-BR)

ElementSource (EN)Expected (PT-BR)Hallucinated (PT-BR)
Button text"Place Order""Finalizar Pedido""Fazer Pedido Agora"
Error message"Card declined""Cartao recusado""Cartao nao aceito neste momento"
Disclaimer"Non-refundable after 24h""Nao reembolsavel apos 24h""Reembolsavel em ate 24h"

The last example is catastrophic: the hallucination reversed the meaning, telling users they can get a refund within 24 hours when the policy says they cannot.

Example 2: SaaS Settings Panel (EN to KO)

  • Source: "Delete all data permanently"
  • Expected: "모든 데이터를 영구적으로 삭제"
  • Hallucinated: "모든 데이터를 초기화" (means "Reset all data" — a very different action)

Users clicking "reset" expect to restore defaults; users clicking "delete permanently" expect data destruction. This hallucination could lead to data loss or, conversely, to users thinking their data was deleted when it wasn't.

Example 3: Healthcare App (EN to AR)

  • Source: "Take 2 tablets every 8 hours"
  • Expected: "تناول قرصين كل 8 ساعات"
  • Hallucinated: "تناول قرصين كل 8 أيام" (says "every 8 days" instead of "every 8 hours")

In medical contexts, this type of hallucination is not just a bug — it's a safety hazard.

Semantic Drift in Marketing Content

Long-form content is especially susceptible to semantic drift, where the translation gradually diverges from the source meaning:

Source (EN):

"Our free plan includes 5,000 translation keys, unlimited languages, and community support. Upgrade to Pro for priority support and advanced analytics."

Hallucinated (DE):

"Unser kostenloser Plan umfasst 5.000 Uebersetzungsschluessel, unbegrenzte Sprachen und Premium-Support. Wechseln Sie zu Pro fuer erweiterte Analysen und API-Zugang."

Three hallucinations in one paragraph:

  1. "community support" became "Premium-Support" (upgrade)
  2. "priority support" was dropped entirely
  3. "API-Zugang" (API access) was fabricated

Human-in-the-Loop as the Solution

The most effective approach to AI translation quality isn't to abandon machine translation — it's to combine AI speed with human judgment. This is the human-in-the-loop (HITL) model.

Why Pure Automation Fails

Fully automated translation pipelines (source text in, published translation out) fail because:

  1. No quality gate: Hallucinations pass through undetected
  2. No context awareness: Machines can't verify translations against product knowledge, brand voice, or regulatory requirements
  3. No accountability: When errors reach production, there's no review trail to understand what went wrong
  4. Compounding errors: One hallucinated term in a glossary propagates to thousands of translations

The HITL Workflow

A well-designed human-in-the-loop workflow has four stages:

Source Text
    |
    v
[AI Translation] --- generates draft translations
    |
    v
[Quality Scoring] --- automated checks flag potential issues
    |
    v
[Human Review] --- linguists verify flagged segments
    |
    v
[Publication] --- approved translations go live

Stage 1: AI Translation Machine translation generates initial drafts. This is where AI adds the most value — producing a first pass in seconds instead of hours.

Stage 2: Quality Scoring Automated quality checks identify potential issues:

  • Terminology consistency against glossaries
  • Number and date format verification
  • Length constraints for UI elements
  • Formality level matching
  • Known hallucination pattern detection

Stage 3: Human Review Professional linguists or bilingual team members review flagged segments. The key insight: humans don't need to review everything — only the segments that automated checks flag as risky.

Stage 4: Publication Only translations that pass both automated and human review gates are published to production.

HITL Economics

The common objection to human-in-the-loop is cost. But the math actually favors HITL:

ApproachCost per 1000 wordsQualityTime
Pure human translation$80-120High2-3 days
Pure machine translation$0-5Variable (60% risk)Minutes
AI + Human review (HITL)$15-30High2-4 hours

HITL achieves near-human quality at 70-80% lower cost than pure human translation, while being dramatically more reliable than pure automation.


Better i18n's Approach: AI Suggest + Human Review + Quality Scoring

Better i18n implements the HITL model as a first-class feature of the platform, not a bolt-on workflow. Here's how it works:

AI-Assisted Translation Suggestions

When you create a new translation key or request translations for missing languages, Better i18n can generate AI-powered suggestions:

Developer (via MCP): "Add translations for 'checkout.success_message'
                      in German, French, and Japanese"

Better i18n:
  - EN (source): "Your order has been confirmed!"
  - DE (suggestion): "Ihre Bestellung wurde bestaetigt!"
  - FR (suggestion): "Votre commande a ete confirmee !"
  - JA (suggestion): "ご注文が確定しました!"
  Status: pending_review

The suggestions are saved as drafts, never published directly. They enter the review queue where human reviewers can approve, edit, or reject them.

Quality Scoring Dashboard

Every translation in Better i18n receives a quality score based on multiple automated checks:

  • Terminology consistency: Does the translation use approved glossary terms?
  • Placeholder integrity: Are all {variables} preserved from the source?
  • Length ratio: Is the translation within expected length bounds for the language?
  • Formality matching: Does the formality level match the project setting?
  • Completeness: Are any source segments omitted?
  • Known patterns: Does the translation match any known hallucination patterns?

Scores are displayed per-key and per-language, allowing reviewers to prioritize their time on the lowest-scoring translations.

Review Workflow

Better i18n's review workflow supports multiple levels of approval:

  1. Self-review: The translator marks their own work as reviewed
  2. Peer review: Another team member verifies the translation
  3. Expert review: A subject-matter expert validates domain-specific terminology
  4. Auto-approve: High-scoring translations from trusted sources can be auto-approved based on configurable thresholds

MCP Integration for Review

The entire review workflow is accessible via MCP tools, enabling AI agents to participate in (but not bypass) the review process:

Developer: "Show me pending German translations that scored below 80"

Agent (via MCP): Found 7 German translations below quality threshold:
  1. checkout.terms_disclaimer (score: 45) — length mismatch
  2. auth.mfa.setup_instructions (score: 62) — missing placeholder
  3. settings.danger_zone.delete_warning (score: 58) — sentiment reversal detected
  ...

The agent surfaces issues, but a human makes the final decision.


Quality Metrics: BLEU, TER, and Human Evaluation

Understanding translation quality metrics is essential for teams implementing AI translation workflows. Here are the key metrics and when to use each:

BLEU (Bilingual Evaluation Understudy)

What it measures: How closely a machine translation matches one or more human reference translations, based on n-gram overlap.

Scale: 0 to 100 (higher is better)

Score RangeInterpretation
50+Excellent — often indistinguishable from human
35-50Good — understandable, minor errors
20-35Acceptable — meaning preserved, some issues
Below 20Poor — significant quality problems

Strengths:

  • Fast and automated
  • Good for comparing MT systems against each other
  • Well-established benchmark (used in MT research since 2002)

Weaknesses:

  • Requires reference translations (expensive to create)
  • Penalizes valid alternative translations
  • Doesn't capture semantic accuracy well
  • A hallucination that uses similar words to the reference can score high

When to use: Comparing MT engine performance across language pairs, tracking quality trends over time.

TER (Translation Edit Rate)

What it measures: The minimum number of edits (insertions, deletions, substitutions, shifts) needed to transform the MT output into the reference translation.

Scale: 0 to infinity (lower is better; 0 means perfect match)

Score RangeInterpretation
Below 0.2Excellent — minimal editing needed
0.2-0.4Good — light post-editing
0.4-0.6Acceptable — moderate post-editing
Above 0.6Poor — may be faster to translate from scratch

Strengths:

  • Intuitive — directly measures editing effort
  • Better at capturing word order issues than BLEU
  • Useful for estimating post-editing cost

Weaknesses:

  • Also requires reference translations
  • Sensitive to reference translation style
  • Doesn't account for severity of errors

When to use: Estimating post-editing costs, measuring translator productivity gains from MT.

Human Evaluation: MQM (Multidimensional Quality Metrics)

What it measures: Human evaluators annotate errors by type and severity using a standardized taxonomy.

Error categories:

  • Accuracy: mistranslation, omission, addition, untranslated text
  • Fluency: grammar, spelling, punctuation, register
  • Terminology: incorrect terms, inconsistent terminology
  • Style: awkward phrasing, unnatural word choice
  • Locale conventions: date formats, number formats, currency

Severity levels:

  • Critical: Changes meaning in a way that could cause harm (e.g., medical dosage error)
  • Major: Changes meaning but unlikely to cause harm
  • Minor: Noticeable issue that doesn't affect meaning
  • Neutral: Stylistic preference, not an error

When to use: Final quality assurance before launch, evaluating whether an MT system is production-ready for a specific domain.

Choosing the Right Metric

ScenarioRecommended Metric
Comparing MT enginesBLEU + TER (automated, fast)
Estimating post-editing costTER
Production quality gateMQM (human evaluation)
Ongoing monitoringBLEU trend tracking + periodic MQM sampling
Hallucination detectionMQM accuracy subcategory (requires human review)

No single metric captures all aspects of translation quality. The most robust approach combines automated metrics for scaling with periodic human evaluation for accuracy.


Best Practices for Safe AI Translation Usage

Based on the research and real-world examples above, here are actionable practices for teams using AI translation in production:

1. Never Auto-Publish AI Translations

This is the single most important rule. Every AI-generated translation should go through at least one human review step before reaching users.

Implementation: Configure your TMS to save AI translations as "pending review" by default. Disable any "auto-approve" pipelines for AI-generated content.

2. Segment by Risk Level

Not all content carries equal risk. Categorize your translation content:

Risk LevelContent TypeReview Requirement
CriticalLegal, medical, financial, safetyExpert human review required
HighCheckout, auth, settings, error messagesPeer review required
MediumMarketing, blog posts, descriptionsSelf-review acceptable
LowInternal tools, dev-facing stringsAuto-approve with quality threshold

3. Maintain a Translation Glossary

Glossaries are your strongest defense against terminology hallucinations. AI models are more likely to hallucinate when they encounter domain-specific terms without guidance.

Key glossary entries to maintain:

  • Product-specific terms (feature names, plan names)
  • Industry jargon with precise translations
  • Terms that should NOT be translated (brand names, technical identifiers)
  • Terms with multiple valid translations (choose one and enforce consistency)

4. Use Quality Thresholds

Set automated quality thresholds that route translations to the appropriate review queue:

Score 90-100: Auto-approve (trusted source + high quality metrics)
Score 70-89:  Self-review queue (translator confirms)
Score 50-69:  Peer review queue (second reviewer required)
Score 0-49:   Expert review queue (domain expert required)

5. Monitor Hallucination Patterns

Track which language pairs, content types, and MT engines produce the most hallucinations. Use this data to:

  • Adjust quality thresholds per language pair
  • Route high-risk pairs to more rigorous review
  • Switch MT engines for problematic language pairs
  • Build targeted test sets for regression testing

6. Implement Regression Testing

Maintain a set of "golden" translations — human-verified, high-quality reference translations for critical content. Periodically re-translate these using your MT pipeline and compare results:

  • Are quality scores stable or declining?
  • Are new hallucination patterns emerging?
  • Has an MT engine update introduced regressions?

7. Train Your Reviewers

Human reviewers are the last line of defense. Invest in training them to:

  • Recognize common hallucination patterns (especially fabrication and semantic reversal)
  • Use quality scoring tools effectively
  • Prioritize reviews based on risk level and quality score
  • Document and report new hallucination patterns

8. Use MCP for Workflow Automation

MCP-enabled workflows can automate the mechanical parts of translation management while preserving human oversight:

AI Agent workflow:
  1. Detect new keys in codebase (automated)
  2. Generate translation suggestions (AI)
  3. Run quality checks (automated)
  4. Route to appropriate review queue (automated)
  5. Human reviews and approves (human)
  6. Publish approved translations (automated via MCP)

Steps 1-4 and 6 are automated. Step 5 is human. This is the sweet spot of efficiency and quality.


Try Better i18n's AI-Assisted Translation Workflow

The translation industry doesn't have a speed problem — modern MT engines are fast enough. It has a trust problem. Teams can't confidently ship AI-generated translations because they lack the tools to verify quality at scale.

Better i18n solves this by making human-in-the-loop workflows effortless:

  • AI suggestions generate draft translations in seconds
  • Quality scoring automatically flags potential hallucinations
  • Review workflows ensure human oversight without bottlenecks
  • MCP integration lets AI agents manage the mechanical work while humans focus on quality decisions

Start Improving Your Translation Quality

  1. Sign up for Better i18n — free tier includes quality scoring
  2. Set up MCP integration — automate your translation workflow
  3. Explore quality features — see the scoring dashboard and review tools
  4. Connect with the community — share best practices with other i18n teams

Your users deserve translations they can trust. With the right combination of AI speed and human judgment, you can deliver both quality and velocity — without compromise.


Concerned about translation quality in your app? Better i18n's quality scoring dashboard gives you visibility into every translation across every language. Get started for free and see your translation health score in minutes.