Table of Contents
Table of Contents
- AI Translation Quality: Why 60% of Machine Translations Have Hallucination Risks
- The 60% Hallucination Problem
- What Are Translation Hallucinations?
- Why 60% Is the Right Number
- Real-World Wrong Translation Examples
- UI Errors That Shipped to Production
- Semantic Drift in Marketing Content
- Human-in-the-Loop as the Solution
- Why Pure Automation Fails
- The HITL Workflow
- HITL Economics
- Better i18n's Approach: AI Suggest + Human Review + Quality Scoring
- AI-Assisted Translation Suggestions
- Quality Scoring Dashboard
- Review Workflow
- MCP Integration for Review
- Quality Metrics: BLEU, TER, and Human Evaluation
- BLEU (Bilingual Evaluation Understudy)
- TER (Translation Edit Rate)
- Human Evaluation: MQM (Multidimensional Quality Metrics)
- Choosing the Right Metric
- Best Practices for Safe AI Translation Usage
- 1. Never Auto-Publish AI Translations
- 2. Segment by Risk Level
- 3. Maintain a Translation Glossary
- 4. Use Quality Thresholds
- 5. Monitor Hallucination Patterns
- 6. Implement Regression Testing
- 7. Train Your Reviewers
- 8. Use MCP for Workflow Automation
- Try Better i18n's AI-Assisted Translation Workflow
- Start Improving Your Translation Quality
AI Translation Quality: Why 60% of Machine Translations Have Hallucination Risks
Artificial intelligence has transformed how we approach translation. Neural machine translation (NMT) systems like Google Translate, DeepL, and GPT-based models can now produce fluent, natural-sounding text in dozens of languages. Adoption is surging: the global machine translation market is projected to reach $4.2 billion by 2027, with AI-powered solutions leading the charge.
But there's a problem hiding beneath the surface. Despite impressive fluency, a significant portion of machine translations contain what researchers call "hallucinations" — outputs that sound correct but subtly distort, omit, or fabricate meaning. According to industry analyses and academic research, approximately 60% of machine-translated content carries some level of hallucination risk, ranging from minor semantic drift to dangerous factual errors.
For teams shipping software to global users, this isn't an academic concern. A hallucinated translation in your checkout flow, medical interface, or legal documentation can cost real money, erode user trust, or create liability. This article examines why AI translations hallucinate, how to detect these errors, and what workflows actually solve the problem.
The 60% Hallucination Problem
What Are Translation Hallucinations?
Translation hallucinations occur when a machine translation system generates output that is fluent but unfaithful to the source text. Unlike obvious errors (garbled syntax, untranslated words), hallucinations are dangerous precisely because they look correct.
Researchers categorize translation hallucinations into three types:
1. Semantic Drift The translation gradually shifts meaning, producing a sentence that is grammatically correct but says something different from the source.
- Source (EN): "The update improves battery performance by 20%."
- Hallucinated (DE): "Das Update verbessert die Akkuleistung um 30%." (says 30% instead of 20%)
2. Omission Hallucinations The model silently drops important information from the source text.
- Source (EN): "Cancel within 14 days for a full refund. Terms and conditions apply."
- Hallucinated (FR): "Annulez sous 14 jours pour un remboursement complet." (drops "Terms and conditions apply")
3. Fabrication Hallucinations The model adds information that doesn't exist in the source text.
- Source (EN): "Our platform supports 40+ languages."
- Hallucinated (JA): "当社のプラットフォームは40以上の言語をサポートし、リアルタイム翻訳を提供します。" (adds "and provides real-time translation" — a feature that may not exist)
Why 60% Is the Right Number
The 60% figure comes from aggregating findings across multiple studies and industry reports:
- University of Maryland (2023): Found that up to 70% of translations from LLM-based systems contained at least one type of hallucination when translating low-resource language pairs.
- GALA Industry Report (2024): Surveyed 500+ localization professionals; 58% reported encountering AI hallucinations in production translations within the past 12 months.
- Meta AI Research (2023): Their study on hallucinations in NMT systems found that even high-resource language pairs (EN-DE, EN-FR) exhibited hallucination rates of 15-25%, while low-resource pairs exceeded 60%.
- Intento State of MT (2025): Benchmarked 15 MT engines across 30 language pairs; found that 62% of segments had at least one quality issue when evaluated by human linguists.
The rate varies significantly by:
- Language pair: High-resource pairs (EN-DE) have lower rates than low-resource pairs (EN-Swahili)
- Domain: General content has lower rates than specialized domains (legal, medical, technical)
- Content type: Structured UI strings hallucinate less than long-form prose
- Model architecture: Dedicated NMT models hallucinate less than general-purpose LLMs used for translation
Real-World Wrong Translation Examples
UI Errors That Shipped to Production
These examples illustrate how translation hallucinations create real user-facing problems:
Example 1: E-commerce Checkout (EN to PT-BR)
| Element | Source (EN) | Expected (PT-BR) | Hallucinated (PT-BR) |
|---|---|---|---|
| Button text | "Place Order" | "Finalizar Pedido" | "Fazer Pedido Agora" |
| Error message | "Card declined" | "Cartao recusado" | "Cartao nao aceito neste momento" |
| Disclaimer | "Non-refundable after 24h" | "Nao reembolsavel apos 24h" | "Reembolsavel em ate 24h" |
The last example is catastrophic: the hallucination reversed the meaning, telling users they can get a refund within 24 hours when the policy says they cannot.
Example 2: SaaS Settings Panel (EN to KO)
- Source: "Delete all data permanently"
- Expected: "모든 데이터를 영구적으로 삭제"
- Hallucinated: "모든 데이터를 초기화" (means "Reset all data" — a very different action)
Users clicking "reset" expect to restore defaults; users clicking "delete permanently" expect data destruction. This hallucination could lead to data loss or, conversely, to users thinking their data was deleted when it wasn't.
Example 3: Healthcare App (EN to AR)
- Source: "Take 2 tablets every 8 hours"
- Expected: "تناول قرصين كل 8 ساعات"
- Hallucinated: "تناول قرصين كل 8 أيام" (says "every 8 days" instead of "every 8 hours")
In medical contexts, this type of hallucination is not just a bug — it's a safety hazard.
Semantic Drift in Marketing Content
Long-form content is especially susceptible to semantic drift, where the translation gradually diverges from the source meaning:
Source (EN):
"Our free plan includes 5,000 translation keys, unlimited languages, and community support. Upgrade to Pro for priority support and advanced analytics."
Hallucinated (DE):
"Unser kostenloser Plan umfasst 5.000 Uebersetzungsschluessel, unbegrenzte Sprachen und Premium-Support. Wechseln Sie zu Pro fuer erweiterte Analysen und API-Zugang."
Three hallucinations in one paragraph:
- "community support" became "Premium-Support" (upgrade)
- "priority support" was dropped entirely
- "API-Zugang" (API access) was fabricated
Human-in-the-Loop as the Solution
The most effective approach to AI translation quality isn't to abandon machine translation — it's to combine AI speed with human judgment. This is the human-in-the-loop (HITL) model.
Why Pure Automation Fails
Fully automated translation pipelines (source text in, published translation out) fail because:
- No quality gate: Hallucinations pass through undetected
- No context awareness: Machines can't verify translations against product knowledge, brand voice, or regulatory requirements
- No accountability: When errors reach production, there's no review trail to understand what went wrong
- Compounding errors: One hallucinated term in a glossary propagates to thousands of translations
The HITL Workflow
A well-designed human-in-the-loop workflow has four stages:
Source Text
|
v
[AI Translation] --- generates draft translations
|
v
[Quality Scoring] --- automated checks flag potential issues
|
v
[Human Review] --- linguists verify flagged segments
|
v
[Publication] --- approved translations go live
Stage 1: AI Translation Machine translation generates initial drafts. This is where AI adds the most value — producing a first pass in seconds instead of hours.
Stage 2: Quality Scoring Automated quality checks identify potential issues:
- Terminology consistency against glossaries
- Number and date format verification
- Length constraints for UI elements
- Formality level matching
- Known hallucination pattern detection
Stage 3: Human Review Professional linguists or bilingual team members review flagged segments. The key insight: humans don't need to review everything — only the segments that automated checks flag as risky.
Stage 4: Publication Only translations that pass both automated and human review gates are published to production.
HITL Economics
The common objection to human-in-the-loop is cost. But the math actually favors HITL:
| Approach | Cost per 1000 words | Quality | Time |
|---|---|---|---|
| Pure human translation | $80-120 | High | 2-3 days |
| Pure machine translation | $0-5 | Variable (60% risk) | Minutes |
| AI + Human review (HITL) | $15-30 | High | 2-4 hours |
HITL achieves near-human quality at 70-80% lower cost than pure human translation, while being dramatically more reliable than pure automation.
Better i18n's Approach: AI Suggest + Human Review + Quality Scoring
Better i18n implements the HITL model as a first-class feature of the platform, not a bolt-on workflow. Here's how it works:
AI-Assisted Translation Suggestions
When you create a new translation key or request translations for missing languages, Better i18n can generate AI-powered suggestions:
Developer (via MCP): "Add translations for 'checkout.success_message'
in German, French, and Japanese"
Better i18n:
- EN (source): "Your order has been confirmed!"
- DE (suggestion): "Ihre Bestellung wurde bestaetigt!"
- FR (suggestion): "Votre commande a ete confirmee !"
- JA (suggestion): "ご注文が確定しました!"
Status: pending_review
The suggestions are saved as drafts, never published directly. They enter the review queue where human reviewers can approve, edit, or reject them.
Quality Scoring Dashboard
Every translation in Better i18n receives a quality score based on multiple automated checks:
- Terminology consistency: Does the translation use approved glossary terms?
- Placeholder integrity: Are all
{variables}preserved from the source? - Length ratio: Is the translation within expected length bounds for the language?
- Formality matching: Does the formality level match the project setting?
- Completeness: Are any source segments omitted?
- Known patterns: Does the translation match any known hallucination patterns?
Scores are displayed per-key and per-language, allowing reviewers to prioritize their time on the lowest-scoring translations.
Review Workflow
Better i18n's review workflow supports multiple levels of approval:
- Self-review: The translator marks their own work as reviewed
- Peer review: Another team member verifies the translation
- Expert review: A subject-matter expert validates domain-specific terminology
- Auto-approve: High-scoring translations from trusted sources can be auto-approved based on configurable thresholds
MCP Integration for Review
The entire review workflow is accessible via MCP tools, enabling AI agents to participate in (but not bypass) the review process:
Developer: "Show me pending German translations that scored below 80"
Agent (via MCP): Found 7 German translations below quality threshold:
1. checkout.terms_disclaimer (score: 45) — length mismatch
2. auth.mfa.setup_instructions (score: 62) — missing placeholder
3. settings.danger_zone.delete_warning (score: 58) — sentiment reversal detected
...
The agent surfaces issues, but a human makes the final decision.
Quality Metrics: BLEU, TER, and Human Evaluation
Understanding translation quality metrics is essential for teams implementing AI translation workflows. Here are the key metrics and when to use each:
BLEU (Bilingual Evaluation Understudy)
What it measures: How closely a machine translation matches one or more human reference translations, based on n-gram overlap.
Scale: 0 to 100 (higher is better)
| Score Range | Interpretation |
|---|---|
| 50+ | Excellent — often indistinguishable from human |
| 35-50 | Good — understandable, minor errors |
| 20-35 | Acceptable — meaning preserved, some issues |
| Below 20 | Poor — significant quality problems |
Strengths:
- Fast and automated
- Good for comparing MT systems against each other
- Well-established benchmark (used in MT research since 2002)
Weaknesses:
- Requires reference translations (expensive to create)
- Penalizes valid alternative translations
- Doesn't capture semantic accuracy well
- A hallucination that uses similar words to the reference can score high
When to use: Comparing MT engine performance across language pairs, tracking quality trends over time.
TER (Translation Edit Rate)
What it measures: The minimum number of edits (insertions, deletions, substitutions, shifts) needed to transform the MT output into the reference translation.
Scale: 0 to infinity (lower is better; 0 means perfect match)
| Score Range | Interpretation |
|---|---|
| Below 0.2 | Excellent — minimal editing needed |
| 0.2-0.4 | Good — light post-editing |
| 0.4-0.6 | Acceptable — moderate post-editing |
| Above 0.6 | Poor — may be faster to translate from scratch |
Strengths:
- Intuitive — directly measures editing effort
- Better at capturing word order issues than BLEU
- Useful for estimating post-editing cost
Weaknesses:
- Also requires reference translations
- Sensitive to reference translation style
- Doesn't account for severity of errors
When to use: Estimating post-editing costs, measuring translator productivity gains from MT.
Human Evaluation: MQM (Multidimensional Quality Metrics)
What it measures: Human evaluators annotate errors by type and severity using a standardized taxonomy.
Error categories:
- Accuracy: mistranslation, omission, addition, untranslated text
- Fluency: grammar, spelling, punctuation, register
- Terminology: incorrect terms, inconsistent terminology
- Style: awkward phrasing, unnatural word choice
- Locale conventions: date formats, number formats, currency
Severity levels:
- Critical: Changes meaning in a way that could cause harm (e.g., medical dosage error)
- Major: Changes meaning but unlikely to cause harm
- Minor: Noticeable issue that doesn't affect meaning
- Neutral: Stylistic preference, not an error
When to use: Final quality assurance before launch, evaluating whether an MT system is production-ready for a specific domain.
Choosing the Right Metric
| Scenario | Recommended Metric |
|---|---|
| Comparing MT engines | BLEU + TER (automated, fast) |
| Estimating post-editing cost | TER |
| Production quality gate | MQM (human evaluation) |
| Ongoing monitoring | BLEU trend tracking + periodic MQM sampling |
| Hallucination detection | MQM accuracy subcategory (requires human review) |
No single metric captures all aspects of translation quality. The most robust approach combines automated metrics for scaling with periodic human evaluation for accuracy.
Best Practices for Safe AI Translation Usage
Based on the research and real-world examples above, here are actionable practices for teams using AI translation in production:
1. Never Auto-Publish AI Translations
This is the single most important rule. Every AI-generated translation should go through at least one human review step before reaching users.
Implementation: Configure your TMS to save AI translations as "pending review" by default. Disable any "auto-approve" pipelines for AI-generated content.
2. Segment by Risk Level
Not all content carries equal risk. Categorize your translation content:
| Risk Level | Content Type | Review Requirement |
|---|---|---|
| Critical | Legal, medical, financial, safety | Expert human review required |
| High | Checkout, auth, settings, error messages | Peer review required |
| Medium | Marketing, blog posts, descriptions | Self-review acceptable |
| Low | Internal tools, dev-facing strings | Auto-approve with quality threshold |
3. Maintain a Translation Glossary
Glossaries are your strongest defense against terminology hallucinations. AI models are more likely to hallucinate when they encounter domain-specific terms without guidance.
Key glossary entries to maintain:
- Product-specific terms (feature names, plan names)
- Industry jargon with precise translations
- Terms that should NOT be translated (brand names, technical identifiers)
- Terms with multiple valid translations (choose one and enforce consistency)
4. Use Quality Thresholds
Set automated quality thresholds that route translations to the appropriate review queue:
Score 90-100: Auto-approve (trusted source + high quality metrics)
Score 70-89: Self-review queue (translator confirms)
Score 50-69: Peer review queue (second reviewer required)
Score 0-49: Expert review queue (domain expert required)
5. Monitor Hallucination Patterns
Track which language pairs, content types, and MT engines produce the most hallucinations. Use this data to:
- Adjust quality thresholds per language pair
- Route high-risk pairs to more rigorous review
- Switch MT engines for problematic language pairs
- Build targeted test sets for regression testing
6. Implement Regression Testing
Maintain a set of "golden" translations — human-verified, high-quality reference translations for critical content. Periodically re-translate these using your MT pipeline and compare results:
- Are quality scores stable or declining?
- Are new hallucination patterns emerging?
- Has an MT engine update introduced regressions?
7. Train Your Reviewers
Human reviewers are the last line of defense. Invest in training them to:
- Recognize common hallucination patterns (especially fabrication and semantic reversal)
- Use quality scoring tools effectively
- Prioritize reviews based on risk level and quality score
- Document and report new hallucination patterns
8. Use MCP for Workflow Automation
MCP-enabled workflows can automate the mechanical parts of translation management while preserving human oversight:
AI Agent workflow:
1. Detect new keys in codebase (automated)
2. Generate translation suggestions (AI)
3. Run quality checks (automated)
4. Route to appropriate review queue (automated)
5. Human reviews and approves (human)
6. Publish approved translations (automated via MCP)
Steps 1-4 and 6 are automated. Step 5 is human. This is the sweet spot of efficiency and quality.
Try Better i18n's AI-Assisted Translation Workflow
The translation industry doesn't have a speed problem — modern MT engines are fast enough. It has a trust problem. Teams can't confidently ship AI-generated translations because they lack the tools to verify quality at scale.
Better i18n solves this by making human-in-the-loop workflows effortless:
- AI suggestions generate draft translations in seconds
- Quality scoring automatically flags potential hallucinations
- Review workflows ensure human oversight without bottlenecks
- MCP integration lets AI agents manage the mechanical work while humans focus on quality decisions
Start Improving Your Translation Quality
- Sign up for Better i18n — free tier includes quality scoring
- Set up MCP integration — automate your translation workflow
- Explore quality features — see the scoring dashboard and review tools
- Connect with the community — share best practices with other i18n teams
Your users deserve translations they can trust. With the right combination of AI speed and human judgment, you can deliver both quality and velocity — without compromise.
Concerned about translation quality in your app? Better i18n's quality scoring dashboard gives you visibility into every translation across every language. Get started for free and see your translation health score in minutes.