Table of Contents
Table of Contents
- Machine Translation Quality: Common Issues, Evaluation Methods, and How to Improve Output
- Key Takeaways
- Common Machine Translation Errors
- Hallucinations
- Literal Translation
- Terminology Inconsistency
- Gender and Formality Errors
- Context Misinterpretation
- Number and Formatting Errors
- Evaluation Methods
- Automated Metrics
- Human Evaluation
- Quality Estimation (Reference-Free)
- Improving Machine Translation Quality
- 1. Write Translation-Friendly Source Text
- 2. Use Custom Glossaries
- 3. Leverage Translation Memory
- 4. Implement Structured Post-Editing
- 5. Provide Context to MT Engines
- 6. Monitor and Iterate
- FAQ
- What is an acceptable MT quality level for production content?
- How do BLEU scores translate to real-world quality?
- Should I invest in custom MT model training?
Machine Translation Quality: Common Issues, Evaluation Methods, and How to Improve Output
Key Takeaways
- Machine translation quality varies significantly by language pair, domain, and content type
- Common MT errors include hallucinations (fabricated content), literal translations, terminology inconsistency, and gender/formality mistakes
- Automated metrics (BLEU, COMET, chrF) provide rough quality estimates, but human evaluation remains the gold standard
- Post-editing machine translation (MTPE) is the standard workflow for production content — combining MT speed with human accuracy
- Improving MT quality requires a combination of better source text, custom glossaries, domain adaptation, and structured post-editing workflows
Common Machine Translation Errors
Understanding the types of errors MT systems produce helps teams build effective review workflows.
Hallucinations
MT models can generate content that doesn't exist in the source text. This is particularly dangerous because the output may look fluent and natural to non-speakers while being factually wrong.
Example: Source: "Click Save" → MT output: "Click Save to preserve your changes and exit the application" (additional meaning fabricated)
Hallucinations are more common in:
- Very short strings with limited context
- Low-resource language pairs
- Content that is ambiguous in the source language
Literal Translation
Translating word-by-word without adapting to the target language's natural expression patterns.
Example: English "It's raining cats and dogs" → French literal translation rather than the natural French idiom "Il pleut des cordes" (it's raining ropes).
In software, literal translations often produce technically correct but unnatural UI text that makes the product feel poorly localized.
Terminology Inconsistency
MT engines don't maintain terminology consistency across strings unless explicitly configured with glossaries. The same source term may be translated differently in different strings.
Example: "Dashboard" translated as "Tableau de bord" in one string and "Panneau de contrôle" in another within the same project.
Gender and Formality Errors
MT systems often default to one gender or formality level and apply it inconsistently.
Example: German translation mixing formal "Sie" address with informal "du" across different strings of the same application.
Context Misinterpretation
Short strings without context are particularly error-prone. The English word "Open" could mean:
- Verb: "Open the file" (German: "Öffnen")
- Adjective: "The file is open" (German: "Geöffnet")
- Noun: "Open (status)" (German: "Offen")
MT systems must guess without context, and frequently guess wrong.
Number and Formatting Errors
MT can incorrectly modify numbers, dates, currencies, and other formatted content:
- Changing currency symbols inappropriately
- Reformatting dates incorrectly
- Modifying technical values (version numbers, measurements)
Evaluation Methods
Automated Metrics
| Metric | What It Measures | Strengths | Limitations |
|---|---|---|---|
| BLEU | N-gram overlap with reference translation | Fast, reproducible, widely used | Penalizes valid alternative translations |
| COMET | Learned quality estimation using neural models | Better correlation with human judgment than BLEU | Requires model download, language-dependent |
| chrF | Character-level F-score | Works well for morphologically rich languages | Less interpretable than BLEU |
| TER | Edit distance to reference translation | Intuitive (lower = fewer edits needed) | Same reference-dependent limitation as BLEU |
Important: Automated metrics require reference translations (human-translated gold standards). They measure similarity to a reference, not absolute quality. A valid translation that differs stylistically from the reference will score lower even if it's perfectly correct.
Human Evaluation
Human evaluation remains the most reliable method. Common frameworks:
MQM (Multidimensional Quality Metrics): A structured framework that categorizes errors by:
- Accuracy: Mistranslation, omission, addition
- Fluency: Grammar, spelling, punctuation
- Terminology: Wrong term, inconsistent terminology
- Style: Register, formality, locale convention
Each error is weighted by severity (critical, major, minor). The total weighted error score gives a quality rating.
Direct Assessment: Human evaluators rate translations on a continuous scale (0-100) for adequacy (does it convey the meaning?) and fluency (does it sound natural?).
Quality Estimation (Reference-Free)
Quality estimation models predict translation quality without a human reference. They're trained on human quality judgments and can:
- Flag low-quality translations for review
- Prioritize post-editing effort
- Provide real-time quality feedback in TMS interfaces
Improving Machine Translation Quality
1. Write Translation-Friendly Source Text
MT quality starts with source text quality:
- Use simple, clear sentences: Avoid complex nested clauses
- Avoid ambiguity: "Right" (correct? or directional?) — be specific
- Minimize idioms and colloquialisms: "Heads up" → "Notice" or "Alert"
- Keep strings self-contained: Don't split sentences across multiple translation keys
- Provide context: Add descriptions or screenshots for translators (and for context-aware MT)
2. Use Custom Glossaries
Enforce consistent terminology by creating a glossary of product-specific terms with their approved translations per language. Most TMS platforms and MT APIs support glossary enforcement.
3. Leverage Translation Memory
Translation memory ensures previously approved translations are reused exactly. New MT suggestions are only generated for content not found in TM, reducing the overall error surface.
4. Implement Structured Post-Editing
MTPE (Machine Translation Post-Editing) workflows come in two levels:
- Light post-editing: Fix errors that change meaning or are clearly unnatural. Accept "good enough" translations. Appropriate for internal content or lower-priority languages.
- Full post-editing: Edit the MT output to the quality of a professional human translation. Appropriate for customer-facing content in primary markets.
Define which level applies to each content type and language pair.
5. Provide Context to MT Engines
When available, send contextual information alongside source strings:
- File/key context: The filename or key prefix helps MT infer domain
- Previous/next strings: Surrounding strings help with consistency
- UI screenshots: Visual context reduces ambiguity
- String descriptions: Developer-provided notes about what a string does
6. Monitor and Iterate
Track MT quality over time:
- Calculate average post-editing distance per language pair
- Identify consistently problematic content patterns
- Update glossaries based on common corrections
- Consider domain adaptation for language pairs with persistent quality issues
FAQ
What is an acceptable MT quality level for production content?
It depends on the content type and audience. For customer-facing product UI, MT output typically needs full post-editing to reach production quality. For help documentation, light post-editing may suffice. For internal communications, raw MT may be acceptable. Define quality tiers by content type and apply the appropriate review level.
How do BLEU scores translate to real-world quality?
BLEU scores are relative, not absolute. A BLEU score of 30+ generally indicates understandable translations, while 50+ suggests high quality. However, these numbers vary significantly by language pair and domain. BLEU is best used for comparing systems or tracking quality changes over time, not for making absolute quality judgments about individual translations.
Should I invest in custom MT model training?
Custom model training is worthwhile when: (a) your domain has specialized vocabulary that generic MT handles poorly, (b) you have sufficient parallel training data (typically 10,000+ sentence pairs), and (c) the language pairs you need are high-volume enough to justify the investment. For most teams, glossaries and translation memory provide substantial quality improvements before custom model training becomes necessary.