Engineering

Machine Translation Quality: Common Issues, Evaluation Methods, and How to Improve Output

Eray Gündoğmuş
Eray Gündoğmuş
·10 min read
Share
Machine Translation Quality: Common Issues, Evaluation Methods, and How to Improve Output

Machine Translation Quality: Common Issues, Evaluation Methods, and How to Improve Output

Key Takeaways

  • Machine translation quality varies significantly by language pair, domain, and content type
  • Common MT errors include hallucinations (fabricated content), literal translations, terminology inconsistency, and gender/formality mistakes
  • Automated metrics (BLEU, COMET, chrF) provide rough quality estimates, but human evaluation remains the gold standard
  • Post-editing machine translation (MTPE) is the standard workflow for production content — combining MT speed with human accuracy
  • Improving MT quality requires a combination of better source text, custom glossaries, domain adaptation, and structured post-editing workflows

Common Machine Translation Errors

Understanding the types of errors MT systems produce helps teams build effective review workflows.

Hallucinations

MT models can generate content that doesn't exist in the source text. This is particularly dangerous because the output may look fluent and natural to non-speakers while being factually wrong.

Example: Source: "Click Save" → MT output: "Click Save to preserve your changes and exit the application" (additional meaning fabricated)

Hallucinations are more common in:

  • Very short strings with limited context
  • Low-resource language pairs
  • Content that is ambiguous in the source language

Literal Translation

Translating word-by-word without adapting to the target language's natural expression patterns.

Example: English "It's raining cats and dogs" → French literal translation rather than the natural French idiom "Il pleut des cordes" (it's raining ropes).

In software, literal translations often produce technically correct but unnatural UI text that makes the product feel poorly localized.

Terminology Inconsistency

MT engines don't maintain terminology consistency across strings unless explicitly configured with glossaries. The same source term may be translated differently in different strings.

Example: "Dashboard" translated as "Tableau de bord" in one string and "Panneau de contrôle" in another within the same project.

Gender and Formality Errors

MT systems often default to one gender or formality level and apply it inconsistently.

Example: German translation mixing formal "Sie" address with informal "du" across different strings of the same application.

Context Misinterpretation

Short strings without context are particularly error-prone. The English word "Open" could mean:

  • Verb: "Open the file" (German: "Öffnen")
  • Adjective: "The file is open" (German: "Geöffnet")
  • Noun: "Open (status)" (German: "Offen")

MT systems must guess without context, and frequently guess wrong.

Number and Formatting Errors

MT can incorrectly modify numbers, dates, currencies, and other formatted content:

  • Changing currency symbols inappropriately
  • Reformatting dates incorrectly
  • Modifying technical values (version numbers, measurements)

Evaluation Methods

Automated Metrics

MetricWhat It MeasuresStrengthsLimitations
BLEUN-gram overlap with reference translationFast, reproducible, widely usedPenalizes valid alternative translations
COMETLearned quality estimation using neural modelsBetter correlation with human judgment than BLEURequires model download, language-dependent
chrFCharacter-level F-scoreWorks well for morphologically rich languagesLess interpretable than BLEU
TEREdit distance to reference translationIntuitive (lower = fewer edits needed)Same reference-dependent limitation as BLEU

Important: Automated metrics require reference translations (human-translated gold standards). They measure similarity to a reference, not absolute quality. A valid translation that differs stylistically from the reference will score lower even if it's perfectly correct.

Human Evaluation

Human evaluation remains the most reliable method. Common frameworks:

MQM (Multidimensional Quality Metrics): A structured framework that categorizes errors by:

  • Accuracy: Mistranslation, omission, addition
  • Fluency: Grammar, spelling, punctuation
  • Terminology: Wrong term, inconsistent terminology
  • Style: Register, formality, locale convention

Each error is weighted by severity (critical, major, minor). The total weighted error score gives a quality rating.

Direct Assessment: Human evaluators rate translations on a continuous scale (0-100) for adequacy (does it convey the meaning?) and fluency (does it sound natural?).

Quality Estimation (Reference-Free)

Quality estimation models predict translation quality without a human reference. They're trained on human quality judgments and can:

  • Flag low-quality translations for review
  • Prioritize post-editing effort
  • Provide real-time quality feedback in TMS interfaces

Improving Machine Translation Quality

1. Write Translation-Friendly Source Text

MT quality starts with source text quality:

  • Use simple, clear sentences: Avoid complex nested clauses
  • Avoid ambiguity: "Right" (correct? or directional?) — be specific
  • Minimize idioms and colloquialisms: "Heads up" → "Notice" or "Alert"
  • Keep strings self-contained: Don't split sentences across multiple translation keys
  • Provide context: Add descriptions or screenshots for translators (and for context-aware MT)

2. Use Custom Glossaries

Enforce consistent terminology by creating a glossary of product-specific terms with their approved translations per language. Most TMS platforms and MT APIs support glossary enforcement.

3. Leverage Translation Memory

Translation memory ensures previously approved translations are reused exactly. New MT suggestions are only generated for content not found in TM, reducing the overall error surface.

4. Implement Structured Post-Editing

MTPE (Machine Translation Post-Editing) workflows come in two levels:

  • Light post-editing: Fix errors that change meaning or are clearly unnatural. Accept "good enough" translations. Appropriate for internal content or lower-priority languages.
  • Full post-editing: Edit the MT output to the quality of a professional human translation. Appropriate for customer-facing content in primary markets.

Define which level applies to each content type and language pair.

5. Provide Context to MT Engines

When available, send contextual information alongside source strings:

  • File/key context: The filename or key prefix helps MT infer domain
  • Previous/next strings: Surrounding strings help with consistency
  • UI screenshots: Visual context reduces ambiguity
  • String descriptions: Developer-provided notes about what a string does

6. Monitor and Iterate

Track MT quality over time:

  • Calculate average post-editing distance per language pair
  • Identify consistently problematic content patterns
  • Update glossaries based on common corrections
  • Consider domain adaptation for language pairs with persistent quality issues

FAQ

What is an acceptable MT quality level for production content?

It depends on the content type and audience. For customer-facing product UI, MT output typically needs full post-editing to reach production quality. For help documentation, light post-editing may suffice. For internal communications, raw MT may be acceptable. Define quality tiers by content type and apply the appropriate review level.

How do BLEU scores translate to real-world quality?

BLEU scores are relative, not absolute. A BLEU score of 30+ generally indicates understandable translations, while 50+ suggests high quality. However, these numbers vary significantly by language pair and domain. BLEU is best used for comparing systems or tracking quality changes over time, not for making absolute quality judgments about individual translations.

Should I invest in custom MT model training?

Custom model training is worthwhile when: (a) your domain has specialized vocabulary that generic MT handles poorly, (b) you have sufficient parallel training data (typically 10,000+ sentence pairs), and (c) the language pairs you need are high-volume enough to justify the investment. For most teams, glossaries and translation memory provide substantial quality improvements before custom model training becomes necessary.