Engineering

Machine Translation Quality: Common Issues, Evaluation Methods, and How to Improve Output

Eray Gündoğmuş

March 2, 2026·10 min read

Share

Machine Translation Quality: Common Issues, Evaluation Methods, and How to Improve Output

Key Takeaways

Machine translation quality varies significantly by language pair, domain, and content type
Common MT errors include hallucinations (fabricated content), literal translations, terminology inconsistency, and gender/formality mistakes
Automated metrics (BLEU, COMET, chrF) provide rough quality estimates, but human evaluation remains the gold standard
Post-editing machine translation (MTPE) is the standard workflow for production content — combining MT speed with human accuracy
Improving MT quality requires a combination of better source text, custom glossaries, domain adaptation, and structured post-editing workflows

Common Machine Translation Errors

Understanding the types of errors MT systems produce helps teams build effective review workflows.

Hallucinations

MT models can generate content that doesn't exist in the source text. This is particularly dangerous because the output may look fluent and natural to non-speakers while being factually wrong.

Example: Source: "Click Save" → MT output: "Click Save to preserve your changes and exit the application" (additional meaning fabricated)

Hallucinations are more common in:

Very short strings with limited context
Low-resource language pairs
Content that is ambiguous in the source language

Literal Translation

Translating word-by-word without adapting to the target language's natural expression patterns.

Example: English "It's raining cats and dogs" → French literal translation rather than the natural French idiom "Il pleut des cordes" (it's raining ropes).

In software, literal translations often produce technically correct but unnatural UI text that makes the product feel poorly localized.

Terminology Inconsistency

MT engines don't maintain terminology consistency across strings unless explicitly configured with glossaries. The same source term may be translated differently in different strings.

Example: "Dashboard" translated as "Tableau de bord" in one string and "Panneau de contrôle" in another within the same project.

Gender and Formality Errors

MT systems often default to one gender or formality level and apply it inconsistently.

Example: German translation mixing formal "Sie" address with informal "du" across different strings of the same application.

Context Misinterpretation

Short strings without context are particularly error-prone. The English word "Open" could mean:

Verb: "Open the file" (German: "Öffnen")
Adjective: "The file is open" (German: "Geöffnet")
Noun: "Open (status)" (German: "Offen")

MT systems must guess without context, and frequently guess wrong.

Number and Formatting Errors

MT can incorrectly modify numbers, dates, currencies, and other formatted content:

Changing currency symbols inappropriately
Reformatting dates incorrectly
Modifying technical values (version numbers, measurements)

Evaluation Methods

Automated Metrics

Metric	What It Measures	Strengths	Limitations
BLEU	N-gram overlap with reference translation	Fast, reproducible, widely used	Penalizes valid alternative translations
COMET	Learned quality estimation using neural models	Better correlation with human judgment than BLEU	Requires model download, language-dependent
chrF	Character-level F-score	Works well for morphologically rich languages	Less interpretable than BLEU
TER	Edit distance to reference translation	Intuitive (lower = fewer edits needed)	Same reference-dependent limitation as BLEU

Important: Automated metrics require reference translations (human-translated gold standards). They measure similarity to a reference, not absolute quality. A valid translation that differs stylistically from the reference will score lower even if it's perfectly correct.

Human Evaluation

Human evaluation remains the most reliable method. Common frameworks:

MQM (Multidimensional Quality Metrics): A structured framework that categorizes errors by:

Accuracy: Mistranslation, omission, addition
Fluency: Grammar, spelling, punctuation
Terminology: Wrong term, inconsistent terminology
Style: Register, formality, locale convention

Each error is weighted by severity (critical, major, minor). The total weighted error score gives a quality rating.

Direct Assessment: Human evaluators rate translations on a continuous scale (0-100) for adequacy (does it convey the meaning?) and fluency (does it sound natural?).

Quality Estimation (Reference-Free)

Quality estimation models predict translation quality without a human reference. They're trained on human quality judgments and can:

Flag low-quality translations for review
Prioritize post-editing effort
Provide real-time quality feedback in TMS interfaces

Improving Machine Translation Quality

1. Write Translation-Friendly Source Text

MT quality starts with source text quality:

Use simple, clear sentences: Avoid complex nested clauses
Avoid ambiguity: "Right" (correct? or directional?) — be specific
Minimize idioms and colloquialisms: "Heads up" → "Notice" or "Alert"
Keep strings self-contained: Don't split sentences across multiple translation keys
Provide context: Add descriptions or screenshots for translators (and for context-aware MT)

2. Use Custom Glossaries

Enforce consistent terminology by creating a glossary of product-specific terms with their approved translations per language. Most TMS platforms and MT APIs support glossary enforcement.

3. Leverage Translation Memory

Translation memory ensures previously approved translations are reused exactly. New MT suggestions are only generated for content not found in TM, reducing the overall error surface.

4. Implement Structured Post-Editing

MTPE (Machine Translation Post-Editing) workflows come in two levels:

Light post-editing: Fix errors that change meaning or are clearly unnatural. Accept "good enough" translations. Appropriate for internal content or lower-priority languages.
Full post-editing: Edit the MT output to the quality of a professional human translation. Appropriate for customer-facing content in primary markets.

Define which level applies to each content type and language pair.

5. Provide Context to MT Engines

When available, send contextual information alongside source strings:

File/key context: The filename or key prefix helps MT infer domain
Previous/next strings: Surrounding strings help with consistency
UI screenshots: Visual context reduces ambiguity
String descriptions: Developer-provided notes about what a string does

6. Monitor and Iterate

Track MT quality over time:

Calculate average post-editing distance per language pair
Identify consistently problematic content patterns
Update glossaries based on common corrections
Consider domain adaptation for language pairs with persistent quality issues

FAQ

What is an acceptable MT quality level for production content?

It depends on the content type and audience. For customer-facing product UI, MT output typically needs full post-editing to reach production quality. For help documentation, light post-editing may suffice. For internal communications, raw MT may be acceptable. Define quality tiers by content type and apply the appropriate review level.

How do BLEU scores translate to real-world quality?

BLEU scores are relative, not absolute. A BLEU score of 30+ generally indicates understandable translations, while 50+ suggests high quality. However, these numbers vary significantly by language pair and domain. BLEU is best used for comparing systems or tracking quality changes over time, not for making absolute quality judgments about individual translations.

Should I invest in custom MT model training?

Custom model training is worthwhile when: (a) your domain has specialized vocabulary that generic MT handles poorly, (b) you have sufficient parallel training data (typically 10,000+ sentence pairs), and (c) the language pairs you need are high-volume enough to justify the investment. For most teams, glossaries and translation memory provide substantial quality improvements before custom model training becomes necessary.

Machine Translation Quality: Common Issues, Evaluation Methods, and How to Improve Output

Machine Translation Quality: Common Issues, Evaluation Methods, and How to Improve Output

Key Takeaways

Common Machine Translation Errors

Hallucinations

Literal Translation

Terminology Inconsistency

Gender and Formality Errors

Context Misinterpretation

Number and Formatting Errors

Evaluation Methods

Automated Metrics

Human Evaluation

Quality Estimation (Reference-Free)

Improving Machine Translation Quality

1. Write Translation-Friendly Source Text

2. Use Custom Glossaries

3. Leverage Translation Memory

4. Implement Structured Post-Editing

5. Provide Context to MT Engines

6. Monitor and Iterate

FAQ

What is an acceptable MT quality level for production content?

How do BLEU scores translate to real-world quality?

Should I invest in custom MT model training?

Related Posts

Online Translation Tools for Developers: Beyond Google Translate

AI-Powered Translation Workflows: From Machine Translation to Post-Editing

How Better i18n Secures Enterprise Translation Workflows: Auth, Encryption & Compliance

Explore More

For Developers

For Translators

For Product Teams

All Features