SEO

Translation Quality Metrics: How to Measure and Improve

Eray Gündoğmuş

March 1, 2026·13 min read

Share

Table of Contents

Translation Quality Metrics: How to Measure and Improve

"Quality" in translation is notoriously difficult to define and measure. A translation can be accurate but stilted. Fluent but unfaithful. Terminologically correct but culturally tone-deaf. And what counts as "high quality" for a technical manual differs fundamentally from what counts as quality for a marketing campaign.

Despite this complexity, translation quality measurement is essential for any organization running a localization program at scale. Without metrics, you can't identify quality problems, improve vendor relationships, make data-driven tool decisions, or demonstrate ROI to stakeholders.

This guide covers the major frameworks, tools, and approaches for measuring translation quality—and how to use those measurements to drive continuous improvement.

Why Quality Measurement Matters

Organizations that don't measure translation quality systematically tend to discover quality problems through:

Customer complaints about confusing or incorrect translations
Support tickets from non-English speaking markets
Legal issues from mistranslated compliance content
Failed product launches in localized markets
Expensive rework after content has already been published

Proactive quality measurement catches problems earlier, when they're cheaper to fix. It also creates accountability in vendor relationships and enables objective comparison of MT tools, translation vendors, and workflow changes.

The Quality Evaluation Framework Landscape

MQM: Multidimensional Quality Metrics

MQM (Multidimensional Quality Metrics) is the most comprehensive and widely adopted framework in professional localization. Developed by the QTLaunchPad project and adopted by ASTM International as F3131, MQM provides a hierarchical taxonomy of translation error types.

MQM error categories (top level):

Category	Description
Accuracy	The translation does not faithfully represent the source
Fluency	The translation is not natural in the target language
Terminology	Terms do not match the approved glossary or domain conventions
Style	The translation violates style guidelines
Locale convention	Numbers, dates, addresses formatted incorrectly for locale
Verity	Claims in the translation are factually incorrect

Each category has subcategories. For example, Accuracy includes: mistranslation, omission, addition, untranslated content, and structural errors.

MQM scoring: Each error is classified by type and severity (critical, major, minor). A weighted score is calculated:

MQM score = (critical × 25 + major × 5 + minor × 1) / word count × 1000

Lower scores are better. Industry benchmarks vary, but common thresholds:

< 1.0: Excellent quality
1.0–3.0: Acceptable quality
3.0–5.0: Needs improvement
5.0: Unacceptable quality

LISA QA Model

The LISA (Localization Industry Standards Association) QA model predates MQM and is simpler. It classifies errors as:

Mistranslation
Accuracy
Terminology
Language (grammar, spelling, punctuation)
Style
Country/locale standard
Formatting

LISA QA is still used widely, particularly in older enterprise localization programs. It's less nuanced than MQM but simpler to implement.

SAE J2450

A simplified error taxonomy developed by the automotive industry. Five error types: wrong term, syntactical error, omission, word structure error, spelling/punctuation. Used in automotive and related industries.

TAUS Dynamic Quality Framework (DQF)

The TAUS (Translation Automation User Society) DQF provides simplified quality assessment tools designed for use at scale. It includes:

Adequacy scale (1-4): Does the translation convey the meaning of the source?
Fluency scale (1-4): How fluent is the language in the translation?

DQF tools are available in major CAT tools and TMS platforms, making it practical for high-volume assessment.

Automated Quality Metrics

Human evaluation is the gold standard but doesn't scale to millions of words. Automated metrics serve as proxies for human judgment at scale.

BLEU (Bilingual Evaluation Understudy)

BLEU measures the overlap between an MT output (or translated text) and one or more human reference translations. It calculates n-gram precision (how many word sequences in the translation appear in the references) with a brevity penalty for translations that are too short.

Interpretation: BLEU scores range from 0–100. Higher is better. But BLEU correlates poorly with human judgments at the segment level—it's a corpus-level metric only useful for comparing systems, not for evaluating individual translations.

Use case: Comparing MT engines or measuring improvement after engine retraining. Not useful for individual segment QA.

COMET (Crosslingual Optimized Metric for Evaluation of Translation)

COMET uses a neural network trained on human quality judgments to predict quality scores. It correlates significantly better with human evaluations than BLEU, particularly at the segment level.

Use case: Evaluating MT quality, comparing engines, predicting post-editing effort. Increasingly used in production MT quality estimation pipelines.

TER (Translation Edit Rate)

TER measures the number of edits required to transform the MT output into the reference translation. Lower TER = fewer edits needed = better quality.

Use case: Estimating post-editing effort. Can be used to route segments: low-TER segments to post-editing, high-TER segments to human translation from scratch.

ChrF (Character F-score)

ChrF works at the character level rather than word level. It performs better than BLEU for morphologically rich languages (German, Turkish, Finnish) where word-level matching misses many correct translations that use different morphological forms.

Linguistic Quality Assurance (LQA) Process

Automated metrics catch some errors but miss many quality dimensions—particularly style, cultural appropriateness, and terminology that isn't in the reference glossary. LQA is the human complement to automated metrics.

Designing an LQA Program

Sampling strategy: You can't evaluate every translated word. Determine your sampling approach:

Random sampling: Evaluate X% of all translated content, randomly selected
Risk-based sampling: Higher sampling rates for high-visibility or high-risk content (legal, medical, marketing)
Stratified sampling: Ensure representation across content types, language pairs, and translators/vendors

A common sampling rate is 5-10% of word volume, with 100% evaluation for critical content types.

Evaluator qualifications: LQA evaluators must be:

Native speakers of the target language
Subject matter experts (for specialized content)
Trained in the specific error taxonomy and scoring methodology
Not the same person who translated the content

Calibration: Before beginning LQA, calibrate evaluators by having multiple evaluators score the same sample and comparing results. Unresolved disagreements become calibration discussions. Periodic re-calibration keeps evaluators aligned as guidelines evolve.

LQA Report Structure

A useful LQA report includes:

Overall quality score and error distribution by type and severity
Top error types and frequency
Examples of each error type with corrections
Trend data (is quality improving or declining?)
Actionable recommendations for the translator/vendor

Measuring Quality at Scale: Business Metrics

Linguistic quality metrics measure the translation itself. Business metrics measure the impact of translation quality on user behavior and business outcomes.

Customer Support Volume by Language

If translation quality is poor, non-English speakers generate more support tickets. Track support ticket volume per language, normalized by user population. Persistent higher rates in specific languages indicate quality or localization gaps.

Conversion Rate by Locale

For e-commerce, SaaS, or app downloads, track conversion rates by locale. Significant underperformance in specific locales often correlates with translation quality issues (but also UX, cultural fit, or pricing factors—triangulate with other data).

User Retention by Language

Track 30-day, 90-day, and annual retention rates by user language. Poor localization quality can manifest as churn rather than immediate complaints.

App Store Ratings and Reviews by Language

Monitor app store ratings broken down by language. Qualitative review mining (using MT to read reviews in other languages, ironically) can surface specific quality complaints.

Terminology Consistency Score

Track the percentage of approved glossary terms correctly applied in translated content. Automated glossary checking in your TMS or QA tool can generate this metric across all content.

Setting Quality Standards and SLAs

Quality metrics are only useful if they're tied to standards and accountability. When working with translation vendors:

Define quality tiers by content type:

Tier 1 (mission-critical): Legal, compliance, product UI → MQM < 1.0
Tier 2 (customer-facing): Marketing, help content → MQM < 2.5
Tier 3 (internal/low-risk): Internal docs, drafts → MQM < 5.0

Establish LQA feedback loops: Share LQA results with vendors. Require vendors to analyze error patterns and submit quality improvement plans. Track improvement over time.

Penalty and remediation clauses: For critical content, include contractual remediation requirements (translator re-does content failing quality threshold at no charge) and penalty clauses for systemic quality failures.

See translation management systems for how TMS platforms support quality tracking and vendor management.

Continuous Improvement Through Quality Data

Quality measurement is most valuable when it drives continuous improvement:

Root cause analysis: When quality scores are poor, trace errors to their source:

Source content quality (ambiguous, poorly written source → poor translation)
Insufficient context (translator didn't have necessary reference materials)
Terminology gaps (term not in glossary → inconsistent translation)
Translator skill gap (specific error type from a specific translator/vendor)
Process failure (insufficient time, inadequate review step)

Feedback loops: Return error analysis to translators and post-editors with specific, actionable feedback. Generic "quality was poor" feedback doesn't improve future output.

Glossary updates: Every terminology error is a signal that the glossary needs updating or better distribution. Build a process for translators to flag new terminology for glossary review. Learn more about translation glossary management.

Process experiments: Use quality metrics to evaluate process changes. Did adding a second review step improve quality? Did switching MT engines reduce post-editing effort? Quality data answers these questions objectively.

Training needs identification: Patterns of specific error types across translators often indicate training needs. If multiple translators are making the same type of error, the issue may be unclear guidelines rather than individual translator skill.

Quality Management for Different Translation Approaches

Quality standards and measurement approaches differ by translation method:

Human translation: Apply full MQM evaluation to LQA samples. Expect high scores but watch for terminology inconsistency and style deviations.

MT + post-editing: Track both MT raw quality (automated metrics) and post-edited quality (LQA). Also measure post-editing effort. See machine translation post-editing for workflow details.

AI translation: See AI translation vs. human translation for quality expectations by content type and how to measure AI translation quality effectively.

Take your app global with better-i18n

better-i18n combines AI-powered translations, git-native workflows, and global CDN delivery into one developer-first platform. Stop managing spreadsheets and start shipping in every language.

Get started free → · Explore features · Read the docs

Translation Quality Metrics: How to Measure and Improve

Translation Quality Metrics: How to Measure and Improve

Why Quality Measurement Matters

The Quality Evaluation Framework Landscape

MQM: Multidimensional Quality Metrics

LISA QA Model

SAE J2450

TAUS Dynamic Quality Framework (DQF)

Automated Quality Metrics

BLEU (Bilingual Evaluation Understudy)

COMET (Crosslingual Optimized Metric for Evaluation of Translation)

TER (Translation Edit Rate)

ChrF (Character F-score)

Linguistic Quality Assurance (LQA) Process

Designing an LQA Program

LQA Report Structure

Measuring Quality at Scale: Business Metrics

Customer Support Volume by Language

Conversion Rate by Locale

User Retention by Language

App Store Ratings and Reviews by Language

Terminology Consistency Score

Setting Quality Standards and SLAs

Continuous Improvement Through Quality Data

Quality Management for Different Translation Approaches

Take your app global with better-i18n

Related Posts

How Search Engines Index Multilingual Content (and How AI Helps)

Google Translate Alternatives: When to Use Something Better

Free Translation Software in 2026: Tools, Limits, and When to Upgrade

Explore More

For Developers

For Translators

For Product Teams

All Features