Table of Contents
Table of Contents
- Translation Quality Metrics: How to Measure and Improve
- Why Quality Measurement Matters
- The Quality Evaluation Framework Landscape
- MQM: Multidimensional Quality Metrics
- LISA QA Model
- SAE J2450
- TAUS Dynamic Quality Framework (DQF)
- Automated Quality Metrics
- BLEU (Bilingual Evaluation Understudy)
- COMET (Crosslingual Optimized Metric for Evaluation of Translation)
- TER (Translation Edit Rate)
- ChrF (Character F-score)
- Linguistic Quality Assurance (LQA) Process
- Designing an LQA Program
- LQA Report Structure
- Measuring Quality at Scale: Business Metrics
- Customer Support Volume by Language
- Conversion Rate by Locale
- User Retention by Language
- App Store Ratings and Reviews by Language
- Terminology Consistency Score
- Setting Quality Standards and SLAs
- Continuous Improvement Through Quality Data
- Quality Management for Different Translation Approaches
- Take your app global with better-i18n
Translation Quality Metrics: How to Measure and Improve
"Quality" in translation is notoriously difficult to define and measure. A translation can be accurate but stilted. Fluent but unfaithful. Terminologically correct but culturally tone-deaf. And what counts as "high quality" for a technical manual differs fundamentally from what counts as quality for a marketing campaign.
Despite this complexity, translation quality measurement is essential for any organization running a localization program at scale. Without metrics, you can't identify quality problems, improve vendor relationships, make data-driven tool decisions, or demonstrate ROI to stakeholders.
This guide covers the major frameworks, tools, and approaches for measuring translation quality—and how to use those measurements to drive continuous improvement.
Why Quality Measurement Matters
Organizations that don't measure translation quality systematically tend to discover quality problems through:
- Customer complaints about confusing or incorrect translations
- Support tickets from non-English speaking markets
- Legal issues from mistranslated compliance content
- Failed product launches in localized markets
- Expensive rework after content has already been published
Proactive quality measurement catches problems earlier, when they're cheaper to fix. It also creates accountability in vendor relationships and enables objective comparison of MT tools, translation vendors, and workflow changes.
The Quality Evaluation Framework Landscape
MQM: Multidimensional Quality Metrics
MQM (Multidimensional Quality Metrics) is the most comprehensive and widely adopted framework in professional localization. Developed by the QTLaunchPad project and adopted by ASTM International as F3131, MQM provides a hierarchical taxonomy of translation error types.
MQM error categories (top level):
| Category | Description |
|---|---|
| Accuracy | The translation does not faithfully represent the source |
| Fluency | The translation is not natural in the target language |
| Terminology | Terms do not match the approved glossary or domain conventions |
| Style | The translation violates style guidelines |
| Locale convention | Numbers, dates, addresses formatted incorrectly for locale |
| Verity | Claims in the translation are factually incorrect |
Each category has subcategories. For example, Accuracy includes: mistranslation, omission, addition, untranslated content, and structural errors.
MQM scoring: Each error is classified by type and severity (critical, major, minor). A weighted score is calculated:
MQM score = (critical × 25 + major × 5 + minor × 1) / word count × 1000
Lower scores are better. Industry benchmarks vary, but common thresholds:
- < 1.0: Excellent quality
- 1.0–3.0: Acceptable quality
- 3.0–5.0: Needs improvement
5.0: Unacceptable quality
LISA QA Model
The LISA (Localization Industry Standards Association) QA model predates MQM and is simpler. It classifies errors as:
- Mistranslation
- Accuracy
- Terminology
- Language (grammar, spelling, punctuation)
- Style
- Country/locale standard
- Formatting
LISA QA is still used widely, particularly in older enterprise localization programs. It's less nuanced than MQM but simpler to implement.
SAE J2450
A simplified error taxonomy developed by the automotive industry. Five error types: wrong term, syntactical error, omission, word structure error, spelling/punctuation. Used in automotive and related industries.
TAUS Dynamic Quality Framework (DQF)
The TAUS (Translation Automation User Society) DQF provides simplified quality assessment tools designed for use at scale. It includes:
- Adequacy scale (1-4): Does the translation convey the meaning of the source?
- Fluency scale (1-4): How fluent is the language in the translation?
DQF tools are available in major CAT tools and TMS platforms, making it practical for high-volume assessment.
Automated Quality Metrics
Human evaluation is the gold standard but doesn't scale to millions of words. Automated metrics serve as proxies for human judgment at scale.
BLEU (Bilingual Evaluation Understudy)
BLEU measures the overlap between an MT output (or translated text) and one or more human reference translations. It calculates n-gram precision (how many word sequences in the translation appear in the references) with a brevity penalty for translations that are too short.
Interpretation: BLEU scores range from 0–100. Higher is better. But BLEU correlates poorly with human judgments at the segment level—it's a corpus-level metric only useful for comparing systems, not for evaluating individual translations.
Use case: Comparing MT engines or measuring improvement after engine retraining. Not useful for individual segment QA.
COMET (Crosslingual Optimized Metric for Evaluation of Translation)
COMET uses a neural network trained on human quality judgments to predict quality scores. It correlates significantly better with human evaluations than BLEU, particularly at the segment level.
Use case: Evaluating MT quality, comparing engines, predicting post-editing effort. Increasingly used in production MT quality estimation pipelines.
TER (Translation Edit Rate)
TER measures the number of edits required to transform the MT output into the reference translation. Lower TER = fewer edits needed = better quality.
Use case: Estimating post-editing effort. Can be used to route segments: low-TER segments to post-editing, high-TER segments to human translation from scratch.
ChrF (Character F-score)
ChrF works at the character level rather than word level. It performs better than BLEU for morphologically rich languages (German, Turkish, Finnish) where word-level matching misses many correct translations that use different morphological forms.
Linguistic Quality Assurance (LQA) Process
Automated metrics catch some errors but miss many quality dimensions—particularly style, cultural appropriateness, and terminology that isn't in the reference glossary. LQA is the human complement to automated metrics.
Designing an LQA Program
Sampling strategy: You can't evaluate every translated word. Determine your sampling approach:
- Random sampling: Evaluate X% of all translated content, randomly selected
- Risk-based sampling: Higher sampling rates for high-visibility or high-risk content (legal, medical, marketing)
- Stratified sampling: Ensure representation across content types, language pairs, and translators/vendors
A common sampling rate is 5-10% of word volume, with 100% evaluation for critical content types.
Evaluator qualifications: LQA evaluators must be:
- Native speakers of the target language
- Subject matter experts (for specialized content)
- Trained in the specific error taxonomy and scoring methodology
- Not the same person who translated the content
Calibration: Before beginning LQA, calibrate evaluators by having multiple evaluators score the same sample and comparing results. Unresolved disagreements become calibration discussions. Periodic re-calibration keeps evaluators aligned as guidelines evolve.
LQA Report Structure
A useful LQA report includes:
- Overall quality score and error distribution by type and severity
- Top error types and frequency
- Examples of each error type with corrections
- Trend data (is quality improving or declining?)
- Actionable recommendations for the translator/vendor
Measuring Quality at Scale: Business Metrics
Linguistic quality metrics measure the translation itself. Business metrics measure the impact of translation quality on user behavior and business outcomes.
Customer Support Volume by Language
If translation quality is poor, non-English speakers generate more support tickets. Track support ticket volume per language, normalized by user population. Persistent higher rates in specific languages indicate quality or localization gaps.
Conversion Rate by Locale
For e-commerce, SaaS, or app downloads, track conversion rates by locale. Significant underperformance in specific locales often correlates with translation quality issues (but also UX, cultural fit, or pricing factors—triangulate with other data).
User Retention by Language
Track 30-day, 90-day, and annual retention rates by user language. Poor localization quality can manifest as churn rather than immediate complaints.
App Store Ratings and Reviews by Language
Monitor app store ratings broken down by language. Qualitative review mining (using MT to read reviews in other languages, ironically) can surface specific quality complaints.
Terminology Consistency Score
Track the percentage of approved glossary terms correctly applied in translated content. Automated glossary checking in your TMS or QA tool can generate this metric across all content.
Setting Quality Standards and SLAs
Quality metrics are only useful if they're tied to standards and accountability. When working with translation vendors:
Define quality tiers by content type:
- Tier 1 (mission-critical): Legal, compliance, product UI → MQM < 1.0
- Tier 2 (customer-facing): Marketing, help content → MQM < 2.5
- Tier 3 (internal/low-risk): Internal docs, drafts → MQM < 5.0
Establish LQA feedback loops: Share LQA results with vendors. Require vendors to analyze error patterns and submit quality improvement plans. Track improvement over time.
Penalty and remediation clauses: For critical content, include contractual remediation requirements (translator re-does content failing quality threshold at no charge) and penalty clauses for systemic quality failures.
See translation management systems for how TMS platforms support quality tracking and vendor management.
Continuous Improvement Through Quality Data
Quality measurement is most valuable when it drives continuous improvement:
Root cause analysis: When quality scores are poor, trace errors to their source:
- Source content quality (ambiguous, poorly written source → poor translation)
- Insufficient context (translator didn't have necessary reference materials)
- Terminology gaps (term not in glossary → inconsistent translation)
- Translator skill gap (specific error type from a specific translator/vendor)
- Process failure (insufficient time, inadequate review step)
Feedback loops: Return error analysis to translators and post-editors with specific, actionable feedback. Generic "quality was poor" feedback doesn't improve future output.
Glossary updates: Every terminology error is a signal that the glossary needs updating or better distribution. Build a process for translators to flag new terminology for glossary review. Learn more about translation glossary management.
Process experiments: Use quality metrics to evaluate process changes. Did adding a second review step improve quality? Did switching MT engines reduce post-editing effort? Quality data answers these questions objectively.
Training needs identification: Patterns of specific error types across translators often indicate training needs. If multiple translators are making the same type of error, the issue may be unclear guidelines rather than individual translator skill.
Quality Management for Different Translation Approaches
Quality standards and measurement approaches differ by translation method:
Human translation: Apply full MQM evaluation to LQA samples. Expect high scores but watch for terminology inconsistency and style deviations.
MT + post-editing: Track both MT raw quality (automated metrics) and post-edited quality (LQA). Also measure post-editing effort. See machine translation post-editing for workflow details.
AI translation: See AI translation vs. human translation for quality expectations by content type and how to measure AI translation quality effectively.
Take your app global with better-i18n
better-i18n combines AI-powered translations, git-native workflows, and global CDN delivery into one developer-first platform. Stop managing spreadsheets and start shipping in every language.