Engineering

Large Language Models for Translation: How LLMs Compare to Traditional NMT

Eray Gündoğmuş
Eray Gündoğmuş
·10 min read
Share
Large Language Models for Translation: How LLMs Compare to Traditional NMT

Large Language Models for Translation: How LLMs Compare to Traditional NMT

Key Takeaways

  • Large language models (LLMs) like GPT-4, Claude, and Gemini can perform translation tasks, but they differ fundamentally from dedicated neural machine translation (NMT) engines
  • LLMs excel at context-aware translation, handling ambiguity, and following style instructions — areas where traditional NMT struggles
  • Dedicated NMT engines (Google Translate, DeepL) are faster, cheaper per token, and more consistent for high-volume translation workloads
  • LLMs are particularly useful for creative content, marketing copy, and content that requires tone or style adaptation
  • The most effective approach for many teams combines NMT for bulk translation with LLM-based refinement for high-value content

How LLMs Approach Translation Differently

Traditional NMT engines are trained specifically on parallel corpora — pairs of sentences in source and target languages. They learn statistical patterns of how one language maps to another.

LLMs are trained on massive amounts of multilingual text from diverse sources. They learn language structure, meaning, and context at a deeper level. When asked to translate, they don't just pattern-match between languages — they understand the content and re-express it in the target language.

This fundamental difference has practical implications:

AspectTraditional NMTLLM-Based Translation
TrainingParallel corpora (source ↔ target)General multilingual text
Context windowSingle sentence or paragraphThousands of tokens
Style controlLimited (glossaries, formality settings)Instruction-following (prompts)
SpeedVery fast (milliseconds)Slower (seconds)
Cost per tokenLow ($10-20 per 1M characters)Higher ($1-15 per 1M tokens)
ConsistencyHigh for same inputMay vary between calls

Where LLMs Excel

Context-Aware Translation

LLMs can process entire documents or conversations, maintaining consistency and understanding references across paragraphs. A traditional NMT engine translating "It was cool" might not know whether "cool" means temperature or approval. An LLM processing the full document can infer the correct meaning.

Style and Tone Adaptation

LLMs can follow instructions like:

  • "Translate this marketing copy into French, maintaining an informal and energetic tone"
  • "Translate this legal document into German using formal register (Sie form)"
  • "Translate this UI string for a children's educational app — use simple, friendly language"

NMT engines have limited controls for style adaptation beyond basic formality settings.

Handling Ambiguity

When a source string like "Open" has multiple possible translations depending on context, LLMs can be prompted with additional context:

Translate the following UI button label to German.
Context: This button opens a file picker dialog.
Source: "Open"

This produces "Öffnen" (verb: to open) rather than "Offen" (adjective: open/available).

Creative and Marketing Content

For content that requires transcreation — adapting the message rather than literally translating it — LLMs produce more natural results. Marketing slogans, taglines, and brand messaging often need cultural adaptation that goes beyond word-for-word translation.

Where Traditional NMT Is Better

Speed and Throughput

NMT engines process translations in milliseconds. LLMs require seconds per request. For applications that need real-time translation (chat, live content) or high-volume batch processing (millions of strings), dedicated NMT is significantly more efficient.

Cost at Scale

For high-volume translation workloads, NMT is substantially cheaper. Translating 1 million characters costs approximately $10-20 with most NMT APIs. The equivalent volume through an LLM API costs significantly more, depending on the model and provider.

Deterministic Output

Given the same input, NMT engines produce the same output every time. LLMs may produce slightly different translations on repeated calls (unless temperature is set to 0, and even then minor variations can occur). For applications requiring strict reproducibility, this matters.

Language Coverage

Major NMT engines support 100-200+ languages. LLMs typically perform well on 20-40 high-resource languages but may produce lower-quality translations for less common languages.

Practical Use Cases

LLM-Based Translation Works Well For

  • Marketing and creative content: Taglines, ad copy, email campaigns
  • Context-dependent UI strings: Strings that are ambiguous without context
  • Style-specific content: Content requiring specific tone, formality, or brand voice
  • Small-volume, high-quality needs: When you need a few hundred strings translated with specific style requirements
  • Translation review and refinement: Using LLMs to improve or polish NMT output

NMT Works Well For

  • Bulk UI string translation: Thousands of application strings
  • Documentation: Help articles, knowledge base content
  • Real-time translation: Chat, live captioning, instant messaging
  • Pre-translation in TMS: Providing first drafts for human translators
  • Cost-sensitive workloads: When translation budget is limited relative to volume

Combining NMT and LLMs

A practical approach for many teams:

  1. Use NMT for initial translation: Fast, cheap, covers the majority of content
  2. Use LLM for high-value refinement: Marketing content, ambiguous strings, style adaptation
  3. Use human review for production content: Final quality check before shipping
Source strings
     ↓
NMT pre-translation (bulk, fast, cheap)
     ↓
LLM refinement (select strings: marketing, ambiguous, style-critical)
     ↓
Human review (all customer-facing content)
     ↓
Published translations

Quality Comparison

Quality comparisons between LLMs and NMT depend heavily on the content type and language pair. General observations based on published research and industry experience:

Content TypeNMT QualityLLM QualityRecommendation
Technical documentationGoodGoodNMT (cheaper, sufficient quality)
UI strings (with context)GoodVery goodLLM for ambiguous strings
Marketing copyFairVery goodLLM
Legal/regulatoryGoodGoodEither + human review
Creative contentFairGoodLLM + human creative review

Note: "Quality" here means the usefulness of the output as a starting point for human review. Neither approach eliminates the need for human review on production content.

Implementation Considerations

Prompt Engineering for Translation

Effective LLM translation requires well-structured prompts:

You are a professional translator. Translate the following text from English to French.

Requirements:
- Use formal register (vous, not tu)
- Preserve all placeholders like {name} and {count} exactly as-is
- Do not translate brand names
- Keep the translation concise — similar length to the source

Source: "Welcome back, {name}! You have {count} unread messages."

Rate Limiting and Batching

LLM APIs have rate limits and per-request overhead. For batch translation:

  • Group multiple strings into single requests where possible
  • Implement retry logic with exponential backoff
  • Cache translations to avoid re-translating unchanged content

Consistency Management

Since LLMs may produce varying outputs, enforce consistency through:

  • Glossaries included in the system prompt
  • Translation memory: reuse previous translations for identical or similar strings
  • Validation scripts: check that product terms are translated consistently

FAQ

Should I replace my NMT integration with an LLM?

For most teams, no. NMT remains the better choice for bulk translation due to cost and speed advantages. Consider adding LLM-based translation as a complementary tool for content types where NMT falls short — marketing copy, ambiguous strings, and style-critical content.

How do I evaluate whether LLM translation quality justifies the higher cost?

Run a side-by-side comparison: translate a representative sample of your content with both NMT and an LLM, then have native speakers evaluate quality. If the LLM produces measurably better translations for certain content types, calculate whether the quality improvement justifies the cost difference for that content tier.

Can LLMs maintain terminology consistency across a large project?

Not natively — LLMs don't have memory between API calls. However, you can achieve consistency by including a glossary in the system prompt, using few-shot examples of approved translations, and implementing post-processing validation that checks for terminology compliance. A TMS with LLM integration handles this automatically.