Tutorials

Using GPT and LLMs for App Localization: A Practical Guide

Eray Gündoğmuş
Eray Gündoğmuş
·11 min read
Share
Using GPT and LLMs for App Localization: A Practical Guide

Using GPT and LLMs for App Localization: A Practical Guide

Large language models have fundamentally changed how development teams approach app localization. Instead of relying solely on traditional machine translation engines or expensive human-only workflows, you can now use GPT, Claude, and other LLMs to produce context-aware, tonally accurate translations — and integrate them directly into your i18n pipeline. This guide covers the practical steps: prompt design, quality control, integration patterns, and cost management.

Key Takeaways

  • LLMs outperform traditional MT for context-sensitive translations. At WMT24, frontier LLMs like Claude 3.5 Sonnet won 9 out of 11 language pairs, outperforming dedicated machine translation systems (WMT24 Findings).
  • Prompt engineering is the single largest lever for translation quality. Providing context, glossaries, and tone instructions in your prompts dramatically reduces post-editing effort.
  • LLMs are not a replacement for human translators on high-stakes content. Research shows professional translators still outperform GPT-4 in expert evaluation, with human translators winning roughly 64% of head-to-head comparisons (Jiao et al., 2024).
  • Cost optimization matters. Smaller models like GPT-4o mini and Claude Haiku can handle straightforward translations at a fraction of the cost, reserving larger models for nuanced content.
  • A hybrid workflow — LLM draft plus human review — delivers the best balance of speed, cost, and quality for production apps.

Why Use LLMs for App Localization?

LLMs bring context awareness, tone adaptability, and format preservation that traditional machine translation engines lack — making them well suited for translating UI strings, marketing copy, and in-app content where nuance matters more than raw throughput.

The Shift from Traditional MT

Traditional machine translation (Google Translate, DeepL) uses neural models trained specifically on parallel corpora. These systems excel at high-volume, general-purpose translation. However, they struggle with several challenges that app localization surfaces daily:

Context fragmentation. App strings are typically short, isolated fragments: "Save", "Cancel", "Your order is ready." Without surrounding context, traditional MT often picks the wrong sense of a word. The German translation for "Save" could be "Speichern" (save a file) or "Sparen" (save money) — and a traditional MT system processing strings in isolation has no way to distinguish them.

Tone and brand voice. A fintech app and a children's game have radically different voice requirements. Traditional MT produces a single neutral output with limited control over register or formality.

Format preservation. App strings contain variables ({count} items), HTML tags, plural forms, and ICU message syntax. Traditional MT engines frequently break these structures.

LLMs address all three. Because they process language generatively with large context windows, you can include surrounding strings, glossaries, and explicit style instructions in each translation request. The WMT24++ benchmark expansion to 55 languages confirmed that frontier LLMs outperform standard MT providers across the board according to automatic metrics (Kocmi et al., 2025).

That said, LLMs introduce their own challenges — cost, latency, and occasional hallucination — which this guide addresses in the sections that follow.

Prompt Engineering for i18n

Effective prompt engineering is the most important factor in LLM translation quality. A well-structured prompt with context, glossary terms, and formatting rules can close most of the gap between raw LLM output and professional human translation.

The Anatomy of a Translation Prompt

Every translation prompt should include five elements:

  1. Role and task definition — Tell the model what it is doing
  2. Source and target language — Be explicit
  3. Context — Describe where these strings appear
  4. Glossary / terminology — Enforce consistent term usage
  5. Format constraints — Preserve variables, HTML, plural syntax

Here is a practical template:

You are a professional translator for a SaaS application.
Translate the following UI strings from English to German.

Context: These strings appear in a project management dashboard.
The tone is professional but approachable. Use "Sie" (formal) for
user-facing text.

Glossary:
- "workspace" → "Arbeitsbereich" (never "Arbeitsplatz")
- "sprint" → "Sprint" (keep in English)
- "backlog" → "Backlog" (keep in English)

Format rules:
- Preserve all variables in {curly_braces} exactly as they appear
- Preserve HTML tags (<b>, <a>, etc.) without translating attributes
- Return translations in the same JSON structure as the input

Input:
{
  "dashboard.welcome": "Welcome back, {userName}",
  "dashboard.sprint_count": "{count, plural, one {# sprint} other {# sprints}} active",
  "dashboard.empty": "No items in your <b>backlog</b> yet"
}

Context Window Strategies

One of the biggest advantages LLMs have over traditional MT is the ability to process multiple strings together. This enables cross-string consistency — the same term gets translated the same way throughout your app.

Batch by feature area. Instead of translating strings one at a time, group related strings and send them together. All strings from your "settings" page should be translated in one request so the model sees the full picture.

Include reference translations. If you already have approved translations for some strings, include them as examples:

Previously approved translations (use these as style reference):
- "Save changes" → "Änderungen speichern"
- "Discard" → "Verwerfen"

Now translate these new strings in the same style:
...

Provide UI context descriptions. When a string is ambiguous, add a developer comment:

{
  "key": "actions.close",
  "source": "Close",
  "context": "Button label to close a modal dialog, not to close an account"
}

Research from Across Systems confirms that source text optimization and well-maintained terminology are even more essential with LLMs than with traditional MT, because the model's output is directly shaped by the input quality (Across, 2024).

Choosing the Right Model

Not every string needs a frontier model. Here is a practical breakdown:

Use CaseRecommended ModelWhy
Simple UI labelsGPT-4o mini, Claude HaikuLow ambiguity, cost-efficient
Marketing copyGPT-4o, Claude SonnetNeeds creative adaptation
Legal / complianceClaude Opus, GPT-4o + human reviewHigh stakes, nuance required
Batch string filesGPT-4o mini, Claude HaikuVolume pricing matters
Cultural adaptationGPT-4o, Claude SonnetRequires cultural reasoning

Lokalise's internal testing ranked Claude 3.5 as the best-performing model across many language pairs, leading them to integrate it deeply into their platform. However, model performance varies by language pair — there is no single "best" model for all scenarios.

Quality Control for LLM Translations

LLM translations require structured quality assurance. While they often produce fluent, natural-sounding text, fluency can mask errors — a confidently wrong translation is harder to catch than an awkwardly correct one.

Common LLM Translation Pitfalls

Hallucination. LLMs occasionally add information not present in the source text. A source string "3 items selected" might become "3 wichtige Elemente ausgewählt" ("3 important items selected") — the model injects "important" without justification.

Inconsistency across batches. If you translate strings in separate API calls, the same term may be rendered differently each time. "Dashboard" might appear as "Dashboard," "Übersicht," or "Instrumententafel" across different requests.

Format corruption. Despite instructions, models sometimes modify variables: {userName} becomes {Benutzername}, or ICU plural syntax gets restructured.

Over-localization. LLMs may translate brand names, product features, or technical terms that should remain in English.

Formality inconsistency. In languages with formal/informal registers (German Sie/du, French vous/tu, Japanese keigo), the model may switch registers mid-batch.

Automated Quality Checks

Build automated validation into your pipeline:

interface TranslationQAResult {
  readonly key: string;
  readonly issues: readonly string[];
  readonly passed: boolean;
}

function validateTranslation(
  sourceKey: string,
  source: string,
  translation: string
): TranslationQAResult {
  const issues: string[] = [];

  // Check variable preservation
  const sourceVars = source.match(/\{[^}]+\}/g) || [];
  const translationVars = translation.match(/\{[^}]+\}/g) || [];

  for (const v of sourceVars) {
    if (!translationVars.includes(v)) {
      issues.push(`Missing variable: ${v}`);
    }
  }

  // Check HTML tag preservation
  const sourceTags = source.match(/<[^>]+>/g) || [];
  const translationTags = translation.match(/<[^>]+>/g) || [];

  for (const tag of sourceTags) {
    if (!translationTags.includes(tag)) {
      issues.push(`Missing HTML tag: ${tag}`);
    }
  }

  // Check for untranslated content (exact match = suspicious)
  if (source === translation && source.length > 3) {
    issues.push("Translation identical to source — may be untranslated");
  }

  // Check length ratio (translations shouldn't be 3x longer/shorter)
  const ratio = translation.length / source.length;
  if (ratio > 3 || ratio < 0.3) {
    issues.push(`Suspicious length ratio: ${ratio.toFixed(2)}`);
  }

  return {
    key: sourceKey,
    issues,
    passed: issues.length === 0,
  };
}

Human Review Workflow

For production apps, the most effective pattern is LLM draft + targeted human review:

  1. LLM translates all strings in a batch with full context
  2. Automated QA catches format issues, missing variables, length violations
  3. Human reviewer focuses on:
    • Strings flagged by automated QA
    • High-visibility strings (onboarding, checkout, error messages)
    • Strings with cultural sensitivity
  4. Approved translations feed back into the glossary and style reference for future batches

This workflow lets human translators spend their time where it matters most — on judgment calls that require cultural knowledge — rather than on straightforward strings an LLM handles well.

Research comparing GPT-4 to professional translators found that GPT-4 performs comparably to junior-level translators in terms of total errors, but there remain clear performance gaps between the model and mid-level or senior human translators (Yan et al., 2024). The hybrid approach captures the speed of LLMs while maintaining the quality ceiling that human expertise provides.

Integrating LLMs Into Your Localization Workflow

Moving from ad-hoc ChatGPT usage to a production-grade LLM translation pipeline requires thoughtful architecture decisions around batching, API integration, and cost management.

Architecture Patterns

Batch translation (recommended for most apps). Collect new or changed strings, translate in batches during CI/CD, and commit results to your i18n files. This is the most cost-effective and predictable pattern.

Developer adds strings → CI detects changes → LLM batch translation
→ Automated QA → Human review queue → Merge to i18n files

Real-time translation (for dynamic content). If your app has user-generated content or CMS-driven pages that need translation on the fly, you can call LLM APIs at request time with caching:

import { createHash } from "node:crypto";

interface TranslationCacheEntry {
  readonly translation: string;
  readonly timestamp: number;
}

const translationCache = new Map<string, TranslationCacheEntry>();
const CACHE_TTL_MS = 24 * 60 * 60 * 1000; // 24 hours

async function translateWithCache(
  text: string,
  sourceLang: string,
  targetLang: string,
  context: string
): Promise<string> {
  const cacheKey = createHash("sha256")
    .update(`${sourceLang}:${targetLang}:${text}`)
    .digest("hex");

  const cached = translationCache.get(cacheKey);
  if (cached && Date.now() - cached.timestamp < CACHE_TTL_MS) {
    return cached.translation;
  }

  const translation = await callTranslationLLM(
    text,
    sourceLang,
    targetLang,
    context
  );

  translationCache.set(cacheKey, {
    translation,
    timestamp: Date.now(),
  });

  return translation;
}

Hybrid pattern. Use batch translation for your core app strings (static i18n files), and real-time translation with caching for dynamic content. This is what most production apps end up with.

API Integration Example

Here is a practical example calling the OpenAI API for batch translation:

interface TranslationRequest {
  readonly strings: Record<string, string>;
  readonly sourceLang: string;
  readonly targetLang: string;
  readonly glossary: Record<string, string>;
  readonly context: string;
}

interface TranslationResponse {
  readonly translations: Record<string, string>;
  readonly model: string;
  readonly tokensUsed: number;
}

async function translateBatch(
  request: TranslationRequest
): Promise<TranslationResponse> {
  const prompt = buildTranslationPrompt(request);

  const response = await fetch(
    "https://api.openai.com/v1/chat/completions",
    {
      method: "POST",
      headers: {
        "Content-Type": "application/json",
        Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
      },
      body: JSON.stringify({
        model: "gpt-4o-mini",
        messages: [
          {
            role: "system",
            content:
              "You are a professional software localizer. Return only valid JSON.",
          },
          { role: "user", content: prompt },
        ],
        temperature: 0.1, // Low temperature for consistency
        response_format: { type: "json_object" },
      }),
    }
  );

  const data = await response.json();
  const parsed = JSON.parse(data.choices[0].message.content);

  return {
    translations: parsed,
    model: data.model,
    tokensUsed: data.usage.total_tokens,
  };
}

function buildTranslationPrompt(
  request: TranslationRequest
): string {
  const glossaryLines = Object.entries(request.glossary)
    .map(([en, target]) => `- "${en}" → "${target}"`)
    .join("\n");

  return `Translate these UI strings from ${request.sourceLang} to ${request.targetLang}.

Context: ${request.context}

Glossary (use these exact translations):
${glossaryLines}

Rules:
- Preserve all {variables} exactly
- Preserve HTML tags
- Do not translate brand names
- Return a JSON object with the same keys

Strings to translate:
${JSON.stringify(request.strings, null, 2)}`;
}

Cost Optimization Strategies

LLM translation costs add up quickly at scale. Here are proven strategies to keep them manageable:

1. Use tiered models. Route simple strings (buttons, labels) to cheaper models and reserve expensive models for complex content. GPT-4o mini costs $0.15 per million input tokens — roughly 60 times cheaper than GPT-4o for input.

2. Cache aggressively. The same string translated to the same language should never be translated twice. Implement content-addressed caching as shown above.

3. Translate incrementally. Only translate strings that have actually changed. Use hash comparison against your previous i18n files to identify deltas.

4. Batch efficiently. Sending 50 strings in one API call is far cheaper and more consistent than 50 individual calls, because you amortize the system prompt and context tokens.

5. Set a temperature of 0.1–0.3. Higher temperatures increase creativity but also increase inconsistency and the chance of hallucination. For translation, you want deterministic output.

As a reference point, translating 10,000 UI strings (averaging 8 words each) from English to German using GPT-4o mini costs approximately $0.10–$0.30 in API fees — orders of magnitude cheaper than professional human translation at $0.10–$0.20 per word.

How better-i18n Works with AI Translation

better-i18n is designed to fit naturally into the AI-assisted localization workflow described above. Rather than replacing your translation process, it provides the infrastructure layer that makes LLM translation practical at scale.

Structured i18n management. better-i18n organizes your translation keys, tracks which strings have changed, and maintains version history — giving you the foundation to build incremental LLM translation on top of.

Context preservation. Each translation key in better-i18n can carry developer notes and context descriptions, which you can pass directly to your LLM prompts to improve translation accuracy.

Review workflows. When LLM translations come back, better-i18n's pending changes workflow lets your team review, approve, or edit translations before they go live — exactly the human-in-the-loop pattern that produces the best results.

SDK integration. The @better-i18n/sdk makes it straightforward to pull source strings, send them through your LLM translation pipeline, and push results back — all via API. See the better-i18n documentation for integration details.

For a broader view of how AI translation tools compare, see our guide to the Best AI Translation Tools in 2026.

FAQ

Which LLM is best for translation?

There is no single best LLM for all translation scenarios. At WMT24, Claude 3.5 Sonnet performed best overall, winning 9 of 11 evaluated language pairs. However, performance varies significantly by language pair and content type. For most app localization work, GPT-4o mini and Claude Haiku provide the best cost-to-quality ratio for standard UI strings, while GPT-4o and Claude Sonnet are better suited for marketing copy or content requiring cultural adaptation. Test with your specific language pairs and content types before committing to a single provider.

Can GPT replace professional translators?

Not entirely — at least not yet. Research benchmarking GPT-4 against human translators found that professional translators win roughly 64% of head-to-head comparisons, with the gap widening for senior-level translators and culturally nuanced content (Yan et al., 2024). However, for standard app UI strings, error messages, and straightforward content, LLMs produce translations that require minimal post-editing. The most effective approach for production apps is a hybrid workflow: LLM-generated first drafts with human review for high-visibility and culturally sensitive strings.

How do you handle context in LLM translations?

Context is the key differentiator between good and bad LLM translations. Three practical strategies: (1) Batch related strings together — translate all strings from a single screen or feature in one API call so the model sees the full picture. (2) Include developer comments — add descriptions like "Button in checkout flow" or "Error shown when payment fails" alongside each string. (3) Provide reference translations — include previously approved translations as style examples in your prompt. These three practices alone eliminate the majority of context-related translation errors. Tools like better-i18n make it easy to attach context metadata to each translation key and pass it through to your LLM pipeline.


This article is part of our series on AI-powered localization. For a comprehensive comparison of available tools, read Best AI Translation Tools in 2026.