Engineering

Script and Writing System Considerations for Software Localization

Eray Gündoğmuş
Eray Gündoğmuş
·10 min read
Share
Script and Writing System Considerations for Software Localization

Script and Writing System Considerations for Software Localization

Key Takeaways

  • The Unicode standard defines over 150,000 characters across 161 scripts, and software must handle all of them correctly
  • CJK (Chinese, Japanese, Korean) scripts have unique line-breaking rules — you cannot break in the middle of certain character sequences
  • Arabic and Hebrew require right-to-left (RTL) rendering with complex bidirectional text handling
  • Indic scripts like Devanagari use conjunct consonants and combining marks that affect text measurement and cursor positioning
  • Font stacking and fallback strategies ensure text renders correctly across all supported scripts

Why Writing Systems Matter for Software

When localizing software, developers often focus on translating strings but overlook the rendering and input challenges introduced by different writing systems. A button that works in English may truncate in German, display incorrectly in Arabic, or use the wrong line breaks in Japanese.

Understanding how different scripts work helps teams build software that handles multilingual content correctly from the start, rather than fixing rendering bugs after launch.

Latin-Based Scripts

Latin script is used by English, Spanish, French, German, and many other languages. While it may seem straightforward, there are considerations:

  • Diacritics and accents: Characters like ñ, ü, ç, ø require proper Unicode support. Using ASCII approximations (replacing ü with u) is incorrect.
  • Text expansion: German and Finnish text can be 30-40% longer than English equivalents
  • Special characters: Languages like Vietnamese use stacked diacritics (e.g., ệ) that require sufficient line-height
  • Sorting/collation: Alphabetical order varies — Swedish places å, ä, ö at the end of the alphabet, not with a and o

CJK (Chinese, Japanese, Korean)

CJK scripts present unique technical challenges:

Character Sets

  • Chinese: Simplified Chinese (used in mainland China, Singapore) and Traditional Chinese (used in Taiwan, Hong Kong) use different character sets. They are not interchangeable.
  • Japanese: Uses three scripts simultaneously — Kanji (Chinese-derived characters), Hiragana (syllabary), and Katakana (syllabary for foreign words)
  • Korean: Uses Hangul, a featural alphabet with syllable blocks

Line Breaking

CJK text does not use spaces between words. Line breaking follows specific rules defined in Unicode Line Breaking Algorithm (UAX #14):

  • Most CJK characters can serve as break points
  • Certain punctuation cannot appear at the start of a line (e.g., 。、)」)
  • Certain punctuation cannot appear at the end of a line (e.g., (「)
  • CSS property word-break: break-all may be needed, but use overflow-wrap: break-word as a more nuanced alternative

Font Considerations

CJK fonts are significantly larger than Latin fonts (tens of thousands of glyphs vs hundreds). Font loading strategies include:

  • System fonts first: font-family: -apple-system, "Hiragino Sans", "MS Gothic", sans-serif
  • Subset loading: Load only the character ranges needed using unicode-range in @font-face
  • Variable fonts: Reduce total font file size while supporting multiple weights

Arabic Script

Arabic script is used by Arabic, Persian (Farsi), Urdu, and other languages. Key considerations:

Right-to-Left (RTL) Rendering

  • Text flows right-to-left
  • UI elements should mirror: navigation, sidebars, icons with directionality
  • Use CSS logical properties (margin-inline-start instead of margin-left)
  • Set dir="rtl" on the HTML element or specific containers

Contextual Shaping

Arabic letters change shape based on their position in a word:

PositionFormExample (ب)
IsolatedStand-aloneب
InitialStart of wordبـ
MedialMiddle of wordـبـ
FinalEnd of wordـب

Modern text rendering engines (HarfBuzz, CoreText, DirectWrite) handle this automatically, but custom text rendering or canvas-based UIs may need explicit support.

Bidirectional (Bidi) Text

When Arabic text contains embedded English words, numbers, or brand names, the Unicode Bidirectional Algorithm (UBA) determines display order. Developers should:

  • Use <bdi> HTML elements for user-generated content that may contain mixed-direction text
  • Apply unicode-bidi: isolate in CSS for inline mixed-direction elements
  • Test with real mixed-direction content, not just pure RTL text

Indic Scripts

Devanagari (Hindi, Marathi, Nepali), Tamil, Bengali, Telugu, and other Indic scripts have complex rendering requirements:

Conjunct Consonants

Multiple consonants can combine into a single visual glyph (ligature). For example, in Devanagari, क + ् + ष = क्ष. This affects:

  • Text measurement: The visual width of a string doesn't correspond linearly to the number of Unicode code points
  • Cursor positioning: The cursor must move through conjuncts correctly, not split them
  • Text selection: Users should select conjuncts as single units

Combining Marks

Vowel signs (matras) attach to consonants in various positions — above, below, before, or after the base consonant. CSS line-height must accommodate these marks without clipping.

Font Requirements

Not all fonts support the full range of conjuncts for a given Indic script. Use established fonts:

  • Devanagari: Noto Sans Devanagari, Mangal
  • Tamil: Noto Sans Tamil, Latha
  • Bengali: Noto Sans Bengali, Vrinda

Encoding Best Practices

UTF-8 Everywhere

UTF-8 should be the default encoding for all text in modern software:

  • Set <meta charset="UTF-8"> in HTML
  • Use UTF-8 in database columns (utf8mb4 in MySQL, UTF8 in PostgreSQL)
  • Ensure file I/O uses UTF-8 encoding
  • Set Content-Type: text/html; charset=UTF-8 in HTTP headers

String Length vs. Display Width

A single "character" as perceived by a user may consist of multiple Unicode code points:

ConceptExampleCode Points
Simple characterA1
Accented characteré1 or 2 (precomposed or combining)
CJK character1 (but double-width)
Emoji👨‍👩‍👧‍👦7 (with zero-width joiners)
Devanagari conjunctक्ष3

Use grapheme cluster counting (available via Intl.Segmenter in JavaScript) instead of .length when you need to count user-visible characters.

// JavaScript: Count grapheme clusters
const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });
const count = [...segmenter.segment('👨‍👩‍👧‍👦')].length; // 1, not 7

Font Stacking Strategy

A robust font stack ensures every script renders correctly:

body {
  font-family:
    /* Latin */
    "Inter", -apple-system, BlinkMacSystemFont,
    /* CJK */
    "Hiragino Sans", "Noto Sans CJK", "Microsoft YaHei",
    /* Arabic */
    "Noto Sans Arabic", "Segoe UI",
    /* Devanagari */
    "Noto Sans Devanagari",
    /* Fallback */
    sans-serif;
}

Google's Noto font family provides consistent coverage across scripts and is freely available.

FAQ

Do I need to support every writing system from the start?

No. Start with the scripts used by your target markets. However, ensure your technical foundation (UTF-8 encoding, flexible layouts, font stacking) can accommodate additional scripts later. Adding RTL support or CJK line-breaking rules after launch is significantly more work than building them in from the start.

How do I test my application with different scripts?

Use pseudo-localization tools to simulate text expansion and special characters. For script-specific testing, create test strings in each target script that include edge cases: long words, conjuncts, bidirectional text, and combining marks. Browser developer tools allow you to switch dir attributes and test RTL layouts without full translations.

Should I use web fonts or system fonts for multilingual applications?

Both approaches have trade-offs. System fonts render immediately with no download cost, but may not match your brand. Web fonts offer brand consistency but CJK web fonts can be very large (several megabytes). A common approach is web fonts for Latin text and system font fallbacks for CJK and other complex scripts, using unicode-range to control which characters trigger each font.