Engineering

Script and Writing System Considerations for Software Localization

Eray Gündoğmuş

March 2, 2026·10 min read

Share

Script and Writing System Considerations for Software Localization

Key Takeaways

The Unicode standard defines over 150,000 characters across 161 scripts, and software must handle all of them correctly
CJK (Chinese, Japanese, Korean) scripts have unique line-breaking rules — you cannot break in the middle of certain character sequences
Arabic and Hebrew require right-to-left (RTL) rendering with complex bidirectional text handling
Indic scripts like Devanagari use conjunct consonants and combining marks that affect text measurement and cursor positioning
Font stacking and fallback strategies ensure text renders correctly across all supported scripts

Why Writing Systems Matter for Software

When localizing software, developers often focus on translating strings but overlook the rendering and input challenges introduced by different writing systems. A button that works in English may truncate in German, display incorrectly in Arabic, or use the wrong line breaks in Japanese.

Understanding how different scripts work helps teams build software that handles multilingual content correctly from the start, rather than fixing rendering bugs after launch.

Latin-Based Scripts

Latin script is used by English, Spanish, French, German, and many other languages. While it may seem straightforward, there are considerations:

Diacritics and accents: Characters like ñ, ü, ç, ø require proper Unicode support. Using ASCII approximations (replacing ü with u) is incorrect.
Text expansion: German and Finnish text can be 30-40% longer than English equivalents
Special characters: Languages like Vietnamese use stacked diacritics (e.g., ệ) that require sufficient line-height
Sorting/collation: Alphabetical order varies — Swedish places å, ä, ö at the end of the alphabet, not with a and o

CJK (Chinese, Japanese, Korean)

CJK scripts present unique technical challenges:

Character Sets

Chinese: Simplified Chinese (used in mainland China, Singapore) and Traditional Chinese (used in Taiwan, Hong Kong) use different character sets. They are not interchangeable.
Japanese: Uses three scripts simultaneously — Kanji (Chinese-derived characters), Hiragana (syllabary), and Katakana (syllabary for foreign words)
Korean: Uses Hangul, a featural alphabet with syllable blocks

Line Breaking

CJK text does not use spaces between words. Line breaking follows specific rules defined in Unicode Line Breaking Algorithm (UAX #14):

Most CJK characters can serve as break points
Certain punctuation cannot appear at the start of a line (e.g., 。、）」)
Certain punctuation cannot appear at the end of a line (e.g., （「)
CSS property word-break: break-all may be needed, but use overflow-wrap: break-word as a more nuanced alternative

Font Considerations

CJK fonts are significantly larger than Latin fonts (tens of thousands of glyphs vs hundreds). Font loading strategies include:

System fonts first: font-family: -apple-system, "Hiragino Sans", "MS Gothic", sans-serif
Subset loading: Load only the character ranges needed using unicode-range in @font-face
Variable fonts: Reduce total font file size while supporting multiple weights

Arabic Script

Arabic script is used by Arabic, Persian (Farsi), Urdu, and other languages. Key considerations:

Right-to-Left (RTL) Rendering

Text flows right-to-left
UI elements should mirror: navigation, sidebars, icons with directionality
Use CSS logical properties (margin-inline-start instead of margin-left)
Set dir="rtl" on the HTML element or specific containers

Contextual Shaping

Arabic letters change shape based on their position in a word:

Position	Form	Example (ب)
Isolated	Stand-alone	ب
Initial	Start of word	بـ
Medial	Middle of word	ـبـ
Final	End of word	ـب

Modern text rendering engines (HarfBuzz, CoreText, DirectWrite) handle this automatically, but custom text rendering or canvas-based UIs may need explicit support.

Bidirectional (Bidi) Text

When Arabic text contains embedded English words, numbers, or brand names, the Unicode Bidirectional Algorithm (UBA) determines display order. Developers should:

Use <bdi> HTML elements for user-generated content that may contain mixed-direction text
Apply unicode-bidi: isolate in CSS for inline mixed-direction elements
Test with real mixed-direction content, not just pure RTL text

Indic Scripts

Devanagari (Hindi, Marathi, Nepali), Tamil, Bengali, Telugu, and other Indic scripts have complex rendering requirements:

Conjunct Consonants

Multiple consonants can combine into a single visual glyph (ligature). For example, in Devanagari, क + ् + ष = क्ष. This affects:

Text measurement: The visual width of a string doesn't correspond linearly to the number of Unicode code points
Cursor positioning: The cursor must move through conjuncts correctly, not split them
Text selection: Users should select conjuncts as single units

Combining Marks

Vowel signs (matras) attach to consonants in various positions — above, below, before, or after the base consonant. CSS line-height must accommodate these marks without clipping.

Font Requirements

Not all fonts support the full range of conjuncts for a given Indic script. Use established fonts:

Devanagari: Noto Sans Devanagari, Mangal
Tamil: Noto Sans Tamil, Latha
Bengali: Noto Sans Bengali, Vrinda

Encoding Best Practices

UTF-8 Everywhere

UTF-8 should be the default encoding for all text in modern software:

Set <meta charset="UTF-8"> in HTML
Use UTF-8 in database columns (utf8mb4 in MySQL, UTF8 in PostgreSQL)
Ensure file I/O uses UTF-8 encoding
Set Content-Type: text/html; charset=UTF-8 in HTTP headers

String Length vs. Display Width

A single "character" as perceived by a user may consist of multiple Unicode code points:

Concept	Example	Code Points
Simple character	A	1
Accented character	é	1 or 2 (precomposed or combining)
CJK character	漢	1 (but double-width)
Emoji	👨‍👩‍👧‍👦	7 (with zero-width joiners)
Devanagari conjunct	क्ष	3

Use grapheme cluster counting (available via Intl.Segmenter in JavaScript) instead of .length when you need to count user-visible characters.

// JavaScript: Count grapheme clusters
const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });
const count = [...segmenter.segment('👨‍👩‍👧‍👦')].length; // 1, not 7

Font Stacking Strategy

A robust font stack ensures every script renders correctly:

body {
  font-family:
    /* Latin */
    "Inter", -apple-system, BlinkMacSystemFont,
    /* CJK */
    "Hiragino Sans", "Noto Sans CJK", "Microsoft YaHei",
    /* Arabic */
    "Noto Sans Arabic", "Segoe UI",
    /* Devanagari */
    "Noto Sans Devanagari",
    /* Fallback */
    sans-serif;
}

Google's Noto font family provides consistent coverage across scripts and is freely available.

FAQ

Do I need to support every writing system from the start?

No. Start with the scripts used by your target markets. However, ensure your technical foundation (UTF-8 encoding, flexible layouts, font stacking) can accommodate additional scripts later. Adding RTL support or CJK line-breaking rules after launch is significantly more work than building them in from the start.

How do I test my application with different scripts?

Use pseudo-localization tools to simulate text expansion and special characters. For script-specific testing, create test strings in each target script that include edge cases: long words, conjuncts, bidirectional text, and combining marks. Browser developer tools allow you to switch dir attributes and test RTL layouts without full translations.

Should I use web fonts or system fonts for multilingual applications?

Both approaches have trade-offs. System fonts render immediately with no download cost, but may not match your brand. Web fonts offer brand consistency but CJK web fonts can be very large (several megabytes). A common approach is web fonts for Latin text and system font fallbacks for CJK and other complex scripts, using unicode-range to control which characters trigger each font.

Script and Writing System Considerations for Software Localization

Script and Writing System Considerations for Software Localization

Key Takeaways

Why Writing Systems Matter for Software

Latin-Based Scripts

CJK (Chinese, Japanese, Korean)

Character Sets

Line Breaking

Font Considerations

Arabic Script

Right-to-Left (RTL) Rendering

Contextual Shaping

Bidirectional (Bidi) Text

Indic Scripts

Conjunct Consonants

Combining Marks

Font Requirements

Encoding Best Practices

UTF-8 Everywhere

String Length vs. Display Width

Font Stacking Strategy

FAQ

Do I need to support every writing system from the start?

How do I test my application with different scripts?

Should I use web fonts or system fonts for multilingual applications?

Related Posts

Online Translation Tools for Developers: Beyond Google Translate

AI-Powered Translation Workflows: From Machine Translation to Post-Editing

How Better i18n Secures Enterprise Translation Workflows: Auth, Encryption & Compliance

Explore More

For Developers

For Translators

For Product Teams

All Features