Table of Contents
Table of Contents
- Script and Writing System Considerations for Software Localization
- Key Takeaways
- Why Writing Systems Matter for Software
- Latin-Based Scripts
- CJK (Chinese, Japanese, Korean)
- Character Sets
- Line Breaking
- Font Considerations
- Arabic Script
- Right-to-Left (RTL) Rendering
- Contextual Shaping
- Bidirectional (Bidi) Text
- Indic Scripts
- Conjunct Consonants
- Combining Marks
- Font Requirements
- Encoding Best Practices
- UTF-8 Everywhere
- String Length vs. Display Width
- Font Stacking Strategy
- FAQ
- Do I need to support every writing system from the start?
- How do I test my application with different scripts?
- Should I use web fonts or system fonts for multilingual applications?
Script and Writing System Considerations for Software Localization
Key Takeaways
- The Unicode standard defines over 150,000 characters across 161 scripts, and software must handle all of them correctly
- CJK (Chinese, Japanese, Korean) scripts have unique line-breaking rules — you cannot break in the middle of certain character sequences
- Arabic and Hebrew require right-to-left (RTL) rendering with complex bidirectional text handling
- Indic scripts like Devanagari use conjunct consonants and combining marks that affect text measurement and cursor positioning
- Font stacking and fallback strategies ensure text renders correctly across all supported scripts
Why Writing Systems Matter for Software
When localizing software, developers often focus on translating strings but overlook the rendering and input challenges introduced by different writing systems. A button that works in English may truncate in German, display incorrectly in Arabic, or use the wrong line breaks in Japanese.
Understanding how different scripts work helps teams build software that handles multilingual content correctly from the start, rather than fixing rendering bugs after launch.
Latin-Based Scripts
Latin script is used by English, Spanish, French, German, and many other languages. While it may seem straightforward, there are considerations:
- Diacritics and accents: Characters like ñ, ü, ç, ø require proper Unicode support. Using ASCII approximations (replacing ü with u) is incorrect.
- Text expansion: German and Finnish text can be 30-40% longer than English equivalents
- Special characters: Languages like Vietnamese use stacked diacritics (e.g., ệ) that require sufficient line-height
- Sorting/collation: Alphabetical order varies — Swedish places å, ä, ö at the end of the alphabet, not with a and o
CJK (Chinese, Japanese, Korean)
CJK scripts present unique technical challenges:
Character Sets
- Chinese: Simplified Chinese (used in mainland China, Singapore) and Traditional Chinese (used in Taiwan, Hong Kong) use different character sets. They are not interchangeable.
- Japanese: Uses three scripts simultaneously — Kanji (Chinese-derived characters), Hiragana (syllabary), and Katakana (syllabary for foreign words)
- Korean: Uses Hangul, a featural alphabet with syllable blocks
Line Breaking
CJK text does not use spaces between words. Line breaking follows specific rules defined in Unicode Line Breaking Algorithm (UAX #14):
- Most CJK characters can serve as break points
- Certain punctuation cannot appear at the start of a line (e.g., 。、)」)
- Certain punctuation cannot appear at the end of a line (e.g., (「)
- CSS property
word-break: break-allmay be needed, but useoverflow-wrap: break-wordas a more nuanced alternative
Font Considerations
CJK fonts are significantly larger than Latin fonts (tens of thousands of glyphs vs hundreds). Font loading strategies include:
- System fonts first:
font-family: -apple-system, "Hiragino Sans", "MS Gothic", sans-serif - Subset loading: Load only the character ranges needed using
unicode-rangein@font-face - Variable fonts: Reduce total font file size while supporting multiple weights
Arabic Script
Arabic script is used by Arabic, Persian (Farsi), Urdu, and other languages. Key considerations:
Right-to-Left (RTL) Rendering
- Text flows right-to-left
- UI elements should mirror: navigation, sidebars, icons with directionality
- Use CSS logical properties (
margin-inline-startinstead ofmargin-left) - Set
dir="rtl"on the HTML element or specific containers
Contextual Shaping
Arabic letters change shape based on their position in a word:
| Position | Form | Example (ب) |
|---|---|---|
| Isolated | Stand-alone | ب |
| Initial | Start of word | بـ |
| Medial | Middle of word | ـبـ |
| Final | End of word | ـب |
Modern text rendering engines (HarfBuzz, CoreText, DirectWrite) handle this automatically, but custom text rendering or canvas-based UIs may need explicit support.
Bidirectional (Bidi) Text
When Arabic text contains embedded English words, numbers, or brand names, the Unicode Bidirectional Algorithm (UBA) determines display order. Developers should:
- Use
<bdi>HTML elements for user-generated content that may contain mixed-direction text - Apply
unicode-bidi: isolatein CSS for inline mixed-direction elements - Test with real mixed-direction content, not just pure RTL text
Indic Scripts
Devanagari (Hindi, Marathi, Nepali), Tamil, Bengali, Telugu, and other Indic scripts have complex rendering requirements:
Conjunct Consonants
Multiple consonants can combine into a single visual glyph (ligature). For example, in Devanagari, क + ् + ष = क्ष. This affects:
- Text measurement: The visual width of a string doesn't correspond linearly to the number of Unicode code points
- Cursor positioning: The cursor must move through conjuncts correctly, not split them
- Text selection: Users should select conjuncts as single units
Combining Marks
Vowel signs (matras) attach to consonants in various positions — above, below, before, or after the base consonant. CSS line-height must accommodate these marks without clipping.
Font Requirements
Not all fonts support the full range of conjuncts for a given Indic script. Use established fonts:
- Devanagari: Noto Sans Devanagari, Mangal
- Tamil: Noto Sans Tamil, Latha
- Bengali: Noto Sans Bengali, Vrinda
Encoding Best Practices
UTF-8 Everywhere
UTF-8 should be the default encoding for all text in modern software:
- Set
<meta charset="UTF-8">in HTML - Use UTF-8 in database columns (
utf8mb4in MySQL,UTF8in PostgreSQL) - Ensure file I/O uses UTF-8 encoding
- Set
Content-Type: text/html; charset=UTF-8in HTTP headers
String Length vs. Display Width
A single "character" as perceived by a user may consist of multiple Unicode code points:
| Concept | Example | Code Points |
|---|---|---|
| Simple character | A | 1 |
| Accented character | é | 1 or 2 (precomposed or combining) |
| CJK character | 漢 | 1 (but double-width) |
| Emoji | 👨👩👧👦 | 7 (with zero-width joiners) |
| Devanagari conjunct | क्ष | 3 |
Use grapheme cluster counting (available via Intl.Segmenter in JavaScript) instead of .length when you need to count user-visible characters.
// JavaScript: Count grapheme clusters
const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });
const count = [...segmenter.segment('👨👩👧👦')].length; // 1, not 7
Font Stacking Strategy
A robust font stack ensures every script renders correctly:
body {
font-family:
/* Latin */
"Inter", -apple-system, BlinkMacSystemFont,
/* CJK */
"Hiragino Sans", "Noto Sans CJK", "Microsoft YaHei",
/* Arabic */
"Noto Sans Arabic", "Segoe UI",
/* Devanagari */
"Noto Sans Devanagari",
/* Fallback */
sans-serif;
}
Google's Noto font family provides consistent coverage across scripts and is freely available.
FAQ
Do I need to support every writing system from the start?
No. Start with the scripts used by your target markets. However, ensure your technical foundation (UTF-8 encoding, flexible layouts, font stacking) can accommodate additional scripts later. Adding RTL support or CJK line-breaking rules after launch is significantly more work than building them in from the start.
How do I test my application with different scripts?
Use pseudo-localization tools to simulate text expansion and special characters. For script-specific testing, create test strings in each target script that include edge cases: long words, conjuncts, bidirectional text, and combining marks. Browser developer tools allow you to switch dir attributes and test RTL layouts without full translations.
Should I use web fonts or system fonts for multilingual applications?
Both approaches have trade-offs. System fonts render immediately with no download cost, but may not match your brand. Web fonts offer brand consistency but CJK web fonts can be very large (several megabytes). A common approach is web fonts for Latin text and system font fallbacks for CJK and other complex scripts, using unicode-range to control which characters trigger each font.