Table of Contents
Table of Contents
- Unicode and Character Encoding: A Developer's Guide to i18n
- The Pre-Unicode World: Why Encoding Matters
- Unicode: The Standard Explained
- Code Points vs. Characters
- Surrogate Pairs
- Grapheme Clusters
- UTF-8, UTF-16, and UTF-32: The Encodings
- UTF-8
- UTF-16
- UTF-32
- Common Encoding Bugs and How to Fix Them
- The "é" Appearing as "é" Bug
- The MySQL utf8 vs. utf8mb4 Problem
- String Length Bugs
- Collation and Sorting
- Case Conversion and Turkish i
- Regular Expressions and Unicode
- Database Configuration for Unicode
- PostgreSQL
- MySQL / MariaDB
- SQLite
- Redis and Caches
- Summary: Unicode i18n Checklist
- Take your app global with better-i18n
Unicode and Character Encoding: A Developer's Guide to i18n
Before a single word can be translated, before a locale can be switched, before pluralization rules can be applied—text must be stored, transmitted, and rendered correctly. This is the domain of character encoding, and getting it wrong produces the broken-character gibberish that developers call "mojibake."
Unicode solved the fundamental problem of representing all the world's scripts in a single, universal standard. But understanding how Unicode works, and how its various encodings (UTF-8, UTF-16, UTF-32) interact with your software stack, is essential knowledge for any developer building international software.
The Pre-Unicode World: Why Encoding Matters
Before Unicode, every country and language had its own encoding standard. ASCII handled English (128 characters). Latin-1 (ISO 8859-1) added Western European characters. Windows-1252 was a Microsoft variant of Latin-1. Shift-JIS encoded Japanese. GB2312 encoded Chinese. KOI8-R encoded Russian Cyrillic.
The problem: these encodings were incompatible. A document encoded in Shift-JIS and displayed as Latin-1 produces garbage. A database that stored strings in Windows-1252 and displayed them in UTF-8 would mangle accented characters. Systems that moved data across encoding boundaries—especially email and the early web—produced mojibake constantly.
Unicode was designed to solve this by providing a single universal encoding that encompasses all scripts.
Unicode: The Standard Explained
Unicode is a character set standard—it assigns a unique number (called a "code point") to every character in every script. The Unicode standard covers:
- Latin scripts (English, French, German, Spanish, etc.)
- Cyrillic (Russian, Ukrainian, Bulgarian, etc.)
- Arabic and Hebrew (RTL scripts)
- CJK (Chinese, Japanese, Korean) – over 90,000 ideographs
- Devanagari (Hindi, Sanskrit)
- Thai, Tibetan, Khmer, Myanmar
- Ethiopian (Amharic)
- Mathematical and scientific symbols
- Emoji (yes, emoji are Unicode characters)
The total Unicode code space contains 1,114,112 possible code points (from U+0000 to U+10FFFF), of which approximately 150,000 are currently assigned.
Code Points vs. Characters
A Unicode code point is a number from U+0000 to U+10FFFF. The letter "A" is U+0041. The letter "é" is U+00E9. The Chinese character 中 is U+4E2D. The emoji 🌍 is U+1F30D.
But a "character" as users perceive it isn't always a single code point. Unicode has combining characters—separate code points that modify the preceding character:
- "é" can be represented as a single code point U+00E9 (precomposed)
- Or as "e" (U+0065) + combining acute accent (U+0301) = é (decomposed)
These two representations are visually identical but are different byte sequences. This matters for:
- String comparison (are these two strings equal?)
- String length (how many characters?)
- String indexing (what is the character at position 3?)
Unicode normalization standardizes these representations. NFC (canonical decomposition, then canonical composition) is the most common form for web use. NFD decomposes everything into base + combining sequences.
Surrogate Pairs
The original Unicode design targeted 65,536 code points (16-bit), covering the "Basic Multilingual Plane" (BMP). Characters outside the BMP—including many CJK ideographs, historic scripts, and emoji—require code points above U+FFFF.
In UTF-16, characters outside the BMP are encoded as "surrogate pairs"—two 16-bit units working together to encode one character. This is a frequent source of bugs in JavaScript, which uses UTF-16 internally:
const emoji = '🌍'; // U+1F30D, outside BMP
// Wrong: treats surrogate pairs as separate characters
emoji.length; // 2 (not 1!)
emoji[0]; // '\uD83C' (high surrogate, not the emoji)
emoji[1]; // '\uDF0D' (low surrogate, not the emoji)
// Correct: use Array.from or the string iterator
Array.from(emoji).length; // 1
[...emoji].length; // 1
// Correct: codePointAt for full code point
emoji.codePointAt(0); // 127757 (0x1F30D)
// Correct: for...of iterates by code point, not code unit
for (const char of emoji) {
console.log(char); // '🌍'
}
Grapheme Clusters
Even code points aren't always what users think of as "characters." A grapheme cluster is a sequence of code points that renders as a single visual unit:
- A base letter + combining diacritics:
ê=e+̂ - An emoji with modifier:
👍🏾(thumbs up + skin tone modifier) = 2 code points, 1 grapheme - A family emoji:
👨👩👧👦= 4 person emojis joined by Zero Width Joiner (ZWJ) = 1 grapheme, 11 code points, 22 UTF-16 code units
For string operations where the user would perceive "one character," you want grapheme clusters:
// Intl.Segmenter (modern API) for grapheme segmentation
const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });
const text = '👨👩👧👦';
const graphemes = [...segmenter.segment(text)];
graphemes.length; // 1 (one visual character)
UTF-8, UTF-16, and UTF-32: The Encodings
Unicode is a character set. UTF-8, UTF-16, and UTF-32 are encodings that specify how to represent Unicode code points as bytes.
UTF-8
UTF-8 is a variable-length encoding that uses 1-4 bytes per code point:
| Code points | Bytes | Note |
|---|---|---|
| U+0000 – U+007F | 1 byte | ASCII compatible |
| U+0080 – U+07FF | 2 bytes | Latin extended, IPA, Hebrew, Arabic |
| U+0800 – U+FFFF | 3 bytes | Most CJK |
| U+10000 – U+10FFFF | 4 bytes | Supplementary (emoji, rare CJK) |
Advantages:
- ASCII compatible: ASCII files are valid UTF-8
- Storage efficient for English/Latin text (1 byte per character)
- Self-synchronizing: any byte can be identified as a start or continuation byte
- No byte order issues
Disadvantages:
- Variable-length makes O(1) random access by code point impossible (requires linear scan)
- CJK text takes 3 bytes per character (vs. 2 bytes in UTF-16)
UTF-8 is the dominant encoding on the web: HTTP headers, HTML files, JSON, and most APIs use UTF-8. If you're building web software, UTF-8 is your default.
UTF-16
UTF-16 uses 2 or 4 bytes per code point:
- BMP characters: 2 bytes
- Supplementary characters: 4 bytes (surrogate pairs)
Used by: Windows APIs, Java String type, JavaScript engines internally, .NET string type
Byte Order Mark (BOM): UTF-16 files often start with a BOM (U+FEFF) to indicate byte order (big-endian or little-endian). UTF-16BE vs. UTF-16LE are the two variants.
UTF-32
UTF-32 uses exactly 4 bytes per code point—fixed width. Simple for code that needs O(1) random access by code point, but uses 4x more memory than ASCII text.
Used by: Python 3 internally (on some platforms), some Unix/Linux APIs
Common Encoding Bugs and How to Fix Them
The "é" Appearing as "é" Bug
This is classic UTF-8 interpreted as Latin-1. The UTF-8 encoding of "é" (U+00E9) is the two bytes 0xC3 0xA9. Interpreted as Latin-1: Ã (0xC3) and © (0xA9).
Fix: Ensure the entire data pipeline uses UTF-8 consistently. Check your database connection charset (charset=utf8mb4 in MySQL), your HTTP response headers (Content-Type: text/html; charset=UTF-8), your file reading code, and any data exports/imports.
The MySQL utf8 vs. utf8mb4 Problem
MySQL's utf8 character set only stores 3-byte UTF-8 sequences—it cannot store emoji or supplementary CJK characters, which require 4-byte UTF-8 sequences.
Fix: Always use utf8mb4 in MySQL for full Unicode support:
CREATE TABLE content (
body TEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci
);
-- Or set at connection level:
SET NAMES utf8mb4;
String Length Bugs
# Python 3: len() returns code point count, not byte count or grapheme count
text = "é" # one grapheme, one code point
len(text) # 1 ✓
text = "e\u0301" # one grapheme, two code points (e + combining acute)
len(text) # 2 ✗ (user sees one character, Python sees two)
# For grapheme count in Python, use the grapheme library:
import grapheme
grapheme.length("e\u0301") # 1 ✓
// JavaScript: length returns UTF-16 code unit count
"🌍".length // 2 (emoji takes 2 UTF-16 units)
// For code point count:
[..."🌍"].length // 1 ✓
// For grapheme count:
const s = new Intl.Segmenter();
[...s.segment("🌍")].length // 1 ✓
Collation and Sorting
Sorting strings in Unicode is not the same as sorting by byte value. "ä" should sort near "a" in German but after "z" in Swedish. "ch" is traditionally sorted as a single unit in traditional Spanish. The Unicode Collation Algorithm (CLDR) defines locale-specific sort rules.
// Wrong: sorts by byte value
['ä', 'z', 'a'].sort()
// ['a', 'z', 'ä'] (wrong for most locales)
// Correct: locale-aware sort
['ä', 'z', 'a'].sort((a, b) => a.localeCompare(b, 'de'));
// ['a', 'ä', 'z'] ✓ (correct for German)
['ä', 'z', 'a'].sort((a, b) => a.localeCompare(b, 'sv'));
// ['a', 'z', 'ä'] ✓ (correct for Swedish)
Case Conversion and Turkish i
The Turkish language has a dotted "i" (İ) and a dotless "ı". In Turkish, lowercase "I" is "ı" (not "i"), and uppercase "i" is "İ" (not "I"). Using locale-unaware case conversion breaks Turkish strings:
// Wrong: locale-unaware
"Istanbul".toLowerCase() // "istanbul" (English)
"istanbul".toUpperCase() // "ISTANBUL" (English)
// Correct: locale-aware
"Istanbul".toLocaleLowerCase('tr') // "istanbul" (same here)
"istanbul".toLocaleUpperCase('tr') // "İSTANBUL" (dotted İ)
// Turkish I bug:
"I".toLowerCase() // "i" (wrong in Turkish)
"I".toLocaleLowerCase('tr') // "ı" (correct dotless ı)
Regular Expressions and Unicode
JavaScript regex by default works on UTF-16 code units, not code points:
// Wrong: . matches one UTF-16 code unit, not one code point
/^.$/.test('🌍') // false (emoji is 2 code units)
// Correct: use the u flag for Unicode-aware regex
/^.$/u.test('🌍') // true ✓
// Also: Unicode property escapes with u flag
/\p{Script=Arabic}/u.test('مرحبا') // true ✓
/\p{Emoji}/u.test('🌍') // true ✓
Database Configuration for Unicode
PostgreSQL
PostgreSQL natively supports Unicode in all modern versions. Use TEXT columns (no length limit) rather than VARCHAR(n) which counts characters differently across versions. The default encoding for new PostgreSQL databases should be UTF8.
MySQL / MariaDB
As noted above, always use utf8mb4 with utf8mb4_unicode_ci collation:
ALTER DATABASE mydb CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
SQLite
SQLite stores text in UTF-8 by default. No special configuration needed.
Redis and Caches
Redis stores strings as bytes—it's encoding-agnostic. Ensure your client code consistently encodes and decodes UTF-8.
Summary: Unicode i18n Checklist
- All files, databases, and APIs use UTF-8 encoding
- MySQL databases use
utf8mb4, notutf8 - HTTP responses declare
charset=utf-8in Content-Type - JavaScript string operations use
uflag in regex - String length calculations account for multi-code-unit characters when user-visible
- Case conversion uses locale-aware methods where needed
- Sorting uses
localeComparewith the appropriate locale - BOM is stripped from UTF-8 files where not expected
- Unicode normalization applied before string comparison
For more on how these technical foundations connect to real localization workflows, see localisation and internationalisation fundamentals and software localization.
Take your app global with better-i18n
better-i18n combines AI-powered translations, git-native workflows, and global CDN delivery into one developer-first platform. Stop managing spreadsheets and start shipping in every language.