SEO

Unicode and Character Encoding: A Developer's Guide to i18n

Eray Gündoğmuş
Eray Gündoğmuş
·14 min read
Share
Unicode and Character Encoding: A Developer's Guide to i18n

Unicode and Character Encoding: A Developer's Guide to i18n

Before a single word can be translated, before a locale can be switched, before pluralization rules can be applied—text must be stored, transmitted, and rendered correctly. This is the domain of character encoding, and getting it wrong produces the broken-character gibberish that developers call "mojibake."

Unicode solved the fundamental problem of representing all the world's scripts in a single, universal standard. But understanding how Unicode works, and how its various encodings (UTF-8, UTF-16, UTF-32) interact with your software stack, is essential knowledge for any developer building international software.

The Pre-Unicode World: Why Encoding Matters

Before Unicode, every country and language had its own encoding standard. ASCII handled English (128 characters). Latin-1 (ISO 8859-1) added Western European characters. Windows-1252 was a Microsoft variant of Latin-1. Shift-JIS encoded Japanese. GB2312 encoded Chinese. KOI8-R encoded Russian Cyrillic.

The problem: these encodings were incompatible. A document encoded in Shift-JIS and displayed as Latin-1 produces garbage. A database that stored strings in Windows-1252 and displayed them in UTF-8 would mangle accented characters. Systems that moved data across encoding boundaries—especially email and the early web—produced mojibake constantly.

Unicode was designed to solve this by providing a single universal encoding that encompasses all scripts.

Unicode: The Standard Explained

Unicode is a character set standard—it assigns a unique number (called a "code point") to every character in every script. The Unicode standard covers:

  • Latin scripts (English, French, German, Spanish, etc.)
  • Cyrillic (Russian, Ukrainian, Bulgarian, etc.)
  • Arabic and Hebrew (RTL scripts)
  • CJK (Chinese, Japanese, Korean) – over 90,000 ideographs
  • Devanagari (Hindi, Sanskrit)
  • Thai, Tibetan, Khmer, Myanmar
  • Ethiopian (Amharic)
  • Mathematical and scientific symbols
  • Emoji (yes, emoji are Unicode characters)

The total Unicode code space contains 1,114,112 possible code points (from U+0000 to U+10FFFF), of which approximately 150,000 are currently assigned.

Code Points vs. Characters

A Unicode code point is a number from U+0000 to U+10FFFF. The letter "A" is U+0041. The letter "é" is U+00E9. The Chinese character 中 is U+4E2D. The emoji 🌍 is U+1F30D.

But a "character" as users perceive it isn't always a single code point. Unicode has combining characters—separate code points that modify the preceding character:

  • "é" can be represented as a single code point U+00E9 (precomposed)
  • Or as "e" (U+0065) + combining acute accent (U+0301) = é (decomposed)

These two representations are visually identical but are different byte sequences. This matters for:

  • String comparison (are these two strings equal?)
  • String length (how many characters?)
  • String indexing (what is the character at position 3?)

Unicode normalization standardizes these representations. NFC (canonical decomposition, then canonical composition) is the most common form for web use. NFD decomposes everything into base + combining sequences.

Surrogate Pairs

The original Unicode design targeted 65,536 code points (16-bit), covering the "Basic Multilingual Plane" (BMP). Characters outside the BMP—including many CJK ideographs, historic scripts, and emoji—require code points above U+FFFF.

In UTF-16, characters outside the BMP are encoded as "surrogate pairs"—two 16-bit units working together to encode one character. This is a frequent source of bugs in JavaScript, which uses UTF-16 internally:

const emoji = '🌍'; // U+1F30D, outside BMP

// Wrong: treats surrogate pairs as separate characters
emoji.length;  // 2 (not 1!)
emoji[0];      // '\uD83C' (high surrogate, not the emoji)
emoji[1];      // '\uDF0D' (low surrogate, not the emoji)

// Correct: use Array.from or the string iterator
Array.from(emoji).length;  // 1
[...emoji].length;          // 1

// Correct: codePointAt for full code point
emoji.codePointAt(0);  // 127757 (0x1F30D)

// Correct: for...of iterates by code point, not code unit
for (const char of emoji) {
  console.log(char);  // '🌍'
}

Grapheme Clusters

Even code points aren't always what users think of as "characters." A grapheme cluster is a sequence of code points that renders as a single visual unit:

  • A base letter + combining diacritics: ê = e + ̂
  • An emoji with modifier: 👍🏾 (thumbs up + skin tone modifier) = 2 code points, 1 grapheme
  • A family emoji: 👨‍👩‍👧‍👦 = 4 person emojis joined by Zero Width Joiner (ZWJ) = 1 grapheme, 11 code points, 22 UTF-16 code units

For string operations where the user would perceive "one character," you want grapheme clusters:

// Intl.Segmenter (modern API) for grapheme segmentation
const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });
const text = '👨‍👩‍👧‍👦';
const graphemes = [...segmenter.segment(text)];
graphemes.length;  // 1 (one visual character)

UTF-8, UTF-16, and UTF-32: The Encodings

Unicode is a character set. UTF-8, UTF-16, and UTF-32 are encodings that specify how to represent Unicode code points as bytes.

UTF-8

UTF-8 is a variable-length encoding that uses 1-4 bytes per code point:

Code pointsBytesNote
U+0000 – U+007F1 byteASCII compatible
U+0080 – U+07FF2 bytesLatin extended, IPA, Hebrew, Arabic
U+0800 – U+FFFF3 bytesMost CJK
U+10000 – U+10FFFF4 bytesSupplementary (emoji, rare CJK)

Advantages:

  • ASCII compatible: ASCII files are valid UTF-8
  • Storage efficient for English/Latin text (1 byte per character)
  • Self-synchronizing: any byte can be identified as a start or continuation byte
  • No byte order issues

Disadvantages:

  • Variable-length makes O(1) random access by code point impossible (requires linear scan)
  • CJK text takes 3 bytes per character (vs. 2 bytes in UTF-16)

UTF-8 is the dominant encoding on the web: HTTP headers, HTML files, JSON, and most APIs use UTF-8. If you're building web software, UTF-8 is your default.

UTF-16

UTF-16 uses 2 or 4 bytes per code point:

  • BMP characters: 2 bytes
  • Supplementary characters: 4 bytes (surrogate pairs)

Used by: Windows APIs, Java String type, JavaScript engines internally, .NET string type

Byte Order Mark (BOM): UTF-16 files often start with a BOM (U+FEFF) to indicate byte order (big-endian or little-endian). UTF-16BE vs. UTF-16LE are the two variants.

UTF-32

UTF-32 uses exactly 4 bytes per code point—fixed width. Simple for code that needs O(1) random access by code point, but uses 4x more memory than ASCII text.

Used by: Python 3 internally (on some platforms), some Unix/Linux APIs

Common Encoding Bugs and How to Fix Them

The "é" Appearing as "é" Bug

This is classic UTF-8 interpreted as Latin-1. The UTF-8 encoding of "é" (U+00E9) is the two bytes 0xC3 0xA9. Interpreted as Latin-1: Ã (0xC3) and © (0xA9).

Fix: Ensure the entire data pipeline uses UTF-8 consistently. Check your database connection charset (charset=utf8mb4 in MySQL), your HTTP response headers (Content-Type: text/html; charset=UTF-8), your file reading code, and any data exports/imports.

The MySQL utf8 vs. utf8mb4 Problem

MySQL's utf8 character set only stores 3-byte UTF-8 sequences—it cannot store emoji or supplementary CJK characters, which require 4-byte UTF-8 sequences.

Fix: Always use utf8mb4 in MySQL for full Unicode support:

CREATE TABLE content (
  body TEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci
);

-- Or set at connection level:
SET NAMES utf8mb4;

String Length Bugs

# Python 3: len() returns code point count, not byte count or grapheme count
text = "é"    # one grapheme, one code point
len(text)     # 1 ✓

text = "e\u0301"  # one grapheme, two code points (e + combining acute)
len(text)     # 2 ✗ (user sees one character, Python sees two)

# For grapheme count in Python, use the grapheme library:
import grapheme
grapheme.length("e\u0301")  # 1 ✓
// JavaScript: length returns UTF-16 code unit count
"🌍".length  // 2 (emoji takes 2 UTF-16 units)

// For code point count:
[..."🌍"].length  // 1 ✓

// For grapheme count:
const s = new Intl.Segmenter();
[...s.segment("🌍")].length  // 1 ✓

Collation and Sorting

Sorting strings in Unicode is not the same as sorting by byte value. "ä" should sort near "a" in German but after "z" in Swedish. "ch" is traditionally sorted as a single unit in traditional Spanish. The Unicode Collation Algorithm (CLDR) defines locale-specific sort rules.

// Wrong: sorts by byte value
['ä', 'z', 'a'].sort()
// ['a', 'z', 'ä'] (wrong for most locales)

// Correct: locale-aware sort
['ä', 'z', 'a'].sort((a, b) => a.localeCompare(b, 'de'));
// ['a', 'ä', 'z'] ✓ (correct for German)

['ä', 'z', 'a'].sort((a, b) => a.localeCompare(b, 'sv'));
// ['a', 'z', 'ä'] ✓ (correct for Swedish)

Case Conversion and Turkish i

The Turkish language has a dotted "i" (İ) and a dotless "ı". In Turkish, lowercase "I" is "ı" (not "i"), and uppercase "i" is "İ" (not "I"). Using locale-unaware case conversion breaks Turkish strings:

// Wrong: locale-unaware
"Istanbul".toLowerCase()  // "istanbul" (English)
"istanbul".toUpperCase()  // "ISTANBUL" (English)

// Correct: locale-aware
"Istanbul".toLocaleLowerCase('tr')  // "istanbul" (same here)
"istanbul".toLocaleUpperCase('tr')  // "İSTANBUL" (dotted İ)

// Turkish I bug:
"I".toLowerCase()           // "i" (wrong in Turkish)
"I".toLocaleLowerCase('tr') // "ı" (correct dotless ı)

Regular Expressions and Unicode

JavaScript regex by default works on UTF-16 code units, not code points:

// Wrong: . matches one UTF-16 code unit, not one code point
/^.$/.test('🌍')   // false (emoji is 2 code units)

// Correct: use the u flag for Unicode-aware regex
/^.$/u.test('🌍')  // true ✓

// Also: Unicode property escapes with u flag
/\p{Script=Arabic}/u.test('مرحبا')  // true ✓
/\p{Emoji}/u.test('🌍')            // true ✓

Database Configuration for Unicode

PostgreSQL

PostgreSQL natively supports Unicode in all modern versions. Use TEXT columns (no length limit) rather than VARCHAR(n) which counts characters differently across versions. The default encoding for new PostgreSQL databases should be UTF8.

MySQL / MariaDB

As noted above, always use utf8mb4 with utf8mb4_unicode_ci collation:

ALTER DATABASE mydb CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

SQLite

SQLite stores text in UTF-8 by default. No special configuration needed.

Redis and Caches

Redis stores strings as bytes—it's encoding-agnostic. Ensure your client code consistently encodes and decodes UTF-8.

Summary: Unicode i18n Checklist

  • All files, databases, and APIs use UTF-8 encoding
  • MySQL databases use utf8mb4, not utf8
  • HTTP responses declare charset=utf-8 in Content-Type
  • JavaScript string operations use u flag in regex
  • String length calculations account for multi-code-unit characters when user-visible
  • Case conversion uses locale-aware methods where needed
  • Sorting uses localeCompare with the appropriate locale
  • BOM is stripped from UTF-8 files where not expected
  • Unicode normalization applied before string comparison

For more on how these technical foundations connect to real localization workflows, see localisation and internationalisation fundamentals and software localization.


Take your app global with better-i18n

better-i18n combines AI-powered translations, git-native workflows, and global CDN delivery into one developer-first platform. Stop managing spreadsheets and start shipping in every language.

Get started free → · Explore features · Read the docs