caduh

Text Beyond UTF‑8 — Unicode gotchas every dev should know

4 min read

Normalization, grapheme clusters, collation, bidi, and emojis—why 'string length' and 'toLowerCase' can betray you, with practical recipes in JS/Python/SQL.

TL;DR

  • Bytes ≠ code points ≠ user‑perceived characters (grapheme clusters). Don’t count by bytes or code units.
  • Normalize input (usually NFC) before storing/uniqueness checks; use NFKC only for identifiers/slugs.
  • For case‑insensitive compare/search, prefer case folding (Python casefold()), or ICU/Intl.Collator with sensitivity: "base".
  • Sort/search with proper collation (ICU/locale), not raw code point order.
  • Handle emoji/ZWJ/skin tones and RTL/bidi safely; measure and truncate by grapheme clusters.
  • Databases: use real Unicode (e.g., utf8mb4 in MySQL), set a consistent collation, and be mindful of unique indexes + normalization.

1) The layers: bytes → code points → grapheme clusters

  • Encoding (UTF‑8/16/32): how code points map to bytes.
  • Code point: a Unicode scalar (e.g., U+00E9 is “é”; U+1F600 is 😀).
  • Grapheme cluster: what users see as one character (may be multiple code points) — e.g., e + combining acute = “é”; the family emoji 👨‍👩‍👧‍👦 uses ZWJ sequences.

JS length surprises

"👨‍👩‍👧‍👦".length            // 11 code units (UTF‑16), not 1
Array.from("👨‍👩‍👧‍👦").length // 7 code points (still not 1)

Count graphemes, not code units:

// Modern JS
const seg = new Intl.Segmenter("en", { granularity: "grapheme" });
[...seg.segment("👨‍👩‍👧‍👦")].length // 1

2) Normalization: NFC, NFD, NFKC, NFKD

The same visual text can have different byte sequences (e.g., “é” as single code point U+00E9 vs “e” + U+0301 combining mark). Normalize to avoid duplicate/compare bugs.

Common practice

  • Store text as NFC.
  • For identifiers/slugs or comparing structurally, consider NFKC/NFKD (compatibility folds).

Examples

// JavaScript
const a = "é"; // U+00E9
const b = "é"; // 'e' + combining acute
a === b                         // false
a.normalize("NFC") === b.normalize("NFC") // true
# Python
import unicodedata as u
a, b = "é", "é"
u.normalize("NFC", a) == u.normalize("NFC", b)  # True

Normalize before hashing, deduping, or enforcing unique constraints.


3) Case‑insensitivity: lowercasing isn’t enough

  • .toLowerCase() can break in some locales (e.g., Turkish I/i).
  • Prefer case folding for language‑agnostic comparisons.
"Straße".casefold() == "STRASSE".casefold()  # True

In JS, use locale‑aware comparison for search/equality:

const coll = new Intl.Collator("de", { sensitivity: "base" });
coll.compare("Straße", "STRASSE") === 0 // True

4) Collation & search (sort like humans)

  • Use ICU/Intl for ordering & accent‑insensitive search.
  • DBs: pick collations explicitly and consistently.

Postgres

-- ICU collation (requires ICU-enabled build)
CREATE COLLATION de_phonebook (provider = icu, locale = 'de@collation=phonebook');
SELECT * FROM contacts ORDER BY name COLLATE "de_phonebook";

MySQL

-- Use real Unicode in MySQL
ALTER TABLE users CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci; -- accent-insensitive

5) Emoji, ZWJ, and limits that don’t explode UX

  • Emoji can be sequences: base + skin‑tone modifier, or multiple joined by ZWJ (U+200D).
  • Limit by grapheme clusters, not bytes or length.
function graphemeCount(s) {
  const seg = new Intl.Segmenter(undefined, { granularity: "grapheme" });
  return [...seg.segment(s)].length;
}
function truncateByGrapheme(s, max) {
  const seg = new Intl.Segmenter(undefined, { granularity: "grapheme" });
  let out = "", n = 0;
  for (const g of seg.segment(s)) { if (n++ >= max) break; out += g.segment; }
  return out;
}

6) BiDi (right‑to‑left) and mixed‑direction text

  • In HTML, prefer semantic controls:
    • dir="auto" on user‑generated content.
    • <bdi> to isolate a user name within surrounding text.
    • <bdo dir="rtl"> only when you must override.
  • CSS: unicode-bidi: isolate; direction: rtl; where appropriate.
  • Avoid sprinkling invisible control chars unless you really know FSI/PDI/LRM/RLM.
<p>By <bdi>{{username}}</bdi></p>
<div dir="auto">{{user_bio}}</div>

7) Databases & uniqueness (real‑world guardrails)

  • Use UTF‑8 everywhere; in MySQL ensure utf8mb4 (not utf8).
  • Pick a single normalization (e.g., NFC) at ingest, then enforce uniqueness on the normalized value.
  • Be explicit with collation on columns and indexes; changing collations later can invalidate index order.

Postgres (email, case‑insensitive)

CREATE EXTENSION IF NOT EXISTS citext;
CREATE TABLE users (
  id bigserial PRIMARY KEY,
  email citext UNIQUE  -- case-insensitive compare/index
);

8) Security: confusables & spoofing

  • Homoglyphs (e.g., Latin “a” vs Cyrillic “а”) can spoof identifiers.
  • For usernames/tenants, consider UTS #39 confusables mapping (skeletons), restrict allowed scripts, or require mixed‑script review.
  • For domains, use Punycode/IDNA rules; don’t roll your own.

Pitfalls & fast fixes

| Pitfall | Why it hurts | Fix | |---|---|---| | Counting by bytes/length | UI truncation, broken limits | Count graphemes (Intl.Segmenter) | | Raw compares on unnormalized text | Duplicate “same” strings | Normalize (NFC) before compare/store | | .toLowerCase() for equality | Locale edge cases | Case fold (Python) or Intl.Collator | | Sorting by code point | Weird order for users | Use ICU/Intl collations | | MySQL utf8 | Can’t store full emoji | Use utf8mb4 | | Unique index without normalization | Near duplicates | Normalize at ingest; enforce on normalized column |


Quick checklist

  • [ ] Normalize to NFC on input; consider NFKC for IDs/slugs.
  • [ ] Compare/search with case folding or Intl.Collator.
  • [ ] Limit/truncate by grapheme clusters.
  • [ ] Use utf8mb4 and explicit collations in DBs.
  • [ ] Add basic confusables protection for identifiers.
  • [ ] Handle RTL with dir="auto"/<bdi> and isolation.

One‑minute adoption plan

  1. Add an NFC normalize step to your input pipeline and uniqueness checks.
  2. Switch MySQL to utf8mb4 (and pick an ICU/locale collation).
  3. Replace naive .length limits with grapheme‑aware counting.
  4. Use Intl.Collator / Python casefold() for case‑insensitive features.
  5. Wrap user‑provided inline text in <bdi> and set dir="auto" on blocks.