Text Beyond UTF‑8 — Unicode gotchas every dev should know

September 17th, 20254 min read#dev #i18n #unicode #text #a11y #ca-duh

Normalization, grapheme clusters, collation, bidi, and emojis—why 'string length' and 'toLowerCase' can betray you, with practical recipes in JS/Python/SQL.

TL;DR

Bytes ≠ code points ≠ user‑perceived characters (grapheme clusters). Don’t count by bytes or code units.
Normalize input (usually NFC) before storing/uniqueness checks; use NFKC only for identifiers/slugs.
For case‑insensitive compare/search, prefer case folding (Python casefold()), or ICU/Intl.Collator with sensitivity: "base".
Sort/search with proper collation (ICU/locale), not raw code point order.
Handle emoji/ZWJ/skin tones and RTL/bidi safely; measure and truncate by grapheme clusters.
Databases: use real Unicode (e.g., utf8mb4 in MySQL), set a consistent collation, and be mindful of unique indexes + normalization.

1) The layers: bytes → code points → grapheme clusters

Encoding (UTF‑8/16/32): how code points map to bytes.
Code point: a Unicode scalar (e.g., U+00E9 is “é”; U+1F600 is 😀).
Grapheme cluster: what users see as one character (may be multiple code points) — e.g., e + combining acute = “é”; the family emoji 👨‍👩‍👧‍👦 uses ZWJ sequences.

JS length surprises

"👨‍👩‍👧‍👦".length            // 11 code units (UTF‑16), not 1
Array.from("👨‍👩‍👧‍👦").length // 7 code points (still not 1)

Count graphemes, not code units:

// Modern JS
const seg = new Intl.Segmenter("en", { granularity: "grapheme" });
[...seg.segment("👨‍👩‍👧‍👦")].length // 1

2) Normalization: NFC, NFD, NFKC, NFKD

The same visual text can have different byte sequences (e.g., “é” as single code point U+00E9 vs “e” + U+0301 combining mark). Normalize to avoid duplicate/compare bugs.

Common practice

Store text as NFC.
For identifiers/slugs or comparing structurally, consider NFKC/NFKD (compatibility folds).

Examples

// JavaScript
const a = "é"; // U+00E9
const b = "é"; // 'e' + combining acute
a === b                         // false
a.normalize("NFC") === b.normalize("NFC") // true

# Python
import unicodedata as u
a, b = "é", "é"
u.normalize("NFC", a) == u.normalize("NFC", b)  # True

Normalize before hashing, deduping, or enforcing unique constraints.

3) Case‑insensitivity: lowercasing isn’t enough

.toLowerCase() can break in some locales (e.g., Turkish I/i).
Prefer case folding for language‑agnostic comparisons.

"Straße".casefold() == "STRASSE".casefold()  # True

In JS, use locale‑aware comparison for search/equality:

const coll = new Intl.Collator("de", { sensitivity: "base" });
coll.compare("Straße", "STRASSE") === 0 // True

4) Collation & search (sort like humans)

Use ICU/Intl for ordering & accent‑insensitive search.
DBs: pick collations explicitly and consistently.

Postgres

-- ICU collation (requires ICU-enabled build)
CREATE COLLATION de_phonebook (provider = icu, locale = 'de@collation=phonebook');
SELECT * FROM contacts ORDER BY name COLLATE "de_phonebook";

MySQL

-- Use real Unicode in MySQL
ALTER TABLE users CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci; -- accent-insensitive

5) Emoji, ZWJ, and limits that don’t explode UX

Emoji can be sequences: base + skin‑tone modifier, or multiple joined by ZWJ (U+200D).
Limit by grapheme clusters, not bytes or length.

function graphemeCount(s) {
  const seg = new Intl.Segmenter(undefined, { granularity: "grapheme" });
  return [...seg.segment(s)].length;
}
function truncateByGrapheme(s, max) {
  const seg = new Intl.Segmenter(undefined, { granularity: "grapheme" });
  let out = "", n = 0;
  for (const g of seg.segment(s)) { if (n++ >= max) break; out += g.segment; }
  return out;
}

6) BiDi (right‑to‑left) and mixed‑direction text

In HTML, prefer semantic controls:
- dir="auto" on user‑generated content.
- <bdi> to isolate a user name within surrounding text.
- <bdo dir="rtl"> only when you must override.
CSS: unicode-bidi: isolate; direction: rtl; where appropriate.
Avoid sprinkling invisible control chars unless you really know FSI/PDI/LRM/RLM.

<p>By <bdi>{{username}}</bdi></p>
<div dir="auto">{{user_bio}}</div>

7) Databases & uniqueness (real‑world guardrails)

Use UTF‑8 everywhere; in MySQL ensure utf8mb4 (not utf8).
Pick a single normalization (e.g., NFC) at ingest, then enforce uniqueness on the normalized value.
Be explicit with collation on columns and indexes; changing collations later can invalidate index order.

Postgres (email, case‑insensitive)

CREATE EXTENSION IF NOT EXISTS citext;
CREATE TABLE users (
  id bigserial PRIMARY KEY,
  email citext UNIQUE  -- case-insensitive compare/index
);

8) Security: confusables & spoofing

Homoglyphs (e.g., Latin “a” vs Cyrillic “а”) can spoof identifiers.
For usernames/tenants, consider UTS #39 confusables mapping (skeletons), restrict allowed scripts, or require mixed‑script review.
For domains, use Punycode/IDNA rules; don’t roll your own.

Pitfalls & fast fixes

| Pitfall | Why it hurts | Fix | |---|---|---| | Counting by bytes/length | UI truncation, broken limits | Count graphemes (Intl.Segmenter) | | Raw compares on unnormalized text | Duplicate “same” strings | Normalize (NFC) before compare/store | | .toLowerCase() for equality | Locale edge cases | Case fold (Python) or Intl.Collator | | Sorting by code point | Weird order for users | Use ICU/Intl collations | | MySQL utf8 | Can’t store full emoji | Use utf8mb4 | | Unique index without normalization | Near duplicates | Normalize at ingest; enforce on normalized column |

Quick checklist

[ ] Normalize to NFC on input; consider NFKC for IDs/slugs.
[ ] Compare/search with case folding or Intl.Collator.
[ ] Limit/truncate by grapheme clusters.
[ ] Use utf8mb4 and explicit collations in DBs.
[ ] Add basic confusables protection for identifiers.
[ ] Handle RTL with dir="auto"/<bdi> and isolation.

One‑minute adoption plan

Add an NFC normalize step to your input pipeline and uniqueness checks.
Switch MySQL to utf8mb4 (and pick an ICU/locale collation).
Replace naive .length limits with grapheme‑aware counting.
Use Intl.Collator / Python casefold() for case‑insensitive features.
Wrap user‑provided inline text in <bdi> and set dir="auto" on blocks.