TL;DR
- Bytes ≠ code points ≠ user‑perceived characters (grapheme clusters). Don’t count by bytes or code units.
- Normalize input (usually NFC) before storing/uniqueness checks; use NFKC only for identifiers/slugs.
- For case‑insensitive compare/search, prefer case folding (Python
casefold()), or ICU/Intl.Collator withsensitivity: "base". - Sort/search with proper collation (ICU/locale), not raw code point order.
- Handle emoji/ZWJ/skin tones and RTL/bidi safely; measure and truncate by grapheme clusters.
- Databases: use real Unicode (e.g., utf8mb4 in MySQL), set a consistent collation, and be mindful of unique indexes + normalization.
1) The layers: bytes → code points → grapheme clusters
- Encoding (UTF‑8/16/32): how code points map to bytes.
- Code point: a Unicode scalar (e.g., U+00E9 is “é”; U+1F600 is 😀).
- Grapheme cluster: what users see as one character (may be multiple code points) — e.g.,
e+ combining acute = “é”; the family emoji 👨👩👧👦 uses ZWJ sequences.
JS length surprises
"👨👩👧👦".length // 11 code units (UTF‑16), not 1
Array.from("👨👩👧👦").length // 7 code points (still not 1)
Count graphemes, not code units:
// Modern JS
const seg = new Intl.Segmenter("en", { granularity: "grapheme" });
[...seg.segment("👨👩👧👦")].length // 1
2) Normalization: NFC, NFD, NFKC, NFKD
The same visual text can have different byte sequences (e.g., “é” as single code point U+00E9 vs “e” + U+0301 combining mark). Normalize to avoid duplicate/compare bugs.
Common practice
- Store text as NFC.
- For identifiers/slugs or comparing structurally, consider NFKC/NFKD (compatibility folds).
Examples
// JavaScript
const a = "é"; // U+00E9
const b = "é"; // 'e' + combining acute
a === b // false
a.normalize("NFC") === b.normalize("NFC") // true
# Python
import unicodedata as u
a, b = "é", "é"
u.normalize("NFC", a) == u.normalize("NFC", b) # True
Normalize before hashing, deduping, or enforcing unique constraints.
3) Case‑insensitivity: lowercasing isn’t enough
.toLowerCase()can break in some locales (e.g., Turkish I/i).- Prefer case folding for language‑agnostic comparisons.
"Straße".casefold() == "STRASSE".casefold() # True
In JS, use locale‑aware comparison for search/equality:
const coll = new Intl.Collator("de", { sensitivity: "base" });
coll.compare("Straße", "STRASSE") === 0 // True
4) Collation & search (sort like humans)
- Use ICU/Intl for ordering & accent‑insensitive search.
- DBs: pick collations explicitly and consistently.
Postgres
-- ICU collation (requires ICU-enabled build)
CREATE COLLATION de_phonebook (provider = icu, locale = 'de@collation=phonebook');
SELECT * FROM contacts ORDER BY name COLLATE "de_phonebook";
MySQL
-- Use real Unicode in MySQL
ALTER TABLE users CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci; -- accent-insensitive
5) Emoji, ZWJ, and limits that don’t explode UX
- Emoji can be sequences: base + skin‑tone modifier, or multiple joined by ZWJ (U+200D).
- Limit by grapheme clusters, not bytes or
length.
function graphemeCount(s) {
const seg = new Intl.Segmenter(undefined, { granularity: "grapheme" });
return [...seg.segment(s)].length;
}
function truncateByGrapheme(s, max) {
const seg = new Intl.Segmenter(undefined, { granularity: "grapheme" });
let out = "", n = 0;
for (const g of seg.segment(s)) { if (n++ >= max) break; out += g.segment; }
return out;
}
6) BiDi (right‑to‑left) and mixed‑direction text
- In HTML, prefer semantic controls:
dir="auto"on user‑generated content.<bdi>to isolate a user name within surrounding text.<bdo dir="rtl">only when you must override.
- CSS:
unicode-bidi: isolate; direction: rtl;where appropriate. - Avoid sprinkling invisible control chars unless you really know FSI/PDI/LRM/RLM.
<p>By <bdi>{{username}}</bdi></p>
<div dir="auto">{{user_bio}}</div>
7) Databases & uniqueness (real‑world guardrails)
- Use UTF‑8 everywhere; in MySQL ensure utf8mb4 (not
utf8). - Pick a single normalization (e.g., NFC) at ingest, then enforce uniqueness on the normalized value.
- Be explicit with collation on columns and indexes; changing collations later can invalidate index order.
Postgres (email, case‑insensitive)
CREATE EXTENSION IF NOT EXISTS citext;
CREATE TABLE users (
id bigserial PRIMARY KEY,
email citext UNIQUE -- case-insensitive compare/index
);
8) Security: confusables & spoofing
- Homoglyphs (e.g., Latin “a” vs Cyrillic “а”) can spoof identifiers.
- For usernames/tenants, consider UTS #39 confusables mapping (skeletons), restrict allowed scripts, or require mixed‑script review.
- For domains, use Punycode/IDNA rules; don’t roll your own.
Pitfalls & fast fixes
| Pitfall | Why it hurts | Fix |
|---|---|---|
| Counting by bytes/length | UI truncation, broken limits | Count graphemes (Intl.Segmenter) |
| Raw compares on unnormalized text | Duplicate “same” strings | Normalize (NFC) before compare/store |
| .toLowerCase() for equality | Locale edge cases | Case fold (Python) or Intl.Collator |
| Sorting by code point | Weird order for users | Use ICU/Intl collations |
| MySQL utf8 | Can’t store full emoji | Use utf8mb4 |
| Unique index without normalization | Near duplicates | Normalize at ingest; enforce on normalized column |
Quick checklist
- [ ] Normalize to NFC on input; consider NFKC for IDs/slugs.
- [ ] Compare/search with case folding or Intl.Collator.
- [ ] Limit/truncate by grapheme clusters.
- [ ] Use utf8mb4 and explicit collations in DBs.
- [ ] Add basic confusables protection for identifiers.
- [ ] Handle RTL with
dir="auto"/<bdi>and isolation.
One‑minute adoption plan
- Add an NFC normalize step to your input pipeline and uniqueness checks.
- Switch MySQL to utf8mb4 (and pick an ICU/locale collation).
- Replace naive
.lengthlimits with grapheme‑aware counting. - Use Intl.Collator / Python
casefold()for case‑insensitive features. - Wrap user‑provided inline text in
<bdi>and setdir="auto"on blocks.