TL;DR
- ASCII is a 7‑bit set of 128 characters (English letters, digits, control chars).
- UTF‑8 is a variable‑length Unicode encoding (1–4 bytes) that can represent every character in modern writing systems.
- UTF‑8 is a superset of ASCII: the first 128 code points encode to the same single byte.
- Most “weird characters” and “question mark boxes” are encoding mismatches or database limits (e.g., MySQL
utf8vsutf8mb4). - Always treat text as Unicode; specify UTF‑8 end‑to‑end (files, HTTP, DB, terminals).
The mental model (60 seconds)
- Code point: abstract number for a character (e.g.,
U+0041= “A”,U+1F44B= “👋”). - Encoding: how those numbers become bytes on disk/wire.
- ASCII encodes only
U+0000–U+007Fas 1 byte each. - UTF‑8 encodes:
U+0000–007F→0xxxxxxx(1 byte)U+0080–07FF→110xxxxx 10xxxxxx(2 bytes)U+0800–FFFF→1110xxxx 10xxxxxx 10xxxxxx(3 bytes)U+10000–10FFFF→11110xxx 10xxxxxx 10xxxxxx 10xxxxxx(4 bytes)
Examples
| Character | Code point | UTF‑8 bytes (hex) |
|---|---|---|
| A | U+0041 | 41 |
| é | U+00E9 | C3 A9 |
| € | U+20AC | E2 82 AC |
| 👋 | U+1F44B | F0 9F 91 8B |
ASCII vs UTF‑8 (quick compare)
| Aspect | ASCII | UTF‑8 | |---|---|---| | Coverage | 128 chars (English + control) | All Unicode (143k+ code points) | | Bytes per char | 1 byte fixed | 1–4 bytes (variable) | | Backward compatible with ASCII | — | Yes (first 128 are identical) | | International text | ❌ | ✅ | | Emojis, symbols | ❌ | ✅ | | File/Protocol popularity | Legacy | Modern default (web, APIs, Linux) |
Real‑world gotchas (and fixes)
1) Mojibake (é instead of é)
- Cause: reading UTF‑8 bytes as if they were Latin‑1/Windows‑1252, or vice‑versa.
- Fix: set encodings explicitly everywhere; re‑decode using the correct encoding.
2) MySQL utf8 ≠ full Unicode
- Problem:
utf8in older MySQL means 3‑byte UTF‑8, which cannot store emojis (U+1F...need 4 bytes). - Fix: use
utf8mb4charset + a matching collation (e.g.,utf8mb4_0900_ai_ci).
3) Byte Order Mark (BOM)
- UTF‑8 BOM (
EF BB BF) is optional but can break scripts/headers. - Fix: prefer UTF‑8 without BOM; configure editors and tools accordingly.
4) Characters vs. what users see (graphemes)
- A single user‑perceived character may be multiple code points (e.g.,
🇿🇦, ore+ combining acute). - Fix: for counting/slicing, use grapheme‑cluster–aware APIs (e.g.,
Intl.Segmenterin JS) when precision matters.
5) Normalization (NFC/NFD)
- The same letter can have composed or decomposed forms.
- Fix: normalize before comparing or storing keys: NFC is common.
Set UTF‑8 end‑to‑end (copy‑paste)
HTML
<meta charset="utf-8">
HTTP headers
Content-Type: text/html; charset=utf-8
Content-Type: application/json; charset=utf-8
Node.js
import fs from "node:fs/promises";
const text = await fs.readFile("input.txt", "utf8");
await fs.writeFile("out.txt", text, "utf8");
// Buffer lengths: bytes vs chars
const s = "👋🏽"; // 1 grapheme, multiple code points
console.log(s.length); // JS "length" = UTF-16 code units
console.log(Buffer.byteLength(s, "utf8")); // bytes on wire
Browser fetch
const res = await fetch("/api");
const text = await res.text(); // assumes server sent utf-8; header matters
Python 3 (default is Unicode)
from pathlib import Path
data = Path("in.txt").read_text(encoding="utf-8")
Path("out.txt").write_text(data, encoding="utf-8")
import unicodedata
s = "Café" # 'e' + combining acute
print(unicodedata.normalize("NFC", s)) # "Café"
PostgreSQL/MySQL
-- PostgreSQL: UTF-8 database
CREATE DATABASE app ENCODING 'UTF8';
-- MySQL: FULL Unicode
CREATE DATABASE app CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;
ALTER TABLE users CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;
Terminals
- Set terminal locale to
UTF-8(e.g.,en_US.UTF-8). - In Docker:
ENV LANG=C.UTF-8 LC_ALL=C.UTF-8.
Diagnosing encoding issues
- Inspect bytes (hex) and compare to known UTF‑8 sequences.
- Check HTTP headers,
Content-Typeandcharset. - Confirm DB column/table charset/collation.
- Verify editor/IDE and source file encodings.
- Normalize strings before equality tests or deduplication.
Quick checklist
- [ ] Use UTF‑8 everywhere (files, HTTP, DB, terminal).
- [ ] Prefer
utf8mb4on MySQL; set appropriate collations. - [ ] Add
<meta charset="utf-8">in HTML; set response charset. - [ ] Normalize to NFC when comparing/storing keys.
- [ ] Be careful counting user‑perceived characters (graphemes), not code units.
- [ ] Avoid BOM in UTF‑8 unless you know you need it.
One‑minute adoption plan
- Audit your stack for encodings; switch everything to UTF‑8 /
utf8mb4. - Add charset headers to all responses and HTML.
- Normalize critical identifiers to NFC on write; document it.
- Add tests that cover emoji, accents, and RTL text.
- Train the team: ASCII ⊂ UTF‑8, bytes ≠ characters.