UTF-8 vs ASCII — What Every Developer Should Know

August 31st, 20253 min read#dev #encoding #unicode #text #web #ca-duh

A simple guide to character encoding: what ASCII is, how UTF‑8 works, why bytes ≠ characters, and the real-world gotchas (mojibake, emojis, normalization, MySQL utf8mb4).

TL;DR

ASCII is a 7‑bit set of 128 characters (English letters, digits, control chars).
UTF‑8 is a variable‑length Unicode encoding (1–4 bytes) that can represent every character in modern writing systems.
UTF‑8 is a superset of ASCII: the first 128 code points encode to the same single byte.
Most “weird characters” and “question mark boxes” are encoding mismatches or database limits (e.g., MySQL utf8 vs utf8mb4).
Always treat text as Unicode; specify UTF‑8 end‑to‑end (files, HTTP, DB, terminals).

The mental model (60 seconds)

Code point: abstract number for a character (e.g., U+0041 = “A”, U+1F44B = “👋”).
Encoding: how those numbers become bytes on disk/wire.
ASCII encodes only U+0000–U+007F as 1 byte each.
UTF‑8 encodes:
- U+0000–007F → 0xxxxxxx (1 byte)
- U+0080–07FF → 110xxxxx 10xxxxxx (2 bytes)
- U+0800–FFFF → 1110xxxx 10xxxxxx 10xxxxxx (3 bytes)
- U+10000–10FFFF → 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx (4 bytes)

Examples

| Character | Code point | UTF‑8 bytes (hex) | |---|---|---| | A | U+0041 | 41 | | é | U+00E9 | C3 A9 | | € | U+20AC | E2 82 AC | | 👋 | U+1F44B | F0 9F 91 8B |

ASCII vs UTF‑8 (quick compare)

| Aspect | ASCII | UTF‑8 | |---|---|---| | Coverage | 128 chars (English + control) | All Unicode (143k+ code points) | | Bytes per char | 1 byte fixed | 1–4 bytes (variable) | | Backward compatible with ASCII | — | Yes (first 128 are identical) | | International text | ❌ | ✅ | | Emojis, symbols | ❌ | ✅ | | File/Protocol popularity | Legacy | Modern default (web, APIs, Linux) |

Real‑world gotchas (and fixes)

1) Mojibake (`Ã©` instead of `é`)

Cause: reading UTF‑8 bytes as if they were Latin‑1/Windows‑1252, or vice‑versa.
Fix: set encodings explicitly everywhere; re‑decode using the correct encoding.

2) MySQL `utf8` ≠ full Unicode

Problem: utf8 in older MySQL means 3‑byte UTF‑8, which cannot store emojis (U+1F... need 4 bytes).
Fix: use utf8mb4 charset + a matching collation (e.g., utf8mb4_0900_ai_ci).

3) Byte Order Mark (BOM)

UTF‑8 BOM (EF BB BF) is optional but can break scripts/headers.
Fix: prefer UTF‑8 without BOM; configure editors and tools accordingly.

4) Characters vs. what users see (graphemes)

A single user‑perceived character may be multiple code points (e.g., 🇿🇦, or e + combining acute).
Fix: for counting/slicing, use grapheme‑cluster–aware APIs (e.g., Intl.Segmenter in JS) when precision matters.

5) Normalization (NFC/NFD)

The same letter can have composed or decomposed forms.
Fix: normalize before comparing or storing keys: NFC is common.

Set UTF‑8 end‑to‑end (copy‑paste)

HTML

<meta charset="utf-8">

HTTP headers

Content-Type: text/html; charset=utf-8
Content-Type: application/json; charset=utf-8

Node.js

import fs from "node:fs/promises";
const text = await fs.readFile("input.txt", "utf8");
await fs.writeFile("out.txt", text, "utf8");

// Buffer lengths: bytes vs chars
const s = "👋🏽";                  // 1 grapheme, multiple code points
console.log(s.length);            // JS "length" = UTF-16 code units
console.log(Buffer.byteLength(s, "utf8")); // bytes on wire

Browser fetch

const res = await fetch("/api");
const text = await res.text(); // assumes server sent utf-8; header matters

Python 3 (default is Unicode)

from pathlib import Path
data = Path("in.txt").read_text(encoding="utf-8")
Path("out.txt").write_text(data, encoding="utf-8")

import unicodedata
s = "Café"  # 'e' + combining acute
print(unicodedata.normalize("NFC", s))  # "Café"

PostgreSQL/MySQL

-- PostgreSQL: UTF-8 database
CREATE DATABASE app ENCODING 'UTF8';

-- MySQL: FULL Unicode
CREATE DATABASE app CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;
ALTER TABLE users CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;

Terminals

Set terminal locale to UTF-8 (e.g., en_US.UTF-8).
In Docker: ENV LANG=C.UTF-8 LC_ALL=C.UTF-8.

Diagnosing encoding issues

Inspect bytes (hex) and compare to known UTF‑8 sequences.
Check HTTP headers, Content-Type and charset.
Confirm DB column/table charset/collation.
Verify editor/IDE and source file encodings.
Normalize strings before equality tests or deduplication.

Quick checklist

[ ] Use UTF‑8 everywhere (files, HTTP, DB, terminal).
[ ] Prefer utf8mb4 on MySQL; set appropriate collations.
[ ] Add <meta charset="utf-8"> in HTML; set response charset.
[ ] Normalize to NFC when comparing/storing keys.
[ ] Be careful counting user‑perceived characters (graphemes), not code units.
[ ] Avoid BOM in UTF‑8 unless you know you need it.

One‑minute adoption plan

Audit your stack for encodings; switch everything to UTF‑8 / utf8mb4.
Add charset headers to all responses and HTML.
Normalize critical identifiers to NFC on write; document it.
Add tests that cover emoji, accents, and RTL text.
Train the team: ASCII ⊂ UTF‑8, bytes ≠ characters.