caduh

UTF-8 vs ASCII — What Every Developer Should Know

3 min read

A simple guide to character encoding: what ASCII is, how UTF‑8 works, why bytes ≠ characters, and the real-world gotchas (mojibake, emojis, normalization, MySQL utf8mb4).

TL;DR

  • ASCII is a 7‑bit set of 128 characters (English letters, digits, control chars).
  • UTF‑8 is a variable‑length Unicode encoding (1–4 bytes) that can represent every character in modern writing systems.
  • UTF‑8 is a superset of ASCII: the first 128 code points encode to the same single byte.
  • Most “weird characters” and “question mark boxes” are encoding mismatches or database limits (e.g., MySQL utf8 vs utf8mb4).
  • Always treat text as Unicode; specify UTF‑8 end‑to‑end (files, HTTP, DB, terminals).

The mental model (60 seconds)

  • Code point: abstract number for a character (e.g., U+0041 = “A”, U+1F44B = “👋”).
  • Encoding: how those numbers become bytes on disk/wire.
  • ASCII encodes only U+0000–U+007F as 1 byte each.
  • UTF‑8 encodes:
    • U+0000–007F0xxxxxxx (1 byte)
    • U+0080–07FF110xxxxx 10xxxxxx (2 bytes)
    • U+0800–FFFF1110xxxx 10xxxxxx 10xxxxxx (3 bytes)
    • U+10000–10FFFF11110xxx 10xxxxxx 10xxxxxx 10xxxxxx (4 bytes)

Examples

| Character | Code point | UTF‑8 bytes (hex) | |---|---|---| | A | U+0041 | 41 | | é | U+00E9 | C3 A9 | | | U+20AC | E2 82 AC | | 👋 | U+1F44B | F0 9F 91 8B |


ASCII vs UTF‑8 (quick compare)

| Aspect | ASCII | UTF‑8 | |---|---|---| | Coverage | 128 chars (English + control) | All Unicode (143k+ code points) | | Bytes per char | 1 byte fixed | 1–4 bytes (variable) | | Backward compatible with ASCII | — | Yes (first 128 are identical) | | International text | ❌ | ✅ | | Emojis, symbols | ❌ | ✅ | | File/Protocol popularity | Legacy | Modern default (web, APIs, Linux) |


Real‑world gotchas (and fixes)

1) Mojibake (é instead of é)

  • Cause: reading UTF‑8 bytes as if they were Latin‑1/Windows‑1252, or vice‑versa.
  • Fix: set encodings explicitly everywhere; re‑decode using the correct encoding.

2) MySQL utf8 ≠ full Unicode

  • Problem: utf8 in older MySQL means 3‑byte UTF‑8, which cannot store emojis (U+1F... need 4 bytes).
  • Fix: use utf8mb4 charset + a matching collation (e.g., utf8mb4_0900_ai_ci).

3) Byte Order Mark (BOM)

  • UTF‑8 BOM (EF BB BF) is optional but can break scripts/headers.
  • Fix: prefer UTF‑8 without BOM; configure editors and tools accordingly.

4) Characters vs. what users see (graphemes)

  • A single user‑perceived character may be multiple code points (e.g., 🇿🇦, or e + combining acute).
  • Fix: for counting/slicing, use grapheme‑cluster–aware APIs (e.g., Intl.Segmenter in JS) when precision matters.

5) Normalization (NFC/NFD)

  • The same letter can have composed or decomposed forms.
  • Fix: normalize before comparing or storing keys: NFC is common.

Set UTF‑8 end‑to‑end (copy‑paste)

HTML

<meta charset="utf-8">

HTTP headers

Content-Type: text/html; charset=utf-8
Content-Type: application/json; charset=utf-8

Node.js

import fs from "node:fs/promises";
const text = await fs.readFile("input.txt", "utf8");
await fs.writeFile("out.txt", text, "utf8");

// Buffer lengths: bytes vs chars
const s = "👋🏽";                  // 1 grapheme, multiple code points
console.log(s.length);            // JS "length" = UTF-16 code units
console.log(Buffer.byteLength(s, "utf8")); // bytes on wire

Browser fetch

const res = await fetch("/api");
const text = await res.text(); // assumes server sent utf-8; header matters

Python 3 (default is Unicode)

from pathlib import Path
data = Path("in.txt").read_text(encoding="utf-8")
Path("out.txt").write_text(data, encoding="utf-8")

import unicodedata
s = "Café"  # 'e' + combining acute
print(unicodedata.normalize("NFC", s))  # "Café"

PostgreSQL/MySQL

-- PostgreSQL: UTF-8 database
CREATE DATABASE app ENCODING 'UTF8';

-- MySQL: FULL Unicode
CREATE DATABASE app CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;
ALTER TABLE users CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;

Terminals

  • Set terminal locale to UTF-8 (e.g., en_US.UTF-8).
  • In Docker: ENV LANG=C.UTF-8 LC_ALL=C.UTF-8.

Diagnosing encoding issues

  1. Inspect bytes (hex) and compare to known UTF‑8 sequences.
  2. Check HTTP headers, Content-Type and charset.
  3. Confirm DB column/table charset/collation.
  4. Verify editor/IDE and source file encodings.
  5. Normalize strings before equality tests or deduplication.

Quick checklist

  • [ ] Use UTF‑8 everywhere (files, HTTP, DB, terminal).
  • [ ] Prefer utf8mb4 on MySQL; set appropriate collations.
  • [ ] Add <meta charset="utf-8"> in HTML; set response charset.
  • [ ] Normalize to NFC when comparing/storing keys.
  • [ ] Be careful counting user‑perceived characters (graphemes), not code units.
  • [ ] Avoid BOM in UTF‑8 unless you know you need it.

One‑minute adoption plan

  1. Audit your stack for encodings; switch everything to UTF‑8 / utf8mb4.
  2. Add charset headers to all responses and HTML.
  3. Normalize critical identifiers to NFC on write; document it.
  4. Add tests that cover emoji, accents, and RTL text.
  5. Train the team: ASCII ⊂ UTF‑8, bytes characters.