caduh

Safe File Uploads — accept untrusted files without getting owned

5 min read

A battle‑tested checklist and patterns: allowlists, size limits, magic‑byte checks, streaming to object storage, virus scanning, image/PDF sanitization, signed URLs, and safe download headers.

TL;DR

  • Never trust filenames or MIME types from the client. Allowlist extensions and validate magic bytes.
  • Stream uploads directly to object storage (S3/R2/GCS) with pre‑signed URLs; don’t buffer huge files in RAM.
  • Put new files in quarantine, run antivirus + type checks + sanitizers, then mark safe (or reject).
  • Serve downloads with Content-Disposition: attachment and X-Content-Type-Options: nosniff; never execute or inline untrusted content.
  • SVG/HTML/JS are code → block or sanitize heavily. For images/PDFs, re‑encode (strip metadata, flatten).
  • Keep objects private; deliver via signed URLs or through an authorized proxy. Log every access.

1) The threat model (what can go wrong)

  • XSS via files: malicious SVG/HTML is executed when viewed; “image” with image/svg+xml contains script.
  • Content sniffing: browsers guess types; an uploaded .txt that’s actually HTML can run.
  • Path traversal: ../../etc/passwd or Windows ..\ in names.
  • Polyglots/zip bombs: archives that expand explosively or parse as multiple types.
  • Server strain: buffering big files, slow‑loris uploads, unbounded temp dirs.
  • Malware: macro office docs, trojans in archives.
  • PII leakage: EXIF GPS in photos, PDFs with embedded JS/attachments.

2) A sane architecture

Client ──(Auth)──► Backend ──(issue pre‑signed URL)──► Object Storage (private)
Client ──(PUT file to pre‑signed URL)──► Storage
Storage event ► Queue/Worker: { object_key, size, content_type? }
Worker:
  1) Fetch head/bytes → validate magic‑bytes/type/size
  2) Virus scan (ClamAV/Cloud AV)
  3) Sanitize (images/PDF), strip metadata
  4) Mark DB record status: safe | rejected | needs_review
App serves:
  - Signed download URLs or proxy with safe headers

Why this works: your app never holds large bodies, scanning happens asynchronously, and files stay private until marked safe.


3) Accept only what you mean to accept (allowlist)

  • Maintain an allowlist mapping: extension → expected MIME(s) → magic bytes (signatures) → max size.
  • Reject everything else (no wildcards). Treat SVG as code; block or sanitize with a strict tool.
  • Enforce per‑route limits (e.g., avatars ≤ 5 MB, PDFs ≤ 20 MB).

Magic‑byte hints

  • JPEG: FF D8 FF
  • PNG: 89 50 4E 47 0D 0A 1A 0A
  • PDF: %PDF- (25 50 44 46 2D)
  • ZIP/OOXML: 50 4B 03 04 (also many other formats) → need deeper check

4) Backend: pre‑signed upload (S3 example)

Issue a pre‑signed URL (Node, AWS SDK v3)

import { S3Client, PutObjectCommand } from "@aws-sdk/client-s3";
import { getSignedUrl } from "@aws-sdk/s3-request-presigner";
import { randomUUID } from "node:crypto";

const s3 = new S3Client({ region: "us-east-1" });
export async function getUploadUrl(userId: string, ext: "jpg"|"png"|"pdf") {
  const key = `uploads/${userId}/${randomUUID()}.${ext}`; // never use raw filename
  const cmd = new PutObjectCommand({
    Bucket: process.env.BUCKET,
    Key: key,
    ContentType: ext === "pdf" ? "application/pdf" : `image/${ext}`,
    ACL: "private",
  });
  const url = await getSignedUrl(s3, cmd, { expiresIn: 15 * 60 }); // 15 min
  return { url, key };
}

Bucket policy tips

  • Block public ACLs, default encryption on.
  • Lifecycle: auto‑expire quarantine after N days.
  • Trigger a queue/worker (SQS/EventBridge) on ObjectCreated:* in the quarantine prefix.

5) Validate + scan + sanitize (worker)

Python (FastAPI worker sketch)

import filetype, hashlib, subprocess, os
from PIL import Image

ALLOWED = {
  ".jpg": {"mime": ["image/jpeg"], "max": 5*1024*1024},
  ".png": {"mime": ["image/png"],  "max": 5*1024*1024},
  ".pdf": {"mime": ["application/pdf"], "max": 20*1024*1024},
}

def head_ok(path, ext):
  size = os.path.getsize(path)
  if size > ALLOWED[ext]["max"]: return False, "too_large"
  kind = filetype.guess(path)
  if not kind or kind.mime not in ALLOWED[ext]["mime"]:
    return False, f"bad_type:{kind and kind.mime}"
  return True, None

def clamav_ok(path):
  # clamscan --no-summary returns 0 on clean, 1 on infected
  return subprocess.call(["clamscan", "--no-summary", path]) == 0

def sanitize_image_inplace(path):
  img = Image.open(path)
  img = img.convert("RGB")           # normalize colorspace
  img.save(path, format="JPEG", quality=85)  # re-encode & strip metadata

# For PDFs: consider qpdf/ocrmypdf or safer: convert first page to image thumbnail

Key ideas

  • Compute and store sha256 for dedupe and integrity.
  • Re‑encode images (drop EXIF/GPS). For PDFs, avoid executing JS; consider flattening or serving via attachment only.
  • Archives: either disallow or inspect contents (limit depth/total expanded size).

6) Uploading directly (browser snippet)

async function upload(file) {
  // 1) Ask backend for an upload URL with desired extension/type
  const { url, key } = await fetch("/uploads/url?ext=jpg").then(r => r.json());
  // 2) PUT directly to storage; include the Content-Type your URL was minted for
  await fetch(url, { method: "PUT", headers: { "Content-Type": file.type }, body: file });
  // 3) Tell backend "done": it will process from the storage event
  await fetch("/uploads/complete", { method: "POST", body: JSON.stringify({ key })});
}

Large files: use multipart upload (S3) or tus.io; support resumption and client‑side chunking.


7) Serving files safely

  • Require auth before issuing signed download URLs (short TTL).
  • If proxying through your app, set headers:
Content-Type: application/octet-stream           # or the verified type
Content-Disposition: attachment; filename="safe-name.ext"
X-Content-Type-Options: nosniff
Cache-Control: private, max-age=0, no-store      # for sensitive
  • Never trust the original filename. Derive a safe name (strip control chars, keep [a-zA-Z0-9._-], clamp length).

8) Limits, timeouts, and DoS controls

  • NGINX/Ingress: client_max_body_size 25m; • reasonable client_body_timeout.
  • Backends: stream to disk/storage; set per‑file and per‑user quotas; rate‑limit upload endpoints.
  • Temp dirs: bound size and auto‑cleanup.
  • Reject files with too many pixels (image bombs). Example: width*height > 40 MP → reject.

9) Special file types

  • SVG: treat as active content. Either block, or sanitize with a strict allowlist (remove scripts, external refs) and serve as attachment.
  • Office docs: block macros by policy; scan with AV; serve as attachment.
  • Audio/video: transcode server‑side to known codecs/containers (ffmpeg) to neutralize weird metadata.
  • Text/CSV: beware CSV injection (=HYPERLINK(...)). Serve as attachment; consider prefix ' for user‑generated cells.

10) Observability & auditing

  • Log: user_id, object_key, sha256, verified content_type, size, scanner results, sanitize actions, download events.
  • Metrics: counts of files by status, scan durations, rejection reasons, bytes stored per user/tenant.
  • Alerts on scanner failures, quarantine backlog, and spikes in rejected files.

11) Pitfalls & fast fixes

| Pitfall | Why it hurts | Fix | |---|---|---| | Trusting Content-Type | Spoofed types run as code | Check magic bytes; maintain allowlist | | Using raw filenames/paths | Path traversal, weird chars | UUID keys; sanitize name only for display | | Public buckets | Data exposure | Keep private; use signed URLs | | Inline rendering of untrusted files | XSS | attachment + nosniff | | No size/pixel limits | DoS, image bombs | Cap bytes and pixels | | Not stripping EXIF | PII leaks | Re‑encode images | | Letting archives explode | Disk exhaustion | Disallow or limit entries/depth/expanded size | | One‑box AV only | Blind spots | Use fresh signatures; consider cloud scanning fallback |


Quick checklist

  • [ ] Allowlist extensions/types; magic‑byte check; cap size/pixels.
  • [ ] Pre‑signed uploads to private storage; stream, don’t buffer.
  • [ ] Quarantine → scan → sanitize → mark safe workflow.
  • [ ] Re‑encode images; treat SVG/HTML as code (block/sanitize).
  • [ ] Signed downloads or proxy with attachment + nosniff.
  • [ ] Log sha256/type/size; monitor scanner health and rejection reasons.

One‑minute adoption plan

  1. Switch to pre‑signed uploads into a quarantine prefix (private).
  2. Add a worker that validates magic bytes, runs AV, and re‑encodes images; mark records safe|rejected.
  3. Serve files only via short‑lived signed URLs or proxy with attachment headers.
  4. Enforce allowlist + size/pixel caps and rate limits.
  5. Add metrics/logs for audits and alert on failures/backlogs.