TL;DR
- Never trust filenames or MIME types from the client. Allowlist extensions and validate magic bytes.
- Stream uploads directly to object storage (S3/R2/GCS) with pre‑signed URLs; don’t buffer huge files in RAM.
- Put new files in quarantine, run antivirus + type checks + sanitizers, then mark safe (or reject).
- Serve downloads with
Content-Disposition: attachmentandX-Content-Type-Options: nosniff; never execute or inline untrusted content. - SVG/HTML/JS are code → block or sanitize heavily. For images/PDFs, re‑encode (strip metadata, flatten).
- Keep objects private; deliver via signed URLs or through an authorized proxy. Log every access.
1) The threat model (what can go wrong)
- XSS via files: malicious SVG/HTML is executed when viewed; “image” with
image/svg+xmlcontains script. - Content sniffing: browsers guess types; an uploaded
.txtthat’s actually HTML can run. - Path traversal:
../../etc/passwdor Windows..\in names. - Polyglots/zip bombs: archives that expand explosively or parse as multiple types.
- Server strain: buffering big files, slow‑loris uploads, unbounded temp dirs.
- Malware: macro office docs, trojans in archives.
- PII leakage: EXIF GPS in photos, PDFs with embedded JS/attachments.
2) A sane architecture
Client ──(Auth)──► Backend ──(issue pre‑signed URL)──► Object Storage (private)
Client ──(PUT file to pre‑signed URL)──► Storage
Storage event ► Queue/Worker: { object_key, size, content_type? }
Worker:
1) Fetch head/bytes → validate magic‑bytes/type/size
2) Virus scan (ClamAV/Cloud AV)
3) Sanitize (images/PDF), strip metadata
4) Mark DB record status: safe | rejected | needs_review
App serves:
- Signed download URLs or proxy with safe headers
Why this works: your app never holds large bodies, scanning happens asynchronously, and files stay private until marked safe.
3) Accept only what you mean to accept (allowlist)
- Maintain an allowlist mapping:
extension → expected MIME(s) → magic bytes (signatures) → max size. - Reject everything else (no wildcards). Treat SVG as code; block or sanitize with a strict tool.
- Enforce per‑route limits (e.g., avatars ≤ 5 MB, PDFs ≤ 20 MB).
Magic‑byte hints
- JPEG:
FF D8 FF - PNG:
89 50 4E 47 0D 0A 1A 0A - PDF:
%PDF-(25 50 44 46 2D) - ZIP/OOXML:
50 4B 03 04(also many other formats) → need deeper check
4) Backend: pre‑signed upload (S3 example)
Issue a pre‑signed URL (Node, AWS SDK v3)
import { S3Client, PutObjectCommand } from "@aws-sdk/client-s3";
import { getSignedUrl } from "@aws-sdk/s3-request-presigner";
import { randomUUID } from "node:crypto";
const s3 = new S3Client({ region: "us-east-1" });
export async function getUploadUrl(userId: string, ext: "jpg"|"png"|"pdf") {
const key = `uploads/${userId}/${randomUUID()}.${ext}`; // never use raw filename
const cmd = new PutObjectCommand({
Bucket: process.env.BUCKET,
Key: key,
ContentType: ext === "pdf" ? "application/pdf" : `image/${ext}`,
ACL: "private",
});
const url = await getSignedUrl(s3, cmd, { expiresIn: 15 * 60 }); // 15 min
return { url, key };
}
Bucket policy tips
- Block public ACLs, default encryption on.
- Lifecycle: auto‑expire quarantine after N days.
- Trigger a queue/worker (SQS/EventBridge) on
ObjectCreated:*in the quarantine prefix.
5) Validate + scan + sanitize (worker)
Python (FastAPI worker sketch)
import filetype, hashlib, subprocess, os
from PIL import Image
ALLOWED = {
".jpg": {"mime": ["image/jpeg"], "max": 5*1024*1024},
".png": {"mime": ["image/png"], "max": 5*1024*1024},
".pdf": {"mime": ["application/pdf"], "max": 20*1024*1024},
}
def head_ok(path, ext):
size = os.path.getsize(path)
if size > ALLOWED[ext]["max"]: return False, "too_large"
kind = filetype.guess(path)
if not kind or kind.mime not in ALLOWED[ext]["mime"]:
return False, f"bad_type:{kind and kind.mime}"
return True, None
def clamav_ok(path):
# clamscan --no-summary returns 0 on clean, 1 on infected
return subprocess.call(["clamscan", "--no-summary", path]) == 0
def sanitize_image_inplace(path):
img = Image.open(path)
img = img.convert("RGB") # normalize colorspace
img.save(path, format="JPEG", quality=85) # re-encode & strip metadata
# For PDFs: consider qpdf/ocrmypdf or safer: convert first page to image thumbnail
Key ideas
- Compute and store sha256 for dedupe and integrity.
- Re‑encode images (drop EXIF/GPS). For PDFs, avoid executing JS; consider flattening or serving via attachment only.
- Archives: either disallow or inspect contents (limit depth/total expanded size).
6) Uploading directly (browser snippet)
async function upload(file) {
// 1) Ask backend for an upload URL with desired extension/type
const { url, key } = await fetch("/uploads/url?ext=jpg").then(r => r.json());
// 2) PUT directly to storage; include the Content-Type your URL was minted for
await fetch(url, { method: "PUT", headers: { "Content-Type": file.type }, body: file });
// 3) Tell backend "done": it will process from the storage event
await fetch("/uploads/complete", { method: "POST", body: JSON.stringify({ key })});
}
Large files: use multipart upload (S3) or tus.io; support resumption and client‑side chunking.
7) Serving files safely
- Require auth before issuing signed download URLs (short TTL).
- If proxying through your app, set headers:
Content-Type: application/octet-stream # or the verified type
Content-Disposition: attachment; filename="safe-name.ext"
X-Content-Type-Options: nosniff
Cache-Control: private, max-age=0, no-store # for sensitive
- Never trust the original filename. Derive a safe name (strip control chars, keep
[a-zA-Z0-9._-], clamp length).
8) Limits, timeouts, and DoS controls
- NGINX/Ingress:
client_max_body_size 25m;• reasonableclient_body_timeout. - Backends: stream to disk/storage; set per‑file and per‑user quotas; rate‑limit upload endpoints.
- Temp dirs: bound size and auto‑cleanup.
- Reject files with too many pixels (image bombs). Example:
width*height > 40 MP → reject.
9) Special file types
- SVG: treat as active content. Either block, or sanitize with a strict allowlist (remove scripts, external refs) and serve as attachment.
- Office docs: block macros by policy; scan with AV; serve as attachment.
- Audio/video: transcode server‑side to known codecs/containers (ffmpeg) to neutralize weird metadata.
- Text/CSV: beware CSV injection (
=HYPERLINK(...)). Serve as attachment; consider prefix'for user‑generated cells.
10) Observability & auditing
- Log:
user_id,object_key, sha256, verifiedcontent_type, size, scanner results, sanitize actions, download events. - Metrics: counts of files by status, scan durations, rejection reasons, bytes stored per user/tenant.
- Alerts on scanner failures, quarantine backlog, and spikes in rejected files.
11) Pitfalls & fast fixes
| Pitfall | Why it hurts | Fix |
|---|---|---|
| Trusting Content-Type | Spoofed types run as code | Check magic bytes; maintain allowlist |
| Using raw filenames/paths | Path traversal, weird chars | UUID keys; sanitize name only for display |
| Public buckets | Data exposure | Keep private; use signed URLs |
| Inline rendering of untrusted files | XSS | attachment + nosniff |
| No size/pixel limits | DoS, image bombs | Cap bytes and pixels |
| Not stripping EXIF | PII leaks | Re‑encode images |
| Letting archives explode | Disk exhaustion | Disallow or limit entries/depth/expanded size |
| One‑box AV only | Blind spots | Use fresh signatures; consider cloud scanning fallback |
Quick checklist
- [ ] Allowlist extensions/types; magic‑byte check; cap size/pixels.
- [ ] Pre‑signed uploads to private storage; stream, don’t buffer.
- [ ] Quarantine → scan → sanitize → mark safe workflow.
- [ ] Re‑encode images; treat SVG/HTML as code (block/sanitize).
- [ ] Signed downloads or proxy with attachment + nosniff.
- [ ] Log sha256/type/size; monitor scanner health and rejection reasons.
One‑minute adoption plan
- Switch to pre‑signed uploads into a quarantine prefix (private).
- Add a worker that validates magic bytes, runs AV, and re‑encodes images; mark records
safe|rejected. - Serve files only via short‑lived signed URLs or proxy with attachment headers.
- Enforce allowlist + size/pixel caps and rate limits.
- Add metrics/logs for audits and alert on failures/backlogs.