DNS for Engineers

April 21st, 202610 min readUpdated April 21st, 2026#dev #dns #networking #ops #devops #ca-duh

How TTLs really affect changes, why propagation feels slow, where DNS failover helps and where it doesn’t, plus cutover playbooks that keep rollback easy.

DNS for Engineers

TTLs, propagation, failover, split-horizon, and cutover playbooks

Goal: help you treat DNS like an operational control plane instead of a mysterious delay mechanism that only shows up when you are moving traffic and hoping nothing breaks.

TL;DR

TTL controls how long caches may reuse an answer. Longer TTLs reduce query load and speed up repeat lookups, but make changes take longer to show up.
DNS “propagation” is usually not a magical push. It is mostly cache expiry at recursive resolvers, browsers, and OS caches, plus your provider’s own internal replication.
Negative caching is real. If a name did not exist, resolvers may cache that non-existence too.
DNS failover only changes where new lookups go. It does not move existing TCP sessions, TLS sessions, WebSockets, or cached client connections.
Split-horizon DNS means the same name can return different answers to different clients, usually internal versus external. It can be useful, but it raises the bar for correctness.
Good cutovers are mostly boring because you lower TTLs in advance, keep old and new targets working during overlap, verify at the authoritative layer, and keep rollback simple.

1) The mental model

Application -> OS cache -> Browser cache -> Recursive resolver -> Authoritative DNS

Most clients do not ask your authoritative nameserver directly on every lookup.

Instead:

the client asks a recursive resolver
the resolver reuses cached answers until TTL expiry
only then does it ask the authoritative source again

That means:

your record change may be visible at the authoritative server immediately
some users can still get the old answer until their recursive resolver cache ages out

If you want the narrower version of this story, see Demystifying DNS Propagation.

2) TTLs: what they actually control

TTL is the cache lifetime of a DNS answer, in seconds.

Practical rule:

high TTL = lower DNS query volume, slower change visibility
low TTL = faster traffic shifts, more DNS traffic, more load on authoritative infrastructure

Cloudflare’s DNS docs summarize the tradeoff clearly: longer TTLs improve cache reuse, but they also make record updates take longer to reach users.

Typical engineering ranges

| Situation | Common TTL posture | Why | |---|---|---| | Stable public records | 300s to 3600s | Good cache efficiency without making changes painful | | During a cutover window | 60s to 300s | Lets caches age out faster | | Health-check-driven DNS failover | usually low, often around 60s | Reduces failover lag for new resolutions | | Nameserver or delegation changes | plan for hours, not seconds | Parent-zone and resolver caching dominate |

What low TTL does not do

Low TTL does not guarantee:

instant traffic movement
immediate failover
existing clients dropping old connections
every recursive resolver behaving exactly the same way

It only gives caches permission to forget sooner.

Provider reality matters

Some providers clamp minimum TTLs or treat certain records specially. For example, Cloudflare documents fixed behavior for proxied records and different minimums for unproxied records depending on plan level.

3) Propagation: what people usually mean

When engineers say “DNS propagation,” they usually mean one or both of these:

the provider has replicated the new answer across its authoritative fleet
cached answers elsewhere have expired and been replaced

The second one is what usually hurts.

Three useful truths

Authoritative DNS may already be correct while users still see old answers.
Different resolvers can disagree temporarily because they cached at different times.
Your laptop cache is not the whole picture. Flushing local DNS does not invalidate public recursive caches.

Negative caching matters

If a record did not exist and returned NXDOMAIN or no data, resolvers may cache that negative answer too. RFC 2308 formalized this behavior. This is why “I just created the record and it still looks missing” can be entirely normal.

Record edits vs delegation changes

Not all DNS changes are equal:

A/AAAA/CNAME/TXT/MX record changes are usually straightforward cache-expiry problems
NS changes, glue changes, and DNSSEC DS changes are more sensitive because they affect delegation and trust chains

If you are changing nameservers or DNSSEC state, assume more risk and more elapsed time than a simple A record swap.

4) What to verify before you blame propagation

Ask the authoritative source

dig NS example.com +short
dig A example.com @ns1.yourdns.tld +nocmd +noall +answer

If the authoritative answer is wrong, you do not have a propagation problem. You have a configuration problem.

Ask multiple public recursive resolvers

dig A example.com @1.1.1.1 +nocmd +noall +answer
dig A example.com @8.8.8.8 +nocmd +noall +answer
dig A example.com @9.9.9.9 +nocmd +noall +answer

Trace the delegation path

dig +trace example.com

Use that when you suspect:

bad delegation
stale parent-zone NS
glue issues
DNSSEC sequencing problems

Look at the TTL in answers

The TTL value tells you how much longer that resolver may reuse its cached answer. This is often the fastest way to decide whether you should wait, roll back, or keep both targets alive longer.

5) DNS failover: what it can and cannot do

DNS failover is useful, but it is not magic.

What it does

It changes what future DNS queries resolve to.

What it does not do

It does not:

terminate existing connections
move open sessions
rewrite already-resolved client destinations
save you if clients or resolvers have long cached answers

Practical failover rule

If you need traffic to move quickly for new lookups, use:

low TTLs
health-aware record selection
endpoints that can overlap during transition

Route 53’s docs are a good concrete example here:

active-passive failover uses primary and secondary records
health checks determine which answer is returned to new queries
AWS recommends low TTLs, often 60 seconds or less, when health checks are part of the routing decision

Two failover patterns

Active-passive

One primary target serves traffic normally. A secondary target is used when the primary becomes unhealthy.

Use this when:

standby capacity exists
you want simple failure behavior
the backup path is materially different from the primary

Active-active

Multiple healthy targets answer queries at the same time. Unhealthy ones are removed from future answers.

Use this when:

you want all capacity live
you can operate multiple healthy regions or clusters
your app tolerates being served from multiple places at once

DNS failover warning

If your app cannot tolerate split traffic during TTL overlap, DNS alone is not enough. You may need a proxy, load balancer, traffic manager, or app-level drain strategy instead.

6) Split-horizon DNS: when the same name returns different answers

Split-horizon DNS, often called split-brain DNS, means the same zone or hostname returns different answers depending on who is asking.

Typical example:

internal users ask for app.example.com and get a private IP
external users ask for app.example.com and get a public IP

Why teams do this

keep internal traffic on private networks
expose different internal and external service topologies
hide internal-only names or addresses
simplify internal access to the same hostname users already know

Why it gets risky

Split-horizon makes debugging harder because:

“works for me” and “broken for them” can both be true
internal and external TLS, routing, and firewall rules can drift
recursion and authoritative behavior can accidentally mix
SSO and identity systems can fail in strange ways if the internal and external name paths do not line up cleanly

Practical split-horizon rule

Use split-horizon when the operational benefit is real, not just because it feels neat.

If you use it:

keep the answer difference intentional and documented
make it obvious which resolvers see which view
test from both internal and external paths
avoid mixing “same name, different meaning” with unclear auth, TLS, or routing assumptions

Concrete implementation examples

BIND documents split DNS using views, where the same zone name can exist in multiple views and return different data depending on client matching.

Microsoft’s split-brain DNS guidance shows the same idea using DNS policy and zone scopes: one zone view for internal users, another for external users.

7) Cutover playbooks

These are the practical playbooks that avoid most DNS drama.

7.1 App endpoint cutover

Use this for:

A record change
AAAA record change
CNAME target swap
moving an app between load balancers or CDNs

Playbook

Check the current TTL and current answers at the authoritative layer.
Lower TTL in advance, ideally one previous-TTL window before the change.
Wait long enough for old high-TTL answers to age out.
Make sure both old and new targets can serve correctly during overlap.
Change the record.
Verify authoritative answers first.
Verify multiple public recursive resolvers.
Watch app traffic and errors, not just DNS answers.
Keep the old target alive until the overlap window has clearly passed.
Raise TTL back to a sane steady-state value after the move is stable.

Rollback

Rollback is the same process in reverse:

point the record back
keep the current target alive long enough for rollback TTL overlap

If you tear down the old environment too early, rollback becomes impossible even if DNS itself is trivial.

7.2 Nameserver migration playbook

Use this when moving from one DNS provider to another.

This is a higher-risk operation than editing a single record.

Playbook

Recreate the full zone on the new provider before touching delegation.
Verify all critical records on the new provider:
- A/AAAA
- CNAME
- MX
- TXT
- SPF/DKIM/DMARC-related records if relevant
- DNSSEC settings if used
Confirm the new provider is authoritative and correct for the whole zone.
If DNSSEC is involved, sequence DS and signing changes carefully.
Update registrar delegation to the new nameservers.
Monitor parent delegation and public resolver answers.
Keep the old zone intact until the delegation change has clearly settled.

Key warning

Do not treat NS migration like an A record swap. Parent-zone caches, glue, and DNSSEC mistakes can take a healthy app offline even when your app infrastructure itself is fine.

7.3 DNS failover playbook

Use this for active-passive DNS failover or planned failover drills.

Playbook

Decide whether you need active-passive or active-active.
Make sure failover targets are truly ready to serve traffic.
Keep TTL low enough that new resolutions move in a useful timeframe.
Validate health-check behavior before you need it.
Test failover in a planned window, not only during an outage.
Measure how long it takes for:
- health detection
- record selection change
- resolver cache expiry
- application traffic stabilization
Document the observed lag so nobody expects “instant” DNS failover next time.

Key warning

A failover target that returns an IP but cannot actually serve the app is worse than no failover. DNS failover should be tied to meaningful health signals, not wishful thinking.

7.4 Split-horizon rollout playbook

Use this when introducing private and public answers for the same name.

Playbook

Define exactly which clients should see the internal answer.
Define exactly which resolvers enforce that policy.
Verify internal and external TLS both work for the same hostname.
Verify internal clients can still resolve public names they need.
Test from:
- an internal client
- an external client
- the recursive resolvers used in each path
Document the resolver path so on-call engineers know which view they are looking at.

Key warning

If engineers do not know whether they are seeing the internal or external view, incident response becomes guesswork immediately.

8) Common mistakes

| Mistake | Why it hurts | Better move | |---|---|---| | Lowering TTL right before the change | Existing caches already hold the old higher TTL | Lower TTL one old-TTL window in advance | | Assuming low TTL means instant failover | Clients may keep existing connections and resolvers may still hold cached answers | Treat TTL as cache permission, not traffic teleportation | | Changing NS before validating the full new zone | Delegation points at incomplete authority | Build and verify the new zone first | | Using split-horizon without documentation | Debugging becomes ambiguous fast | Document resolver paths and intended answers | | Tearing down the old target too early | Overlap traffic still exists | Keep both old and new environments alive during cache expiry | | Blaming DNS for app or CDN caching | Wrong layer, wrong fix | Separate authoritative DNS checks from app and CDN verification |

9) FAQ

Does low TTL guarantee fast cutover?

No. It helps new lookups refresh sooner, but it does not move existing sessions or force every recursive resolver to forget immediately.

Why do different DNS checkers show different answers?

Because they are querying different recursive resolvers with different cache states. That is normal during a transition.

Is split-horizon the same as split-brain DNS?

In practice, yes. Different teams prefer different names, but the idea is the same: different answers for the same zone or hostname depending on the client path.

Why does a newly created record still look missing?

Negative caching. A resolver may have cached an earlier NXDOMAIN or no-data response.

Is DNS failover enough for zero downtime?

Not by itself. It helps steer new resolutions, but zero-downtime outcomes still depend on app readiness, overlap capacity, client behavior, and sensible TTLs.

10) Final recommendation

If you only keep one DNS rule in your head, keep this one:

use higher TTLs when things are stable
lower them before planned change
verify the authoritative answer first
assume failover is only as fast as health detection plus cache expiry
treat split-horizon as a power tool, not a default

That is still the cleanest DNS operating model for engineers in April 2026.