DNS for Engineers
TTLs, propagation, failover, split-horizon, and cutover playbooks
Goal: help you treat DNS like an operational control plane instead of a mysterious delay mechanism that only shows up when you are moving traffic and hoping nothing breaks.
TL;DR
- TTL controls how long caches may reuse an answer. Longer TTLs reduce query load and speed up repeat lookups, but make changes take longer to show up.
- DNS “propagation” is usually not a magical push. It is mostly cache expiry at recursive resolvers, browsers, and OS caches, plus your provider’s own internal replication.
- Negative caching is real. If a name did not exist, resolvers may cache that non-existence too.
- DNS failover only changes where new lookups go. It does not move existing TCP sessions, TLS sessions, WebSockets, or cached client connections.
- Split-horizon DNS means the same name can return different answers to different clients, usually internal versus external. It can be useful, but it raises the bar for correctness.
- Good cutovers are mostly boring because you lower TTLs in advance, keep old and new targets working during overlap, verify at the authoritative layer, and keep rollback simple.
1) The mental model
Application -> OS cache -> Browser cache -> Recursive resolver -> Authoritative DNS
Most clients do not ask your authoritative nameserver directly on every lookup.
Instead:
- the client asks a recursive resolver
- the resolver reuses cached answers until TTL expiry
- only then does it ask the authoritative source again
That means:
- your record change may be visible at the authoritative server immediately
- some users can still get the old answer until their recursive resolver cache ages out
If you want the narrower version of this story, see Demystifying DNS Propagation.
2) TTLs: what they actually control
TTL is the cache lifetime of a DNS answer, in seconds.
Practical rule:
- high TTL = lower DNS query volume, slower change visibility
- low TTL = faster traffic shifts, more DNS traffic, more load on authoritative infrastructure
Cloudflare’s DNS docs summarize the tradeoff clearly: longer TTLs improve cache reuse, but they also make record updates take longer to reach users.
Typical engineering ranges
| Situation | Common TTL posture | Why | |---|---|---| | Stable public records | 300s to 3600s | Good cache efficiency without making changes painful | | During a cutover window | 60s to 300s | Lets caches age out faster | | Health-check-driven DNS failover | usually low, often around 60s | Reduces failover lag for new resolutions | | Nameserver or delegation changes | plan for hours, not seconds | Parent-zone and resolver caching dominate |
What low TTL does not do
Low TTL does not guarantee:
- instant traffic movement
- immediate failover
- existing clients dropping old connections
- every recursive resolver behaving exactly the same way
It only gives caches permission to forget sooner.
Provider reality matters
Some providers clamp minimum TTLs or treat certain records specially. For example, Cloudflare documents fixed behavior for proxied records and different minimums for unproxied records depending on plan level.
3) Propagation: what people usually mean
When engineers say “DNS propagation,” they usually mean one or both of these:
- the provider has replicated the new answer across its authoritative fleet
- cached answers elsewhere have expired and been replaced
The second one is what usually hurts.
Three useful truths
- Authoritative DNS may already be correct while users still see old answers.
- Different resolvers can disagree temporarily because they cached at different times.
- Your laptop cache is not the whole picture. Flushing local DNS does not invalidate public recursive caches.
Negative caching matters
If a record did not exist and returned NXDOMAIN or no data, resolvers may cache that negative answer too. RFC 2308 formalized this behavior. This is why “I just created the record and it still looks missing” can be entirely normal.
Record edits vs delegation changes
Not all DNS changes are equal:
- A/AAAA/CNAME/TXT/MX record changes are usually straightforward cache-expiry problems
- NS changes, glue changes, and DNSSEC DS changes are more sensitive because they affect delegation and trust chains
If you are changing nameservers or DNSSEC state, assume more risk and more elapsed time than a simple A record swap.
4) What to verify before you blame propagation
Ask the authoritative source
dig NS example.com +short
dig A example.com @ns1.yourdns.tld +nocmd +noall +answer
If the authoritative answer is wrong, you do not have a propagation problem. You have a configuration problem.
Ask multiple public recursive resolvers
dig A example.com @1.1.1.1 +nocmd +noall +answer
dig A example.com @8.8.8.8 +nocmd +noall +answer
dig A example.com @9.9.9.9 +nocmd +noall +answer
Trace the delegation path
dig +trace example.com
Use that when you suspect:
- bad delegation
- stale parent-zone NS
- glue issues
- DNSSEC sequencing problems
Look at the TTL in answers
The TTL value tells you how much longer that resolver may reuse its cached answer. This is often the fastest way to decide whether you should wait, roll back, or keep both targets alive longer.
5) DNS failover: what it can and cannot do
DNS failover is useful, but it is not magic.
What it does
It changes what future DNS queries resolve to.
What it does not do
It does not:
- terminate existing connections
- move open sessions
- rewrite already-resolved client destinations
- save you if clients or resolvers have long cached answers
Practical failover rule
If you need traffic to move quickly for new lookups, use:
- low TTLs
- health-aware record selection
- endpoints that can overlap during transition
Route 53’s docs are a good concrete example here:
- active-passive failover uses primary and secondary records
- health checks determine which answer is returned to new queries
- AWS recommends low TTLs, often 60 seconds or less, when health checks are part of the routing decision
Two failover patterns
Active-passive
One primary target serves traffic normally. A secondary target is used when the primary becomes unhealthy.
Use this when:
- standby capacity exists
- you want simple failure behavior
- the backup path is materially different from the primary
Active-active
Multiple healthy targets answer queries at the same time. Unhealthy ones are removed from future answers.
Use this when:
- you want all capacity live
- you can operate multiple healthy regions or clusters
- your app tolerates being served from multiple places at once
DNS failover warning
If your app cannot tolerate split traffic during TTL overlap, DNS alone is not enough. You may need a proxy, load balancer, traffic manager, or app-level drain strategy instead.
6) Split-horizon DNS: when the same name returns different answers
Split-horizon DNS, often called split-brain DNS, means the same zone or hostname returns different answers depending on who is asking.
Typical example:
- internal users ask for
app.example.comand get a private IP - external users ask for
app.example.comand get a public IP
Why teams do this
- keep internal traffic on private networks
- expose different internal and external service topologies
- hide internal-only names or addresses
- simplify internal access to the same hostname users already know
Why it gets risky
Split-horizon makes debugging harder because:
- “works for me” and “broken for them” can both be true
- internal and external TLS, routing, and firewall rules can drift
- recursion and authoritative behavior can accidentally mix
- SSO and identity systems can fail in strange ways if the internal and external name paths do not line up cleanly
Practical split-horizon rule
Use split-horizon when the operational benefit is real, not just because it feels neat.
If you use it:
- keep the answer difference intentional and documented
- make it obvious which resolvers see which view
- test from both internal and external paths
- avoid mixing “same name, different meaning” with unclear auth, TLS, or routing assumptions
Concrete implementation examples
BIND documents split DNS using views, where the same zone name can exist in multiple views and return different data depending on client matching.
Microsoft’s split-brain DNS guidance shows the same idea using DNS policy and zone scopes: one zone view for internal users, another for external users.
7) Cutover playbooks
These are the practical playbooks that avoid most DNS drama.
7.1 App endpoint cutover
Use this for:
- A record change
- AAAA record change
- CNAME target swap
- moving an app between load balancers or CDNs
Playbook
- Check the current TTL and current answers at the authoritative layer.
- Lower TTL in advance, ideally one previous-TTL window before the change.
- Wait long enough for old high-TTL answers to age out.
- Make sure both old and new targets can serve correctly during overlap.
- Change the record.
- Verify authoritative answers first.
- Verify multiple public recursive resolvers.
- Watch app traffic and errors, not just DNS answers.
- Keep the old target alive until the overlap window has clearly passed.
- Raise TTL back to a sane steady-state value after the move is stable.
Rollback
Rollback is the same process in reverse:
- point the record back
- keep the current target alive long enough for rollback TTL overlap
If you tear down the old environment too early, rollback becomes impossible even if DNS itself is trivial.
7.2 Nameserver migration playbook
Use this when moving from one DNS provider to another.
This is a higher-risk operation than editing a single record.
Playbook
- Recreate the full zone on the new provider before touching delegation.
- Verify all critical records on the new provider:
- A/AAAA
- CNAME
- MX
- TXT
- SPF/DKIM/DMARC-related records if relevant
- DNSSEC settings if used
- Confirm the new provider is authoritative and correct for the whole zone.
- If DNSSEC is involved, sequence DS and signing changes carefully.
- Update registrar delegation to the new nameservers.
- Monitor parent delegation and public resolver answers.
- Keep the old zone intact until the delegation change has clearly settled.
Key warning
Do not treat NS migration like an A record swap. Parent-zone caches, glue, and DNSSEC mistakes can take a healthy app offline even when your app infrastructure itself is fine.
7.3 DNS failover playbook
Use this for active-passive DNS failover or planned failover drills.
Playbook
- Decide whether you need active-passive or active-active.
- Make sure failover targets are truly ready to serve traffic.
- Keep TTL low enough that new resolutions move in a useful timeframe.
- Validate health-check behavior before you need it.
- Test failover in a planned window, not only during an outage.
- Measure how long it takes for:
- health detection
- record selection change
- resolver cache expiry
- application traffic stabilization
- Document the observed lag so nobody expects “instant” DNS failover next time.
Key warning
A failover target that returns an IP but cannot actually serve the app is worse than no failover. DNS failover should be tied to meaningful health signals, not wishful thinking.
7.4 Split-horizon rollout playbook
Use this when introducing private and public answers for the same name.
Playbook
- Define exactly which clients should see the internal answer.
- Define exactly which resolvers enforce that policy.
- Verify internal and external TLS both work for the same hostname.
- Verify internal clients can still resolve public names they need.
- Test from:
- an internal client
- an external client
- the recursive resolvers used in each path
- Document the resolver path so on-call engineers know which view they are looking at.
Key warning
If engineers do not know whether they are seeing the internal or external view, incident response becomes guesswork immediately.
8) Common mistakes
| Mistake | Why it hurts | Better move | |---|---|---| | Lowering TTL right before the change | Existing caches already hold the old higher TTL | Lower TTL one old-TTL window in advance | | Assuming low TTL means instant failover | Clients may keep existing connections and resolvers may still hold cached answers | Treat TTL as cache permission, not traffic teleportation | | Changing NS before validating the full new zone | Delegation points at incomplete authority | Build and verify the new zone first | | Using split-horizon without documentation | Debugging becomes ambiguous fast | Document resolver paths and intended answers | | Tearing down the old target too early | Overlap traffic still exists | Keep both old and new environments alive during cache expiry | | Blaming DNS for app or CDN caching | Wrong layer, wrong fix | Separate authoritative DNS checks from app and CDN verification |
9) FAQ
Does low TTL guarantee fast cutover?
No. It helps new lookups refresh sooner, but it does not move existing sessions or force every recursive resolver to forget immediately.
Why do different DNS checkers show different answers?
Because they are querying different recursive resolvers with different cache states. That is normal during a transition.
Is split-horizon the same as split-brain DNS?
In practice, yes. Different teams prefer different names, but the idea is the same: different answers for the same zone or hostname depending on the client path.
Why does a newly created record still look missing?
Negative caching. A resolver may have cached an earlier NXDOMAIN or no-data response.
Is DNS failover enough for zero downtime?
Not by itself. It helps steer new resolutions, but zero-downtime outcomes still depend on app readiness, overlap capacity, client behavior, and sensible TTLs.
10) Final recommendation
If you only keep one DNS rule in your head, keep this one:
- use higher TTLs when things are stable
- lower them before planned change
- verify the authoritative answer first
- assume failover is only as fast as health detection plus cache expiry
- treat split-horizon as a power tool, not a default
That is still the cleanest DNS operating model for engineers in April 2026.