← Back

Why your DNS failover failed

TTL is a suggestion. Most of the systems between your DNS answer and a working browser request don't respect it.

The story is the same in every postmortem: origin goes down, health check catches it within a minute, DNS provider flips the A record from the broken target to the healthy one. And the on-call still spends the next hour explaining why support tickets keep coming in.

What people don't realize about DNS failover is that the TTL on your record is the floor on how long failover takes, not the ceiling. Once you stack up every cache layer between the authoritative answer and the customer's browser, "1-minute TTL" routinely produces 30-minute outages.

The five layers that ignore your TTL

1. The recursive resolver's minimum TTL

Most public recursive resolvers enforce a minimum TTL. Cloudflare's 1.1.1.1 has historically clamped some workloads to a 30s minimum. Google's 8.8.8.8 honors very low TTLs but caches negative answers (NXDOMAIN, empty result) for longer. Corporate resolvers vary wildly, Cisco Umbrella has a documented 30-second floor on negative caches, dnsmasq has its own minimums, and a poorly-configured ISP resolver might just round everything up to 5 minutes.

2. Operating-system DNS caches

macOS' mDNSResponder caches DNS for up to 5 minutes regardless of TTL. Windows DNS Client honors TTL on Windows 10+ but has its own quirks under low-memory pressure. Linux without nscd doesn't cache; with nscd it caches for the configured TTL but the daemon doesn't always pick up changes immediately.

3. The browser's internal DNS cache

Chrome caches DNS in-process for ~60 seconds after the first lookup. Firefox is similar. Both caches are independent of the OS and the resolver. On a busy single-page app, the browser does one DNS lookup at page load and reuses that answer for every XHR for a full minute regardless of any external TTL change.

You can flush Chrome's DNS cache at chrome://net-internals/#dns. You cannot ask 50 million users to do that.

4. Long-lived connections

DNS only happens at connection establishment. An HTTP/2 keep-alive connection, a websocket, a gRPC stream, once it's open, it stays open against the original IP for as long as both sides keep ack'ing. Your TTL is irrelevant; the connection is bypassing DNS entirely. For websocket-heavy apps, a DNS flip without server-side connection draining means existing users stay on the dead box until they reconnect (sometimes hours).

5. Negative caching of "the new IP isn't responding yet"

If your health check declares the primary down at second 30, your DNS engine writes a new A record at second 31, the cutover-aware resolvers pick up the new answer at second 60. But during seconds 31–60 some unlucky client got the new IP and hit a TCP RST because the backup wasn't fully warm. That client's browser may now cache "this hostname is broken" via its connection-error backoff. They keep trying the old (now-stale) IP because the DNS cache hasn't expired yet locally.

What this means in real timelines

Let's say you have a 60-second DNS TTL and your health check fires within 10 seconds of an outage. Realistic recovery time for a 50-million-user workload:

~0–10s: outage starts, health check detects, DNS provider publishes new answer.
~10–70s: resolvers refresh. About 60% of traffic gets the new answer.
~70s–5 min: OS / resolver / browser caches drain. Another 30% of traffic moves.
5–30 min: stubborn resolvers with minimum-TTL clamping, long-lived HTTP/2 connections, and corporate caches finally let go. Remaining 10% recovers, slowly.

DNS failover gets you most of the way back in single-digit minutes. It does not give you sub-second-RTO. Treating it as if it does, then publishing a "99.99% SLA" anyway, is how outages turn into refund-eligible incidents.

If you need real fast failover

Two approaches actually work for sub-30-second recovery:

Anycast. Run the same service IP at multiple sites and let BGP route to the closest healthy one. When a site goes down, BGP convergence is the failover and it happens in seconds (typically 5–15 in the wild). The client never sees DNS change because the IP didn't change. This is how every CDN, every public DNS resolver (including aigw' nameservers), and most cloud load balancers handle it.

An L4 or L7 load balancer in front. The LB has a single IP / hostname that DNS points at. Behind it, the LB does its own health checking on a much tighter loop (1–5 seconds) and steers connections away from broken backends. Failover is server-side; DNS never changes. ALB, GLB, HAProxy, Caddy, you name it.

DNS failover earns its keep for cross-region disaster recovery: an entire AWS region falls over, you need to send traffic to your GCP standby an hour from now, you can survive a few minutes of recovery. For "one app server crashed", DNS is the wrong tool.

How aigw handles failover

aigw does failover with a health-checked pool set to active-passive selection: the pool serves its highest-priority healthy member. The healthchecker probes each member every 30 seconds (configurable). When the primary goes down, the next DNS query returns the next-priority member with the dead one excluded; when it recovers, queries shift back. The dashboard shows incident transitions so you can post-mortem against real timestamps.

We don't pretend this is sub-second. Used in the right place, geographic DR, multi-cloud standby, "this is our break-glass", it's a clean way to express "if A is down, send people to B". For tighter SLOs, you want anycast or an LB in front.