System Design: Global Load Balancing and DNS-based Routing

The challenge of maintaining high availability and low latency at a global scale is one of the most significant hurdles in modern software architecture. When a service grows beyond a single data center, the complexity of directing users to the correct location increases exponentially. We have seen high profile outages at companies like Meta or AWS where networking misconfigurations or failures in the control plane led to global downtime. These events highlight a fundamental truth in our industry: the network is not reliable, and our routing strategies must be resilient to regional failures.

A common architectural goal is to achieve an active-active multi-region setup. Netflix pioneered this approach by moving away from a single primary region to a model where traffic can be evacuated from one AWS region to another in minutes. The primary tool for achieving this level of traffic control is Global Server Load Balancing (GSLB) driven by the Domain Name System (DNS).

The thesis of this analysis is that while DNS-based GSLB is the most flexible and cost-effective method for global traffic management, its effectiveness is strictly limited by the behavior of recursive resolvers and the proper implementation of the EDNS Client Subnet (ECS) extension. Without a deep understanding of these underlying protocols, architects risk building systems that fail to failover during a crisis or route users to distant, high-latency regions.

Architectural Pattern Analysis: DNS vs. Anycast

To understand GSLB, we must first compare it to its primary alternative: Anycast routing. In an Anycast setup, multiple servers across the globe share the same IP address. The Border Gateway Protocol (BGP) directs traffic to the nearest instance based on network hops. This is the foundation of many Content Delivery Networks (CDNs) like Cloudflare.

However, Anycast has limitations. It provides very little control over which specific user hits which specific data center. If a data center is at capacity but still healthy from a BGP perspective, it will continue to attract traffic. DNS-based GSLB, on the other hand, allows for much finer control. By returning different IP addresses based on the user's location, current server load, or regional health, we can implement sophisticated traffic engineering.

The following table compares these two dominant approaches across critical architectural criteria.

Criteria	DNS-based GSLB	Anycast (BGP)
Failover Speed	Minutes (Limited by TTL)	Seconds (BGP Convergence)
Traffic Granularity	High (User, Region, Weight)	Low (Network Proximity)
Operational Complexity	Moderate	High (Requires BGP expertise)
Client Precision	High (With ECS support)	Natural (Network-based)
Infrastructure Cost	Lower (Managed Services)	Higher (IP Space and Hardware)

The failure of simple round-robin DNS is well documented. In the early days of the web, many teams simply listed multiple A records for a single hostname. The hope was that clients would distribute themselves evenly. In reality, recursive resolvers at Internet Service Providers (ISPs) often cache these records and return them in a fixed order, or clients might only try the first IP and fail if it is unreachable. This lack of intelligence is why modern GSLB solutions act as a dynamic policy engine rather than a static list.

The sequence diagram above illustrates the standard flow of a DNS-based GSLB request. The critical step is the GSLB nameserver evaluating health and proximity. If Region A is currently experiencing a 5% increase in error rates, the GSLB engine can immediately start shifting a percentage of traffic to Region B by updating the DNS responses it provides to the recursive resolvers.

The Role of EDNS Client Subnet (ECS)

A major pitfall in DNS-based routing is the location of the recursive resolver. If a user in Tokyo uses a DNS resolver located in New York, a standard DNS server will see the request coming from New York and route the user to a US-based data center. This results in terrible latency.

RFC 7871, which defines the Client Subnet in DNS Queries, solves this by allowing the resolver to include a portion of the user's IP address (the subnet) in the request to the authoritative nameserver. This allows the GSLB engine to see where the actual user is located. Companies like Google and OpenDNS were early adopters of this, and it is now a requirement for any high-performance global architecture.

However, not all ISPs support ECS. When ECS is missing, the GSLB has to fall back to the resolver's IP address. This is why many global companies still maintain a large number of "Edge" points of presence (PoPs) to ensure that even if DNS routing is slightly off, the initial TCP/TLS termination happens close to the user.

The Blueprint for Implementation

Building a robust GSLB system requires more than just a managed DNS service. It requires an integrated health checking system and a strategy for handling the "Thundering Herd" during failover. When you change a DNS record, you are at the mercy of the Time to Live (TTL) value. If you set a TTL of 60 seconds, you expect traffic to shift in a minute. In practice, many resolvers ignore low TTLs and cache for longer, leading to a long tail of traffic that persists on a failing region for 10 to 15 minutes.

Health Check Logic

Health checks must be more than a simple TCP ping. A service might be "up" but returning 500 errors or experiencing extreme database contention. A senior architect should implement "Deep Health Checks" that verify the entire request path.

The following TypeScript snippet demonstrates a conceptual health aggregator that a GSLB controller might use to determine regional weights.

interface RegionalMetrics {
  regionId: string;
  successRate: number; // 0.0 to 1.0
  p99LatencyMs: number;
  cpuUtilization: number;
}

interface GSLBConfig {
  maxLatencyThreshold: number;
  minSuccessRate: number;
}

/**
 * Calculates a routing weight for a region based on its current health.
 * A weight of 0 indicates the region should be evacuated.
 */
function calculateRegionalWeight(
  metrics: RegionalMetrics,
  config: GSLBConfig
): number {
  // Hard failure: If success rate is below threshold, stop routing traffic
  if (metrics.successRate < config.minSuccessRate) {
    return 0;
  }

  // Soft failure: If latency is too high, reduce weight proportionally
  if (metrics.p99LatencyMs > config.maxLatencyThreshold) {
    const latencyPenalty = metrics.p99LatencyMs / config.maxLatencyThreshold;
    return Math.max(10, 100 / latencyPenalty);
  }

  // Load balancing: Reduce weight if CPU is saturated to prevent cascading failure
  if (metrics.cpuUtilization > 0.85) {
    return 50;
  }

  // Default healthy weight
  return 100;
}

This logic ensures that traffic shifting is not a binary switch but a gradual process. By reducing the weight of a degraded region, you can alleviate pressure without immediately overwhelming the remaining regions. This is a lesson learned from large-scale incidents where a sudden 100% failover caused a "domino effect" of failures across every data center.

The flowchart above demonstrates the relationship between the health monitor and the GSLB engine. The health monitor must be distributed. If you only monitor your EU region from the US, a transatlantic fiber cut might make the EU region look "down" to your monitor, even though it is perfectly healthy for local EU users. A mature architecture uses a "quorum" of monitors located in different parts of the world to determine regional health.

Common Implementation Pitfalls

One of the most frequent mistakes I see is the "Sticky DNS" problem. Some client libraries and mobile operating systems perform DNS resolution once and cache the result for the lifetime of the application process. This completely bypasses your GSLB logic. If you evacuate a region, these "sticky" clients will continue to send traffic to the dead IP until the app is restarted. To mitigate this, your application layer must be aware of connection failures and force a DNS re-resolution or use a secondary endpoint.

Another pitfall is the lack of "Default" routing. If your GSLB logic is based on complex geo-fencing and a user arrives from an unknown or new IP range, what happens? I have seen systems return an empty response or a 404 at the DNS level. Always ensure a robust default region is configured.

Strategic Implications: The Cost of Global Resilience

Implementing GSLB is not just a networking task; it is a business decision that affects the entire stack. If you route a user to a different region, is their data there? This brings us to the CAP theorem. DNS-based routing is the "easy" part of global architecture. The "hard" part is data synchronization.

If you are using a database like Amazon Aurora Global, you have to account for replication lag. If a user is routed from US-East to US-West, they might experience "time travel" where a record they just created has not yet appeared in the new region. As an architect, you must decide if your application can handle eventual consistency or if you need to implement "Regional Sticky Sessions" at the application level to keep a user in a region as long as it is healthy.

LinkedIn, for example, has discussed their use of a "Traffic Shift" tool that allows engineers to move percentages of traffic between data centers during maintenance or incidents. This requires that every data center is capable of serving any user's request, which implies a massive investment in global data replication and service parity.

Managing the Long Tail of DNS Caching

As mentioned previously, the TTL is a suggestion, not a law. In a real-world failover scenario, you will observe a "long tail" of traffic. This is traffic from clients or recursive resolvers that ignore your 60-second TTL and keep the old IP for 30 minutes, an hour, or even longer.

To handle this, you cannot simply turn off the old region. You must "drain" it. This involves:

Updating DNS to point to the new region.
Monitoring the traffic decrease in the old region.
Keeping a "skeleton" crew of services running in the old region to handle the remaining traffic.
Optionally, using a proxy in the old region to forward requests to the new region over a private backbone.

This proxying approach is what many top-tier engineering teams use to achieve near-instant failover despite the limitations of DNS. The DNS handles the bulk of the shift, while the application-level proxy handles the cached "long tail."

The state diagram above shows the lifecycle of a region during a traffic shift. The "Proxying" state is critical for maintaining a 100% success rate during the transition. By logging the "Deprecated Clients," you can identify specific ISPs or client versions that are not respecting DNS TTLs and investigate further.

Strategic Considerations for Your Team

When designing or refining your global routing strategy, consider the following principles:

Prioritize Simplicity Over Perfect Routing: It is better to have a slightly suboptimal route (e.g., routing a user from France to Germany instead of a local French PoP) than a highly complex GSLB configuration that is prone to human error.
Automate the Failover: In the heat of an incident, humans make mistakes. Your GSLB should be capable of automatic "Circuit Breaking." If a region's health drops below a certain threshold, the system should automatically begin the drain process.
Test Your "Drain" Regularly: If you never practice moving traffic, you will fail when a real emergency occurs. Netflix uses "Chaos Kong" to regularly simulate the failure of an entire AWS region. This ensures that their GSLB, data replication, and service capacity are always ready.
Monitor from the Outside In: Use "Synthetic Monitoring" from various global locations to verify what your users are actually seeing. Your internal dashboards might show everything is green, but a DNS misconfiguration could be sending all of Australia to a data center in Brazil.
Understand Your Client Behavior: If you control the client (e.g., a mobile app), implement smart retry logic. If a connection fails, do not just retry the same IP. Perform a fresh DNS lookup or have a hardcoded "Emergency IP" to reach a global gateway.

The Evolution of Global Routing

We are moving toward a world where the boundary between DNS and Anycast is blurring. Services like AWS Global Accelerator provide you with static Anycast IPs that then use the AWS private network to route traffic to the best regional endpoint. This combines the failover speed of Anycast with the fine-grained control of GSLB.

However, the fundamentals of DNS-based routing remain essential. Whether you are using a managed service or building your own, the ability to control traffic at the edge is the only way to achieve true global scale and resilience. As architects, our job is to embrace the limitations of the protocols we use and build systems that are robust in the face of the inevitable network failures.

TL;DR (Too Long; Didn't Read)

Global Server Load Balancing (GSLB) via DNS is the primary mechanism for directing global traffic, offering high granularity and control compared to Anycast. Its success relies on the EDNS Client Subnet (ECS) extension to accurately identify user locations and low TTL values for responsive failover. However, DNS caching at the ISP and client levels creates a "long tail" of traffic, requiring a "drain and proxy" strategy rather than a hard switch. High-availability architectures must combine DNS-based routing with deep health checks and global data replication to ensure that users are not only routed to a healthy region but also find their data consistent upon arrival. Regular "region evacuation" drills are mandatory to verify that the system can handle the sudden load shift of a real-world outage.

Global Load Balancing and DNS-based Routing

Architectural Pattern Analysis: DNS vs. Anycast

The Role of EDNS Client Subnet (ECS)

The Blueprint for Implementation

Health Check Logic

Common Implementation Pitfalls

Strategic Implications: The Cost of Global Resilience

Managing the Long Tail of DNS Caching

Strategic Considerations for Your Team

The Evolution of Global Routing

TL;DR (Too Long; Didn't Read)

Comments

System Design

Bulkhead Pattern for System Isolation

More from this blog

Domain-Driven Design in Microservices

Blue-Green vs Canary Deployment Strategies

Bulkhead Pattern for System Isolation

Auto-scaling and Load-based Scaling

Command Palette

Architectural Pattern Analysis: DNS vs. Anycast

The Role of EDNS Client Subnet (ECS)

The Blueprint for Implementation

Health Check Logic

Common Implementation Pitfalls

Strategic Implications: The Cost of Global Resilience

Managing the Long Tail of DNS Caching

Strategic Considerations for Your Team

The Evolution of Global Routing

TL;DR (Too Long; Didn't Read)

Comments

System Design

Bulkhead Pattern for System Isolation

More from this blog