How Load Balancers Distribute Traffic
In 1999, a small online retailer experienced what many web engineers at the time considered an inevitability: their single application server, a beefy machine running Apache on a Sun Microsystems box, buckled under the weight of a holiday traffic surge. The server's CPU pegged at 100 percent, memory exhaustion caused the operating system to thrash, and within minutes the site returned nothing but connection timeouts. Customers left, revenue evaporated, and the engineering team scrambled to bring the machine back online. The fix, when it came days later, was not a bigger server -- it was a second server, and a mysterious new device sitting in front of both of them that decided which machine should handle each incoming request.
That device was a load balancer, and its introduction marked one of the most consequential architectural shifts in the history of web infrastructure. The fundamental insight behind load balancing is deceptively simple: instead of routing all traffic to a single server and hoping it can handle the load, you distribute incoming requests across a pool of servers so that no individual machine becomes a bottleneck. But the simplicity of that premise belies an extraordinary depth of engineering complexity beneath the surface. How does the load balancer decide which server gets the next request? What happens when a server fails mid-request? How do you ensure that a user's shopping cart persists across multiple requests if each one might land on a different server? How do you terminate encrypted connections efficiently? How do you load-balance the load balancer itself?
These questions have driven decades of innovation in networking, software architecture, and distributed systems design. Load balancers today sit at the heart of virtually every large-scale web application, from social media platforms handling billions of requests per day to enterprise APIs serving internal microservices. They exist as dedicated hardware appliances, as software running on commodity servers, as cloud-managed services, and as lightweight processes embedded directly into application sidecars. Understanding how they work -- not just at a surface level, but deeply, mechanically, with an appreciation for the tradeoffs involved in every design decision -- is essential knowledge for anyone who builds, operates, or architects systems that must be reliable, performant, and scalable.
This article is a comprehensive examination of load balancing from first principles. We will trace the history from hardware appliances to modern software and cloud implementations, dissect the algorithms that determine how traffic is distributed, explore the critical distinction between Layer 4 and Layer 7 load balancing, understand health checking mechanisms that keep pools healthy, grapple with the tradeoffs of session affinity, examine SSL/TLS handling strategies, and study how load balancers themselves achieve high availability. By the end, you will have the deep, practical understanding needed to design, evaluate, and operate load-balanced architectures in production.
The Problem Load Balancers Solve: Single Server Limitations and the Need to Scale
Every web application begins its life on a single server. A developer writes code, deploys it to a machine, points a domain name at that machine's IP address, and users begin making requests. For a personal blog or a small internal tool, this architecture works perfectly well. But as traffic grows, the single server encounters hard limits that no amount of hardware optimization can fully overcome.
Vertical scaling -- adding more CPU cores, more RAM, faster storage -- can extend the life of a single-server architecture, but it has diminishing returns and hard ceilings. There is a largest server you can buy, and its cost scales superlinearly: a machine with twice the capacity often costs four or five times as much. More fundamentally, a single server is a single point of failure. If that machine crashes, loses network connectivity, or needs to be rebooted for a kernel update, every user of your application experiences downtime.
Horizontal scaling -- adding more servers and distributing traffic across them -- addresses both limitations simultaneously. It allows you to scale capacity almost linearly by adding commodity machines, and it provides fault tolerance because the failure of any single server does not bring down the entire application. But horizontal scaling introduces a new problem: something needs to sit between the users and the pool of servers, accepting incoming connections and deciding which backend server should handle each one. That something is the load balancer.
"A load balancer is fundamentally a traffic cop standing at the intersection between clients and servers, directing each request to the most appropriate destination based on a set of rules, policies, and real-time observations about the health and capacity of the server pool."
The specific problems that load balancers solve include:
- Traffic distribution: Spreading requests evenly (or according to capacity) across multiple servers so that no single machine is overwhelmed
- High availability: Automatically detecting when a server has failed and routing traffic away from it, so users experience no downtime
- Horizontal scalability: Enabling the addition or removal of servers from the pool without disrupting active users
- Performance optimization: Routing requests to the server best positioned to handle them quickly, whether by geographic proximity, current load, or content specialization
- SSL/TLS offloading: Handling the computationally expensive work of encrypting and decrypting HTTPS traffic so that backend servers can focus on application logic
- Security: Acting as a choke point where DDoS mitigation, rate limiting, and request filtering can be applied before traffic reaches application servers
Without load balancers, the modern internet as we know it -- with its expectation of always-on availability and sub-second response times -- would simply not be possible.
A Brief History: From Hardware Appliances to Software and Cloud
The Hardware Era
The first commercial load balancers appeared in the mid-1990s as dedicated hardware appliances. Companies like Cisco (with its LocalDirector, later the Content Switching Module), F5 Networks (with the BIG-IP platform), and Citrix (with NetScaler) built purpose-built devices with custom ASICs designed to process network packets at wire speed. These appliances were expensive -- often tens or hundreds of thousands of dollars -- but they could handle enormous volumes of traffic with very low latency.
Hardware load balancers dominated the market through the 2000s, particularly in enterprise environments and large-scale web properties. They offered features that software alternatives of the time could not match: hardware-accelerated SSL processing, deep packet inspection at line rate, and sophisticated health checking with dedicated management interfaces.
The Software Revolution
The landscape began shifting in the late 2000s and accelerated through the 2010s as commodity server hardware became powerful enough to handle load balancing in software. Two open-source projects in particular transformed the field:
- HAProxy, first released in 2001 by Willy Tarreau, became the gold standard for high-performance software load balancing. Written in C and designed from the ground up for reliability and performance, HAProxy could handle hundreds of thousands of concurrent connections on a single commodity server.
- Nginx, originally written by Igor Sysoev in 2002 and publicly released in 2004, served double duty as both a web server and a reverse proxy / load balancer. Its event-driven architecture made it exceptionally efficient at handling large numbers of concurrent connections.
The economic argument was compelling: a $5,000 commodity server running HAProxy or Nginx could handle traffic volumes that previously required a $100,000 hardware appliance. As cloud computing emerged, the cost differential became even more stark.
The Cloud and Service Mesh Era
The rise of cloud computing in the 2010s introduced managed load balancing services that abstracted away the underlying infrastructure entirely. Amazon Web Services launched Elastic Load Balancing (ELB) in 2009, later splitting it into the Application Load Balancer (ALB) for Layer 7 and the Network Load Balancer (NLB) for Layer 4. Google Cloud Platform introduced Cloud Load Balancing, and Microsoft Azure offered its own Azure Load Balancer and Application Gateway.
Simultaneously, the rise of microservices architectures and containerized deployments created demand for a new category: service mesh load balancing. Projects like Envoy (originally built at Lyft, open-sourced in 2016) and Linkerd embedded load balancing logic into sidecar proxies running alongside every service instance, enabling sophisticated traffic management without centralized load balancer infrastructure.
Today, Traefik, an open-source edge router designed for cloud-native environments, has gained significant adoption in Kubernetes ecosystems, while Envoy serves as the data plane for service meshes like Istio. The evolution continues, but the fundamental principles remain remarkably consistent.
Load Balancing Algorithms: How the Decision Gets Made
The algorithm a load balancer uses to select a backend server for each incoming request is arguably its most important configuration decision. Different algorithms optimize for different goals -- fairness, performance, simplicity, or session consistency -- and choosing the wrong one can negate many of the benefits load balancing is supposed to provide.
Round Robin
Round robin is the simplest and most widely understood load balancing algorithm. The load balancer maintains an ordered list of backend servers and assigns each incoming request to the next server in the list, cycling back to the beginning after reaching the end.
If you have three servers -- A, B, and C -- requests are distributed as follows:
- Request 1 goes to Server A
- Request 2 goes to Server B
- Request 3 goes to Server C
- Request 4 goes to Server A
- Request 5 goes to Server B
- ...and so on
Advantages: Round robin is trivially simple to implement, requires no state tracking beyond a pointer to the current position in the list, and distributes requests with perfect mathematical equality over time.
Disadvantages: Round robin assumes all servers have equal capacity and all requests impose equal load. Neither assumption is typically true. A server with 4 CPU cores and 16 GB of RAM will receive the same number of requests as one with 32 cores and 128 GB. A request that triggers a complex database query will be treated identically to one that returns a cached static page. This can lead to uneven actual load despite even request distribution.
Round robin works best when backend servers are homogeneous (identical hardware and configuration) and request processing times are relatively uniform.
Weighted Round Robin
Weighted round robin extends the basic algorithm by assigning a numerical weight to each server, proportional to its capacity. A server with weight 3 receives three times as many requests as a server with weight 1.
For example, with Server A (weight 3), Server B (weight 2), and Server C (weight 1), the distribution over six requests would be:
- Requests 1, 2, 3 go to Server A
- Requests 4, 5 go to Server B
- Request 6 goes to Server C
In practice, most implementations interleave the distribution more smoothly (A, A, B, A, B, C) rather than sending them in blocks, to avoid bursty load patterns.
Use case: Weighted round robin is essential when your server pool contains machines of varying capacity -- a common situation during hardware refresh cycles, when some servers are older and less powerful than others.
Least Connections
The least connections algorithm routes each new request to the server currently handling the fewest active connections. This is a dynamic algorithm that adapts to real-time conditions: if one server is processing a batch of slow requests, it will accumulate connections and the load balancer will automatically route new requests elsewhere.
Advantages: Least connections is significantly better than round robin at handling variable request processing times. It naturally adapts to heterogeneous workloads without explicit configuration.
Disadvantages: It requires the load balancer to track the number of active connections to each backend server, adding a small amount of state and overhead. It can also behave poorly at startup: when all servers have zero connections, the algorithm may send a burst of requests to the first server in the list before connection counts have time to diverge.
Weighted Least Connections
Weighted least connections combines the capacity awareness of weighted algorithms with the dynamic adaptability of least connections. The load balancer calculates a score for each server by dividing its current active connection count by its weight, then routes to the server with the lowest score.
For example, if Server A (weight 5) has 10 active connections and Server B (weight 2) has 3 active connections:
- Server A's score: 10 / 5 = 2.0
- Server B's score: 3 / 2 = 1.5
- The next request goes to Server B
This algorithm is often considered the best general-purpose choice for production deployments because it accounts for both server capacity and current load.
IP Hash
IP hash applies a hash function to the client's source IP address to deterministically map each client to a specific backend server. The same client IP will always be routed to the same server (as long as the server pool remains unchanged).
server_index = hash(client_ip) % number_of_servers
Advantages: IP hash provides a form of session persistence without requiring cookies or server-side session tracking. It is simple to implement and adds no state to the load balancer beyond the hash function itself.
Disadvantages: The distribution quality depends entirely on the hash function and the distribution of client IP addresses. In environments where many users share a single public IP (corporate NAT, university networks, mobile carrier gateways), IP hash can produce severely unbalanced distribution. Additionally, any change in the server pool -- adding or removing a server -- causes a significant portion of clients to be remapped to different servers, disrupting any in-progress sessions.
Consistent hashing is an important refinement that minimizes the disruption caused by pool changes. Rather than a simple modulo operation, consistent hashing maps both servers and client IPs onto a virtual ring, so that adding or removing a server only remaps the clients that were mapped to that specific server, leaving all others unaffected.
Least Response Time
The least response time (or fastest response) algorithm routes requests to the server with the lowest average response time, as measured by the load balancer. Some implementations combine response time with active connection count to produce a composite score.
This algorithm is particularly effective when backend servers have varying performance characteristics -- for example, when some servers are located on faster storage, have warmer caches, or are running on newer hardware. It requires the load balancer to continuously measure and track response times, adding complexity but providing genuinely intelligent routing.
Random
The random algorithm selects a backend server at random for each request, typically using a uniform distribution. While it may seem naive, random selection has a solid theoretical basis: with a large enough number of requests and servers, random distribution converges toward uniform distribution, and it does so without requiring any state tracking or coordination.
The "power of two random choices" is a notable refinement: instead of selecting one random server, the algorithm selects two random servers and routes to whichever has fewer active connections. This approach, backed by research from Michael Mitzenmacher and others, achieves near-optimal distribution with minimal overhead and has been adopted in systems like Envoy.
Algorithm Comparison
| Algorithm | State Required | Handles Heterogeneous Servers | Adapts to Load | Session Affinity | Complexity |
|---|---|---|---|---|---|
| Round Robin | Minimal (position counter) | No | No | No | Very Low |
| Weighted Round Robin | Minimal (position + weights) | Yes | No | No | Low |
| Least Connections | Connection counts per server | Partially | Yes | No | Medium |
| Weighted Least Connections | Connection counts + weights | Yes | Yes | No | Medium |
| IP Hash | None (stateless computation) | No | No | Yes (by IP) | Low |
| Least Response Time | Response time tracking | Yes | Yes | No | High |
| Random | None | No | No | No | Very Low |
| Random Two Choices | Connection counts per server | Partially | Yes | No | Low |
Layer 4 vs. Layer 7 Load Balancing: A Critical Architectural Decision
The distinction between Layer 4 and Layer 7 load balancing refers to the layer of the OSI (Open Systems Interconnection) model at which the load balancer operates. This is not merely a technical distinction -- it fundamentally determines what information the load balancer can see, what routing decisions it can make, and the performance characteristics of the entire system.
Layer 4 (Transport Layer) Load Balancing
A Layer 4 load balancer operates at the transport layer of the OSI model, making routing decisions based on information available in the TCP (or UDP) packet headers: primarily the source IP address, destination IP address, source port, and destination port. It does not inspect the contents of the packets -- it does not know whether the traffic is HTTP, WebSocket, database protocol, or anything else.
When a client initiates a TCP connection to the load balancer's IP address, the L4 load balancer selects a backend server (using its configured algorithm) and forwards the entire TCP connection to that server. This forwarding can happen through several mechanisms:
- NAT (Network Address Translation): The load balancer rewrites the destination IP address in each packet to point to the selected backend server, and rewrites the source IP in response packets back to the load balancer's address.
- DSR (Direct Server Return): The load balancer forwards packets to the backend, but the backend responds directly to the client, bypassing the load balancer on the return path. This dramatically increases throughput because the load balancer only handles inbound traffic.
- IP tunneling: Similar to DSR, but the original packet is encapsulated in a new IP packet for delivery to the backend.
Advantages of L4 load balancing:
- Extremely high performance because packets are forwarded with minimal processing
- Protocol-agnostic: works with any TCP or UDP-based protocol
- Lower latency because there is no need to buffer or inspect application-layer data
- Simpler implementation with fewer potential failure modes
Disadvantages of L4 load balancing:
- Cannot make routing decisions based on HTTP headers, URL paths, cookies, or any application-layer information
- Cannot modify request or response content (no header injection, no URL rewriting)
- Cannot perform SSL/TLS termination (the load balancer never sees the decrypted content)
- Limited health checking: can verify that a TCP port is open, but cannot verify that the application is responding correctly to HTTP requests
Layer 7 (Application Layer) Load Balancing
A Layer 7 load balancer operates at the application layer, fully parsing and understanding the application protocol -- most commonly HTTP/HTTPS. It terminates the client's TCP connection, reads the full HTTP request, makes a routing decision based on the request's contents, and then opens a separate TCP connection to the selected backend server to forward the request.
Because the L7 load balancer understands HTTP, it can make routing decisions based on an extraordinary range of criteria:
- URL path: Route
/api/*requests to API servers and/static/*to CDN origins - HTTP method: Route
GETrequests to read replicas andPOST/PUT/DELETEto write primaries - Host header: Route
api.example.comandwww.example.comto different backend pools - HTTP headers: Route based on
Accept-Language,User-Agent, custom headers, or API version headers - Cookies: Implement session affinity by reading a session cookie
- Query parameters: Route based on specific query parameter values
- Request body: In some implementations, route based on the content of the request body (though this is uncommon due to performance implications)
Advantages of L7 load balancing:
- Content-based routing enables sophisticated traffic management
- SSL/TLS termination offloads encryption work from backend servers
- Can modify requests and responses: inject headers (like
X-Forwarded-For), rewrite URLs, add or remove cookies - Rich health checking: can send actual HTTP requests and verify response codes and content
- Enables features like A/B testing, canary deployments, and blue-green deployments through header-based or percentage-based routing
- Can compress responses, cache content, and apply rate limiting at the HTTP level
Disadvantages of L7 load balancing:
- Higher latency because the load balancer must fully parse the application protocol
- Higher resource consumption (CPU and memory) on the load balancer
- More complex configuration and more potential failure modes
- Must understand each protocol it handles (adding WebSocket, gRPC, or HTTP/2 support requires explicit implementation)
When to Use Which
In practice, many production architectures use both layers. A common pattern is to deploy an L4 load balancer (such as AWS NLB or a Linux IPVS-based system) as the first layer, distributing traffic across a pool of L7 load balancers (such as Nginx or HAProxy instances), which then perform content-based routing to backend application servers. This two-tier architecture combines the raw throughput of L4 with the routing intelligence of L7.
"The choice between L4 and L7 is not either-or. The most resilient architectures layer them, using L4 for raw distribution and fault tolerance at the edge, and L7 for application-aware routing closer to the services."
Health Checking: Keeping the Server Pool Healthy
A load balancer is only as good as its ability to detect when a backend server has become unhealthy and to stop sending traffic to it. Health checks are the mechanism by which load balancers continuously monitor the status of each server in the pool.
Without health checks, a load balancer would blindly send requests to servers that have crashed, are overloaded, or are experiencing application errors -- resulting in failed requests, timeouts, and a degraded user experience that defeats the purpose of load balancing entirely.
Types of Health Checks
TCP Health Checks are the simplest form. The load balancer attempts to establish a TCP connection to the backend server's IP and port. If the three-way TCP handshake completes successfully, the server is considered healthy. If the connection is refused or times out, the server is marked unhealthy.
TCP checks are fast and lightweight, but they only verify that the server's operating system is running and the port is open. They cannot detect application-level failures -- a web server might accept TCP connections but return HTTP 500 errors for every request.
HTTP/HTTPS Health Checks send an actual HTTP request to a specific endpoint on the backend server and evaluate the response. A typical configuration might send a GET /health request and consider the server healthy only if it returns an HTTP 200 status code within a specified timeout.
More sophisticated HTTP health checks can also:
- Verify that the response body contains expected content (e.g., a JSON object with
{"status": "ok"}) - Check specific response headers
- Verify that the response time is below a threshold
- Send requests with specific headers to test authentication or routing
Custom Script Health Checks execute a user-defined script or command to determine server health. This enables checks that go beyond network connectivity -- for example, verifying that the server can reach its database, that its disk usage is below a threshold, or that a background job processor is running.
gRPC Health Checks follow the gRPC health checking protocol defined in the grpc.health.v1 package, sending a Check RPC to a designated health service endpoint.
Health Check Configuration Parameters
The behavior of health checks is governed by several key parameters:
- Interval: How frequently the load balancer sends health check probes (e.g., every 10 seconds). Shorter intervals detect failures faster but generate more probe traffic.
- Timeout: How long the load balancer waits for a response to each probe before considering it failed (e.g., 5 seconds).
- Healthy threshold: How many consecutive successful checks are required before a previously unhealthy server is marked healthy again (e.g., 3 successes). This prevents a flapping server from being rapidly added and removed from the pool.
- Unhealthy threshold: How many consecutive failed checks are required before a healthy server is marked unhealthy (e.g., 2 failures). Setting this to more than 1 prevents a single dropped packet from unnecessarily removing a healthy server.
A typical production configuration might look like this:
health_check {
interval = 15s
timeout = 5s
healthy_threshold = 3
unhealthy_threshold = 2
path = "/health"
expected_status = 200
}
With these settings, the load balancer checks each server every 15 seconds. A server must fail 2 consecutive checks (taking at minimum 30 seconds) to be removed from the pool, and must pass 3 consecutive checks (taking at minimum 45 seconds) to be re-added. These thresholds balance detection speed against stability.
Active vs. Passive Health Checks
The health checks described above are active health checks: the load balancer proactively sends probes regardless of actual traffic patterns. Some load balancers also support passive health checks (also called outlier detection), which infer server health from the characteristics of real traffic.
For example, if a backend server starts returning HTTP 503 errors for a significant percentage of real requests, a passive health check system can mark it unhealthy without waiting for the next active probe cycle. Envoy's outlier detection is a particularly sophisticated example, tracking error rates, response times, and success rates per server and automatically ejecting outliers from the pool.
The most robust configurations use both: active health checks as a baseline, supplemented by passive checks for faster detection of degradation that manifests in real traffic patterns.
Session Affinity and Sticky Sessions: The Statefulness Challenge
One of the most persistent challenges in load-balanced architectures is handling stateful applications -- applications that store per-user state (such as shopping cart contents, authentication tokens, or multi-step form progress) on the application server itself, rather than in a shared external store.
When a user makes a request and that request creates state on Server A, subsequent requests from the same user must also go to Server A, or the state will be lost. Session affinity, also called sticky sessions, is the mechanism by which a load balancer ensures that all requests from a given user are consistently routed to the same backend server.
Why Session Affinity Exists
In an ideal world, applications would be completely stateless: all per-user state would be stored in a shared database, distributed cache (like Redis or Memcached), or client-side tokens (like JWTs). In such architectures, any request can be handled by any server, and load balancing is straightforward.
In practice, many applications maintain server-local session state for performance reasons, because of legacy architectures, or because certain operations (like WebSocket connections or long-running file uploads) are inherently stateful. Session affinity bridges the gap between stateful applications and distributed server pools.
Implementation Methods
Cookie-based affinity is the most common approach for HTTP traffic. The load balancer inserts a cookie into the HTTP response that identifies which backend server handled the initial request. On subsequent requests, the client sends this cookie back, and the load balancer uses it to route to the same server.
For example, an HAProxy configuration for cookie-based affinity might look like:
backend web_servers
balance roundrobin
cookie SERVERID insert indirect nocache
server web1 192.168.1.10:80 check cookie web1
server web2 192.168.1.11:80 check cookie web2
server web3 192.168.1.12:80 check cookie web3
Source IP affinity uses the client's IP address to determine the backend server, typically via IP hash. This works without modifying HTTP traffic but has the limitations discussed in the IP hash algorithm section -- particularly problems with shared IPs and pool changes.
Application-controlled affinity delegates the routing decision to the application itself. The application includes a header or token in its response that tells the load balancer where to route future requests. This gives the application maximum control but requires tight coupling between the application and the load balancer.
The Tradeoffs of Sticky Sessions
Session affinity solves a real problem, but it introduces significant tradeoffs:
- Uneven load distribution: If one user generates significantly more traffic than others, the server handling that user becomes disproportionately loaded, and the load balancer cannot redistribute that traffic.
- Reduced fault tolerance: If a server fails, all users whose sessions were pinned to that server lose their session state. The load balancer can redirect them to a healthy server, but any in-progress work is lost.
- Scaling complexity: Adding a new server to the pool does not immediately reduce load on existing servers, because existing sticky sessions continue routing to their original servers.
- Operational risk: Server maintenance requires draining sessions (discussed later) rather than simply removing the server from the pool.
The industry-wide trend is toward stateless architectures that externalize session state, precisely because of these tradeoffs. However, session affinity remains a necessary tool for legacy applications, WebSocket connections, and certain specialized workloads.
SSL/TLS Handling: Termination, Passthrough, and Bridging
Virtually all modern web traffic is encrypted with TLS (Transport Layer Security, the successor to SSL). How a load balancer handles this encryption is a critical architectural decision that affects security, performance, operational complexity, and the load balancer's ability to inspect and route traffic.
SSL/TLS Termination
In SSL termination (also called SSL offloading), the load balancer decrypts incoming HTTPS traffic, processes the plaintext HTTP request, makes a routing decision, and forwards the unencrypted request to the backend server over the internal network. Responses follow the reverse path: the backend sends plaintext HTTP to the load balancer, which encrypts it before sending it to the client.
Client <--HTTPS--> Load Balancer <--HTTP--> Backend Server
Advantages:
- The load balancer can inspect HTTP headers, URLs, cookies, and content for L7 routing decisions
- Backend servers are freed from the computational cost of TLS encryption, which can be significant at high traffic volumes
- Certificate management is centralized on the load balancer rather than distributed across all backend servers
- The load balancer can cache, compress, and modify responses
Disadvantages:
- Traffic between the load balancer and backend servers is unencrypted. If the internal network is not trusted (or for compliance reasons), this may be unacceptable
- The load balancer must have access to the private key for the TLS certificate, which is a security-sensitive asset
- The load balancer becomes a more attractive attack target because it holds the keys and processes all decrypted traffic
SSL termination is the most common approach in practice, particularly when the load balancer and backend servers communicate over a trusted internal network (e.g., within the same VPC in a cloud environment).
SSL Passthrough
In SSL passthrough, the load balancer forwards encrypted TLS traffic directly to the backend server without decrypting it. The backend server performs the TLS handshake with the client and handles all encryption and decryption.
Client <--HTTPS (tunneled through LB)--> Backend Server
Advantages:
- End-to-end encryption: the load balancer never sees plaintext traffic
- The load balancer never possesses the TLS private key, reducing the attack surface
- Meets strict compliance requirements that mandate end-to-end encryption
- Simpler load balancer configuration (no certificate management)
Disadvantages:
- The load balancer can only perform L4 routing (by IP address and port), because it cannot read the encrypted HTTP content
- The load balancer can read the Server Name Indication (SNI) extension during the TLS handshake, which reveals the requested hostname, enabling basic host-based routing even in passthrough mode
- Backend servers bear the full computational cost of TLS processing
- Certificate management must be performed on every backend server
- Cannot perform content-based caching, compression, or modification
SSL Bridging (Re-encryption)
SSL bridging is a hybrid approach: the load balancer terminates the client's TLS connection, inspects and routes the plaintext request, and then re-encrypts it using a new TLS connection to the backend server.
Client <--HTTPS--> Load Balancer <--HTTPS--> Backend Server
Advantages:
- The load balancer can perform full L7 routing decisions
- Traffic is encrypted at every point, satisfying end-to-end encryption requirements
- The client-facing and backend-facing TLS configurations can use different certificates and protocols
Disadvantages:
- Double encryption/decryption imposes a significant CPU cost on the load balancer
- Certificate management complexity: both the load balancer and backends need certificates
- Higher latency due to the additional TLS handshake between the load balancer and backend
| TLS Strategy | L7 Routing | End-to-End Encryption | LB Holds Private Key | Performance Impact | Certificate Management |
|---|---|---|---|---|---|
| SSL Termination | Yes | No (internal plaintext) | Yes | Low on backends | Centralized on LB |
| SSL Passthrough | No (L4 only + SNI) | Yes | No | Full load on backends | Distributed on backends |
| SSL Bridging | Yes | Yes | Yes (plus backend certs) | Double encryption cost | Both LB and backends |
High Availability for Load Balancers: Avoiding a New Single Point of Failure
A load balancer that sits in front of all your servers, receiving all traffic, is itself a single point of failure. If the load balancer crashes, all traffic stops -- exactly the scenario load balancing was supposed to prevent. Making the load balancer highly available is therefore essential.
Active-Passive (Failover)
In an active-passive configuration, two load balancer instances are deployed. The active instance handles all traffic, while the passive instance stands by, monitoring the active instance's health. If the active instance fails, the passive instance takes over its IP address (using a mechanism like a floating IP or Virtual IP / VIP) and begins handling traffic.
The takeover is typically facilitated by a protocol like VRRP (Virtual Router Redundancy Protocol) or a tool like keepalived on Linux. VRRP allows multiple routers (or load balancers) to share a virtual IP address. One is elected as the master, and if it fails, another member is promoted to master and claims the virtual IP.
Advantages: Simple to understand and implement. Only one load balancer handles traffic at a time, so there are no synchronization concerns.
Disadvantages: The passive instance sits idle, wasting resources. Failover is not instantaneous -- there is typically a brief period (often 1-5 seconds) during which the passive instance detects the failure and takes over. Active connections at the moment of failover are usually dropped.
Active-Active
In an active-active configuration, multiple load balancer instances are deployed simultaneously, all handling traffic. Incoming traffic is distributed across the active instances using DNS round robin, anycast routing, ECMP (Equal-Cost Multi-Path) routing, or an upstream L4 load balancer.
Advantages: All instances are utilized, providing better resource efficiency and higher total capacity. There is no failover delay because remaining instances continue handling traffic if one fails.
Disadvantages: More complex to implement. State synchronization between instances (for sticky sessions, connection tracking, or rate limiting) can be challenging. DNS-based distribution has TTL-related delays when removing a failed instance.
Cloud-Managed High Availability
Cloud load balancers (AWS ALB/NLB, GCP Cloud Load Balancing, Azure Load Balancer) handle high availability internally and transparently. The cloud provider manages redundant instances across multiple availability zones, performs health checks, and handles failover automatically. From the user's perspective, the load balancer is a single, always-available endpoint.
This is one of the most compelling reasons to use cloud-managed load balancers: the operational complexity of maintaining load balancer high availability is entirely offloaded to the cloud provider.
Global Server Load Balancing and GeoDNS
Everything discussed so far has focused on distributing traffic within a single datacenter or region. Global Server Load Balancing (GSLB) extends the concept across multiple geographic regions, directing users to the datacenter nearest to them (or best-suited to serve them) before local load balancing takes over within that datacenter.
DNS-Based GSLB
The most common GSLB mechanism is DNS-based routing, often called GeoDNS. When a user resolves your domain name, the DNS server returns different IP addresses based on the user's geographic location (inferred from the DNS resolver's IP address or, with EDNS Client Subnet, the user's actual subnet).
For example, a user in Tokyo might receive the IP address of your Tokyo datacenter, while a user in Frankfurt receives the IP address of your Frankfurt datacenter. Each datacenter has its own local load balancing infrastructure.
Services like AWS Route 53, Cloudflare DNS, NS1, and Google Cloud DNS offer GeoDNS capabilities with various routing policies: latency-based, geolocation-based, weighted, and failover.
Anycast
Anycast is a network addressing technique where the same IP address is advertised from multiple locations via BGP (Border Gateway Protocol). Routers on the internet automatically direct traffic to the nearest advertising location based on network topology. Cloudflare, Google, and other large CDN/infrastructure providers use anycast extensively.
Unlike DNS-based GSLB, anycast operates at the network layer and does not depend on DNS TTLs or resolver behavior. It provides near-instant failover: if one location goes down, BGP withdrawals automatically redirect traffic to the next nearest location.
Software Load Balancers in Depth
Nginx
Nginx is the world's most widely deployed web server and reverse proxy, used by over 30% of all websites. Its load balancing capabilities include round robin, least connections, IP hash, and (in the commercial Nginx Plus version) least time algorithms. Nginx handles HTTP, HTTPS, TCP, UDP, gRPC, and WebSocket traffic.
A typical Nginx load balancer configuration:
upstream backend {
least_conn;
server 10.0.1.10:8080 weight=3;
server 10.0.1.11:8080 weight=2;
server 10.0.1.12:8080 weight=1;
server 10.0.1.13:8080 backup;
}
server {
listen 443 ssl;
server_name example.com;
ssl_certificate /etc/ssl/certs/example.com.crt;
ssl_certificate_key /etc/ssl/private/example.com.key;
location / {
proxy_pass http://backend;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header Host $host;
}
}
HAProxy
HAProxy is purpose-built for load balancing and proxying, and it excels at raw performance and reliability. It is the default load balancer in many high-traffic environments and is used by organizations including GitHub, Reddit, Stack Overflow, and Tumblr. HAProxy supports all the algorithms discussed in this article, along with advanced features like connection rate limiting, request queuing, and detailed statistics reporting.
HAProxy's design philosophy prioritizes zero-downtime operations: configuration reloads can be performed without dropping a single connection, server additions and removals take effect immediately, and the software has an extraordinary track record of stability.
Envoy
Envoy was designed from the ground up for modern microservices architectures. Originally built at Lyft to solve the observability and reliability challenges of their service-oriented architecture, Envoy provides L4 and L7 load balancing with advanced features including automatic retries, circuit breaking, outlier detection, distributed tracing integration, and a powerful xDS API that enables dynamic configuration from a control plane.
Envoy is the data plane for service meshes like Istio and AWS App Mesh, and its sidecar proxy deployment model -- where an Envoy instance runs alongside every service instance -- has become a defining pattern of cloud-native architecture.
Traefik
Traefik is designed specifically for cloud-native environments and integrates natively with container orchestrators like Kubernetes, Docker Swarm, and HashiCorp Consul. It automatically discovers services and configures routing rules by reading labels, annotations, or service registries, eliminating the need for manual configuration files.
Traefik's automatic service discovery makes it particularly popular in dynamic environments where services are constantly being created, scaled, and destroyed.
Cloud Load Balancers
AWS Elastic Load Balancing
AWS offers three load balancer types:
- Application Load Balancer (ALB): Layer 7, HTTP/HTTPS. Supports path-based routing, host-based routing, weighted target groups for blue-green and canary deployments, WebSocket, gRPC, and integration with AWS WAF.
- Network Load Balancer (NLB): Layer 4, TCP/UDP/TLS. Designed for extreme performance (millions of requests per second with ultra-low latency). Supports static IP addresses, preserves source IP, and integrates with AWS PrivateLink.
- Gateway Load Balancer (GWLB): Layer 3/4, designed specifically for deploying and scaling third-party network virtual appliances (firewalls, intrusion detection systems) in a transparent manner.
GCP Cloud Load Balancing
Google Cloud's load balancing is notable for being a truly global service. The Global External Application Load Balancer uses Google's backbone network to route traffic to the nearest healthy backend, providing a single anycast IP that works worldwide. GCP also offers regional, internal, and network load balancers for different use cases.
Azure Load Balancer and Application Gateway
Azure provides Azure Load Balancer for Layer 4 and Azure Application Gateway for Layer 7. The Application Gateway includes a built-in Web Application Firewall (WAF) and supports URL-based routing, multi-site hosting, and cookie-based session affinity.
Connection Draining and Graceful Shutdown
When a server needs to be removed from the load balancer pool -- whether for maintenance, a deployment, or because it failed a health check -- simply cutting off all traffic immediately would terminate any in-progress requests, causing errors for users.
Connection draining (also called graceful shutdown or deregistration delay) is the process of allowing a server to finish processing its active requests before being fully removed from the pool. During the draining period:
- The load balancer stops sending new requests to the server
- Existing connections are allowed to continue until they complete naturally or a maximum drain timeout is reached
- Once all connections have closed (or the timeout expires), the server is fully removed
This mechanism is critical for zero-downtime deployments. During a rolling update, each server is drained, updated, health-checked, and then re-added to the pool. At no point do users experience dropped connections or errors.
A typical drain configuration in AWS ALB:
Deregistration delay: 300 seconds (5 minutes)
Setting the drain timeout appropriately requires understanding your application's request duration. For API servers where requests complete in milliseconds, 30 seconds is generous. For applications that handle long-running WebSocket connections or file uploads, you may need several minutes or longer.
Load Balancing for Microservices: Service Mesh and Sidecar Patterns
The rise of microservices architecture -- where applications are decomposed into dozens or hundreds of small, independently deployable services -- has fundamentally changed how load balancing is approached.
The Challenge of East-West Traffic
In monolithic architectures, most traffic flows north-south: from external clients, through a load balancer, to backend servers. In microservices architectures, the dominant traffic pattern is east-west: service-to-service communication within the datacenter. A single user request might trigger dozens of internal service calls, each of which needs to be load-balanced.
Traditional centralized load balancers are poorly suited to east-west traffic. Routing every internal service call through a centralized load balancer adds latency, creates a bottleneck, and is operationally complex when hundreds of services are communicating.
Client-Side Load Balancing
One approach is client-side load balancing, where each service maintains its own list of available instances for each service it calls and performs load balancing locally. Libraries like Netflix Ribbon (now in maintenance mode) and gRPC's built-in load balancing implement this pattern.
The service discovers available instances through a service registry (like Consul, etcd, or Kubernetes service endpoints) and applies a load balancing algorithm locally when making outgoing requests. This eliminates the centralized bottleneck but distributes the complexity into every service.
The Service Mesh Pattern
A service mesh addresses the challenges of microservices load balancing by deploying a sidecar proxy alongside every service instance. All inbound and outbound traffic for the service passes through its sidecar, which handles load balancing, health checking, retries, circuit breaking, mutual TLS, and observability -- without the service code needing to implement any of this logic.
The architecture consists of two planes:
- Data plane: The sidecar proxies (typically Envoy) that handle all service-to-service traffic
- Control plane: A centralized management layer (like Istio's istiod or Linkerd's control plane) that configures the data plane proxies, distributes service discovery information, and manages certificates
In a service mesh, load balancing happens at every hop. When Service A calls Service B, Service A's sidecar proxy selects a healthy instance of Service B using a configured algorithm (often least requests or the power-of-two-choices), handles the connection, applies retry policies if the call fails, and reports metrics. No centralized load balancer is involved in east-west traffic.
This pattern has become the dominant approach for sophisticated microservices deployments, with Istio, Linkerd, and Consul Connect as the leading implementations.
Monitoring and Observability for Load Balancers
A load balancer occupies a uniquely privileged position in the network -- all traffic flows through it -- making it an invaluable source of observability data. Effective monitoring of load balancers is essential for detecting problems, understanding performance characteristics, and capacity planning.
Key Metrics to Monitor
Request rate: The total number of requests per second flowing through the load balancer. This is the most fundamental capacity metric, and tracking it over time reveals traffic patterns, growth trends, and anomalies.
Error rate: The percentage of requests resulting in errors, broken down by error type: 4xx (client errors) vs. 5xx (server errors). A sudden spike in 5xx errors often indicates a backend failure, while a spike in 4xx errors might indicate a client-side issue or a bot attack.
Latency distribution: Not just average latency, but the full distribution -- especially p50 (median), p95, p99, and p99.9 percentiles. A load balancer that adds 1ms of latency at p50 but 500ms at p99 has a very different performance profile than one that adds 5ms consistently. Percentile tracking reveals long-tail latency issues that averages mask.
Active connections: The current number of active connections to the load balancer and to each backend server. This metric is essential for capacity planning and for detecting connection leaks or backends that are accumulating connections without closing them.
Backend health status: The current health status of each backend server, along with the history of health check transitions. Alert on any server becoming unhealthy, and especially on multiple servers becoming unhealthy simultaneously (which may indicate a systemic issue rather than an individual server problem).
Throughput: Bytes per second flowing through the load balancer, both inbound and outbound. This is particularly important for applications that serve large responses (file downloads, media streaming) where bandwidth, not request count, is the limiting factor.
Connection draining status: During deployments, monitoring the number of connections being drained and the drain duration helps verify that deployments are proceeding smoothly and that the drain timeout is appropriately configured.
Observability Best Practices
- Export metrics to a time-series database (Prometheus, Datadog, CloudWatch) for historical analysis and alerting
- Use distributed tracing (Jaeger, Zipkin, AWS X-Ray) to track requests end-to-end through the load balancer and into backend services, enabling precise identification of latency sources
- Log access events at the load balancer level for security auditing, troubleshooting, and compliance. Most load balancers can log the client IP, requested URL, selected backend, response code, and latency for every request
- Set alerts on error rate thresholds (e.g., 5xx rate exceeding 1%), latency thresholds (e.g., p99 exceeding 500ms), and backend health changes
- Dashboard the essentials: a single-pane view showing request rate, error rate, latency percentiles, and backend health is the most valuable operational tool for any team running load-balanced services
Real-World Architecture Examples
E-Commerce Platform
A large e-commerce site might deploy the following architecture:
- GeoDNS (Route 53 or Cloudflare) directs users to the nearest regional datacenter
- L4 load balancer (AWS NLB or IPVS) distributes connections across a pool of L7 load balancers
- L7 load balancer (Nginx or ALB) terminates SSL, performs content-based routing:
/api/*routes to API server pool (weighted least connections)/static/*routes to CDN origin or static asset servers/checkout/*routes to checkout service pool with session affinity (cookie-based)
- Internal service mesh (Envoy/Istio) handles east-west traffic between API servers, inventory service, payment service, recommendation engine, etc.
- Health checks at every layer: TCP checks on NLB, HTTP checks on ALB, and gRPC health checks within the service mesh
SaaS API Platform
A multi-tenant SaaS API platform might use:
- Anycast or GeoDNS for global entry point distribution
- L7 load balancer (ALB or Envoy) performing:
- Host-based routing:
tenant-a.api.example.comvs.tenant-b.api.example.com - Header-based routing: API version (
Accept-Version: v2) to different backend pools - Rate limiting per tenant (enforced at the load balancer level)
- Host-based routing:
- Weighted routing for canary deployments: 95% of traffic to the stable version, 5% to the canary
- Connection draining with a 60-second timeout during rolling deployments
Real-Time Messaging System
A chat or messaging system with persistent WebSocket connections might use:
- L4 load balancer (NLB) because WebSocket connections are long-lived and the load balancer does not need to inspect individual messages after the initial HTTP upgrade
- Session affinity based on IP hash or a connection token, because WebSocket connections are inherently stateful
- Aggressive health checking (5-second interval) because a failed server means disconnected users
- Extended drain timeout (30 minutes or more) because WebSocket connections may last for hours
Advanced Topics and Emerging Patterns
Rate Limiting and Traffic Shaping
Modern L7 load balancers can implement sophisticated rate limiting policies: per-client, per-endpoint, per-API-key, with configurable burst allowances and multiple time windows. This turns the load balancer into a first line of defense against abuse, DDoS attacks, and misbehaving clients, protecting backend servers from traffic they cannot handle.
Circuit Breaking
Borrowed from electrical engineering, circuit breaking is a pattern where the load balancer stops sending traffic to a backend that is failing, giving it time to recover rather than continuing to send requests that will inevitably fail. Envoy implements circuit breaking with configurable thresholds for maximum connections, pending requests, retries, and concurrent requests per backend.
Retry Budgets and Retry Storms
When a backend server fails, load balancers (and clients) often retry the request on a different server. While retries improve reliability for transient failures, aggressive retry policies can create retry storms that amplify load during outages -- exactly when servers can least afford additional traffic. Modern load balancers implement retry budgets that limit the total percentage of requests that can be retries, preventing cascading failures.
WebAssembly Extensions
An emerging trend in load balancer extensibility is the use of WebAssembly (Wasm) to implement custom logic. Envoy pioneered this approach, allowing operators to write custom filters in languages like Rust or Go, compile them to Wasm, and deploy them to the load balancer without modifying Envoy itself. This enables custom authentication, transformation, and routing logic with near-native performance and strong sandboxing.
HTTP/3 and QUIC
The adoption of HTTP/3, built on the QUIC protocol, introduces new challenges and opportunities for load balancing. QUIC operates over UDP rather than TCP, uses connection IDs rather than the traditional 4-tuple for connection identification, and supports connection migration (a client can change IP addresses without restarting the connection). Load balancers must understand QUIC connection IDs to correctly route packets for existing connections, and L4 load balancers must be updated to handle UDP-based HTTP traffic.
Common Pitfalls and Operational Wisdom
Even with a thorough understanding of load balancing concepts, production deployments frequently encounter pitfalls that are worth cataloging:
Forgetting to preserve the client IP address. When a load balancer forwards a request to a backend server, the backend sees the load balancer's IP address as the source, not the original client's IP. For logging, rate limiting, and geolocation, the original client IP must be preserved -- typically via the X-Forwarded-For HTTP header (for L7) or the PROXY protocol (for L4). Failing to configure this correctly is one of the most common load balancing mistakes.
Setting health check intervals too aggressively. Health checks that run every second with a single-failure threshold will cause servers to flap in and out of the pool due to transient network issues. Conversely, checks that run every 60 seconds with a high failure threshold will take minutes to detect a failed server. Finding the right balance requires understanding your specific reliability requirements and network characteristics.
Ignoring the load balancer's own resource limits. A load balancer is a server too, and it has finite CPU, memory, and network capacity. Particularly for L7 load balancers performing SSL termination, CPU can become a bottleneck at high traffic volumes. Monitor the load balancer itself, not just the backends.
Using sticky sessions as a crutch instead of fixing the underlying architecture. Session affinity is a legitimate tool, but relying on it to work around a fundamentally stateful application architecture creates fragility. The long-term solution is usually to externalize state to a shared store, allowing any server to handle any request.
Neglecting connection draining during deployments. Removing a server from the pool without draining active connections results in dropped requests -- exactly the kind of user-facing error that load balancing is supposed to prevent. Always configure appropriate drain timeouts and verify they are working during deployment rehearsals.
Over-relying on DNS-based load balancing without understanding TTL behavior. DNS records have TTLs, and clients (and intermediate resolvers) cache DNS responses for the duration of the TTL. If you remove a server and update DNS, clients may continue sending traffic to the removed server until their cached DNS records expire. For this reason, DNS-based load balancing is best used for coarse-grained geographic routing, not for rapid failover.
References and Further Reading
"Introduction to Modern Network Load Balancing and Proxying" by Matt Klein (creator of Envoy) -- A widely cited blog post covering load balancing fundamentals: https://blog.envoyproxy.io/introduction-to-modern-network-load-balancing-and-proxying-a57f6ff80236
HAProxy Documentation -- The official documentation for HAProxy, with detailed coverage of configuration, algorithms, and operational best practices: https://docs.haproxy.org/
Nginx Load Balancing Guide -- Nginx's official guide to HTTP load balancing configuration: https://docs.nginx.com/nginx/admin-guide/load-balancer/http-load-balancer/
AWS Elastic Load Balancing Documentation -- Comprehensive documentation covering ALB, NLB, and GWLB: https://docs.aws.amazon.com/elasticloadbalancing/
"The Power of Two Random Choices" by Michael Mitzenmacher -- The foundational research paper on the two-random-choices load balancing technique: https://www.eecs.harvard.edu/~michaelm/postscripts/handbook2001.pdf
Envoy Proxy Documentation -- Architecture overview, load balancing, health checking, and circuit breaking: https://www.envoyproxy.io/docs/envoy/latest/
Google Cloud Load Balancing Documentation -- Overview of GCP's global and regional load balancing options: https://cloud.google.com/load-balancing/docs/load-balancing-overview
"Maglev: A Fast and Reliable Software Network Load Balancer" -- Google's paper on their custom L4 load balancer, published at NSDI 2016: https://research.google/pubs/pub44824/
Traefik Documentation -- The official documentation for Traefik, with coverage of automatic service discovery and Kubernetes integration: https://doc.traefik.io/traefik/
Istio Service Mesh Documentation -- Architecture and traffic management for Istio, including load balancing within the mesh: https://istio.io/latest/docs/
"Building Secure and Reliable Systems" by Heather Adkins et al. (O'Reilly, Google SRE series) -- Chapters on load balancing, health checking, and graceful degradation in large-scale systems: https://sre.google/books/
RFC 7230 - HTTP/1.1 Message Syntax and Routing -- The specification that defines how HTTP proxies and intermediaries (including load balancers) should handle HTTP traffic: https://datatracker.ietf.org/doc/html/rfc7230
Azure Load Balancer Documentation -- Microsoft's documentation for Azure Load Balancer and Application Gateway: https://learn.microsoft.com/en-us/azure/load-balancer/
"Designing Data-Intensive Applications" by Martin Kleppmann (O'Reilly) -- Chapter 6 on partitioning and Chapter 9 on consistency, with extensive discussion of load distribution in distributed systems: https://dataintensive.net/