Gobalancer

Gobalancer is a distributed HTTP load-balancer, written in Go. Gobalancer uses a least-busy worker policy to route requests, and can re-assign workers on the fly based on the observed load.

Check it out!

Final Report

Summary

In this project, we explore the opportunity of using parallelism to speedup a common task : balancing load generated by many clients across worker machines. The main goal is to create a distributed load balancer that can scale linearly as more nodes are added, while maintaining a least-busy first routing policy.

Background

How it's done

Load balancing is a difficult but mostly solved problem. In the real world, companies either use hardware load balancers or software solutions. These days, boxes are getting fast enough that hardware solutions are getting less and less attractive, especially as more logic is pushed onto the load balancer. Common setups are :

Caching reverse proxy, lightweight web server

A caching reverse proxy (Varnish) that serves static and cached assets, in front of a web server (often Nginx) that manages a pool of upstream app servers to render dynamic content. On itself, Nginx doesn’t have a very flexible load balancing mechanism. It’s mostly round robin, with the possibility of assigning weights to upstream servers, and setting conditions to determine when a server becomes unresponsive. As such, Nginx cannot handle elasticity, but it is possible to extend Nginx with embedded Lua scripts that could presumably be used to provide such a functionality.

Dedicated software packages

Using HAProxy, a software package specifically designed for load balancing. It offers very high performance (conditioned on the appropriate tuning of the underlying operating system), a wide variety of load balancing algorithms, SSL termination, TCP or HTTP load balancing … HAProxy is based on the event loop model, whereby a single very highly optimized loop is doing all the processing, on a single core. It is possible to run HAProxy in a multi-process fashion, but many functionalities are unavailable in this mode and additional cores are often used simply to deal with SSL. Out of the box, HAProxy does not provide any elasticity features. It’s up to the user to monitor statistics reported by HAProxy and use some other mechanism to scale in or out and update the HAProxy configuration.

SaaS solutions

Another very common option is the use the Elastic Load Balancing service provided by Amazon. This service provides both load balancing and elasticity, for a fee. One can define metrics and thresholds that trigger scaling events that in turn launch or destroy EC2 virtual machines.

There are many variations based around these three setups, with more caching layers or other common techniques.

The Heroku blowout

Another interesting bit of information is what happened recently with the Heroku routing mesh. Heroku used to have a “smart” routing layer, where the traffic would be directed to the least loaded machine in the pool. A re-engineering of the system changed this property, causing a major outcry and prompting Heroku to give much more detailed explanations about what is going on under the hood. As it turns out, the different machines in the routing/load balancing layer do not share their state, and so while one router might be aware that a particular app server is bogged down and shouldn’t be sent more traffic, other routers in the routing mesh are oblivious to that reality and send traffic, bogging down the overloaded machine further. In order to avoid that, routers in the routing layer would have to share state, which is a difficult engineering challenge.

Parallel vs Serial Balancing

HAProxy, the de-facto standard load balancer, maintains that the reasons to not parallelize a load balancer are three fold:

  • It would be bogged down by lock contention
  • It would be slowed down by the logic of the system thread scheduler
  • It would be limited by the memory of the system
While all three of these do present challenges, the fact that every year machines are getting more cores, better scheduling algorithms, and more memory, all while processor clock rates remain static since hitting the Power Wall, makes changing that outlook and tacking a parallel approach to this problem seem all the more appealing.

Data Structures and Contention

One of the biggest issues with implementing a parallel load balancer is that contention has to be at a minimum while still maintaining reasonable speeds. The load balancer needs an efficient data structure to store workers that allows it to add new workers, remove workers that fail, and query which workers currently have the lowest workload in reasonable times. Unfortunately, many of the most common choices may not lend themselves to being shared by processes simultaneously attempting to modify the data structure. Luckily there are some locked structures that do not exhibit terrible slowdown from contention and in the recent years there have been many papers on different types of lock free data structures that have yet to see use in industry.

Memory usage

Another issue plaguing parallel load balancers is memory, despite the fact that handling HTTP requests is embarrassingly parallel, each thread that is spawned to handle a request requires its own memory, which could potentially be a bottleneck of the system. Additionally, a queue is needed somewhere in the system, either on the load balancer or the workers, that makes sure that during bursts of HTTP messages as few are dropped as possible. Both queue locations have their advantages: if the message queue is on the load balancer then messages can be spooned out to workers as they become available thereby ensuring that no messages get stuck behind one that takes an abnormally long time, while having queues on the workers means that they do not have to wait for one Round Trip Time(RTT) between processing each request enabling them to get through many small requests more quickly.

Approach

The following diagram summarizes our architecture : System

Load balancer

Gobalancer is essentially a modified version of the standard Go HTTP server. The Go HTTP server has a somewhat unusual way of doing things: each time a request is accepted, a new goroutine 1 is spawned to handle the processing. Each of these goroutines is in charge of:

  1. Picking the least busy worker from the queue
  2. Connecting to it and sending the request
  3. Getting the finished response from the worker
  4. Sending the data from the response back to the client

The queue itself is implemented using a simple binary heap contained within a Go slice. To ensure thread-safety, we started by simply using a global mutex to lock the whole heap whenever it needed to be mutated. Our plan was to move on to fine-grained locking or even a lock-free data-structure later on, but we came to realize that in practice, our approach was plenty fast enough. On average, grabbing a worker from the heap, updating it and putting it back takes in the order of 10 to 20 microseconds. To be more precise, we use the delete-min, insert, and remove operations on the heap, which are respectively O(1), O(logn) and O(logn). As it turns out, this is only a small fraction of the processing time for a request, which is mostly dominated by copying data from and back to the client.

Gobalancer itself consists of multiple nodes, each node is aware of a subset of workers that it is in charge of and each node will spawn thousands of goroutines to do the above. Due to potential imbalance of work amongst the Gobalancer nodes we included a facility for the nodes to pass workers amongst themselves based on their respective workloads in an attempt to decrease average response latency for the system as a whole. While it is not expected for the loads amongst the nodes to be too different since DNS round robin can distribute the requests evenly, having the ability to account for such events seemed necessary.

Workers

Although the workers were not the main focus of our project, they are an essential component as they allow us to test Gobalancer. In the real world, the load balancer would be put in front of hundreds, possibly thousands of nodes. For obvious reasons, we simply do not have access to that many physical machines for our final project. On the other end, the workers do not have to perform any useful work for the purpose of load testing Gobalancer. They merely have to simulate a real-world load.

To achieve that goal, we have implemented workers that simulate a real-world application while using only a fraction of the resources, which allows us to run a large number of workers on a single machine. To simulate the load, the workers sample from a Weibull Distribution (with parameters customizable on the command line) to determine both how long to sleep for (simulating processing) and how much data to return (the assumption here is that longer running requests return more data). The Weibull Distribution was selected since Rap Genius, the site that raised the flag on Heroku's bad scaling practice, noted that it very accurately follows their request's response time and provided some suggested parameters. It is basically a skewed normal distribution with a long tail.

Clients

The clients are the second component of our testing rig. We surveyed a number of load-testing tool, including Twitter Iago but eventually decided to go for our own, simpler solution. We wrote a simple Go program that spawns a number of goroutines (specified on the command line), and each of those goroutines enters the following loop:

  1. Check if a stop signal was received, if so, return
  2. Send an HTTP GET request to the specified target
  3. Read the response
All the clients run for a specified amount of time. When that delay has expired, the stop signal is sent to all the goroutines.

Collecting statistics

Collecting statistics about our system with statsd is really easy. However, doing it in a manner that doesn't slow down execution is a lot trickier. In fact, useful statistics needed to be collected in the hot paths of the codebase in our case. At first, we just used a statsd client implementation in Go that we found on GitHub. This however slowed down the system significantly, mainly because this particular implementation was taking a lock to protect writes to the UDP socket. It also was making heavy use of Sprintf which is slow, and was actually unnecessary. We solved this problem by getting rid of Sprintf and switching to a mix of string concatenation and lower level primitives like Itoa. To work around the lock contention, we spawned a goroutine dedicated to collecting statistics, and created a Go channel with a large (1 million elements) buffer. This goroutine would continuously read statistics from that channel and write them periodically to statsd, and all the other goroutines could simply post their statistics in the channel without blocking (unless the channel got backed up to a million elements, which does not happen).

Results

Experimental setup

To test Gobalancer under load, we setup the following experiment on the GHC machines at CMU :
Workers
We launched 2,500 workers across 30 machines. That amounts to ~80 workers per machine. This may sound like a lot, but remember that our workers only simulate a real-world load and in reality they spend most of their time sleeping.
Gobalancer
For this experiment, we setup Gobalancer with various number of nodes, from 1 to 5. The goal was to show that we could linearly increase the throughput by adding more nodes. Each Gobalancer node was setup on it's own machine, separate from the worker machines. Ideally, each node would be selected in a round-robin fashion by the DNS server, but for this experiment we just manually spread the load roughly equally across each node.
Clients
To generate load for Gobalancer, we used the remaining ~40 unused machines in the GHC cluster. Depending on the number of nodes in the Gobalancer cluster, we launched from 500 to 2000 concurrent clients on these machines. We would have liked to launch more, but because of limitations on the number of available file descriptors on the GHC machines, the load balancing nodes would run out and start refusing new connections.
Statistics collection server
We setup a virtual machine on EC2 with statsd 2 and graphite 3 to collect statistics for the test runs. A sample graph is shown below. This allows us to monitor the state of the cluster in near real-time (~10s delay).

Results

A graph summarizing the throughput of Gobalancer is presented below : Scaling As we can see, the throughput of the system scales linearly in the number of nodes. This is the result we were hoping for.

Below are some sample graphs from our statsd+graphite setup: Stats On this graph, we can see that on average during this test run, the Gobalancer nodes we able to process each request in 200 to 300 microseconds. This includes selecting the least busy worker, preparing the request to be sent to the worker, copying the headers and the body of the response back to the client. Not shown on this graph is the time needed for the worker to process the request, which is in the tens of milliseconds.

One Node Gobalancer: 1 node Two Nodes Gobalancer: 2 nodes Nine Nodes Gobalancer: 9 nodes

Above are some of the graphs we use to monitor the state of Gobalancer during load-testing (click for larger versions). From left to right, the graphs show :

  • Global throughput of the system (requests / second), this is how much traffic our load balancer can handle
  • Throughput for each node (requests / second), this shows that all nodes are handling similar amounts of traffic
  • Worker latency : how long does it take for a worker to respond to a request on average (ms)
  • Number of workers in each node's pool. This is useful to monitor how the workers are re-distributed amongst the nodes.
As we can see, especially on the 9 nodes run, the worker latencies remain roughly constant, albeit at different levels. We believe this could be due to the physical locations of the machines on the network with respect to each other. One other explanation is that since the GHC machines are used by many people, load balancers experiencing more latency might simply have been experiencing heavier load at the time.

Raw performance ceiling

As seen in the graphs above, we didn't reach the 40,000 requests/s that HAProxy achieves with our single box solution. We believe this is due in part to being unable to fine tune the underlying OS on the GHC machines (we experienced various errors such as "Too many open files" even at moderate loads), and also due to the fact that we did not have time to perform extensive optimizations. There is also the fact that Go is a garbage collected language, and garbage collection pauses hurt performance. We felt that by getting performance within a factor of 2 of HAProxy, our time would be best spent distributing the load balancer across machines.

Limitations

  • The ring topology will have limited ability to scale, but we believe that this problem would be fairly easily overcome should the need arise
  • We didn't do any extra work to make the ring fault tolerant. As things stand now, if a node crashes, the others will continue to operate without the ability to share workers, and the crashed node's workers will become unavailable
  • We didn't go out of our way to harden this system. There is likely a number of edge cases that are not covered. Our tests only consisted of GET requests against a known backend. In other words, it's not production ready.

References

Notes

[1] Goroutine
From the Go documentation : A goroutine has a simple model: it is a function executing concurrently with other goroutines in the same address space. It is lightweight, costing little more than the allocation of stack space. And the stacks start small, so they are cheap, and grow by allocating (and freeing) heap storage as required. more...
[2] Statsd
From the Statsd documentation : A network daemon that runs on the Node.js platform and listens for statistics, like counters and timers, sent over UDP and sends aggregates to one or more pluggable backend services (e.g., Graphite). more...
[3] Graphite
From the Graphite documentation : Graphite is a highly scalable real-time graphing system. As a user, you write an application that collects numeric time-series data that you are interested in graphing, and send it to Graphite's processing backend, carbon, which stores the data in Graphite's specialized database. The data can then be visualized through graphite's web interfaces. more...

List of work by each student

Equal work was performed by both team members.