Gobalancer is a distributed HTTP load-balancer, written in Go. Gobalancer uses a least-busy worker policy to route requests, and can re-assign workers on the fly based on the observed load.Check it out!
In this project, we explore the opportunity of using parallelism to speedup a common task : balancing load generated by many clients across worker machines. The main goal is to create a distributed load balancer that can scale linearly as more nodes are added, while maintaining a least-busy first routing policy.
Load balancing is a difficult but mostly solved problem. In the real world, companies either use hardware load balancers or software solutions. These days, boxes are getting fast enough that hardware solutions are getting less and less attractive, especially as more logic is pushed onto the load balancer. Common setups are :
A caching reverse proxy (Varnish) that serves static and cached assets, in front of a web server (often Nginx) that manages a pool of upstream app servers to render dynamic content. On itself, Nginx doesn’t have a very flexible load balancing mechanism. It’s mostly round robin, with the possibility of assigning weights to upstream servers, and setting conditions to determine when a server becomes unresponsive. As such, Nginx cannot handle elasticity, but it is possible to extend Nginx with embedded Lua scripts that could presumably be used to provide such a functionality.
Using HAProxy, a software package specifically designed for load balancing. It offers very high performance (conditioned on the appropriate tuning of the underlying operating system), a wide variety of load balancing algorithms, SSL termination, TCP or HTTP load balancing … HAProxy is based on the event loop model, whereby a single very highly optimized loop is doing all the processing, on a single core. It is possible to run HAProxy in a multi-process fashion, but many functionalities are unavailable in this mode and additional cores are often used simply to deal with SSL. Out of the box, HAProxy does not provide any elasticity features. It’s up to the user to monitor statistics reported by HAProxy and use some other mechanism to scale in or out and update the HAProxy configuration.
Another very common option is the use the Elastic Load Balancing service provided by Amazon. This service provides both load balancing and elasticity, for a fee. One can define metrics and thresholds that trigger scaling events that in turn launch or destroy EC2 virtual machines.
There are many variations based around these three setups, with more caching layers or other common techniques.
Another interesting bit of information is what happened recently with the Heroku routing mesh. Heroku used to have a “smart” routing layer, where the traffic would be directed to the least loaded machine in the pool. A re-engineering of the system changed this property, causing a major outcry and prompting Heroku to give much more detailed explanations about what is going on under the hood. As it turns out, the different machines in the routing/load balancing layer do not share their state, and so while one router might be aware that a particular app server is bogged down and shouldn’t be sent more traffic, other routers in the routing mesh are oblivious to that reality and send traffic, bogging down the overloaded machine further. In order to avoid that, routers in the routing layer would have to share state, which is a difficult engineering challenge.
HAProxy, the de-facto standard load balancer, maintains that the reasons to not parallelize a load balancer are three fold:
One of the biggest issues with implementing a parallel load balancer is that contention has to be at a minimum while still maintaining reasonable speeds. The load balancer needs an efficient data structure to store workers that allows it to add new workers, remove workers that fail, and query which workers currently have the lowest workload in reasonable times. Unfortunately, many of the most common choices may not lend themselves to being shared by processes simultaneously attempting to modify the data structure. Luckily there are some locked structures that do not exhibit terrible slowdown from contention and in the recent years there have been many papers on different types of lock free data structures that have yet to see use in industry.
Another issue plaguing parallel load balancers is memory, despite the fact that handling HTTP requests is embarrassingly parallel, each thread that is spawned to handle a request requires its own memory, which could potentially be a bottleneck of the system. Additionally, a queue is needed somewhere in the system, either on the load balancer or the workers, that makes sure that during bursts of HTTP messages as few are dropped as possible. Both queue locations have their advantages: if the message queue is on the load balancer then messages can be spooned out to workers as they become available thereby ensuring that no messages get stuck behind one that takes an abnormally long time, while having queues on the workers means that they do not have to wait for one Round Trip Time(RTT) between processing each request enabling them to get through many small requests more quickly.
The following diagram summarizes our architecture :
Gobalancer is essentially a modified version of the standard Go HTTP server. The Go HTTP server has a somewhat unusual way of doing things: each time a request is accepted, a new goroutine 1 is spawned to handle the processing. Each of these goroutines is in charge of:
The queue itself is implemented using a simple binary heap contained within a Go slice. To ensure thread-safety, we started by simply using a global mutex to lock the whole heap whenever it needed to be mutated. Our plan was to move on to fine-grained locking or even a lock-free data-structure later on, but we came to realize that in practice, our approach was plenty fast enough. On average, grabbing a worker from the heap, updating it and putting it back takes in the order of 10 to 20 microseconds. To be more precise, we use the delete-min, insert, and remove operations on the heap, which are respectively O(1), O(logn) and O(logn). As it turns out, this is only a small fraction of the processing time for a request, which is mostly dominated by copying data from and back to the client.
Gobalancer itself consists of multiple nodes, each node is aware of a subset of workers that it is in charge of and each node will spawn thousands of goroutines to do the above. Due to potential imbalance of work amongst the Gobalancer nodes we included a facility for the nodes to pass workers amongst themselves based on their respective workloads in an attempt to decrease average response latency for the system as a whole. While it is not expected for the loads amongst the nodes to be too different since DNS round robin can distribute the requests evenly, having the ability to account for such events seemed necessary.
Although the workers were not the main focus of our project, they are an essential component as they allow us to test Gobalancer. In the real world, the load balancer would be put in front of hundreds, possibly thousands of nodes. For obvious reasons, we simply do not have access to that many physical machines for our final project. On the other end, the workers do not have to perform any useful work for the purpose of load testing Gobalancer. They merely have to simulate a real-world load.
To achieve that goal, we have implemented workers that simulate a real-world application while using only a fraction of the resources, which allows us to run a large number of workers on a single machine. To simulate the load, the workers sample from a Weibull Distribution (with parameters customizable on the command line) to determine both how long to sleep for (simulating processing) and how much data to return (the assumption here is that longer running requests return more data). The Weibull Distribution was selected since Rap Genius, the site that raised the flag on Heroku's bad scaling practice, noted that it very accurately follows their request's response time and provided some suggested parameters. It is basically a skewed normal distribution with a long tail.
The clients are the second component of our testing rig. We surveyed a number of load-testing tool, including Twitter Iago but eventually decided to go for our own, simpler solution. We wrote a simple Go program that spawns a number of goroutines (specified on the command line), and each of those goroutines enters the following loop:
Collecting statistics about our system with statsd is really easy. However, doing it in a manner that doesn't slow down execution is a lot trickier. In fact, useful statistics needed to be collected in the hot paths of the codebase in our case. At first, we just used a statsd client implementation in Go that we found on GitHub. This however slowed down the system significantly, mainly because this particular implementation was taking a lock to protect writes to the UDP socket. It also was making heavy use of Sprintf which is slow, and was actually unnecessary. We solved this problem by getting rid of Sprintf and switching to a mix of string concatenation and lower level primitives like Itoa. To work around the lock contention, we spawned a goroutine dedicated to collecting statistics, and created a Go channel with a large (1 million elements) buffer. This goroutine would continuously read statistics from that channel and write them periodically to statsd, and all the other goroutines could simply post their statistics in the channel without blocking (unless the channel got backed up to a million elements, which does not happen).
A graph summarizing the throughput of Gobalancer is presented below : As we can see, the throughput of the system scales linearly in the number of nodes. This is the result we were hoping for.
Below are some sample graphs from our statsd+graphite setup: On this graph, we can see that on average during this test run, the Gobalancer nodes we able to process each request in 200 to 300 microseconds. This includes selecting the least busy worker, preparing the request to be sent to the worker, copying the headers and the body of the response back to the client. Not shown on this graph is the time needed for the worker to process the request, which is in the tens of milliseconds.
Above are some of the graphs we use to monitor the state of Gobalancer during load-testing (click for larger versions). From left to right, the graphs show :
As seen in the graphs above, we didn't reach the 40,000 requests/s that HAProxy achieves with our single box solution. We believe this is due in part to being unable to fine tune the underlying OS on the GHC machines (we experienced various errors such as "Too many open files" even at moderate loads), and also due to the fact that we did not have time to perform extensive optimizations. There is also the fact that Go is a garbage collected language, and garbage collection pauses hurt performance. We felt that by getting performance within a factor of 2 of HAProxy, our time would be best spent distributing the load balancer across machines.