Title:        Distributed Cooperative Web Server. 

Abstract:    This project will investigate techniques for building highly scaleable web servers by constructing the web server itself as a distributed cooperative application.

Team Members: David Ladiges

Project Web Site: Distributed Apache

Status: Completed

Published Results:

HTTP Load Balancing with the Scalable Distributed Apache System, J. Michael Meehan and David Ladiges, Information: An International Interdisciplinary Journal, Volume 4 Number 4, October 2001, .

Scaleable Distributed Apache WWW Server, J. Michael Meehan and David Ladiges, Proceedings of the 16th International Conference on Computers and their Applications (CATA-2001) March 28-30, 2001.


HTTP Load Balancing with the Scalable Distributed Apache System

J. Michael Meehan

David Ladiges

Computer Science Department

Western Washington University

Abstract

With the ever-increasing utilization of WWW based services, organizations are faced with the problem of providing highly scalable web servers. Attempts to create scalable web services have been made utilizing URL based load balancing, hardware load balancing and DNS based load balancing. We present a solution based on transforming the web server itself into a distributed application. This approach has the advantage that it is almost completely transparent to the content development process in that the web content being served does not need to be rewritten in order to achieve increased performance.

Keywords

Web servers, distributed systems, load balancing, Apache web server, Inverse Notification Frequency Algorithm, Scalable Distributed Apache System.

1. Introduction

In recent years the phenomenal growth in the utilization of the Internet in general and web-based technology in particular has created a situation in which organizations are dependent on web services in order to function. This has in turn created a demand for ever-greater performance from the web servers. To address the demand for ever-increasing performance various load balancing techniques have been applied to this problem domain. Software based load balancing is one of the most economical and flexible approaches to load balancing and is the method what we chose for implementation of the Scalable Distributed Apache (SDA) System.

2. Software Based Load Balancing

Several different methods of software based load balancing present viable options for load balancing. Most important to our research are URL Rewriting and DNS Redirection. In the following we will briefly discuss each method and detail the approach that the Scalable Distributed Apache System employs.

2.1 URL Rewriting

A group at the University of Arizona has developed a modification to Apache known as DCWS (Distributed Cooperative Web Server) [1]. This system is implemented as an Apache module. Under this approach the web server dynamically rewrites the URL to allow HTTP redirection to route traffic to one of a collection of coop web servers. Content is migrated amongst servers dynamically to attempt to distribute work throughout the system. Fault awareness is addressed through a separate server pinging mechanism to account for server crashes. Information concerning the relative load of the components of the system is exchanged through extensions to the http headers [1].

While this approach works for static content, it does not address delivery of other than static content. The distribution of dynamic web pages, CGI processing, and writing to persistent storage is outside the scope of this system. Unfortunately, the distribution of these high processor cost techniques, such as CGI, is the most significant need in current systems.

2.2 DNS Redirection

DNS redirection is the simplest approach to load balancing web servers and generally involves making only a few changes to a DNS server. The first step in deploying DNS based load balancing is to set up multiple servers to be used to balance the web server traffic. Normally, the DNS entry for a web server would be set to resolve to one IP address. In order to provide increase web services performance the DNS server is modified to permit multiple IP addresses to be associated with a single host name. The DNS server then resolves the hostname to one of the pool of IP addresses thus effectively distributing the workload. The decision as to which IP address to use for the hostname can be done in various ways. The simplest approach is to use a round-robin style DNS redirection. One advantage of the DNS redirection approach is that it requires no changes whatsoever to the actual web server itself. This approach in no way addresses any issues of fault tolerance. It also relies on some other mechanism, such as a network or distributed file system, to deal with resolving problems associated with operating multiple web servers with what must function as single content.

3. Apache as a Distributed Application

The current research deals with a web server implementation based on the open source Apache WWW server. The modifications to a standard Apache server via a module (Scalable Distributed Apache module or mod_SDA) employ the Module API approach provided by the Apache 1.3.xx implementation [2,3]. This allows the modifications to be easily added to standard Apache servers to quickly build a distributed processing server network. We utilize the popular PVM library to develop code to provide load balancing information distribution and distributed processing functionality. We observe that the dominant paradigm for e-commerce server implementation is still CGI scripting and other dynamic web page creation techniques. This means in order to effectively address scalability of web servers, a solution should provide for the transparent distribution of CGI requests to many servers operating in parallel. Furthermore, this should be done in a transparent fashion without requiring an unreasonable effort on the part of the content development or programming teams.

The additions to Apache we provide do not in themselves implement facilities for dealing with replication or consistency of files. We assume that the underlying capabilities of the filesytem itself or a transactional database is utilized to deal with these issues.

The module we developed interacts with a primary Apache server to intercept HTTP requests, rewrites them to point to another computer on the distributed network, and uses HTTP redirection to send the requests onward. The load balancing distribution is determined via a configuration file that is read on startup. One of several distribution methods is employed for load balancing. The different distribution mechanisms available for use from the SDA System include standard distributed processing algorithms including a baseline option for comparison, sequential (round robin) processing, random distribution, and a least utilization metric that comprises the core of the SDA System.

3.1 Baseline

The baseline method for the SDA System allows all HTTP requests to pass through to the host server without attempting any load balancing. As this is done with a minimum of operations, the baseline method simulates a web server running without any modifications made by the SDA System. All benchmarks of the different load balancing methods available from the SDA System include the baseline method for comparison.

3.2 Random

A simple approach to the distribution problem is a random distribution approach where the initial configuration of the module prepares a list of available servers in the distribution network and the module selects web servers at random from that list to receive the redirected requests. This method has a minimum amount of overhead other than to ensure that the list of web servers is up to date with all functioning web servers. A true random load-balancing algorithm should ensure that all web servers are equally used over a lengthy time span, but over a short time span there is no guarantee that all web servers will be equally used.

3.3 Sequential

Another simple distribution method is a simple round robin approach. In this method the list of available servers in the distribution network is cycled through sequentially to receive the redirected requests. This approach creates negligible overhead, although some maintenance is required to keep an accurate list of web servers. Round robin load balancing also ensures that all servers are equally used on an originating request basis but the size of the information requested may not be balanced.

3.4 Least Utilization

A more sophisticated method for distribution of the HTTP requests is based on web server load and availability. Our module bases its decision as to which web server receives the request on the basis of the relative CPU loads of the available servers and the availability of the web server.

The crux of the issue in implementing a load-balancing algorithm for a distributed system is how to deal with updating and distributing the information concerning the relative loads of the various machines. The fundamental problem is that one cannot consume all of the system resources supporting the distribution of load information while attempting to improve performance through load distribution. In other words, the overhead of the load-balancing algorithm cannot consume the very resources you were attempting to capture through load distribution. This problem manifests itself clearly if one considers a load-balancing approach in which the workers in the systems report to a central "scoreboard" each time a piece of work in accepted to be performed. The busier the workers become the more traffic they generate to report how busy they are. This is precisely what you don’t want to happen.

Our system uses a scheme we call an Inverse Notification Frequency Algorithm. Under this approach the busier a worker becomes the less often it reports its load to the "scoreboard." In this manner the overhead of the algorithm becomes less at precisely the times when the system has the most work to do and the overhead of the algorithm increases only at times when the system is more idle and it will have least impact. In this approach the absence of a notification actually has informational content, as the notification would have been received in a certain amount of time if the notifying system were not utilized above a certain threshold. The exceptions to this are that the notifying machine may have crashed or the notification message may have been lost in transmission. The second situation cannot occur if we base our communication on an underlying virtual circuit capability. If a load notification message was not received in the specified time interval due to the fact that the notifying machine is dead then we wouldn’t be able to send any work to it anyway. In other words, the dead machine is extremely busy from the point of view of the load-balancing algorithm. After a specified timeout period, we will notice that the notifying machine is in fact dead and remove it from consideration. If the notifying machine does come back online, it can rejoin the secondary serving network provided it is still running the load-measuring component of the SDA system.

In a system that utilizes notification of tasks accepted to indicate relative load the exact opposite situation would exist. The load-balancing algorithm would erroneously conclude that the notifying machine was idle in the absence of messages indicating tasks accepted. This would perhaps lead the algorithm to select the dead machine as the most attractive to forward new work to.

Our system utilizes a central "scoreboard" housed on the home server to keep track of load information. The initial copy of our modified version of Apache utilizes PVM library capabilities to create auxiliary processes on each machine in the secondary server network. The auxiliary processes on non-home server machines communicate to an auxiliary process on the home server machine to provide load information. The home server auxiliary process communicates with the modified Apache server on the home system to provide the system load information to it. The Apache server then utilizes this information in deciding which remote machine will be referenced in the redirected URL.

This system as written is asymmetrical and not truly distributed in the robust sense of that word. The system could be redesigned to be a truly distributed application but this would imply that we would need to create a new capability to distribute the initial URL requests to more than one "home" machine. In other words, there is not much point in making the system load information exchange portion of the system truly distributed and robust if we have to have a home server concept anyway.

4. Conclusions

The simple setup of servers and transparency of operation allow this load balancing method to compare favorably with the DC Apache application and help maintain a significant cost advantage compared to hardware based method. There is also significantly more control of distribution at the server level than with a DNS based load balancing method. The greatest advantage of our load balancing approach comes from the ability of any of the web servers in the distributed network to handle dynamic content and CGI applications.

Distributed forms of Apache do share one common flaw with regard to dynamic content and CGI execution in that file based IO must forced to access centralized network file storage or face the problems of dealing with distributed file storage. Even with a networked file system the file server may present a bottleneck as file reads and writes may have to block as they access the needed files sequentially. This is less of a problem with a centralized database server as most transaction oriented database systems (including Oracle and MS SQL Server) are prepared to deal with these types of multiple access issues.

The system as implemented is not a "true distributed solution" in that we still utilize a central "home" server. This is an artifact of the way in which DNS names are resolved to IP addresses and not due to the inability of the other aspects of the system to be written in a "truly distributed" fashion.

Testing the solution and performing a performance analysis as with any distributed application is a tricky business. In order evaluate the effectiveness of the three solutions and draw meaningful conclusions we need to be able to create a controllable load demand on the system. We used the HTTPerf Performance Measuring Tool as the load generator in our testing [2].

While many results were measured during the testing phase, the most informative results came from testing a set of 10,000 connections to the host web server at a rate of 500 requests per second. This was sufficient to overload the web server with all tested methods of load balancing. However, as the following chart details, the random, sequential, and intelligent load balancing methods all produce more requests and replies handled with a shorter connection time than the baseline method [4].

500 Req/S

Requests

Replies

Connection Time [S]

Baseline

3429.33

2843.00

826.33

Random

5616.67

4960.00

466.43

Sequential

5772.00

5096.67

466.27

Intelligent

5434.33

4748.33

491.47

 

When compared to the baseline method, the intelligent load balancing method is slightly slower than the random and sequential methods but still presents 58% improvement in the number of requests handled and a 68% increase in replies made. (The random method provided improvements of 64% and 75% while the sequential method provided improvements of 68% and 80% for requests and replies respectively.) Connection time for the intelligent method was only 59% of the baseline connection time compared to 56% for both the random and sequential methods. While the intelligent method definitely performs at a slightly lower rate than the random and sequential methods this difference is somewhat diminished by the intelligent decision making with regards to the chosen destination server.

Other limitations of this project include an incompatible module structure with the forthcoming Apache 2.0 module API. The module as implemented requires some additional work to port it to the new API structure.

With the current usage of the Internet, the applications of this project are significant. A network of computers can be assembled and configured with this software load balancing method for far less than the equivalent hardware load balancing solution and with far more traffic shaping control than with DNS or standard URL rewriting techniques.

References:

[1] Baker, S. M. and Moon, B. Technical Report 98-08: Scalable Web Server Design for Distributed Data Management, August 1998.
[2] Mosberger, D. and Jin, T. httperf – A Tool for Measuring Web Server Performance. Hewlett-Packard Research Labs, 1998.
[3] Stein, L. and MacEachern, D. Writing Apache Modules with Perl and C. O’Reilly and Associates, Inc., 1999.
[4] The Apache Server Foundation. About the Apache HTTP Server Project. http://httpd.apache.org/ABOUT_APACHE.html, February 2001.
[5] Ladiges, D. Scalable Distributed Apache: Load Balancing with PVM. Western Washington University, 2001.