The M5.2 event centered near Yountville in the Napa Valley on the morning of September 3, 2000 provided a good test of the Pasadena Office web service capabilities. Parts of the system performed very well, while others suffered somewhat under the onslaught.
| Fig 1: Pasadena Web Server |
|
The story of how the current configuration came into existence is told in Web Servers, Earthquakes, and the Slashdot Effect.
The Squid server has proven itself in past events in the Los Angeles area, but this event was the first test of serving the Community Internet Intensity Map on a nationwide basis. This turned out to alter the mix of traffic considerably.
The CIIM uses an online form that users fill out for input. This input is then fed to a CGI script for processing on the web server. This proved to be a major user of system resources after this event.
The Pasadena Office web server received a total of 390,231 HTTP requests on September 3rd, of which 118,053 were passed through for processing by Ehzsouth. Compounding the load was the fact that Ehzsouth also serves as one of three servers that serve the Menlo Park Office web pages . The Menlo Park site received approximately 2,200,000 requests during that day, of which 682,606 were sent to Ehzsouth. This traffic also consumed a significant proportion of the system resources.
![]() |
Fig 2: Web Server Requests |
|---|
In addition, Ehzsouth serviced 57,352 requests for Menlo Park, including 2,601 CGI requests during this peak time period.
|
| Fig. 3: CIIM Questionnaires Submitted |
|---|
The trouble began as people began to click on the Submit Data for this Earthquake link. After completing the questionnaire, they submitted the form for processing. This calls ciim_update.pl, which is a Perl script. The ciim_update.pl script reads their input, calculates a Mercalli Intensity, as well as some other information. During the period from 01:50 to 02:30, 3,776 people submitted completed questionnaires. The logs on the Squid server show only 1,138 questionnaires successfully completed. The log on Ehzsouth shows 2,172 completed. The difference represents 1,334 questionnaires that were just lost by Ehzsouth, and 1,304 that were completed, but took so long that the Squid server timed out and logged them as a 504 Error. These users ended up disappointed.
Figure 3 shows the number of questionnaires submitted per minute for the day, as well as the number actually processed successfully. It appears that the server is not capable of processing more than about 25-30 questionnaires per minute.
| Table 1: Traffic Summary 01:50-02:30 | ||
| Squid Log | Request Type | Ehzsouth Log |
|---|---|---|
| 38,942 | Total Requests | 8,306 |
| 9,587 | Total CGI | 4,936 |
| 3,776 | ciim_update.pl | 2,172 |
| 1,138 | Status 200 | |
| 1,617 | Status 504 | |
Table 1 summarizes the logs for the Squid server and for Ehzsouth. The discrepancy between the total requests on the Squid and the total on Ehzsouth mostly reflects the fact that approximately 75% of the incoming requests were served out of the Squid cache. Under normal circumstances, these numbers will not match. The numbers for total CGI calls will also not match, as Squid is instructed to cache the output of certain CGI scripts. But in the case of the ciim_update.pl calls, the numbers should match. Unfortunately, 3,776 > 2,172. The Squid server logged a total of 3,483 504 errors for the day, all of them between 01:57 and 02:26. As the table shows, 1,617 of them were generated by people who submitted completed questionnaires. The remainder were just a random mix of different HTTP requests that the server was unable to service in a timely manner. Looking at the Squid logs shows that of the 38,942 requests received, the Squid sent 12,404 through to Ehzsouth for processing. This number should agree with the total requests reported by Ehzsouth, but it does not. There is a difference of some 4,098 requests that were lost by the overloaded server.
During this time, Big Brother was reporting that Ehzsouth was overloaded. The Big Brother logs show a load average of 22.5 for this period. This indicates that the machine was severely overloaded. The periods of high CPU load are indicated by the red in Figure 4. These correspond well with the times when the traffic was high as seen in Figure 2. The Squid server also showed a load peak around this time, but its load average never went above 0.5.
|
| Fig. 4: Big Brother CPU History |
|---|
As after other events, the initial surge of web traffic subsided after a short time, and the server was able to catch up. After dawn, a second, less-peaked swell began, and continued for the rest of the day. During this time, the Ehzsouth reported load averages in the 4-5 range, indicating that it was heavily loaded, but it was still able to keep up with the work load.
The Pasadena Office has recently acquired a new server, which will provide somewhat more horsepower for processing CGI scripts. It is an AMD Athlon/800, and testing indicates that it can process about 360 questionnaires per minute. This is a big improvement over the 20 per minute the current server can do. Testing was done with the Apache Benchmark program, using four Sun workstations. Completed questionnaires were fed to the server 20 and 40 at a time. During this testing, the server's load average went as high as 45, but it was responsive at all times.
| Questionnaires | Concurrent Requests | Time to Process | Average Rate |
|---|---|---|---|
| 600 | 60 | 100 sec | 6/sec |
| 12,000 | 60 | 1,900 sec | 6.3/sec |
| 2,400 | 120 | 380 sec | 6.3/sec |
| Table 2: Testing the New Server | |||
Another possibility to increase our capacity is to streamline the processing of the incoming questionnaires. Any small improvements in efficiency will reap big benefits under the heavy load after an event.
Still another possibility is to alter the questionnaire to be a Mailto: tag. By using email as the transport to send the data would reduce speed, but it would probably increase reliability. Unfortunately, this would also remove the immediate feedback that users get from the Mercalli Intensity that the CGI script calculates, as well as some other information that the script provides.
We also tested mod_perl with the Apache web server to speed up processing of the CIIM forms. Mod_perl links a Perl interpreter directly into the Apache web server executable, and provides for much faster processing of Perl scripts. This is largely due to it being able to avoid starting a separate process to run the scripts. The new server was able to process 20 completed questionnaires per second using this module.
The new server will help us, and there are also some other things we can do to increase our capacity. With a bit of luck, we won't have another large event until we have implemented a full solution.
The new server went live on Tuesday, September 19th. It is set up to use mod_perl for processing the CIIM questionnaires. In testing, this new configuration was able to process 20/second, which gives us capacity that is an order of magnitude higher than the peak rate experienced after the Yountville event. This should be sufficient for the time being, although we will continue to explore other ways to increase our capacity for the future.