M5.6 San Francisco Bay Area, October 30, 2007

8:04:54 - Event NC 40204628 origin time
8:05:33 - First QDDS notice from Menlo
M0.0 Mcd (ENS ignores this)
8:09:58 - First magnitude from Menlo
M5.6 Mcd (ENS also ignores this)
8:10:58 - Got Ml 5.6 from Menlo
This is when ENS finally sent the event out
Ens sends 84,174 messages
8:10 - Peak web traffic:
earthquake: 16,672 hits/sec, 1.056Gb/sec
quake.wr: 11,380 hits/sec, 566Mb/sec
pasadena.wr: 2,168 hits/sec, 269Mb/sec
neic: 696 hits/sec, 99Mb/sec
8:10 - Peak traffic on the EHP web site
Akamai serviced 16,672 hits/sec to the public. Peak on the backend servers was 460 hits/sec, out of about 2,600 hits/sec that Akamai was requesting from our servers. Horst, Graben, and Mesa all peg at their configured maximum of 2048 httpd processes. This turns out to be too many, and all three servers go into a death spiral.
Peak back-end traffic was actually about 2,600 hits/sec so we were short. It looks from the graphs that our current server setup can comfortably handle about 115 hits/sec. So we were short by about a factor of 20.
8:15 - The ITSOT Scan Mitigator
The Scan Mitigator apparently tries to autmatically detect network attacks. It decides that the net traffic to and from Ens1 constitutes a denial-of-service attack and blocks it. The ENS web site goes dark.
9:15 - Logged on to Horst, Mesa, and Graben
I got on the web servers. The reported load averages were in the range of 60-70, and there were about 2,200 processes running. I edited the Apache configuration and lowered MaxClients from 2048 to 1024 This helped, but the servers were still having trouble.
8:19 - CIIM incoming questionnaires peak
The peak rate was 2882/min or 48/sec. No errors occurred in preliminary processing.
9:25 - Problems transferring CIIM questionnaires
Got a call from Dave Wald. Completed CIIM questionnaires were not being copied over from the Pasadena web servers. Looked at the move_entries.pl script and figured out that the list of files to copy was just too long, and scp was choking on it. Used tar to move about 60,000 entries over to Ciim and Ciim-pas.
9:30 - The web servers came back from the dead
The process of editing the Apache configuration and restarting the server took about 15 minutes. The web servers came back to life, and were now able to service about 1/2 of the traffic. Not good, but better than nothing.
9:35 - Lowered MaxClients to 700
The server loads were in the range of 20-25 with MaxClients set at 1024. I lowered it to 700. This brought the server load down to where it was about 12-15. This is just at the alarm threshold. The servers were responsive and were able to run all right.
10:00 - Traffic on the EHP site about 1,200 hits/sec
Traffic came down to a level the servers could cope with. At this point, we were servicing all of the web traffic successfully.
10:01 - Switch the ENS primary
There were problems connecting to the ENS machines in Golden, so I switched the primary system from Ens3 in Golden to Ens2 in Pasadena so that processing can continue.
10:54 - Modified the move_entries.pl
The script was making up a list of files waiting to be transferred and using scp to move them. This transfer was choking when the list got over about 1,000 files. Modified this script so that it will use tar to transfer a single archive file of completed questionnaires for processing.
11:50 - ENS MySQL replication was broken
Replication was broken for all the Golden servers. Attempted to resync the databases. This is when I first figured out that there was a block on Ens1. But I had no idea who to call about this, or even if anyone would be available at that time. ENS was able to run all right with ENS2 primary. I did not realize at that time that that meant the web site was dead.

11:42 Oct 31 - Talked to Lien La
She confirmed that the Scan Mitigator had blocked Ens1. She talked to ITSOT and got them to remove the block. The ENS web site came back up.


Stan Schwarz
Honeywell Technical Services
Southern California Seismic Network Contract
Pasadena, California
November 1, 2007