M5.6 San Francisco Bay Area, October 30, 2007
- 8:04:54 - Event NC 40204628 origin time
- 8:05:33 - First QDDS notice from Menlo
- M0.0 Mcd (ENS ignores this)
- 8:09:58 - First magnitude from Menlo
- M5.6 Mcd (ENS also ignores this)
- 8:10:58 - Got Ml 5.6 from Menlo
- This is when ENS finally sent the event out
- Ens sends 84,174 messages
- 8:10 - Peak web traffic:
- earthquake: 16,672 hits/sec, 1.056Gb/sec
- quake.wr: 11,380 hits/sec, 566Mb/sec
- pasadena.wr: 2,168 hits/sec, 269Mb/sec
- neic: 696 hits/sec, 99Mb/sec
- 8:10 - Peak traffic on the EHP web site
- Akamai serviced 16,672 hits/sec to the public. Peak on the
backend servers was 460 hits/sec, out of about 2,600 hits/sec
that Akamai was requesting from our servers.
Horst, Graben, and Mesa all peg at their configured
maximum of 2048 httpd processes. This turns out to be
too many, and all three servers go into a death spiral.
-
Peak back-end traffic was actually about 2,600 hits/sec
so we were short. It looks from the graphs that our
current server setup can comfortably handle about 115
hits/sec. So we were short by about a factor of 20.
- 8:15 - The ITSOT Scan Mitigator
- The Scan Mitigator apparently tries to autmatically detect
network attacks. It decides that the net traffic to and
from Ens1 constitutes a denial-of-service attack and blocks it.
The ENS web site goes dark.
- 9:15 - Logged on to Horst, Mesa, and Graben
- I got on the web servers. The reported load averages were
in the range of 60-70, and there were about 2,200 processes
running. I edited the Apache configuration and
lowered MaxClients from 2048 to 1024
This helped, but the servers were still having trouble.
- 8:19 - CIIM incoming questionnaires peak
- The peak rate was 2882/min or 48/sec. No
errors occurred in preliminary processing.
- 9:25 - Problems transferring CIIM questionnaires
- Got a call from Dave Wald. Completed CIIM questionnaires were not being
copied over from the Pasadena web servers. Looked at the
move_entries.pl script and figured out that the list of files
to copy was just too long, and scp was choking on it. Used
tar to move about 60,000 entries over to Ciim and Ciim-pas.
- 9:30 - The web servers came back from the dead
- The process of editing the Apache configuration and restarting the
server took about 15 minutes. The web servers came back to life, and
were now able to service about 1/2
of the traffic.
Not good, but better than nothing.
- 9:35 - Lowered MaxClients to 700
- The server loads were in the range of 20-25 with MaxClients
set at 1024. I lowered it to 700. This brought the server load down
to where it was about 12-15. This is just at the alarm
threshold. The servers were responsive and were able to
run all right.
- 10:00 - Traffic on the EHP site about 1,200 hits/sec
- Traffic came down to a level the servers
could cope with. At this point, we were servicing all of the
web traffic successfully.
- 10:01 - Switch the ENS primary
- There were problems connecting to the ENS machines in Golden, so I
switched the primary system from Ens3 in Golden to Ens2 in Pasadena
so that processing can continue.
- 10:54 - Modified the move_entries.pl
- The script was making up a list of files waiting to be transferred and
using scp to move them. This transfer
was choking when the list got over about 1,000 files. Modified this
script so that it will use
tar to transfer a single archive file of completed
questionnaires for processing.
- 11:50 - ENS MySQL replication was broken
- Replication was broken for all the Golden
servers. Attempted to resync the databases. This is when I first
figured out that there was a block on Ens1. But I had no idea
who to call about this, or even if anyone would be available
at that time. ENS was able to run all right with ENS2 primary.
I did not realize at that time that that meant the web site
was dead.
- 11:42 Oct 31 - Talked to Lien La
- She confirmed that the Scan Mitigator had blocked Ens1. She
talked to ITSOT and got them to
remove the block. The ENS web site came back up.