- Tested the disks from Galena in another Ultra-5. Still reported errors.
Frank Vernon said they would send up a replacement machine.
- Talked to Lighthouse about a quote for a 400GB RAID 0+1 array. They
quoted nearly $25,000. This is beyond our budget right now, so we are
falling back to software solutions to the RAID performance problem.
- Mounted the K2 waveform disks on Scree. This did not appear to help
Jiggle to perform better, but I found out later that I had mounted them in
the wrong place. Moved the mounts to the right place.
- Added '-u 113' to the ftpd line on Ehzmenlo to fix the problem on the
Program web pages where files that were placed with ftp did not allow group
access.
- The replacement for Galena came in and Mandy and I configured it. The
wavenet interface did not seem to work. Turned out the Sbus Ethernet card
was bad. Replaced the card, but it still did not work due to the Wavenet
hub not being dual-speed. We need a 100Mb card for the machine. Mandy
called Frank Vernon and they will decide what to do on Monday.
- M3.5 event near Alameda on Thursday evening made a peak of 190 hits/sec
on the Menlo Park servers, and 28/sec on the Pasadena server. Both servers
seemed to keep up with the load.
- Added threshold checking to the MRTG config for USGS web server traffic
to detect traffic surges after earthquakes.
- Enabled the Topozone.com URL for email sent to the duty@eqinfo mailing
list.
- There was a major CITnet network problem from 02:18-03:25 on Monday. The
router in S. Mudd failed, and the 65 and 61 nets were unreachable.
- Fixed the NFS mounts on scree again. Turned out I had the wrong disks
mounted off of K2.
- The network problem came back from 14:30-15:50. ITS found that the RSM
in the S. Mudd Catalyst had failed. Routing is supposed to fail over to
routers in Booth and Fairchild, but the Booth router was incorrectly
configured for teh 65 and 61 nets, so they were cut off again.
- Installed the Simpson Map 5.1 on Bort for testing.
- Added a time delay for process pages off of Magma. The Bakersfield
import program dies occasionally. When this happens, it is automatically
restarted. Big Brother will now not page for missing processes unless they
are missing on two consecutive checks.
- Removed the account 'jocey' from the systems.
- Moved disk partitions on Terra10 to the new 9GB disk.
- There was a weird problem with cnssm on Agent86. 'selyoung_cnssm' was
dumping core, and the catalog files were left blank. Copied over a new
copy of the 'catalog' directory to fix it.
- ITS fixed the configuration of the Booth router and did a software
upgrade to the S. Mudd RSM to prevent a recurrence of the Monday network
problem. They also proposed putting an RSM in the USGS Catalyst to provide
a third failover path.
- Removed Phil from more more alias lists on the realtime systems.
- Iron had a disk problem on Thursday morning. The Andataco disks went
offline. Had to power-cycle and reboot to clear the problem.
- M5.1 event near Big Bear at 13:05 on Saturday. Web traffic peaked at 81
hits/sec, and ciim_update peaked at questionnaires/min. The servers
handled the spike well.
- Patrick was trying to reboot Hotspot on Monday morning. It hung in POST.
Took it apart and found that one of the PCI SCSI cards was falling out of
the slot. It did not have a screw holding it in. Put a screw in it. Then
POST was complaining about the memory. Reseated all the memory. Then the
system passed POST and rebooted.
- Did some analysis of the web traffic spike after the M5.1 event on
Saturday. The traffic showed the familiar pattern of a peak at 10 minutes
after the event, followed by an exponential decay.
- Got a disk back from nStor on Tuesday. The RMA number was 20824, which
was sent out the week of March 10, 2000.
- Researched flow control issues for the Central Data SCSI terminal ports
on Granite.
- Ran analog against the Menlo Park web site logs for the week of 1/21-1/28
so they could see how many non-US visits they were getting.
- Changed the Mapblast link on the Terra10 Simpson Maps to use Topozone
instead.
- Did some testing on the old trinet-squid. It was crashing whenever it
had to do a lot of disk access. Swapped out the SCSI controller for another
one, and this seemed to make it stable. Then it developed a further
instability that seemed to be connected to overheating of the 1000RPM disk.
Placing a fan on this disk seemed to fix this problem.
- Made an account on Agent86 for Greg Anderson to put up his web pages.
- Installed FreeBSD on the former trinet-squid and mirrored the Trinet web
pages on it. When the machine is stable and tested it will be used to
replace Flint as the Trinet back-end server.
- Changed the QDDS on Agent86 from java to kaffe.
- Got the four-disk array from Terra1 installed on Spring.
- Moved Aladdin to the trailer for Greg Anderson.
- Set up the wavepool links on the new disks on Spring.
- Swapped out Rust for Nickel for Alan Yong. After it was all done, he said
the new machine was even louder than the old one, and asked me to put the
old one back.
- Talked to Paul Friberg about TNW and the server.
- M4.4 near San Jose at 15:18 on Sunday. Pasadena server peaked at 61
hits/sec. Menlo Park server was completely overwhelmed.
- Did an analysis of the Menlo Park web traffic spike. Dave Oppenheimer
wants a meeting to figure out what to do about their web capacity.
- Upgraded the Simpson Maps on Agent86 to the new version to correct the
Shake Map links for Northern California.
- Helped Karen upgrade the Simpson Maps on SCEDC.
- Helped Stan Silverman install FreeBSD and Squid on their new server in
Menlo Park.
- Turned off rsync between Ehzsouth and Agent86 on Tuesday.
- Made the file-moving script for Spring to move files from the temporary
wavepool the main wavepool. Set it up to run in cron on Wednesday morning.
- M6.8 near Seattle at 10:54 Wednesday. All-day web traffic surge.
Report at http://bort.gps.caltech.edu/spikes/28feb2001/
- Helped Stan Silverman turn on their new Squid server at about 11:55. It
came online doing 165 hits/sec, and their site was available again.
- Had to run the stats for Pasadena and the National site by hand on
Thursday. The logs were too big.
- All USGS earthquake web servers went down on Wednesday, except for
Pasadena and the National Page.
- Mail from Oppenheimer on Friday, proposing expanding their DNS
load-sharing to other sites. Did some research and got proof that it's a
bad idea: http://bort.gps.caltech.edu/spikes/ehzdns.html
***
- Did some research on ways to fix the performance problems on the Trinet
online systems. Hardware solutions are too expensive right now, so we are
going to use a mixed hardware-software solution. We put a set of four small
disks on Spring for a test, and have started moving stations over to the
new disks for testing.
- Added a URL for a topographic map to the Trinet internal web pages
earthquake maps, as well as to the email sent to the duty seismologist so
they can better see the location of events.
- Upgraded the Trinet Squid server to an Athlon-based machine. This is
faster than the old server, and will improve web access to Shake Map during
the large web traffic spikes after earthquakes.
- Experienced the largest sustained web traffic surge to date after the
M6.8 near Seattle on February 28. All USGS earthquake web servers went
down in the crush, except for the Pasadena and National Earthquake Hazards
pages. These are administered out of the Pasadena office, and both are
designed for high capacity to accomodate post-earthquake traffic surges.