- Spring crashed on Sunday night. Fixed the file systems early Monday
morning so it could reboot.
- Installed two new CPUs in Spring. The vendor had said that they would
work as long as the two CPUs on a single board matched each other. Moved
the CPU from board 6 up to board 4, and installed the two new CPUs on board
6. The system passed self-test and booted correctly.
- Added expiration headers to the earthquake commentary directories on the
Trinet web server. This will allow the commentary files to be cached, but
only for a limited time. This is necessary because these files are subject
to frequent revision after significant events.
- Spring crashed again at 13:10 on Monday. Fixed the file systems so it
could reboot. Re-enabled logging of the console output using kermit on the
Trinet Squid server.
- Spring had another RAID disk failure on Tuesday. The disk at [1,8] failed.
Replaced with a cold spare. Called nStor and got RMA #23159 for the return.
- Fixed a problem with mail statistics collection on eqinfo. The former
mechanism had problems if the log file rolled over during a mass mailing.
The new mechanism takes the last two log files every time and checks for
relevant log entries.
- The Bigone system disk was out of space on Wednesday. Dug around and
found some big files in [SYS0.MSA.MSAP$SPOOL], [SOCAL] and
[CIT_RTCUSP.PAGER.PAGERINI].
- Registered anss.org with Network Solutions. Mailed Charlene about
setting up DNS for it.
- Talked to Sarala about disk space issues for Oracle. She requested that I
lower the panic level for db1-db5 to 90%.
- Installed MHonArc on Terra10 and rebuilt the station email archives.
This will allow for better searching and archiving of the mail.
- Ehzsouth was running hot after the 5.2 Napa Valley event on Sunday. The
CPU load average was running between 3 and 7 for most of the day. This was
due to the combination of CIIM map traffic, and also the portion of Menlo
Park traffic that was being sent to it.
- The USGS Squid server rebooted itself at 01:53:35 on Sunday morning.
This was during the traffic spike after the M5.2 Napa Valley event,
although it is not clear that there was any connection between the two.
- The Squid server was reporting a 75% cache hit rate after the Sunday
event. This turned out to have been due to the CIIM map gif not being
cached. Adding the following lines to /usr/local/etc/squid/squid.conf
raised the cache hit rate to 85-90%:
acl CIIM-REFRESH urlpath_regex shake/cgi-bin/refresh\.pl
no_cache allow CIIM-REFRESH
- Spring died at 14:55 on Tuesday. The console log indicated that it had
detected a hardware failure on Board 0. When it rebooted, it could not see
the 2GB of memory on that board.
- The Simpson Map on Terra10 was broken and had not updated in several days.
This turned out to be caused by cnssm. The cnssm_merge program was choking
and segfaulting on a 'stack full' error. Increased the declared size of
the stack in the source and recompiled.
- Talked to Will Prescott about a plan for relocating the National
Earthquake Hazards pages. We agreed to put master servers in Menlo Park
and Reston, mirrored with rsync. These in turn would feed Squid servers in
Menlo Park, Reston, and Pasadena. This should give us a capacity to serve
on the order of 1,200 hits/sec.
- Spring died again at 19:57 on Tuesday.
- Took Spring down for service. One of the heat sinks on Board 0 had
come off and fallen down onto the back of Board 2. This was likely the
cause of the detected failure. Reattached the heat sink and reseated the
memory on Board 0. All parts passed self-test and the system booted
properly.
- The DCP board on row 3 of the Spring RAID showed a red light. nStor said
that this indicated a failure in the right-side power supply. They advised
reseating the power supply, and this cleared the problem.
- Conference call with Lisa Wald and the Web Team to discuss the plan for
moving the National pages.
- Lisa sent me a complaint that was sent in about problems submitting a
CIIM questionnaire on Sunday. At first glance, the statistics reports
looked like the system was functioning normally, but digging in the log
files showed that during the peak activity after the event, only about 1/3
of submitted questionnaires were processed properly. Wrote up a report
about the problems we had. The report is online at:
http://bort.gps.caltech.edu/stan/sep03/
- Installed Jbuilder on Iron for Doug.
- The USGS Squid server rebooted itself again at 22:59:16 on Saturday night,
9/9/2000. As with the other reboots, there was no indication in the log of
what happened.
- Discovered a problem with the mailing lists. A user did a reply-to-all
to one of the earthquake messages, and it was sent to quake-all@jet, which
replaced the sender field with 'quake-all-relay@jet'. This was then sent
to eqinfo, which checked the sender, found it on its list, and distributed
it to all 358 subscribers. Fixed this by removing the '-f' from the
quake-all alias on Jet and Spring so that they won't replace the sender
field.
- Analyzed the log files from the September 3rd event to find how many CIIM
questionnaires were lost. The peak activity was between 01:57 and 02:26 on
Sunday, when 3,776 questionnaires were submitted. Of these, only about 1/3
were processes successfully. The server appears to have a maximum capacity
of about 20/min.
- Tested submitting completed CIIM questionnaires to Agent86. It was able
to process 6.3/sec for sustained periods.
- Worked with Mandy to find the source of the input errors on the Goldfish
ethernet interface. Swapping the old Cabletron hub for a new cisco switch
eliminated the errors, which indicates that there was some incompatibility
between the Cabletron 10base-T and cisco.
- Installed mod_perl on Agent86 for testing. Preliminary testing showed
that this increased the CIIM processing speed to about 20/sec. Vince is
going to look at the CIIM scripts to see if they need any cleanup to be
able to run reliably under mod_perl.
- Talked to Keith Stark about getting space in their rack for Agent86.
- Set up two new control directories under tnw on Iron for Paul Friberg.
- Spring crashed at 12:49 on Thursday. I did not get the console output
from this crash.
- Fixed the ownership of files in the 99OCT04 directory for Kate. They
were owned by [1,1] instead of [CIT].
- Talked to Will Prescott and Stan Silverman about rsync. Gave them tar
files of compiled ssh and rsync to install on their systems.
- Added people to the eqpager mailing list for cube pages.
- Big Brother was sending alarms for high temperatures in the computer room.
Called Physical Plant at x4717. They said that the air conditioning was
working correctly, but was just overworked due to the extreme heat.
- The office took a brief power hit at 19:50 on Thursday evening. Most
computers that were not on UPS rebooted. The S. Mudd UPS seems to have
absorbed the hit successfully.
- Helped Mike Watkins with resetting a stuch terminal server.
- Tried to get the packet radio to Anaheim working, but it was not able to
even connect to the Mt. Wilson repeater.
- Set up RRDtool on Bort to graph the mail statistics from Eqinfo.
- Tejon crashed at 23:30 on Friday night. Had to reboot Ridge to clear up
the queue manager.
- Added Chuck, Bruce, and Jocey to the eqpager mailing list.
- Moved Agent86 into the rack next to the Squid server.
- Made Agent86 the live web server at 12:30 on Tuesday.
- Email sent to eqalarm_local on Jet for two events on Wednesday was
delivered 20 minutes late. Traced this to a problem with the primary
nameserver declared in /etc/resolv.conf. That machine is no longer a
nameserver. Removed it to fix the problem.
- Got a call from Joyce Costello in Reston offering to pay for building
three new Squid servers for the National Earthquake Hazards web server.
- Got a call from Dave Oppenheimer asking for a change in cnssm to allow
for merging events from UNR. The change was to add 'nn.cat' to the
definition of NETSTOMERGE in cnssm/bin/Settings.
- Made up at-a-glance status pages for the USGS, Trinet, and SCEDC web
servers, as well as the Eqinfo mailing list server.
- Helped Howard Bundock fix the access controls for tcp wrappers on his
machine so that Bruce could log in.
- The merged UNR events were not showing up on the Simpson Maps. Turned
out that we had to add 'nn' to the declaration of 'regionlist' in two files.
The files were Machine2.sh and req.Config, and both are stored in the
req2webdir/Config directory.
- Even with the config changes, the UNR events were not being plotted.
Found that this was caused by their sending the events with a blank version
field. This caused the Simpson software to reject them. Dave Oppenheimer
spoke with them and had them correct this.
- Went computer shopping with Bob to decide on a configuration for the
three Squid servers for the National web page.
- Added Jim Goltz to the eqpager mailing list.
- Set up a cgi-bin directory for Matt to use for testing.
- Set up SNMP monitoring of CPU usage on the USGS Squid and Agent86 servers.
Also set up monitoring of Squid memory usage.
- Built up two of the three new Squid servers for the National web page.
- Patrick said that snpp.airtouch.com was down. Tested, and it was up.
Found out it does not respond to pings. Called Bill Frank at Airtouch.
- Arranged with Patrick for the weekly CUBE test page to be sent to the
eqpager mailing list.
***
- The M5.2 quake on September 3rd provided a good stress test of the
Pasadena Office web server. Overall, the server performed well, except for
the peak period between 01:57 and 02:26, when it was unable to keep up with
the load. The major resource user was the processing of completed
questionnaires for the Community Internet Intensity Map. The server was
only able to process about 1/3 of the questionnaires submitted during this
time. A full report of the impact of this event on our web server is
available at http://bort.gps.caltech.edu/stan/sep03/
- The backup Trinet online system crashed with a hardware fault on Tuesday,
September 5th. Self-test indicated a fault on Board 0. Removed the board
and found that one of the heat sinks had fallen off, and the metal sink was
shorting contacts on the board. Reattached the heat sink, and the system
was able to reboot with no errors.
- Worked with people from USGS-Menlo Park to make a plan for relocating the
National Earthquake Hazards web servers from Denver to Menlo Park and
Reston. The network link to Denver is only about 3Mb/s, and this was
nearly saturated after the September 3rd event. Moving to Menlo Park and
Reston will give an aggregate bandwidth of nearly 200Mb/s. Built three
Squid reverse-proxy servers which will act as the front-end for this web
site. This will give it a much higher load capacity.
- The new office web server went live on September 19th. This is an AMD
Athlon 800 CPU, and it is much more powerful than the old web server. In
testing, it was able to process Community Internet Intensity Map
questionnaires 18 times faster than the old server.