- Set up MRTG to query the Squid servers for network traffic and cache
activity information. Graphs are produced at 5-minute intervals, and are
displayed at http://bort.gps.caltech.edu/mrtg
- Did backup and restore to defragment the Hot and Redhot disks on Carizo.
- Swapped the extension cord connecting the GPS clock for Carizo to the
small UPS.
- Iron hung in early hours Tuesday morning. This turned out to have been
caused by Doug's cron job that builds the review pages. One event was
hanging it, and cron started up multiple copies. This ate all the virtual
memory on the system and caused it to hang. The machine had to be rebooted
with a Stop-A and sync.
- Rtdev was reporting file system errors with the filesystem under /d20.
The system was rebooted in single-user mode and fsck was run to fix the
filesystem.
- The AA RAID on Spring was running off of its hot spare on Wednesday.
There were no fault lights lit on the controllers. The console terminal
showed that disk 0:5 was marked as "DED". The disk was replaced with a
cold spare, and the RAID was rebuilt.
- Got a copy of Xvfb for Solaris. This is a rump X server that acts like a
dummy server that any program can attach to. It acts like an X display
without actually displaying anything on the screen. Doug's Java program
that generates the review page graphics insists on having an X server to
connect to, so running Xvfb is a way to avoid Doug having to be logged in
all the time. The Xvfb binary kit was downloaded from
ftp://ferret.wrc.noaa.gov/special_request/xvfb/solaris
This was installed under /opt/X11R6 on Iron. A link from /usr/X11R6 was
made on Boron to link to /opt/X11R6. The Xvfb program insists on running
from the /usr/X11R6 directory. Finally, /etc/rc3.d/S164/Xvfb was set up to
start this at boot time.
- Added two old 2GB disks to Terra10 to alleviate the space crunch it was
experiencing.
- Added "jar" to the list of extensions listed as "application/octet-stream"
in the mime.types file for the apache web server on Iron. This will allow
people to download the Jiggle jar file for installation without it trying
to display in their web browsers.
- Randy from Carnel came out and fixed the MSA0: 9-track drive on Mojave.
The power supply and head were bad. He replaced the power supply and put
the head from MSB0: on the drive to make it work. We can now read the old
800bpi tapes again.
- Looked at the Water Resources web site planning documents in Reston.
They have an ambitious plan to build a nationwide network of web servers,
but it appears that their needs are somewhat different from our. They were
planning for a peak traffic rate that is far below what we expect after a
major earthquake.
- Assisted Dave Johnson and Mike Watkins with the T1-fiber connection in
the yellow house basement.
- Set up a secondary Big Brother network monitor and pager to act as a
failover in case the primary notification system fails.
- Rackmounted the disks for Hotspot and put it on the FDDI ring.
- Fixed the broken screws on the SCSI connector on Jet.
- Put a terminator on the SCSI bus on Rtdev. This appears to have fixed
the SCSI bus parity errors it was experiencing. Sun says that the internal
auto-termination on those disk boxes is problematic, and that an external
terminator is sometimes needed.
- Spring crashed on Monday. Changed the syslogd config file to set up
logging of kernel panic messages to the system messages file. Also set the
system up to save core dumps on reboot. This should help to track down the
cause of any further crashes.
- Set Jet up to save coredumps and console messages in case of future
crashes.
- The microwave link to Edison was fixed on Tuesday when they did something
to the router interface on their end. The two routers can see each other
now.
- Set up Big Brother to monitor the room and CPU temperatures recorded by
Jet, Spring, Hotspot, and Willow.
- Fixed the QDDS installation on Genie. It had an incorrect path for
storing incoming files. Making a link from /home/shake/quake/qdds to
/home/shake/qdds fixed it.
- Moved Ojai into the telemetry room rack with the other online systems.
Also recabled the FDDI in the Tan House and telemetry room.
- Fixed the USGS Squid server to auto-boot after a power failure. This
required setting jumper J11 on the main board, as well as tying together
pins 14 and 15 on the ATX power supply.
- The send-a-page facility on Terra10 had a problem with pages over 160
characters long. Added some logic to the perl script to break long
messages up into no more than two 160-character pages.
- Attached Willow to the 63 net, but was unable to get a link light.
Different ports and cables all gave the same result. Called Sun for
warranty service.
- Mandy got Nathan Lee from ITS to come over and look over our router
configuration for the Edison microwave link. Nathan figured out that the
problem was on the Edison end. Looking at their router configuration
confirmed that they had configured it with the assumption that the network
on our end would be 131.215.63.x, rather than 206.117.40.x.
- Installed Big Brother on Quake.
- The Pasadena Water and Power utility experienced problems at about 16:50
on Sunday. The Seismo Lab UPS failed to do its job, and almost all
machines crashed. Most machines rebooted automatically. Jet rebooted
after about 30 minutes. Spring was unable to reboot, as it had experienced
RAID disk failures from the power spike. Ridge failed to reboot until Bob
Busby cycled power on its CPU box and external disk. Several other
machines were left in strange states.
- Spoke at length with Tom at nStor about the problems with the Spring AA
RAID. Five disks had failed, but we were able to bring two of them back
online, which got the RAID functioning again. SDRV0 and SDRV1 were both
marked "CRITICAL" due to each being missing a disk. Arranged to get some
replacements. The replacement disks arrived on Tuesday and were installed.
The failed disks were sent back under RMA numbers 21878 and 21879. The
Third failed disk was sent back under RMA 21943. The turnaround for
replacement disks is generally on the order of 2-3 months.
- Installed Big Brother on K2, Quakedc, and Scec.
- Willie from Sun came out and swapped out the bad ethernet card in Willow.
This brought up its connection to the 63 subnet.
- Set up server-level redirection for the new shake URL on flint. This
involved setting "Redirect /shake.html http://www.trinet.org/shake/" in the
httpd.conf. Also did this same change to the shakemap installation on
Terra10.
- Enabled SNMP in the SCEDC Squid server and set MRTG on Bort to monitor
cache activity.
- Set up temperature monitoring on K2.
- Patched Water with the Solaris 2.7 recommended patch cluster. This is an
attempt to fix the java error that makes reference to necessary patch
without actually mentioning specifically _which_ patch is needed.
***
- Moved Ojai into the telemetry room so that it is in the same rack as the
other CUSP online systems.
- Dealt with a major power failure, including multiple disk failures on the
Spring AA RAID.
- Set up temperature monitoring for the telemetry room using Big Brother.