- Fang was reporting SCSI controller errors and bus resets over the weekend.
Reseating the controller probably would have fixed it, but I swapped it for
a spare unit.
- Put netpbm and xemacs on CIIM. Also sent in the DNS registration request
for ciim.gps.caltech.edu.
- Bigone system disk was full. This caused the mail to be wedged.
- Put the 'deliver' package on Bigone. This allows for account to auto-reply
with a 'this account is going away' message.
- CISN meeting.
- Recompiled orb2shear for Mandy.
- Got quotes for the Arena RAID units for Jet.
- Put the full coastline data for GMT on CIIM.
- Installed QDDS and CNSSM on CIIM.
- The keyboard on Silver was wedged. Had to power-cycle the machine to
fix it.
- Worked with Carl and Mike Watkins on cleanup of the telemetry room.
- 1/2 day vacation on Friday.
- There was a spam incident over the weekend. Onyx was running an open
relay. This turned out to have been caused by Allan Walter installing
a patch cluster from Sun on it. This overwrote the sendmail executable
with an old version that allowed relaying. Had to clean up 7,000
'postmaster - no such user' messages that were delivered to postmaster
on Iron.
- Installed the 2GB replacement memory in Jet.
- Disabled cachefsd on all Sun systems after CERT advisory
http://www.cert.org/advisories/CA-2002-11.html
- Got the new case and board for CIIM. They were missing the ATX
bulkhead for the back of the case, and it needs a power supply cable
extension. Called CalPC and arranged to get these items.
- Made an area for Ellen to install Oracle files on Sangabriel.
- The rest of the parts for CIIM came in on Tuesday. Assembed the machine
and booted it around noon. The graphics card would not go in the PCI
riser because it was a 5v, and the riser was 3.3v. Booted off the serial
console.
- Worked with Mike Watkins and Busby on telemetry room cleanup.
- Wrote a Big Brother external script to check up on Fort Scotty and its
stations.
- Remounted CIIM in the rack closest to the wall. Added a fourth fan to
the case. Also arranged to send back the 3.3v PCI riser to exchange for
a 5v riser so that the graphics card will work.
- Program web conference call.
- Talked to Will Prescott about NatWeb and AFS. We decided to not put the
Program site on the NatWeb machines, but to just co-locate our servers in
their area so we can have remote console access and power management through
their equipment. Also, we decided to try and stick with rsync for content
distribution, rather than moving our whole site to AFS.
- The SCEC web server was dead on Friday morning. It died at about 20:05
on Thursday evening. Squid segfaulted and died. Showed Vikki how to restart
the processes and added an http test to Big Brother for that machine.
- Timers' meeting.
- Fixed a problem in the station updates archive script. Station 'TIM' was
not being archived due to their being a string 'TIME' in the 'notstations'
file. Tightened up the regular expression to prevent this from matching.
- Contacted Tyan to get an RMA number for the bad S2460 board we bought at
Fry's.
- Made up a status page for CIIM. It is at
http://bort.gps.caltech.edu/mrtg/ciim.html
- Re-enabled the 'phil' account on Jet, Spring, and Iron. Apparently,
Phil is doing some consulting work for Quanterra and needs access to
our systems for testing.
- Did some testing to verify backups for Hotspot.
- Rebooted Lander to clear a DecWindows problem.
- Swapped some boards in Foreshock and helped Lisa set up the revamped
Earthquake Hazards Program web site there. The new site was put out on
the public servers at the end of the day on Monday.
- M4.9 near Gilroy at 22:00 PDT. Traffic peaks in hits/second were: Menlo/3000,
Program/680, Pasadena/360. Also, the rate of incoming CIIM questionnaires
peaked at 480/minute. The only problem was that the Menlo Park site was
slow to respond, and the origin servers were unresponsive. Talked to Will
Prescott about this at about 22:45, but we were not able to determine the
cause at the time.
- Postmortem on web service problems on Tuesday. Examined the logs from
the web servers. Ehzeast was reporting 'out of swap space' and 'cannot
fork' errors. Web server logs recorded several thousand calls to
'helicorder.pl', with a peak of 430/minute soon after the event. The
servers are only able to process about 60/minute between the two of them,
so they were quickly overwhelmed as hundreds of Perl processes consumed
all available system resources. The full story of this can be found at
http://bort.gps.caltech.edu/spikes/13may2002/
- There was a problem with CDMG emailing some data that they transfer to
our online systems. Checked the sendmail logs on Jet and Spring and
found nothing amiss.
- Refaxed the RMA request to Tyan and got back an RMA number. Sent the
bad board back to Tyan on Wednesday for replacement.
- Configured Sqehznorth for future use as an origin web server for Menlo
Park.
- Did some testing with the Apache Benchmark ['ab'] program. Sent calls
to 'helicorder.pl' to Ehzeast to demonstrate that it was indeed the cause
of the problems on Monday night.
- Got the new graphics riser from CalPC and used it to put the graphics
card in CIIM.
- Dog Day and farewell party for John Galetzka on Wednesday.
- Team Meeting on Wednesday.
- Fixed the 'fixperms.sh' script on Ehzmenlo for Lisa.
- CIIM crashed for undetermined reasons at 13:25 on Thursday.
- Rsync the Menlo Park web site to Sqehzeast and Sqehznorth.
- Did some investigation of the image maps used for the Recent Earthquakes
maps. They are indeed server-side image maps, and some people were pointing
to the hits generated by this as a possible contributor to the problems
on Monday night. Bob Simpson will implement a client-side image map for
the main index page, although this would not have helped on Monday night.
- Attended the CUBE users meeting on Friday morning.
- The new RAID unit for Jet arrived on Friday.
- 1/2 day vacation on Friday afternoon.
- Assembled the new RAID, using the disks from Pluton. The carcass of
Pluton will be given to the Data Center.
- Conference call with Menlo Park about issues pertaining to moving their
web site to the new servers.
- Put the new RAID in the rack and connected it to Jet. Made five partitions:
two 18GB for /home and /opt, two 73GB for the database, and the remainder for
the wavepool. Copied the contents of the current wavepool. Modified the
hourly file copy script to copy data files to both wavepool areas.
- OES said that the problems with the data transfers had to do with using
Kermit to transfer the files. Checked the serial lines on Jet and Spring
and compared configurations. Made them identical.
- Helped Bruce Worden copy the shakemap web pages to Horst and Graben so
that they can be served by the Program site.
- Scott Lydeen reported problems getting his GPS email through his new
Road Runner connection. Turned out to be caused by an improper DNS setup
in rr.com. His reverse lookup worked to resolve his IP address to a name,
but the name had no forward lookup. So the TCP wrappers on Dagalas were
rejecting his connection. Road Runner insisted that this was all our
fault. Ended up just having his mail forwarded to his rr.com account.
- Rebooted CIIM on Thursday to enable soft updates on the filesystems.
- Changed the virtual server for shakemap.org and shakemap.com to issue
a redirect to http://earthquake.usgs.gov/shakemap/
- Made a new small wavepool for Hotspot and returned the borrowed Arena
RAID to Mike Black.
- The T1 to Menlo Park failed at 18:40 on Sunday.
- Fort Scotty had a telemetry problem at about 23:00 on Sunday. Big
Brother detected this, but it did not send any pages until 08:00 on
Monday due to a typo in the bbwarnrules.cfg file. Fixed the typo. Also
found that Hugo was not in the users list for paging off of the online
systems. Added him and restarted paging.
- Pac Bell came out and fixed the T1 at 15:30 on Tuesday.
- Talked to Patrick about putting his paging software on some other systems
so that we will have alternate ways of generating pages if both online
systems were to crash. Mandy installed the paging software on Rift on
Wednesday.
- CIIM logged some ata bus resets on Tuesday and Wednesday mornings. This
indicates a possible bad disk drive or IDE cable. Swapped the disk drive
cable on Thursday morning.
- CIIM crashed for undetermined reasons at 13:25 on Thursday. Reseated
all the cards in it in hopes of some sympathetic voodoo.
- Dave reported that CIIM was asking for an Skey password at login
time. Disabled Skey in the ssh configuration.
- Made an account for Richard Allen on Iron so he can log on and do some
work with Patrick related to early warning.
- The scignmail mailing list was discovered by spammers. Keith asked that
it be set to restrict posting. Changed the config to only allow posting
by subscribers.
- Looked up UPS options for Jet when it moves to the 525 computer room.
Total power consumption by the system and its RAIDs is about 2500W.
- Got the Lexmark Optra C printer fixed. The repair guy opened it up and
found a small torn corner of paper wedged inside the guts of the printer.
Framed the corner as a keepsake.
- Started work on a disk mounting script for the online systems so that
the big RAID area with the main wavepool does not have to be mounted
automatically at boot time. The temporary wavepool disks can hold up
to about 48 hours of data, so the main wavepool does not have to be
mounted at boot time. Having the fsck running while the system is up
and taking data could reduce crash recovery time from 3+ hours to about
20-30 minutes.
***
- The new server for the Community Internet Intensity Maps came online this
month. This machine replaces an old Sun workstation that was severely
overloaded by this task. The new server is a dual-processor Athlon-based
PC running FreeBSD Unix.
- Set up the new RAID for one of the Trinet online systems. The new RAID
is a twelve-disk IDE-based RAID-5 array, with about 850GB of storage
capacity and a hot spare disk.
- Worked with the Shakemap group to put their pages up on the main
USGS Earthquake Hazards Program web site. Prior to this, their maps
were only available on the Trinet web site.
- M4.9 event near Gilroy on May 13 caused a large surge of traffic to the
USGS web servers. All the servers worked well, except for the Menlo Park
site. Worked with Menlo Park to identify and correct the problems there.