- Changed Rift to 68.6 on Monday.
- Tested the FreeBSD Oracle client on Dawn. Got it to talk to the database
successfully.
- Resurrected Scarp. It had died on 11/20 when the temperature in the
computer room hit 80 degrees.
- Changed the Big Brother clients on all systems to reflect the new
address for Rift.
- Started building a duplicate of Rift on Scarp.
- Put Pluton in a rack in the 525 computer room.
- Met with Bob Sauer from Cal Climate, Philip Vaziri and Michael Raven
from Caltech, and a representative from Carrier to discuss the air
conditioning problems in the 525 computer room.
- Added fields to the pagers database to store information so that we
can use the database to generate configurations for tnpage.
- Tried to set up replication for MySQL from Rift to Scarp. It would
not work, giving a 'increase max_allowed_packet on master' error. Tried
increasing this to its maximum, but this did not fix it.
- M4.8 event in Mexico at 13:04. Felt in San Diego and the Imperial
Valley. Trinet web server traffic peaked at 14 hits/sec.
- Alarms problem with QDDS on Makalu. Turned out that the AA_Simpson_Map
script was generating QDDS messages and putting them in
/home/rtem/QDDS/polldir. This is where QDDS runs on the online systems.
But on the Data Center machines, it runs under /opt/util/share/QDDS.
Made a link so that /home/rtem/QDDS/polldir links to
/opt/util/share/QDDS/polldir.
- Put the new tnpage configs on Atlantic, Pacific, and Rift. These configs
are generated out of the pagers database with the 'maketnpage.pl' script
on Rift in /home/stan/Paging.
- Attached the modem on Pacific to its shelf with velcro on Tuesday.
- The Arena RAID on Pacific was reporting bad sectors on disk 3 on Wednesday
morning. Replaced it with a spare and sent the bad disk back to Maxtor
with RMA number 0201209120.
- Started installing the new SunFire V480 servers that Egill bought.
- The AA RAID on Atlantic failed again on Wednesday morning. The left-hand
power supply on the second chassis was bad and failed when the right-hand
power was turned off. All four RAID-5 arrays were disrupted. Got the
arrays back on line after replacing an apparent failure at 0,0. Brought
the system up. After fsck, the contents of all RAID filesystems were
bad. Did a newfs and created new wavepool directories. Ellen rebuilt
the database. The backups of /home and /opt were bad. After the last
failure, I forgot to fix the backup script to reflect the fact that they
had moved to different logical devices. Copied /home and /opt from Pacific
to restore the system. Mandy made the necessary changes to rtem and got
the system running again about 19:00 on Wednesday.
- Analog data was bad starting about 05:30 on Thursday morning. The
data was just spikes. We ended up having power-cycle Carizo before
it cleared up. The bad data confused the system and caused it to
generate about 80-100 bogus events in and around the L.A. basin.
- Menlo Park had a DNS problem starting about 18:40 on Thursday. Horst was
refusing ssh connections because it could not do a reverse lookup.
Added Rift's IP address to its hosts file to fix this. Also added
the address for Sqehzmenlo, but it turned out that I made a typo
in the IP address. Bob Simpson called on Friday morning to say that
ssh from Sqehzmenlo to Horst was not working. Fixed the typo at 09:05
on Friday.
- Analog data went missing at about 18:00 on Thursday night. Checked
interface on Carizo with UCX and 'show interface cf0' to get the counters.
Saw packets being sent. Checked the interface on Magma with 'snoop -d le0'
and saw packets coming in from Carizo. Mandy restarted coaxtoring.
http://rift.gps.caltech.edu/trinetrt_updates/msg00137.html
- Got one of the new V480 servers mounted in the rack. When it was
powered on, it displayed a 'Please login' prompt. None of the documents
that came with it told what to do at this point.
- The AA RAID on Atlantic failed again on Monday. The controllers
appeared to get very confused and were returning I/O errors to the
host. Attached Atlantic to the AA RAID from Hotspot and rebuilt the
system.
- Restored Hotspot without its RAID. Egill said that we can order a new
Arena RAID to replace the Andataco GigaRAID that failed.
- Helped Ruth Ludwin with setting up ssh access to Horst and Graben for
PNW shake map access.
- Installed Solaris on the new V480 on Wednesday. The system is called
Plume. Installed the patch set and Forte compilers.
- Timers meeting on Wednesday.
- Made a new group called 'DutyOp' for paging.
- Nextel came back to talk to us about their cell phone services.
- Helped Karen with some Simpson map problems on the Data Center.
- Helped Karen with starting pdaemon on the Data Center. The process
had died and left a dangling semaphore. The trick was to use 'ipcs'
to display the semaphores. To remove the dangling semaphore, use
'ipcrm -S 0x00005208'.
- Added the cubebelt2 and cubebelt3 pagers back into the database.
Apparently these are used for some outside agencies.
- Made a script to restart pdaemon on the online systems.
- Fang crashed mysteriously at 07:55 on Thursday. It had crashed one
other time on October 21. Took it apart and did some voodoo by reseating
all the boards.
- Jury Duty all day on Friday.
- Hotspot was hung over the weekend. It appears that the small disk
array we put on it is faulty. Removed the disk array and rebuilt the
system on the remaining disks.
- Got the replacement for RMA number 201209120 from Maxtor. This is
the replacement for the 80GB disk that failed on December 11.
- Hotspot was still having problems on Tuesday. The console had
messages about 'disk not responding to selection'. Replaced the
SCSI terminator and rebooted.
- The Menlo Park network connection went bad overnight on Christmas day. The
outage ran from about 20:00 until 07:31 on the 26th.
- Added a 'sort by date' feature to the new PHP Station Updates web page at
http://rift.gps.caltech.edu/station_updates
- Vacation on Friday.
***
- Added service information to the pagers database so that it can be used
to generate the configurations for tnpage on the online systems.
- The Andataco GigaRAID AA on Atlantic suffered two total failures this
month. After the second failure we switched it off and rebuilt the system
on the other AA RAID. There are no plans to try and resurrect the failed
unit again.