- Spring crashed at 03:30 on Monday. Took about 30 minutes to reboot.
There was no crash dump left, despite having savecore enabled.
- Moved my system administration notes and status reports archive web pages
to the web server at http://bort.gps.caltech.edu in order to take advantage
of the easier administration on the FreeBSD server.
- Called Sun about the problems with the Spring crashes. They said that we
may not be able to save a dump due to the fact that the physical memory on
Spring is larger than either of the two swap partitions. Spring has 4GB of
memory, and two 2GB swap partitions. The swap partitions cannot be made
larger, as Solaris uses 32-bit addressing and cannot deal with swap spaces
larger than 2GB.
- Set up the kermit on the Trinet Squid server to use its serial port as
the console for Spring. This will allow us to trap the console output from
the next crash.
- Spring crashed at 13:00 on Thursday. Examining the console log and
searching the comp.unix.solaris archives indicated that a kernel patch
might fix the problem. Got patch 105181-21 from Sun and installed it on
Thursday afternoon.
- Joe reported that the 9-track drive could not read some of the Cedar
tapes that we were able to read before. Called Carnel and they said that
some mechanical alignment of the head might be necessary to get it to read
the 800bpi tapes. They will come out next week to do the service.
- Helped Doug set up Big Brother monitoring of the Trinet Post-Processing
processes on Iron. These processes send status reports to Big Brother
every time they run. If they hang, Doug will be paged after 30 minutes.
- Talked with ham radio operator Khalil Ladjevardi, who had volunteered to
help debug the packet radio to Anaheim. He did some tests and confirmed
the problems.
- Jet crashed at 09:34 on Saturday. Took 35 minutes to reboot. There was
no saved crash dump.
- Phil installed patch 105181-21 on Jet following the Monday crash. This
despite our having spoken and agreed to wait until the following Monday to
install it, due to not knowing for sure whether it would help.
- Examining the Big Brother history for Jet and Spring showed that crashes
have occurred at the following times:
Jet: Spring:
May 02 15:32 May 03 12:00
May 08 13:25 May 06 22:02
Jun 01 03:26 May 13 06:01
Jun 04 09:34 May 15 11:40
May 26 09:07
May 29 03:37
Jun 01 12:55
- Hacked RPAGE on the VAXen to have it send pages as email to the Airtouch
email paging server. This is a workaround for the problems with the
dialout modem hanging.
- Turned disk space alarms back on for Hotspot and Rtdev. Mandy is
monitoring the health of these systems.
- Installed bison-1.25, ddd-3.0, flex-2.5.4a, gcc-2.95.2, gdb-4.17,
make-3.76.1, texinfo-3.9, and textutils-1.22 on Spring and Jet. These were
requested by Paul Friberg at ISTI.
- Randy from Carnel came out on Wednesday and realigned the head on the
9-track so that it can read the old 800bpi tapes.
- Bob requested that I write up my notes on system security. In the
process of writing these up, I discovered that Jet and Spring were
both allowing unrestricted incoming telnet, ftp, rlogin, and rshell access
to the entire Internet. It appears that someone may have deleted the
/etc/hosts.allow file that controls this access. Re-created this file to
only allow access from the gps subdomain and for the internal networks that
the dataloggers are on.
- System security notes are available online at
http://bort.gps.caltech.edu/stan/security
- Installed tripwire on Jet and Spring to detect any possible future
tampering with the system.
- Set up another tape drive for backups of the FreeBSD machines.
- Installed libthreads patch 106980-04 on Water to fix the problem with
Java runtime and the TrinetWatch applet.
- There was a network problem on the 65 subnet at 12:40 on Thursday. Turned
out a guy from Physical Plant was doing some wiring in the CITnet cage in
S. Mudd and accidentally knocked a plug loose. The outage was fixed by
12:53. Wrote to ITS to ask that they check that plugs on the network
equipment be physically secured to avoid this happening in an earthquake.
- Will Prescott made a change to the Ehz DNS scheme on Wednesday afternoon
which ended up rerouting all of our web traffic directly to our web server,
bypassing the Squid server. Called him and had it fixed. It left about a
24-hour hole in the Squid statistics.
- Made a minor change to Big Brother so that it will truncate outgoing
alphanumeric pages at 160 characters. There was a problem with some pages
not being delivered because they were so long that they overran the input
buffer on Patrick's pager daemon.
- Checked the mail logs to check the performance of the mail server. It
turned out to have taken 10:30 to send mail to all 163 quake-all
recipients after the Morongo Valley event on Sunday. Installed bulk_mailer
on Eqinfo to attempt to improve this.
- Installed Arc/Info binaries on Willow. There was a problem with the
license file. They had sent the license as a fax, and there were probably
typos in it after keying it in by hand. Called ESRI and they promised to
email a new license file on Thursday.
- The Simpson map on Terra10 was not updating properly. This turned out to
be a configuration problem with cnssm. It was generating the proper update
files, but putting them in the wrong place. It was putting them in
/home/cnssm/outputspool. The fix was to make this directory a link to
/home/quake/outputspool.
- Hooked up Aladdin for Dave Wald's summer student.
- The Imperial Valley had a swarm on Wednesday afternoon. This allowed for
extensive testing and tweaking of the mail server configuration. After
seting the 'maxdomains' variable to 2 for bulk_mailer, it was able to send
166 mails in 1:40 for a factor of six improvement. The mail was sent as 40
separate messages which could process in parallel.
- Built a new kernel for Eqinfo with a larger process table and more
network resources to allow for the increased load from parallel mail
processing.
- Spent some time with Mandy and Mike Watkins fighting with the frame relay
connection to the new router.
- Added Matt Silva to paging.
- Set up MRTG to monitor and graph the volume of mail going through the
Eqinfo mail server. The URL for the graphs is
http://bort.gps.caltech.edu/mrtg/eqinfo-mail.html
- Pasadena Water and Power came by on Thursday to tell us that they would
be shutting off our power from midnight to 6:30 AM on Tuesday, June 20th.
In preparation, we conducted a practice power outage at lunchtime on Friday
in order to inventory which outlets are on the generator, and to verify
that the web servers and network equipment would run through the power
failure.
- Moved the Eqinfo mailing list server to the SCIGN shop area and put it
on a small UPS.
- Got Frame Relay to work with the new Cicso router using test IP addresses
for the FRAD and datalogger.
- Made a script to count the number of subscribers to the mailing lists and
report the numbers back to MRTG on Bort for graphing. The graphs can be
seen at http://bort.gps.caltech.edu/cgi-bin/eqinfo-lists.pl
- Got two refurbished disks back from nStor. Put on in the AA RAID on
Spring to replace the failed disk in position 0,0 and put the second disk
in the drawer as a cold spare.
- Pasadena Water and Power turned off our power at 00:30 on Tuesday
morning. The generator carried critical machines through to morning.
- There was a problem with Big Brother paging for certain services. This
turned out to have been caused by an improper definition of BBPAGER in the
bb-hosts file.
- Got the Arc/Info licenses to work on Willow. The problem turned out to
be that they sent us the wrong licenses in the first place back in March.
- Ojai crashed with a 'machine check' error at 14:19 on Sunday. Fixed the
disks and Nick restarted the batch jobs.
- Did some more performance tuning on the mail server. The current kernel
will support a maximum of 4,116 processes, and it peaked at 58 processes
while sending out mail from the 4.6 and 3.7 events on Monday morning.
Based on this, the bulk_mailer was set to maxdomains=1. This should double
the number of processes generated, but should improve mail throughput
speeds.
- Found out that the mail server allowed people to get the subscriber lists
for our mailing list. Disabled this feature in the list configurations. A
check of the logs showed that nobody had actually used this feature.
- There was a problem in Big Brother that prevented it from sending pages
for certain problems. This turned out to have been caused by a typo in the
BBPAGER definition in the bb-hosts file on some machines.
- Investigated a problem with redi email to paging. Certain redi messages
were failing and bouncing the email back to Berkeley. This turned out to
be caused by a message ending with a "{" character interacting with the
shell script that sends the page. The script was fixed to allow this
character in messages.
- Installed the UdmSearch and ht://Dig search engines on Bort for testing.
After some testing, installed ht://Dig on Ehzsouth for more testing.
- Got ht://Dig 3.20b2 to build. This is a beta release of version 3.2, and
includes phrase matching capabilities. After testing and conferring with
Lisa, we decided that it was a good search engine, and I installed it on
Ehzsouth and Terra10.
***
- Jet and Spring were crashing every 3-10 days. Installed kernel patch
105181-21 and it appears to have fixed the problem.
- Got the 9-track tape drive fixed so that Kate can get a summer student to
assist with reading the old Cedar tapes.
- Made a number of security changes to the Solaris systems. Also wrote up
notes on system security measures at
http://bort.gps.caltech.edu/stan/security
- Installed and configureg bulk_mailer on Eqinfo to increase its
performance in sending out earthquake mail messages.
- Installed and configured the ht://Dig 3.2.0b2 search engine on Ehzsouth
and Terra10. This is a full-featured open-source search engine available
from http://www.htdig.org.