- There was a problem over the weekend where the Simpson Map on the
Pasadena server did not update for over 24 hours. Checked, and found a
problem with cnssm that gave a corrupt catalog. The problem is detailed at
http://bort.gps.caltech.edu/stan/mail-archive/msg00011.html
- The UPS detected a 4-second power failure on Sunday morning. It sent an
SNMP trap about this, which converted to a page. It did not send a
'problem has cleared' message because it is set to only send traps in
response to severe error conditions. The power coming back on is
considered to be an informational message.
- Removed Phil from the sendmail alias files on the Trinet online systems.
- The squid process died unexpectedly on Ehzsquidpas on Tuesday. Had to
restart it by hand. The daily run the next day showed the squid process
dying on a signal 6.
- Changed the root password on Willow.
- There was a problem with the network on Wednesday afternoon. This turned
out to be Nathan Lee moving some connections around. Since he was there, we
moved the ITS network equipment on the second floor of S. Mudd to the UPS
circuit outlets.
- Stan Silverman installed Apache on Ehzsouth. I changed the configuration
to make ncweb-south a virtual server. This is so that the other servers on
that machine can also be virtual servers under a single instance of Apache.
- Ran checkbot against the earthquake.usgs.gov and www.trinet.org web
servers.
- Took some memory from Nickel to put in Sylmar.
- The USGS Holiday Party was held on Friday. Pictures are on the web at
http://pasadena.wr.usgs.gov/stans/xmas2000/
- Moved Stannum to Patrick's new office.
- Moved Nitya into Mandy's new office and renamed it Dacite.
- Turned off Nickel for the time being, as there is nobody using it.
- Enabled softupdates on Bort for testing. This is a method for speeding
up the FreeBSD file system. This is recommended for Squid servers and web
servers. It is described at
ftp://ftp.freebsd.org/pub/FreeBSD/FreeBSD-current/src/sys/ufs/ffs/README.softupdates
- Drew up a proposal for fixing our use of the RAID-5 arrays by setting up
a temporary wave pool area for the active file to be written to before
being moved to the main RAID storage. The proposal is at
http://bort.gps.caltech.edu/stan/raid-proposal.html
- The S. Mudd UPS reported a 4-second power outage at about 05:15 on
Wednesday.
- Put additional memory in Dacite.
- Investigated the '403 - Forbidden' errors reported by some users on the
National Web Site. It appears that some users' browsers are attempting to
use the Squid servers as regular proxy servers to get parts of our site.
Changed the httpd_accel_host to 'earthquake.usgs.gov' and added an entry to
the hosts file to point that to one of the back-end servers. This avoids
the looping that leads to a 403 error.
- Set up a password for Alex Bittenbinder to access the Trinet Internal web
pages.
- Talked with Egill and Ken about moving Ken's rack to the telemetry room.
- Monitor.pl on Spring was reporting 'high CPU usage' on Thursday evening.
There was no apparent problem, and Big Brother and the Spring status graphs
at http://bort.gps.caltech.edu/trinet/spring.html all looked normal. The
false alarm was caused by the fact that monitor.pl uses vmstat to check for
CPU usage, and there is a known bug in vmstat on Solaris that sometimes
causes it to return negative numbers. This caused monitor to think the CPU
usage was wrong.
- There were two events, M4.3 and M4.1 on Saturday evening. These events
set a record for web activity, with a peak of 134 hits/sec on the USGS
server. Then KCBS showed the shakemap URL on TV, and traffic on the Trinet
server jumped to 233 hits/sec for a time. There is a report on this at
http://bort.gps.caltech.edu/spikes/13jan2001/
- Moved Ehzsouth into my office so that Ken can move his rack to the
telemetry room.
- While the Pasadena and Trinet servers performed well after the weekend's
events, the Menlo Park web server was very slow, and many people were
served '403 Forbidden' errors. Got copies of the logs and analyzed them.
Their peak hit rate was around 70 hits/sec. The problem was that after
about 23 hits/sec, their servers were slow, and so ehzdns started sending
the excess traffic to Ehzmenlo, which is not configured as a public server.
It then doled out the 403 errors. Sent them graphs of their traffic. They
are considering ways to increase their capacity for the future.
- Fixed the Send-A-Page on Ehzsouth to use SNPP. This was necessary due to
the machine having moved away from the phone line it used for dialup.
- Added a link to CIIM from the Earthquake Commentary pages on Trinet.
- Set up an access password for Raven to use to get to the Trinet Intranet
web pages.
- Changed the sed hack to remove the 'Fault' lines from public email to
also remove 'Fault Zone' lines.
- Experimented with a perl script to add a URL for topozone.com to the
internal event emails.
- Monitor.pl gave some more spurious CPU alerts on Wednesday night. These
were caused by the same problem with vmstat. Patrick disabled the CPU test.
- Eviscerated monitor.pl on Thursday. Disabled all disk space, process,
and network checking. The only check it is still performing are the 'too
many stations timed out' and 'check_dwp.'
- Set up a script for Big Brother to change the list of processes monitored
depending on whether Jet or Spring is primary. The script is
/opt/bb/etc/setrole.
- Monitor.pl gave some more spurious warnings on Friday morning. 'Error
256 from check_dwp' errors were caused by a link in the QUG directory that
pointed back at the same directory. This confused the check_dwp procedure.
Removed the link to fix it.
- Assisted Jim Goltz and Margaret to adapt the Jan 13 web service report as
an article for the ERA newsletter.
- Received the new hardware from Ken Ou to build a new, more faster web
server for Trinet. Also arranged to order three new PC workstations for
the Timers to use.
- Installed the new HP laser printer in the Yellow House at 131.215.66.102.
- Called Jack Popejoy at KFWB about their 'Quake Report' web page. It was
skimming the Los Angeles map off of the Menlo Park web server, and during
the crush after the Jan 13 events, the page was coming up blank. Told him
to have them skim it from our web server instead.
- Mandy reported problems on Galena. It looks like the disk may be going
bad. Opened it up and reseated all the disk connections. Also tried fsck
in single-user mode. Still reported errors.
- The Trinet Squid server developed an apparent hardware instability. It
crashed repeatedly on Monday, Tuesday, and Wednesday.
- Finished configuring the new Athlon machine to replace the Trinet Squid
server. Had to upgrade the BIOS to set it to reboot after a power failure.
Installed it at 13:30 on Wednesday.
- Fixed a problem with web group on Ehzmenlo. Added it as a secondary
group for users 'lisa' and 'golden'.
- Hooked up the tape drive for Sue Hough.
- Attended the Trinet meeting and also a meeting to discuss the RAID
situation. Decided to look into a pure hardware solution and find out how
much it would cost.
- Got the three new Timer machine.
- Configured the first new Timer machine as node 'scree'.
- Performed USGS outreach at the Doubletree Hotel Earthquake Safety Fair on
Friday.
***
- The USGS Pasadena office web server set new records for web traffic after
two earthquakes, M4.3 and M4.1 were felt in Los Angeles on Saturday,
January 13. The peak activity was 233 requests/sec. This load is nearly
500 times the normal level of activity on the web site. The servers
performed well, and all the traffic was served successfully.
- Ordered new workstations for the Timers to use, and started building a
new, faster web server for Trinet. Even though the current servers
performed well after the January 13 events, the continued growth of the
Internet means that traffic levels will increase in the future.