We have just fixed rather a subtle bug in the reboot sequence for the Orions which I thought might be of interest. When a Unix machine is in the air and thinking of going multi-user, it runs the shell script /etc/rc which fscks the filesystems and starts up systems such as lpr, ftp and so on (go and have a look at it if you are not familiar with it). It contains lots of phrases like: if [ -f /etc/ftpd ]; then /etc/ftpd ; echo -n ' ftpd' >/dev/console fi if [ -f /etc/talkd ]; then /etc/talkd ; echo -n ' talkd' >/dev/console fi if [ -f /etc/syslog ]; then /etc/syslog ; echo -n ' syslog' >/dev/console fi which test for the existence of a software system and start it up if it does. It was found that quite often, the master daemons for the ftp system on 4.2BSD Orions was not running (producing the error message "Drat!" from fcp). We looked at the output of "lastcomm" since reboot when this was discovered in time, and found that no p-go and q-go processes had died, so they had not been started up. When I brought mars up with 4.2 for the first time, the output on the console during reboot, the list of things it started up did not include ni-ftp. I then rebooted it immediately and it DID. Ugh? "Computers are deterministic" can go out of the window. The relevant extract from /etc/rc: /etc/update; echo -n ' update' >/dev/console /etc/cron; echo -n ' cron' >/dev/console /etc/accton /usr/adm/pacct; echo -n ' accounting' >/dev/console #ukc ftp system if [ -f /etc/ni-ftp/P/p-go ]; then /etc/ni-ftp/P/p-go /etc/ni-ftp/Q/q-go echo -n ' niftp' >/dev/console fi /etc/mmdf.start; echo -n ' mail' >/dev/console The output went straight from accounting to mail without starting ftp, but when it was up and running, ls -l /etc/ni-ftp/P/p-go showed that it DID exist. I put an "else" clause into ftp's "if", and found that $?, which is set by the shell to the exit status of the last program run, was set to 9. Nine?! The source code for test (also known as "/bin/[") says that it can only exit with 0 or 1. Some straws that got a good clutching: - it sometimes happens on Lucy, mars, and merlin. They've all got kennedy disc drives. I don't remember it happening on Gos, Falcon or kestrel, which have Fuji Eagle disks, but can't be sure. A faulty disk? Unlikely that it would have the same read error in the same place on three machines. - old, broken version of /bin/sh? Nope. the same on all orions. - Exit(9)? Is the "test" actually core dumping or something and leaving a random value as its exit code? 9? A dirt-encrusted bell clatters in the distance: mmdf uses exit 9 to indicate successful termination. Nope, mmdf is started *after* ftp. - is there a dangling else or a fi-less if somewhere? No, because it sometimes works. - all the other comments in the script are preceded by a blank line. Really getting desperate now! - Is the test actually being done, or is it failing to execute the [ and getting the exit code from something else? seems unlikely because all the other daemons are started by very similar code and they all work. To cut a long story not quite so long, what was happening seems to be as follows: Way way back in the script, it tests for the existence of sendmail, an old mail system which we do not use here. if [ -f /usr/lib/sendmail ]; then (cd /usr/spool/mqueue; rm -f lf*) /usr/lib/sendmail -bd -q1h & echo -n ' sendmail' >/dev/console fi /usr/lib/sendmail *does* exist, because it is one of the unused things which was never tidied up. Note the & which sets it off in the background. I tried executing the /usr/lib/sendmail -bd line, and $status (the c-shell equivalent of $?) was 9! The sendmail trundled off in the background, and /etc/rc carried on its way. The sendmail must have exited 9 just as the shell running /etc/rc set off the test for /etc/ni-ftp/P/p-go. The shell waits for the test to complete, is informed of the completion of the sendmail, and thinks the test has failed. Because the speed things happened during reboot only depends upon the processor speed and the speed of the disks, there is no external random element such as network delays, so the sendmail always ended at the same time as the test. That would explain why the orions with Fuji disk drives didn't behave the same way as those with Kennedies. The sendmail exiting would coincide with the running of rc in a different place, either less harmful or less obvious. BUGS: a combination of starting things off in the background in rc and the shell getting confused about who had just exited, made apparent by the mailer's strange ideas about exit stati and the omission to omit sendmail. FIX: Don't start things off in the background in /etc/rc! Sigh.