A Hermes "thermal event"

The few days before Christmas 2003 were slightly more exciting than usual for the Hermes admins. On Sunday 14 December, a disk failed in the old Hermes machine called green. The machine was tweaked so that it no longer used the disk, but otherwise it remained in service.

This turned out to be a bad idea, because a week later the failed disk started crashing the rest of the machine, which had to be hurriedly taken out of service. David fixed green by removing the disk, but didn't restore it to service because it was the Monday before Christmas and it could wait until after the holidays.

This turned out to be a good idea, because on the evening of Monday 22 December the old Hermes machine called orange died suddenly. David renamed green to take the place of orange to keep Hermes working properly, until the problem could be investigated the next morning.

On Tuesday 23rd we went into the machine room to discover the ominous smell of burnt electronics. Somethine had gone seriously wrong with orange - not a power supply or a disk, and there was an unusual panic message:

    Dec 22 21:20:01 orange.csi.cam.ac.uk unix: WARNING: [AFT0] Stickpanic: ptl1 trap reason 0x2
    TL=0x1 TT=0x68 TICK=0x80071d0045836cb8
            TPC=0x1002a4c8 TnPC=0x10134ce0 TSTATE=0x80001e03
    TL=0x2 TT=0x68 TICK=0x80071d0045836c4e
            TPC=0x10006aa4 TnPC=0x10006aa8 TSTATE=0x4480001504
    Softerror encountered on Memory Module 1703

    panic[cpu0]/thread=40033e40: Kernel panic at trap level 2

    10406180 unix:sys_tl1_panic+8 (200000, 33e80, 200000, 40032040, 1e, 14)
      %l0-7: 00000003 00001c00 80001e03 10006c34 00000001 00000000 0000000f 104061e0
    10406270 SUNW,UltraSPARC-II:cpu_ce_scrub_mem_err+4c (0, 6, 360fa640, 40032040, d813bf00, 0)
      %l0-7: 00000000 7ae321b0 80001e01 101329f0 00000001 00000000 0000000e 40032710
    40031fe0 SUNW,UltraSPARC-II:cpu_ce_error+1a4 (0, 0, 0, 0, 0, 0)

    RED State Exception

After the holidays, when we called out Sun's on-site support, we discovered how badly the machine had broken, and how lucky we had been that it didn't set off the fire alarm (taking the rest of the machine room with it). The Sun engineer said it was the worst "thermal event" he had seen.

In mid-April, I received a phone call from a Sun UK manager saying that Sun were upset by this web page and would like it to be taken down. Although it was down for a while, I have put it back since there is no reason to be embarrassed about a machine failing after five years of heavy use. And my boss likes this page better than he likes Sun.

Pictures of the failed machine:

Tony Finch <fanf2@cam.ac.uk>

