The Mac Pro uses Error Correcting Code (ECC) memory modules. ECC memory modules contain an extra bit for every byte (8 bits). This extra bit allows single-bit-per-byte errors to be detected and corrected. Multi-bit errors can also be detected, but cannot be corrected, and cause a kernel panic (crash). See my Jan 24, 2006 blog entry for more on ECC memory.
Memory errors can be caused by excessively high operating temperatures and/or cosmic rays (cosmic rays are considerably more of a problem at high altitudes, making one wonder why laptops don’t use ECC memory; they often are used on airplanes at 30,000'). Poor-quality memory chips are also more subject to random errors than high-quality memory.
Today, with diglloydTools run-stress-test running on 9GB of memory, I left for the office. I had intended an 8-hour test, but I inadvertently started the test at the default run time of 1 hour. So it ran for an hour, then the Mac Pro idled (without sleeping) for another 9 hours in a closed room. The ambient temperature upon my return, as measured by the Mac Pro itself and displayed by Temperature Monitor , was 84° F. Apple alludes to ambient temperatures up to 95° F for FB-DIMM modules in Technical Note TN2156, and my test was 12° F below that temperature—hardly an extreme test.
Upon my return, I checked the memory status (=> => ). This is what you don’t want to see—ECC memory errors (but at least they were all correctable).
Rebooting did not make the errors go away; after reboot DIMM A2 still reported 88339 “ECC Correctable Errors”. However, cooling the room down by about 10° F was sufficient to make all ECC Errors vanish. The wonderful thing about ECC memory is that instead of a mysterious system crash, the system just keeps working. No data corruption, and no system crash. Good stuff.
The 2GB module in question is one of the SATech modules previously discussed. I will be asking for a replacement of that module. It’s hard to say if the non-compliant heat sinks on the 2GB modules are part of the problem, but I don’t want to use memory that has any ECC errors.