Update: using a Apple Macmini (so I don’t have to touch my production system), I’ve isolated the problem to Seagate 12TB enterprise drives in one particular OWC Thunderbay enclosure—they run fine for a while, but once triggered, it’s perma-failure in that enclosure unless it sits for many hours. The cause is most likely a firmware issue on the Seagate drives, according to highly credible source (these are over a year old, so whether current models at issue, no idea). The Toshiba 14TB drives have never failed me, even in the problem enclosure. However, I’ve only been using them a short while. Still, I could not provoke a failure when swapping them for the Seagate drives.
I’ve gotten little done the past 3 days due to storage issues.
My primary RAID-5 and RAID-0 storage volumes kept going offline due to disk I/O errors. I thought it was a bad drive since always the error was on the same drive. But replacing that drive, it just chose another drive to fail with (and always failing on the same drive, just as before). It was sporadic at first, but got progressively worse until I could provoke it within a few seconds.
Ultimately my main store was hosed badly enough for macOS to force it to read only. The faults were so frequent that while SoftRAID kept rebuilding the RAID-5 successfully, things began to fail so often that the rebuild could not occur (the whole bus was hosed). SMART status is/was OK for all drives and there are/were no remapped sectors.
At one point, the failures propagated to other devices including six other (non-RAID) hard drives in the Thunderbay 6, and hosing a brand-new OWC Thunderblade so that it I/O errors trying to initialize it (was able to fix it later on a 2016 MBP). Whatever hardware issue is going on is pretty darn scary, hosing the entire Thunderbolt bus. I strongly suspect that the whole problem is due to the firmware of the hard drives. It could also be the enclosure firmware perhaps, or a bad interaction. More on that below.
The worst case (and a serious possibility) is Apple Core Rot, e.g., a bug in macOS. But it would have to be on both 10.13 High Sierra and 10.14 Mojave.
Isolating for the cause
A summary of just how much I did to isolate the issue:
- NOT the drive—replacement drive fails too. And this time the failure is on one of the drives that was already there, not the replacement drive. Which tells me that it has nothing to do with the drives.
- NOT computer specific (reproduced on 2017 iMac 5K and 2016 MacBook Pro)
- NOT cable specific (two 0.5m cables and one 2m cable tried).
- NOT unit specific (two different OWC Thunderbay 4 units and one OWC Thunderbay 6).
- NOT bay specific (swapped drive into another bay, error followed the drive).
- NOT macOS version specific; fails on both macOS 10.13 High Sierra and 10.14 Mojave (two different machines).
- NOT software specific: can provoke with a Finder copy or an "ic verify" (by sheer good luck, Carbon Copy Cloner did not provoke the issue, so I was able to make up-to-date backups).
- NOT a daisy chaining issue (direct connect, nothing else on that port).
- NOT an interaction with other peripherals (sole peripheral on the 2016 MacBook Pro)
- NOT a bad file system (Disk Utility gives clean bill of health, plus the errors are right off the drive).
- Could not reproduce the issue with a single drive, a 2-drive RAID-0 or a 3-drive RAID-0, only a 4-drive RAID-0 or RAID-5.
I still don’t know for certain what the cause is, but it might actually be the firmware on the hard drives but I won’t name the suspect hard drives until I have some certainty—would not want to blame unfairly. To be clear, they are NOT the Toshiba 14TB drives, which so far have performed flawlessly (and very quietly)—love 'em.
I expect to have a fresh set of Toshiba 14TB MG07ACA hard drives tomorrow with which I can perform one more test to verify or disprove that theory: try to reproduce the problem with the Toshiba 14TB drives versus the problem drives. If I cannot reproduce the problem with the Toshiba 14TB drives, then I will reproduce it with the problem drives. I’ll do this several times, and if the Toshiba drives do not fail and the other ones do, then I will finally have an answer, and a solution—get rid of the problem brand. If both fail, then I’ll have to blame the enclosure firmware.