Yet Another Server Outage

Ok, so the server went down in flames and smoke just as I had left for a presentation in Moscow, Russia. As I came back home and started troubleshooting, I realized things were worse than I thought.

It was yet another hard drive problem – I was seeing superblock I/O errors all over the logs, and the server had dismounted the file system to read-only. After trying with a new SSD drive, and then another (which I had ordered to my home while in Russia), which all showed the same symptoms, I realized it was the SATA circuits on the motherboard that were fried and giving me disturbing intermittent failures, rather than the hard drives.

So, rummaging through the spares, I found a PCI SATA adapter which would bypass the motherboard’s SATA circuits. No game. The server would install fine, but BIOS would not boot into that hard drive (it was an Adaptec 1210 card).

Two options remained. The first was to get a PCIe SATA adapter that was different from the PCI SATA and hope that the BIOS would be able to transfer control to that hard drive. The second, last option, would be to get an IDE-to-SATA adapter to feed the SSD off the ATA interface on the motherboard, which hopefully had a different circuit path.

The third and fourth options were increasingly arcane (such as running the system SSD over USB2) and just aimed at getting something running, rather than getting it running in a close to decent way.

Anyway, the shops opened today at 1000 Stockholm time, and the PCIe SATA that I got worked after some trial, error, and reconfiguration. The server is now up and running and I hope this was the root cause of the errors that has been affecting the site off and on since February.

Rick Falkvinge

Rick is the founder of the first Pirate Party and a low-altitude motorcycle pilot. He lives on Alexanderplatz in Berlin, Germany, roasts his own coffee, and as of right now (2019-2020) is taking a little break.

Discussion

  1. Good Work

    😀

  2. Askarel

    You also have the option of making /boot on a separate drive, like a USB stick and keep the non-bootable PCI adapter for the root filesystem.

    If you still want to use the good old ATA connector, you might be interested by one of these: http://linitx.com/viewcategory.php?catid=1005 (i use one in my car computer, run flawlessly. :-))

    1. Jon Severinsson

      You don’t even needto put /boot on a separate device, all you need is to put the boot sectors on something the BIOS will detect (USB stick, sd-card, floppy drive or even an old HD on the partially broken SATA interface).

      On startup BIOS will load GRUB from that boot device, and then GRUB will boot the kernel from your working harddrive on the PCI SATA controller. GRUB is only ~30kB (depending on RAID/LVM/FS for /boot, on my ext4 on RAID1 system it is 29916 bytes), so it is usually possible to boot from HW that breaks down on even a moderately light load (such as booting the kernel).

      1. Askarel

        Sadly it’s a bit more complicated than that: it’s not enough to put the boot sectors on a drive reachable by BIOS. The boot sector use the BIOS disk routines to read the remaining sectors containing your boot loader, and your boot loader also use the BIOS disk routines to load drivers/kernel/initrd. All the pieces need to be reachable by the BIOS at boot time.

        The purpose of the ROM module on those storage add-on cards is to install the needed hooks so the BIOS can (at least) access the attached drives.

        Grub2 come with some direct access drivers that can get around BIOS limitations, but it still need to be loaded and reassembled in memory using good ol’ BIOS routines

  3. Cesar

    Sorry for saying this, Rick, but it is best for you to replace the entire server, instead of jury-rigging a way of bypassing the damaged parts.

    When something on the motherboard fries, it is usually either caused by problems in the power circuits, or causing problems in the power circuits. The power circuits are global to the machine (it all goes from a single power supply, or a set of them in more expensive servers), so when a problem happens to them, it can affect the whole machine. The SATA ports might have been the first ones to fail, but other parts of the server can start failing at any moment (if they aren’t already failing invisibly, for instance causing bit flips – did you run a memtest86 to check the memory?).

    1. Rick Falkvinge

      I am painfully aware of this, but can’t afford a new server right now. Thanks for taking the time to tell me, though. Appreciated.

Comments are closed.

arrow