The Disk Is Dying - Arcaro Mosè Portfolio

I tried to run apt update and it failed. The disk was mounted as read-only. I remounted it read-write, updated, and thought nothing of it. Then I decided to investigate why it was read-only in the first place.

What I found changes everything.

The Discovery

Installed smartmontools to check the disk health:

sudo apt install smartmontools
sudo smartctl -a /dev/sda

The overall health assessment said PASSED. Cool, right? Except the SMART attributes told a very different story:

Attribute	Value	What it means
Current_Pending_Sector	16	16 sectors the disk can't read, waiting to be reallocated
Reallocated_Event_Count	77	77 sectors already failed and were moved to spare areas
ATA Error Count	68	68 read/write errors logged by the disk
Raw_Read_Error_Rate	393216	Very high number of raw read errors
Power_On_Hours	18929	~2.16 years of total runtime

The disk has been silently dying for who knows how long. 77 sectors have already failed and been reallocated to spare areas. 16 more are failing right now and waiting in line. And 68 read/write errors are logged in the disk's error history.

The Read-Only Mystery Solved

Remember the crashes from Chapter 11? The "power outage or kernel freeze" that left no trace in the logs? And the read-only filesystem I stumbled into today?

It was the disk. The whole time.

The journal from this boot tells the story:

EXT4-fs (sda1): orphan cleanup on readonly fs
EXT4-fs (sda1): mounted filesystem ... ro with ordered data mode

Here's what happens: the disk hits a bad sector while writing. The kernel gets an I/O error. ext4's safety mechanism kicks in the errors=remount-ro mount option tells the filesystem: "if you encounter an error, go read-only to prevent data corruption." The server stays "running" but can't write anything, no logs, no temp files, services start silently failing. Eventually enough things break that it looks like a crash.

This also explains why the "crash experiment" with the battery gave inconclusive results. The battery DID help with power flickers (uptime went from 3 days to 7 days), but the disk errors were happening independently. Two different problems, overlapping symptoms.

The Hitachi HTS545050A7E380

The disk is a Hitachi Travelstar Z5K500 a 500GB, 5400RPM, 2.5" laptop drive. It's a spinning disk with physical platters and a read/write head that moves across them. After 18,929 hours of operation and nearly 30,000 start/stop cycles, it's wearing out.

The SMART status says "PASSED" because the threshold for failure hasn't been crossed yet. But the numbers are trending in one direction: more bad sectors, more errors, more reallocations. It's not a question of if it will fail, but when.

"The disk says it's fine. The disk is lying. The disk has always been lying. The disk is a laptop hard drive with 19,000 hours on it and 77 reallocated sectors. The disk is not fine."

Immediate Action: Backup Everything

Before the disk gets worse, I backed up every configuration file, script, key, and certificate on the server:

sudo tar -cpzf /tmp/server-backup.tar.gz \
    /etc/nginx \
    /etc/openvpn \
    /etc/fail2ban \
    /etc/default/sslh \
    /etc/systemd/system \
    /etc/systemd/logind.conf \
    /etc/default/grub \
    /etc/network/interfaces \
    /etc/ssh/sshd_config \
    /etc/ufw \
    /etc/webhook.conf \
    /usr/local/bin \
    /usr/local/etc/xray \
    /var/www/mosearc.eu \
    /root/.ssh \
    /var/lib/deploy/.ssh \
    /home/mose/openvpn-install.sh \
    /etc/update-motd.d \
    /etc/pam.d/sshd

Pulled it off the server to my regular computer immediately. The website content is already on GitHub (thank you, auto-deploy webhook from Chapter 6), but the configs, keys, and certificates aren't. If the disk dies completely, I need to be able to rebuild.

The Plan

The fix is obvious: replace the disk. And this time, not with another spinning drive with an SSD. A Samsung 870 EVO 250GB costs around €30 and would be a massive upgrade:

No moving parts: no mechanical failure, no bad sectors from physical wear
Faster read/write: the server would boot in seconds instead of minutes
Lower power consumption: matters on a laptop running 24/7
No vibration sensitivity: the loose Ethernet port won't cause I/O errors from desk vibrations
Silent: no more hard drive clicking sounds from the basement

Until I get the SSD, I'm monitoring the disk daily:

sudo smartctl -H /dev/sda

If it stops saying PASSED, the drive could fail completely at any moment. The 16 pending sectors will likely grow, each one is a spot on the disk where data could be lost without warning.

Looking Back

I've spent months debugging crashes on this server. Power outages, battery issues, C-State freezes, charger problems I investigated every possible cause. And the whole time, the disk was slowly dying underneath everything else.

The read-only remount was actually the system protecting itself. ext4's errors=remount-ro is a safety mechanism that prevents a failing disk from corrupting data. The filesystem detected the I/O errors and went into lockdown mode. It's the reason I still have my data.

But it also means every "unexplained crash" might have been a disk error. The logs stop abruptly not because the power went out, but because the kernel couldn't write to disk anymore. The server was technically still running just silently broken.

This is the most important lesson from the entire self-hosting journey so far: always check the hardware. I spent hours configuring software, hardening security, building notification systems and the thing that almost killed the server was a physical disk with bad sectors. smartctl should have been one of the first things I installed.

What's Next

Replace the disk with an SSD (URGENT)
UPS (the power/charger issue is still real, just not the only problem)
V2Ray mobile client setup
Containerization with Docker
Self-hosted drive with NAS
VPS reverse proxy
Mail server
Local LLM