I tried to run apt update and it failed. The disk was mounted as read-only. I remounted it read-write, updated, and thought nothing of it. Then I decided to investigate why it was read-only in the first place.
What I found changes everything.
The Discovery
Installed smartmontools to check the disk health:
sudo apt install smartmontools
sudo smartctl -a /dev/sda
The overall health assessment said PASSED. Cool, right? Except the SMART attributes told a very different story:
| Attribute | Value | What it means |
|---|---|---|
| Current_Pending_Sector | 16 | 16 sectors the disk can't read, waiting to be reallocated |
| Reallocated_Event_Count | 77 | 77 sectors already failed and were moved to spare areas |
| ATA Error Count | 68 | 68 read/write errors logged by the disk |
| Raw_Read_Error_Rate | 393216 | Very high number of raw read errors |
| Power_On_Hours | 18929 | ~2.16 years of total runtime |
The disk has been silently dying for who knows how long. 77 sectors have already failed and been reallocated to spare areas. 16 more are failing right now and waiting in line. And 68 read/write errors are logged in the disk's error history.
The Read-Only Mystery Solved
Remember the crashes from Chapter 11? The "power outage or kernel freeze" that left no trace in the logs? And the read-only filesystem I stumbled into today?
It was the disk. The whole time.
The journal from this boot tells the story:
EXT4-fs (sda1): orphan cleanup on readonly fs
EXT4-fs (sda1): mounted filesystem ... ro with ordered data mode
Here's what happens: the disk hits a bad sector while writing. The kernel gets an I/O error. ext4's safety mechanism kicks in the errors=remount-ro mount option tells the filesystem: "if you encounter an error, go read-only to prevent data corruption." The server stays "running" but can't write anything, no logs, no temp files, services start silently failing. Eventually enough things break that it looks like a crash.
This also explains why the "crash experiment" with the battery gave inconclusive results. The battery DID help with power flickers (uptime went from 3 days to 7 days), but the disk errors were happening independently. Two different problems, overlapping symptoms.
The Hitachi HTS545050A7E380
The disk is a Hitachi Travelstar Z5K500 a 500GB, 5400RPM, 2.5" laptop drive. It's a spinning disk with physical platters and a read/write head that moves across them. After 18,929 hours of operation and nearly 30,000 start/stop cycles, it's wearing out.
The SMART status says "PASSED" because the threshold for failure hasn't been crossed yet. But the numbers are trending in one direction: more bad sectors, more errors, more reallocations. It's not a question of if it will fail, but when.
"The disk says it's fine. The disk is lying. The disk has always been lying. The disk is a laptop hard drive with 19,000 hours on it and 77 reallocated sectors. The disk is not fine."
Immediate Action: Backup Everything
Before the disk gets worse, I backed up every configuration file, script, key, and certificate on the server:
sudo tar -cpzf /tmp/server-backup.tar.gz \
/etc/nginx \
/etc/openvpn \
/etc/fail2ban \
/etc/default/sslh \
/etc/systemd/system \
/etc/systemd/logind.conf \
/etc/default/grub \
/etc/network/interfaces \
/etc/ssh/sshd_config \
/etc/ufw \
/etc/webhook.conf \
/usr/local/bin \
/usr/local/etc/xray \
/var/www/mosearc.eu \
/root/.ssh \
/var/lib/deploy/.ssh \
/home/mose/openvpn-install.sh \
/etc/update-motd.d \
/etc/pam.d/sshd
Pulled it off the server to my regular computer immediately. The website content is already on GitHub (thank you, auto-deploy webhook from Chapter 6), but the configs, keys, and certificates aren't. If the disk dies completely, I need to be able to rebuild.
The Plan
The fix is obvious: replace the disk. And this time, not with another spinning drive with an SSD. A Samsung 870 EVO 250GB costs around €30 and would be a massive upgrade:
- No moving parts: no mechanical failure, no bad sectors from physical wear
- Faster read/write: the server would boot in seconds instead of minutes
- Lower power consumption: matters on a laptop running 24/7
- No vibration sensitivity: the loose Ethernet port won't cause I/O errors from desk vibrations
- Silent: no more hard drive clicking sounds from the basement
Until I get the SSD, I'm monitoring the disk daily:
sudo smartctl -H /dev/sda
If it stops saying PASSED, the drive could fail completely at any moment. The 16 pending sectors will likely grow, each one is a spot on the disk where data could be lost without warning.
Looking Back
I've spent months debugging crashes on this server. Power outages, battery issues, C-State freezes, charger problems I investigated every possible cause. And the whole time, the disk was slowly dying underneath everything else.
The read-only remount was actually the system protecting itself. ext4's errors=remount-ro is a safety mechanism that prevents a failing disk from corrupting data. The filesystem detected the I/O errors and went into lockdown mode. It's the reason I still have my data.
But it also means every "unexplained crash" might have been a disk error. The logs stop abruptly not because the power went out, but because the kernel couldn't write to disk anymore. The server was technically still running just silently broken.
This is the most important lesson from the entire self-hosting journey so far: always check the hardware. I spent hours configuring software, hardening security, building notification systems and the thing that almost killed the server was a physical disk with bad sectors. smartctl should have been one of the first things I installed.
What's Next
- Replace the disk with an SSD (URGENT)
- UPS (the power/charger issue is still real, just not the only problem)
- V2Ray mobile client setup
- Containerization with Docker
- Self-hosted drive with NAS
- VPS reverse proxy
- Mail server
- Local LLM