2020-03-06
md
Things Like SD Cards and Code Do Fail

A precautionary tale about failure upon failure, some of them my fault some not, that brought the home automation system down and later made it hard to bring it back on line.

Table of Content

  1. How the Home Automation System Ended up on an Orange PI PC 2
  2. Catastrophic System Failure
  3. Failure upon Failure
  4. The Solution - Start Over
  5. Yet More SD Card Woes
  6. Lessons Learned

How the Home Automation System Ended up on an Orange PI PC 2 toc

About 16 months ago, I attempted to convert an Amlogic S192 Android TV box into a Linux server hoping to install my home automation system on it. Unfortunately, that was a partial failure because Alexa device discovery in conjunction with ha-bridge by bwssystems did not work. Unable to determine the source of the problem, I put the converted TV box aside even though everything else worked quite well. Indeed, it would have been possible to continue to run ha-bridge on the small Orange Pi Zero HA Bridge on Armbian Working with Domoticz and Alexa as before. However, I really wanted to run all the servers for the home automation system on one machine with a gigabyte Ethernet connection to the local network. The solution was to use the Orange Pi PC 2 instead of a rather similar Raspberry PI 3B except for its 1 GB Ethernet connection.

The home automation system has been running on an Orange Pi PC 2 (OPiPC2) with an unsupported Armbian 22.05.3 operating system itself based on Ubuntu 22.04.1 Jammy. Not unexpectedly, the single-board computer had no problem handling Domoticz, ha-bridge, WireGuard and nginx constantly since July 2022. Furthermore, Alexa plays nicely with the system. It was to be a temporary fix until either the problem with the Android TV box / Linux appliance was resolved or the whole thing was moved to a more robust X86_64 server I have slowly being constructing. Since it had been running without problems , the whole thing was on the back burner. Funny how temporary fixes turn into permanent fixtures.

Catastrophic System Failure toc

The system broke down a few days ago. Domoticz had been sending e-mail notifications every 10 minutes warning the OPiPC2 was running hot, but I was not reading e-mails that afternoon so I remained unaware. Then I noticed that a WiFi switch (built with a W600-PICO) was not working properly. On checking, the Domoticz Web interface could not be used to switch lights on or off reliably. I looked at the MQTT messages to and from Domoticz and saw that the garage door controller was flooding the broker with "garage door open" and "garage door closed" messages in very quick succession. The garage door had been opened. After manually closing the garage door, the "open" "closed" messages kept on being delivered. Unplugging the IoT device stopped that. Then I went about rebooting the OPiPC2 fully expecting that the home automation system would function properly again, minus the garage door WiFi controller. Of course, it didn't otherwise there would be no need for this post!

After some poking at the system, looking at journal entries for the most part, it became obvious that the µSD card on which were saved the operating system and the home automation database was defective. There are many warnings about the unreliability of these storage devices, but frankly, I had not had significant failures with SD cards on the Raspberry Pi hosting the home automation system for years before. That is true even though I often used no-name cards from unknown suppliers.

In retrospect, the failure was explainable. Each bogus change in the status of the garage door required that the database be updated since it contains a log of the state of each device. So Domoticz was attempting to rewrite an approximately 2.5 MB database file (domoticz.db) many times per second. In all probability, the just as big and often much bigger, domoticz.db-wal file (a SQLite "write-ahead log" reference) was being updated even more frequently. No wonder that the SoC was getting hot. One can only guess at the number of read and write cycles that were imposed on the poor little 8GB µSD card in the machine. The email record shows that this had been going on for about three hours before I pulled the plug.

Failure upon Failure toc

Mama didn't raise no fool; I had a backup µSD card which I just inserted into the OPiPC2. Then I copied my last backup of the database from the desktop to the OPiPC2 and restarted the home automation system. It worked with almost the exact same setup as before except for changed IR codes for an IR blaster. No big deal, they are easily obtained. Besides, I had actually saved the new codes in my Tasmota configuration document. So after updating that bit and ascertaining that the system worked, the next order of business was to create a backup SD card of the updated backup SD card that was now the main card. The venerable dd program used to make copies of SD cards failed because of read errors. Since it was a micro SD card in an SD card reader, I tried with different adapters since these are known to wear out. No luck.

The Solution - Start Over toc

There was no way around this. I reinstalled Armbian 23.8 Bookworm CLI, Aug 31, 2023 on a new 16 GB SanDisk Ultra A1 card. This was a very smooth operation, all the services worked out of the box and systemctl reported that there were no failed services. Not bad at all when considering that this is an up to date community supported only OS for an orphaned single board computer.

___ ____ _ ____ ____ ____   / _ \| _ \(_) | _ \ / ___|___ \   | | | | |_) | | | |_) | | __) |   | |_| | __/| | | __/| |___ / __/   \___/|_| |_| |_| \____|_____|   Welcome to Armbian 23.8.1 Bookworm with Linux 6.1.53-current-sunxi64 No end-user support: community creations System load: 2% Up time: 6 min Memory usage: 11% of 984M IP: 192.168.1.22 CPU temp: 39°C Usage of /: 12% of 15G RX today: 159.2 MiB [ General system configuration (beta): armbian-config ]

It did not take long to install the minimum packages to get going.

One reason this was easy is that I was following my own instructions: Home Automation System on a Raspberry Pi. There were small problems to solve, a missing library, finding the latest version of a package and so on, but nothing major that required much time. However the biggest time saver is the fact that I had backups of the domoticz database and the ha-bridge database and configuration file as well as all my scripts. It was not even necessary to update Alexa with the dreaded device discovery procedure; ha-bridge worked just as before.

To be fair there were some missing pieces. While I had the scripts, there was no backup for the crontab jobs that ran a few scripts. But the defective backup µSD card could be read and I was able to recover almost everything I needed.

Some things need to be added, but the home automation system was running that same day it came crashing down which reduced the inconvenience to others in the household. Of course, just before officially putting the system back on line, I copied the µSD card to another SanDisk card and kept the newly created original card as a backup.

After that I did spend a couple of hours working out a better backup strategy based on some of the lessons learned. I hope to update an old post on that subject in a few weeks.

Yet More SD Card Woes toc

That is not the end of this tale. Why not attempt yet again to use the S912 TV box as a Linux appliance to run the home automation system. I got off on the wrong foot and in frustration reinstalled Android on the machine. In retrospect that was not necessary and it was actually easy enough to get a very recent version of Armbian on the machine.

_ _ ____ ___ _ ____   / \ _ __ ___ | | / ___|/ _ \/ |___ \   / _ \ | '_ ` _ \| |____\___ \ (_) | | __) |   / ___ \| | | | | | |_____|__) \__, | |/ __/   /_/ \_\_| |_| |_|_| |____/ /_/|_|_____|   Welcome to Armbian 23.11.0-trunk Lunar with Linux 6.1.60-ophub No end-user support: unsupported (lunar) userspace! System load: 3% Up time: 18:14 Memory usage: 18% of 1.88G IP: 192.168.1.22 CPU temp: 45°C Usage of /: 16% of 15G storage/: 1% of 29G RX today: 363.6 MiB [ General system configuration (beta): armbian-config ]

According to systemctl there was only one failed service. The solution was just to disable it because it was complaining that it could not find a drive with SMART technology... of course not. Unfortunately while proceeding with the installation of the needed packages as done on the OPiPC2, the following error occured

[ 1000.028355] mmc1: Card stuck being busy! __mmc_poll_for_busy [ 1001.272246] mmc1: card never left busy state [ 1001.275371] mmc1: tried to HW reset card, got error -110 [ 1001.279241] mmcblk1: recovery failed! [ 1001.282875] I/O error, dev mmcblk1, sector 1294288 op 0x0:(READ) flags 0x3000 phys_seg 1 prio class 2 [ 1001.292677] EXT4-fs error (device mmcblk1p2): __ext4_find_entry:1682: inode #40076: comm deb-systemd-hel: reading directory lblock 0 [ 1001.293623] mmc_erase: group start error -110, status 0x0 [ 1001.295895] Aborting journal on device mmcblk1p2-8. [ 1001.298754] I/O error, dev mmcblk1, sector 888832 op 0x3:(DISCARD) flags 0x0 phys_seg 1 prio class 2 [ 1001.300065] mmc_erase: group start error -110, status 0x0 [ 1001.337648] I/O error, dev mmcblk1, sector 1413160 op 0x3:(DISCARD) flags 0x0 phys_seg 1 prio class 2 [ 1001.353364] mmc_erase: group start error -123, status 0x0 [ 1001.355056] I/O error, dev mmcblk1, sector 3039128 op 0x3:(DISCARD) flags 0x0 phys_seg 1 prio class 2 [ 1001.364049] mmc1: card 0007 removed [ 1001.364067] JBD2: I/O error when updating journal superblock for mmcblk1p2-8. [ 1001.374833] EXT4-fs (mmcblk1p2): I/O error while writing superblock [ 1001.381007] EXT4-fs (mmcblk1p2): Remounting filesystem read-only

From then on, the system was cought in a loop of file system errors and was unusable.

[ 1008.292448] EXT4-fs error (device mmcblk1p2): __ext4_find_entry:1682: inode #40183: comm vnstatd: reading directory lblock 0 ... [ 1008.988527] EXT4-fs error (device mmcblk1p2): __ext4_get_inode_loc_noinmem:4605: inode #21: block 354: comm gmain: unable to read itable block ...

Device mmcblk1 was the µSD card on which the OS was stored which was a Verbatim 16 GB card. I tried again with another Verbatim 16 GB, but this time a Premium card. Surprisingly, the same thing happened albeit at a different time while reading a different sector. What are the odds that two µSD cards would be damaged? I did a quick search in ophub's amlogic-s9xxx GitHub thinking that the device tree might have been changed. The problem was raised in ext4-fs error #889. My understanding of the discussion between ophub and ermac500 is that the latter attempted to scan and repair the SD card (with either badblocks and fsck in Linux or chkdsk in Windows) but that messed up the image. Not wanting to waste time, I tried a third µSD card, another SanDisk of the same type used with success on the OPiPC2. This third attempt worked and the home automation system has been running on the converted TV Box for a few days without problem.

Lessons Learned toc