A precautionary tale about failure upon failure, some of them my fault some not, that brought the home automation system down and later made it hard to bring it back on line.
Table of Content
- How the Home Automation System Ended up on an Orange PI PC 2
- Catastrophic System Failure
- Failure upon Failure
- The Solution - Start Over
- Yet More SD Card Woes
- Lessons Learned
How the Home Automation System Ended up on an Orange PI PC 2
About 16 months ago, I attempted to convert an Amlogic S192 Android TV box into a Linux server hoping to install my home automation system on it. Unfortunately, that was a partial failure because Alexa device discovery in conjunction with ha-bridge by bwssystems did not work. Unable to determine the source of the problem, I put the converted TV box aside even though everything else worked quite well. Indeed, it would have been possible to continue to run ha-bridge
on the small Orange Pi Zero HA Bridge on Armbian Working with Domoticz and Alexa as before. However, I really wanted to run all the servers for the home automation system on one machine with a gigabyte Ethernet connection to the local network. The solution was to use the Orange Pi PC 2 instead of a rather similar Raspberry PI 3B except for its 1 GB Ethernet connection.
The home automation system has been running on an Orange Pi PC 2 (OPiPC2) with an unsupported Armbian 22.05.3 operating system itself based on Ubuntu 22.04.1 Jammy. Not unexpectedly, the single-board computer had no problem handling Domoticz, ha-bridge, WireGuard and nginx constantly since July 2022. Furthermore, Alexa plays nicely with the system. It was to be a temporary fix until either the problem with the Android TV box / Linux appliance was resolved or the whole thing was moved to a more robust X86_64 server I have slowly being constructing. Since it had been running without problems , the whole thing was on the back burner. Funny how temporary fixes turn into permanent fixtures.
Catastrophic System Failure
The system broke down a few days ago. Domoticz had been sending e-mail notifications every 10 minutes warning the OPiPC2 was running hot, but I was not reading e-mails that afternoon so I remained unaware. Then I noticed that a WiFi switch (built with a W600-PICO) was not working properly. On checking, the Domoticz Web interface could not be used to switch lights on or off reliably. I looked at the MQTT messages to and from Domoticz and saw that the garage door controller was flooding the broker with "garage door open" and "garage door closed" messages in very quick succession. The garage door had been opened. After manually closing the garage door, the "open" "closed" messages kept on being delivered. Unplugging the IoT device stopped that. Then I went about rebooting the OPiPC2 fully expecting that the home automation system would function properly again, minus the garage door WiFi controller. Of course, it didn't otherwise there would be no need for this post!
After some poking at the system, looking at journal entries for the most part, it became obvious that the µSD card on which were saved the operating system and the home automation database was defective. There are many warnings about the unreliability of these storage devices, but frankly, I had not had significant failures with SD cards on the Raspberry Pi hosting the home automation system for years before. That is true even though I often used no-name cards from unknown suppliers.
In retrospect, the failure was explainable. Each bogus change in the status of the garage door required that the database be updated since it contains a log of the state of each device. So Domoticz was attempting to rewrite an approximately 2.5 MB database file (domoticz.db
) many times per second. In all probability, the just as big and often much bigger, domoticz.db-wal
file (a SQLite "write-ahead log" reference) was being updated even more frequently. No wonder that the SoC was getting hot. One can only guess at the number of read and write cycles that were imposed on the poor little 8GB µSD card in the machine. The email record shows that this had been going on for about three hours before I pulled the plug.
Failure upon Failure
Mama didn't raise no fool; I had a backup µSD card which I just inserted into the OPiPC2. Then I copied my last backup of the database from the desktop to the OPiPC2 and restarted the home automation system. It worked with almost the exact same setup as before except for changed IR codes for an IR blaster. No big deal, they are easily obtained. Besides, I had actually saved the new codes in my Tasmota configuration document. So after updating that bit and ascertaining that the system worked, the next order of business was to create a backup SD card of the updated backup SD card that was now the main card. The venerable dd program used to make copies of SD cards failed because of read errors. Since it was a micro SD card in an SD card reader, I tried with different adapters since these are known to wear out. No luck.
The Solution - Start Over
There was no way around this. I reinstalled Armbian 23.8 Bookworm CLI, Aug 31, 2023 on a new 16 GB SanDisk Ultra A1 card. This was a very smooth operation, all the services worked out of the box and systemctl
reported that there were no failed services. Not bad at all when considering that this is an up to date community supported only OS for an orphaned single board computer.
It did not take long to install the minimum packages to get going.
- avahi-daemon - zero-configuration networking mDNS
- mosquitto - MQTT broker
- Python 3 virtual environments - for some Python scripts
- OpenJDK (jre) - Java run-time
- ha-bridge - Philips Hue Emulator
- domoticz - home automation server
One reason this was easy is that I was following my own instructions: Home Automation System on a Raspberry Pi. There were small problems to solve, a missing library, finding the latest version of a package and so on, but nothing major that required much time. However the biggest time saver is the fact that I had backups of the domoticz
database and the ha-bridge
database and configuration file as well as all my scripts. It was not even necessary to update Alexa with the dreaded device discovery procedure; ha-bridge
worked just as before.
To be fair there were some missing pieces. While I had the scripts, there was no backup for the crontab
jobs that ran a few scripts. But the defective backup µSD card could be read and I was able to recover almost everything I needed.
Some things need to be added, but the home automation system was running that same day it came crashing down which reduced the inconvenience to others in the household. Of course, just before officially putting the system back on line, I copied the µSD card to another SanDisk card and kept the newly created original card as a backup.
After that I did spend a couple of hours working out a better backup strategy based on some of the lessons learned. I hope to update an old post on that subject in a few weeks.
Yet More SD Card Woes
That is not the end of this tale. Why not attempt yet again to use the S912 TV box as a Linux appliance to run the home automation system. I got off on the wrong foot and in frustration reinstalled Android on the machine. In retrospect that was not necessary and it was actually easy enough to get a very recent version of Armbian on the machine.
According to systemctl
there was only one failed service. The solution was just to disable it because it was complaining that it could not find a drive with SMART technology... of course not. Unfortunately while proceeding with the installation of the needed packages as done on the OPiPC2, the following error occured
From then on, the system was cought in a loop of file system errors and was unusable.
Device mmcblk1
was the µSD card on which the OS was stored which was a Verbatim 16 GB card. I tried again with another Verbatim 16 GB, but this time a Premium card. Surprisingly, the same thing happened albeit at a different time while reading a different sector. What are the odds that two µSD cards would be damaged? I did a quick search in ophub's amlogic-s9xxx GitHub thinking that the device tree might have been changed. The problem was raised in ext4-fs error #889. My understanding of the discussion between ophub and ermac500 is that the latter attempted to scan and repair the SD card (with either badblocks
and fsck
in Linux or chkdsk
in Windows) but that messed up the image. Not wanting to waste time, I tried a third µSD card, another SanDisk of the same type used with success on the OPiPC2. This third attempt worked and the home automation system has been running on the converted TV Box for a few days without problem.
Lessons Learned
- The quality of the storage medium on which the OS is installed does matter.
The Armbian Getting Started guide is very clear about that. The forum is full of responses that start with checking the SD card. - The quality of SD cards is not easily ascertained before hand.
The "no-name" BINFUL SD card bought from Aliexpress that was running on the OPiPC2 for 16 months and, perhaps, for many months before on a Raspberry Pi did very well. Can't say the same thing for the Verbatim SD cards purchased locally from a well-known North American chain store named after an office consumable. - Purchase class A1 or A2 SD cards.
This is discussed in Getting Started - How to prepare a SD card. Read the more recent entries in the SD card performance forum post, especially the conclusion if in a hurry. - Test any µSD card as soon as it is purchased and also before putting it into service.
If the card has been used before restore it to factory default and then perform all tests. Once again this is covered in the Getting Started - How to prepare a SD card guide although there is no help for those of us that use Linux. - Resolve Programming errors.
The immediate cause of this series of failures was my code on the ESP8266 module monitoring the status of the garage door. The code should be rewritten to assume that the sensor (i.e. a contact switch) is faulty if the state of the garage door alternates very quickly. It is easy to write code that works when everything functions as expected, it is another matter to anticipate and correctly handle possible errors and hardware failure. - Redundancy Can be Good.
About a month ago, the garage monitor went wild as described above. At the time, I had repaired the source of the problem: a bad solder connection for the pull-up resistor on the signal line from the contact switch. A repair that was not well executed, clearly. Those failures highlight that the ESP8286 had no other way of determining the state of the door, which explains why it opened a closed door. That would not happen if the garage door opener had separate open and close connection but it only has a toggle connection. There is an obvious need for a second sensor to confirm that the door is open before toggling the garage door close/open switch with the hope that the door will indeed close.