This is the first of the "new and improved" hardware watchdogs for the Raspberry Pi single board computer. As the title hints, the watchdog is a Raspberry Pi also. Given the fact that Raspbian Buster will work on all models of the Raspberry Pi, any model could be the watchdog and any model could be the monitored server Pi. The plan was to use a Raspberry Pi Zero to keep the cost down. However, if the watchdog is to send an e-mail message when it reboots the server Pi, then it must be able to communicate over the network. In that case a Raspberry Pi Zero W is probably the most economical choice. Since I do not have a Pi Zero, an old model 1 Raspberry Pi is used as a proxy.
Table of contents
Wiring the Watchdog
To do better than the mining rig watchdog, our watchdog must be able to
properly shut down the server and perform a hard reset only if the shutdown
failed. That means that our watchdog will need two output connections to
initiate first a proper shutdown and then a reset if the shutdown failed. If
the shutdown is carried out, then the watchdog it will need to trigger the
RUN
signal of the server to complete the reboot. Of course, the
watchdog needs to be "fed" by the server. In other words, the watchdog needs
to monitor a heartbeat connection from the server. So, in addition to a
common ground between the two Raspberry Pi, three other connections must be
made as shown in the following schematic.
Connection | Watchdog | Server |
---|---|---|
heartbeat | GPIO17 (pin 11) - input | GPIO17 (pin 11) - output |
shutdown | GPIO27 (pin 13) - output | GPIO27 (pin 13) - input |
reset | GPIO25 (pin 22) - output | RUN - input |
ground | ground (pin 20) | ground (pin 20) |
The three connections should use general purpose input-output pins avoiding special purpose pins that could be used by either Pi for other peripherals. Since a real-time clock is connected via the hardware I2C bus to the server then GPIO2 and GPIO3 pins (I2C data and clock signals respectively) cannot be used for the hardware watchdog. Even with that proviso, there are many possible connections; the choices I made were dictated by aesthetic considerations: I thought the wiring diagram looked clean and simple and the resultant symmetry was pleasing.
There are many ground pins on the Raspberry Pi GPIO header. It does not
matter which is used to connect together the grounds of the two boards.
Except for the Raspberry Pi 3 A+ and B+, the little
two-pin connector named P2
, P6
, or RUN
or
the three pin connector named RUN GLOBAL_EN
on the
Raspberry Pi 4 also have a ground pin that could be used. Do not, under any
circumstances, connect directly together pin 1 (3.3 volts) of the two
devices. Similarly, if the two Raspberry Pi are powered from different
sources, do not connect together pins 2 and 4 (5 volts) of the two devices.
A normally open push button is also placed across the shutdown and ground
connections. That way it is possible to manually initiate a shutdown of the
server Pi. The same is done across the reset and ground connections to
manually restart the server if everything else fails or to manually restart
the server Pi if it has been shut down with a bash
command such
as halt
or poweroff
.
Similarly two normally open push buttons are connected to the
watchdog Pi. The function of the first labelled power
will be explained later. The push button connecting RUN
to ground will reset the watchdog Pi if it becomes necessary. The power
button is not used in the first, lean and mean watchdog, but it is a useful
addition as explained below.
Server Setup
There are two parts to preparing the Raspberry Pi that will be monitored by the hardware watchdog. A service (daemon) must be created that regularly signals to the watchdog that everything is nominally functionning correctly. Then the GPIO shutdown module has to be included in the kernel and must be bound with the shutdown GPIO pin. By activating that pin, the watchdog will be able to shut down the server in an orderly fashion.
Feeding the Watchdog
The server Pi has to "feed" the watchdog on a regular basis. It does this by toggling the state of its heartbeat output pin. I use a Python script to do this. First some requisite modules must be installed in the virtual Python environment for system utilities.
Virtual Python Environments:I prefer using virtual environments for Python development. There is more than one way of doing this, the one used here was described in an older post: Python 3 virtual environments. Lately, I have systematically created a virtual environment for "system" scripts as last described in the Working Directories section of a post on installing Raspbian. On
To activate the virtual environment, the commandgoldserver
that environment is called.systempy
.ve
is used. It is a Bash alias forsource $1/bin/activate
, which meansve .systempy
is the same assource .systempy/bin/activate
.
It is very simple to create a LED object with gpiozero
which nominally blinks a LED but which is in fact the required
heartbeat.
Feeding the Watchdog the Hard Way
Disappointed that the script to feed the watchdog is so simple? If you are satisfied with just three lines of actual code, move on to the next section. If you want to do it the hard way, read on.
Real programmers don't use newfangled things like gpiozero
.
So, if you were following along in the previous section just remove it from
the virtual environment. Might as well remove colorzero
which was installed along with that module.
If you were not following along, then go ahead, start the virtual
environment and install RPi.GPIO
at this point.
No matter how you started, the virtual environment should now contain
RPi.GPIO
The idea is to create a timer that will wait for a long while and after
will toggle the output pin to HIGH
before returning it to
LOW
. Obviously, we want that cycle to repeat indefinitely.
However, Python timers are simple one shot things. Simple compared to timers
in Free Pascal which can be one shot or can be made to repeat. Fortunately,
right2clicky
created an elegant RepeatTimer
class,
which is just what we need, by sub classing the Timer class (see: StackOverflow.)
Heartbeat Service
A LED
could be connected to GPIO17 (header pin 11) to test either of these scripts.
Don't forget the current limiting resistor and respect the polarity of the
LED. With the second version, testing could be as simple as enabling the
print statement in the toggleHeartbeat
function. Make the script
executable and then execute it and check that the LED flashes on for two
tenths of a second every five seconds or that toggleHeartbeat
is
printed to the console every five seconds.
Press the CtrlC key
combination to stop execution of the script. Notice how the virtual
environment was deactivated, yet the script executed correctly even if the
required Python modules are not installed in the default Python directory.
That is because the "shebang" line,
#!/home/woopi/.systempy/bin/python
, at the start of the script
informs the shell that the .systempy
virtual environment Python
interpreter in the home/woopi/.systempy/bin
directory is to be
used, not the default Python interpreter in /usr/bin
. It is
important to adjust the shebang to the correct directory. If it is wrong, then
bash
will complain.
Do not forget to adjust the constant HEARTBEAT_PIN = 17
to
the correct GPIO pin if the watchdog and server were wired differently than
how I did it. Those two things are about the only two possible errors, aside
from some typo, of course. If the script is working correctly, you may want to
write-protect it.
The script needs to be run automatically whenever the server Pi is booted up. One way is to create a
cron
task performed at each reboot.
And add the last line shown below.
While that is simple to put in place, it is preferable to run the script
in the background as a daemon. Here is a basic systemd
unit file
for it.
Create that file as the super user and save it in the
/etc/systemd/system
directory. An easy way of doing this
is by starting the nano
editor, copying the file from
above and pasting it in the editor.
Use the systemctl
utility to start the daemon and then
to enable it so that it will be automatically started when the server
is rebooted.
The value of this approach is that it is just as simple to stop the daemon.
That will be a good way of testing the watchdog later on. Furthermore, it is easy to verify that the service is running properly.
Shutdown Module
The gpio-shutdown
module has already been discussed at
length in section 5 of a previous post: Warm and Cold restarts of the Raspberry. There is
no need to rehash the subject. I made three changes to the configuration
file config.txt
.
Two changes, shown in blue, were optional. Nevertheless, it is comforting
to see that they are compatible with the necessary addition of the
gpio-shutdown
module shown in red. The latter will monitor
GPIO27 and will initiate an orderly shutdown whenever that pin is brought low
either manually with the push button or by the watchdog. Optionally, the
hardware I2C controller and the I2C driver for a hardware clock using the
DS3231 chip are included in the device tree. Also optionally the mini-UART
is enabled. That makes it easier to see if an orderly shutdown is
occurring or not. Once the watchdog is found to be working correctly, the
UART will be disabled as it slightly slows down the system.
That completes the changes that need to be made to the server.
Watchdog Setup
I will present three versions of the Python watchdog script. The first will emulate the hardware mining rig watchdog. This lean and mean version does take care of the problems associated with the mining rig watchdog without doing more. With the second version, the obedient watchdog, it will be possible to use its power button to shut down the server without the watchdog restarting it. In the final version the watchdog will bark, meaning it will log its actions and, when possible, send out e-mail notification when it reboots the server.
As mentioned in the introduction to this series of posts, an early Raspberry Pi is being used as a proxy for a Raspberry Pi Zero or Raspberry Pi Zero W which are better choices because of their size and price.
That was the last model 1 Raspberry Pi with only two USB ports and a 26 pin GPIO header. Like the Zero it has 512 Mbytes of RAM. Both have the same Broadcom system on a chip (BCM2835) with the same one core ARM processor (ARM1176JZF-S). The Zero runs at 1GHz while the model 1 has a lower clock speed of 700 MHz which can be overclocked.
The operating system is Rasbian Buster
Lite (kernel 4.19) version 2019-09-26 to which only a few
modifications have been made. The host name was changed to wdog
while the default user remains pi
. The virtual Python
environment for systems utilities such as the watchdog is named
.systempy
. For more details, see the post titled Installation and Configuration of Raspbian Buster Lite.
A Wi-Fi USB dongle makes life much easier because there is no simple way to connect to the local area network with Ethernet in the room where this experiment is being run. Happily, the dongle is based on the Realtek RTL8188CUS chip which is supported by Buster.
The Raspberry Pi Zero does not have conventional network capabilities, but I understand that this would be possible to open SSH sessions using a USB connection between the Raspberry Pi Zero and the desktop.
As always, the operating system was upgraded just before starting this project.
As more and more changes are made to the OS after the Raspberry
Foundation updates the download image, the update will take longer. Given
that the image was four months old and the Raspberry Pi has a relatively
under-powered processor, I had time for a quick lunch at this point. Do not
forget the -y
flag, if you want this upgrade to proceed
unattended.
Lean and Mean Watchdog
The aim here is to replicate the mining rig watchdog while overcoming its main drawbacks. To that extent, this minimal watchdog will
- shut down the Raspberry Pi server properly if possible,
- minimize toggling of the
RUN
input, and - let the Raspberry Pi sever function properly even when it is itself off.
The watchdog will be implemented with a Python script. Again the
RPi.GPIO
module is a prerequisite that is added in the
Python virtual environment for system utilities.
After deactivating the virtual environment, the script was created.
The script, renamed wdog_lm.py to
distinguish the three versions, can be downloaded by clicking on the link, but
here is a quick way to obtain the script, to rename it wdog.py
and to make it executable.
If the virtual environment directory is not named .systempy
the
above commands will have to be adjusted as well as the first "shebang" line of the
script.
If the comments were omitted, it would be obvious that this is a short
script with not much to it. Whenever the server sends a heartbeat, an
interrupt occurs and its handler, aliveCallback
, updates the
time of reception of the signal. A timer regularly executes
checkAlive
which will reboot the server if it the last received
heartbeat occurred too long ago.
Sharp-eyed readers will have noticed that the cleanup code was not
registered with atexit
as done in the previous script. Instead
the pause
statement is encased in a
try
...finally
block and the cleanup code is
performed in the finally
clause which is certain to be
executed. The cleanup now includes cancelling the timer, otherwise the CtrlC keyboard combination will
not halt the timer thread. And by the way, the RepeatTime
class introduced above is used again.
Rebooting the server is done in two steps. First it is shut down properly
by activating the server GPIO pin bound to the gpio-shutdown
kernel module. The watchdog then waits while the shutdown is performed. After
an appropriate delay, the server is restarted by activating its
RUN
pin. This two-step approach provides a fail-safe mechanism.
If the server had gone off the tracks to the extent that activating the
GPIO pin bound to gpio-shutdown
did not shut down the operating
system properly, the second step will reset the system, albeit without a
proper shutdown.
Note that when started and when it has rebooted the server, the watchdog
is not active and it will not reboot the server even in the absence of a
heartbeat. The watchdog must be activated which occurs when it has received a
specific, user definable, number of heartbeats from the server. This is on
purpose. It is possible to disable the feed service on the server and after
an initial reboot by the watchdog, the latter will no longer try to reboot
the server. Similarly, it is possible to change the operating system on the
server and the watchdog will not interfere as the OS is updated and services
are installed. That startup feature entails that it is not
necessary to wait for the completion of the server Pi reboot process before
restarting the watchdog. It will patiently wait for the heartbeat to resume
before starting its job. I think this is a clever idea, but it is not mine.
It can be found in the software watchdog
(see Raspberry Pi and Domoticz Watchdog or the man
page
for watchdog
).
One fortunate consequence of not starting the watchdog until it has received
at least one signal from the system being monitored is that it does not
matter which device is started first. Unfortunately, it does matter which is
shut down first. If the Raspberry Pi server is shut first and if the
watchdog was started then the latter will restart the server after the time-out
delay. It does not matter if this is done with a command line utility or
with the reboot or reset buttons. This was a problem with the mining rig
watchdog also. The work around is to first stop the wdog.py
script or shut down the watchdog Pi.
Obedient Watchdog
There is a way to partially avoid the last problem. Instead of using either the shutdown or reset buttons of the server, the "power" button connected to the watchdog Pi will be used. Indeed, during this experimental phase, the power button will be able to perform four different functions depending on the number of times it is pressed in quick succession.
Press count | Action | Note |
---|---|---|
1 | Reboot the server | The watchdog continues to function |
2 | Shut down the server | The watchdog is disabled until the server restarts |
3 | Reboot the watchdog | Server unaffected |
4 | Shut down the watchdog |
The gpiozero
module is added to the virtual environment
because it has a convenient button object.
The listing below only shows the additions made to the
~/.systempy/wdog.py
script. Because the number of times the button
is pressed must be counted in quick succession, a timer, (called
buttonTimer
, a global variable) is started whenever the button
is released. If the button is pressed before time runs out, the button
pressed count in incremented and the timer is restarted.
When the timer does run out, then doButton
is executed. Note
that it will be necessary to run the script as root
, which will
be the case when the script is set up as a service.
To try this version, get the complete script and make it executable. You may want to preserve the older script as shown in the first line below. The script, wdog_o.py, can also be downloaded.
Of course this solution does not stop the watchdog from restarting the
server Pi when the later is shutdown with a bash
command. Here
are some initial ideas about ways to take care of that problem:
- Use yet another GPIO connection from the server Pi to the watchdog Pi which when asserted turns off the watchdog.
- Use a serial communication protocol UART, SPI or OneWire to do something similar, but they all require one, two or even more GPIO connections.
- I2C was not included because Raspberry Pi are I2C masters only and I2C masters cannot talk to each other. But it may be possible to use an I2C EEPROM as a letter box where the server Pi leaves a warning in a specific memory address that it is shutting down and the watchdog Pi always looks at the address before rebooting the server.
- Use a more sophisticated heartbeat which transmits two types of messages: "I'm alive" and an "I'm about to shut down" signal. This will entail more complex scripts, but in principle it is quite possible.
It's fun to speculate about these solutions to what I judge to be a minor problem. Before spending time examining them any further, it would be best to verify just how effective the hardware watchdog will be.
The watchdog described above has the minimum capabilities that will be required of all the other devices to be evaluated as potential hardware watchdogs. This will be done in future posts as announced in the introduction to this series of posts.
Obedient and Barking Watchdog
So far the watchdog has been doing its job very quietly. This will be
especially true when the watchdog is run as a service because the print
statements in the wdog.py
scripts, meant to help in the initial
testing, will not be visible. But even the lowly Raspberry Pi Zero has
logging capabilities. So I converted the print statements into logging
statements. Here is an example.
The log
function is merely a wrapper around the
syslog
function of the syslog
Python module which
optionally prints out the log message to the console as before except
for the addition of a time stamp. The sendNotification
function calls on a library function, postmail
to send an email when ever the server Pi is about to be rebooted or shutdown.
Of course you will need the pymail.py
module containing the postmail
function. The module and a "secrets" file, pymail_secrets.py
, are in an archive that can be obtained here: pymail_0-2-0.zip. Values in the secrets file and in pymail.py
will have to be adjusted. If the watchdog does not have access to the Internet, then set the constant
SEND_NOTIFICATION
to false.
If an SSH session can be opened on the watchdog Pi, then it will be possible to see the logging messages in real time.
Finally, the power button function was simplified. One short click of the button will reboot both the watchdog and the server Pi. One long press of the power button will shut down the watchdog doing nothing to the server. Once the watchdog Pi is down, it will be possible to restart the Pi by pressing the button again because it is connected to GPIO3. I think it is much more likely that I will remember these two possible actions instead of the four. And activating the wanted action will be less finicky.
When both devices are down, then it will be possible to restart them without toggling their power off and then back on. Pressing the power button of the watchdog will restart the latter, the server can be restarted by pressing its reset button.
To try this last version of the watchdog script to be presented in this post, download it and make it executable. Again you may want to save any previous version before downloading this version of the script.
It will be necessary to adapt some constants at the beginning of the script.
Unleashing the Watchdog
All that needs to be done now is to ensure that the watchdog script is executed automatically when the watchdog Pi is booted. This is done with a unit file that is almost identical to the one created for the heartbeat script on the server Pi.
As before, it is easy to perform the ususal tasks.
- Start the service:
pi@wdog:~ $ sudo systemctl start piwdog.service
- Stop the service:
pi@wdog:~ $ sudo systemctl stop piwdog.service
- Verify the status of the service:
pi@wdog:~ $ sudo systemctl status piwdog.service
- Enable automatic starting of the service at boot:
pi@wdog:~ $ sudo systemctl enable piwdog.service
- Disable automatic starting of the service at boot:
pi@wdog:~ $ sudo systemctl disable piwdog.service
Timing
How quickly should the hardware watchdog reboot the server when it no longer receives the heartbeat signal? It would be preferable that the home automation system be on line all the time to execute scheduled tasks as planned. That could lead one to decide on a fast response from the watchdog. However, provisions have been made in the circuit shown above for manual reboots of the server. Furthermore, I have rebooted the server from outside the house using one of the functions of the home automation system Domoticz more than once. When rebooting, the server will not be sending out heartbeat signals and it would be unfortunate if the starved watchdog were to reboot the server while it is in the process of booting. It would not be a catastrophe because when the watchdog itself reboots the server it knows to wait long enough for the server to reboot before trying to restart it. Actually, the watchdog does not know much of anything, the script contains no less than seven timing constants that will probably need to be adjusted in actual use.
The script is basically event driven. One event is when the repeat timer expires. The CHECK_INTERVAL is the time between timeouts. When that happens, the event handler checks how long it has been since the last time a heartbeat was received from the server Pi. If that time exceeds WATCHDOG_TIMEOUT, then the watchdog tries to reboot the server. That time period should be greater than the time the server needs to reboot as explained above. Right now the timeout is set at 45 seconds, but the test server, a Raspberry Pi 3 B, is just a skeleton. It is important to actually time a few reboots and set the timeout in accordance with the measured time plus a safety margin just in case adding another piece of software adds to the boot time. The constant SHUTDOWN_DELAY is related, because it should correspond to the time needed by the server Pi to shut down which should be in approximately half the time needed for a complete reboot. It is important not to underestimate this time because when that interval is over the watchdog will activate the server RUN
signal. If that happens too soon, the effect would be to stop the whole shutdown process instead of restarting a machine. The RESET_DELAY may seem very short at 5 seconds, but this is not a very important value. After all, the watchdog is reset after initiating a reboot of the server and it will then wait however long it takes for the server to send a few initial heartbeats (as specified by the START_COUNT constant) before beginning to function.
If the pulse activating the server shutdown GPIO pin is too short, it will not work. No doubt because of the debounce delay in the gpio-shutdown
module. A 3 tenths of a second pulse seems ok, but if the shutdowns initiated by the watchdog do not seem to work, it may be worthwhile to increase the value of PULSE_TIME. The BUTTON_BOUNCE constant is not too critical. It matters more in the previous version of the script when the number of consecutive button presses was being counted. In this version, all that needs to be distinguished is a short versus a long button press and the debounce delay could easily be 3 or 4 times greater without creating much difficulty.
Testing
Testing of the watchdog was done with two approaches. The simplest is to turn on both Raspberry Pi and, after the watchdog has started, to stop feeding it. To help with the timing, the current time will be obtained just before halting the wdfeed
service.
If the 51-seconds timeout seems excessive, remember that the watchdog checks the last time it was fed once every 10 seconds. Only if it has been more than 45 seconds since the last heartbeat was received will the watchdog initiate a reboot. So the time-out could be anywhere between 45 and 55 seconds. The SSH session opened with the server Pi was closed and the following message was received.
This confirmed that the watchdog performed as expected. I ran the test overnight by adding the following cron
task.
The server stops feeding the watchdog every fifteen minutes triggering the watchdog whcih reboots the server Pi. This was verified by looking at the incoming emails the following morning. To be pedantic, the task only happens once, 15 minutes after the server Pi boots up, but then the cycle repeat. That verifies the mechanics of the watchdog. But to see it in action, it was necessary to "crash" the server Pi. In the past I have used the forkbomb.sh
script.
It has the disadvantage of taking a relatively long while to use up all the resources. Others have come up with a similar script, let's call it crash.sh
So that is my arsenal for testing.
Do not forget to make the scripts executable, and remember that there is no point in enabling more than one of these tasks because the watchdog will reboot the server Pi when one of these tasks is first performed. (That's not exactly true, if crash.sh
were started right after either of the other two, it would probably crash the Linux kernel before the previous task could be completed.