2020-03-28
md
Raspberry Pi Hardware Watchdog: Two Pies Please
<-Warm and Cold Reboots of the Raspberry Pi
<-Rethinking the Raspberry Pi Hardware Watchdog

This is the first of the "new and improved" hardware watchdogs for the Raspberry Pi single board computer. As the title hints, the watchdog is a Raspberry Pi also. Given the fact that Raspbian Buster will work on all models of the Raspberry Pi, any model could be the watchdog and any model could be the monitored server Pi. The plan was to use a Raspberry Pi Zero to keep the cost down. However, if the watchdog is to send an e-mail message when it reboots the server Pi, then it must be able to communicate over the network. In that case a Raspberry Pi Zero W is probably the most economical choice. Since I do not have a Pi Zero, an old model 1 Raspberry Pi is used as a proxy.

Table of contents

  1. Wiring the Watchdog
  2. Server Setup
    1. Feeding the Watchdog
    2. Feeding the Watchdog the Hard Way
    3. Heartbeat Service
    4. Shutdown Module
  3. Watchdog Setup
    1. Lean and Mean Watchdog
    2. Obedient Watchdog
    3. Obedient and Barking Watchdog
    4. Unleashing the Watchdog
  4. Timing
  5. Testing

Wiring the Watchdog toc

To do better than the mining rig watchdog, our watchdog must be able to properly shut down the server and perform a hard reset only if the shutdown failed. That means that our watchdog will need two output connections to initiate first a proper shutdown and then a reset if the shutdown failed. If the shutdown is carried out, then the watchdog it will need to trigger the RUN signal of the server to complete the reboot. Of course, the watchdog needs to be "fed" by the server. In other words, the watchdog needs to monitor a heartbeat connection from the server. So, in addition to a common ground between the two Raspberry Pi, three other connections must be made as shown in the following schematic.

Connection Watchdog Server
heartbeat GPIO17 (pin 11) - input GPIO17 (pin 11) - output
shutdown GPIO27 (pin 13) - output GPIO27 (pin 13) - input
reset GPIO25 (pin 22) - output RUN - input
ground ground (pin 20) ground (pin 20)

The three connections should use general purpose input-output pins avoiding special purpose pins that could be used by either Pi for other peripherals. Since a real-time clock is connected via the hardware I2C bus to the server then GPIO2 and GPIO3 pins (I2C data and clock signals respectively) cannot be used for the hardware watchdog. Even with that proviso, there are many possible connections; the choices I made were dictated by aesthetic considerations: I thought the wiring diagram looked clean and simple and the resultant symmetry was pleasing.

There are many ground pins on the Raspberry Pi GPIO header. It does not matter which is used to connect together the grounds of the two boards. Except for the Raspberry Pi 3 A+ and B+, the little two-pin connector named P2, P6, or RUN or the three pin connector named RUN GLOBAL_EN on the Raspberry Pi 4 also have a ground pin that could be used. Do not, under any circumstances, connect directly together pin 1 (3.3 volts) of the two devices. Similarly, if the two Raspberry Pi are powered from different sources, do not connect together pins 2 and 4 (5 volts) of the two devices.

A normally open push button is also placed across the shutdown and ground connections. That way it is possible to manually initiate a shutdown of the server Pi. The same is done across the reset and ground connections to manually restart the server if everything else fails or to manually restart the server Pi if it has been shut down with a bash command such as halt or poweroff.

Similarly two normally open push buttons are connected to the watchdog Pi. The function of the first labelled power will be explained later. The push button connecting RUN to ground will reset the watchdog Pi if it becomes necessary. The power button is not used in the first, lean and mean watchdog, but it is a useful addition as explained below.

Server Setup toc

There are two parts to preparing the Raspberry Pi that will be monitored by the hardware watchdog. A service (daemon) must be created that regularly signals to the watchdog that everything is nominally functionning correctly. Then the GPIO shutdown module has to be included in the kernel and must be bound with the shutdown GPIO pin. By activating that pin, the watchdog will be able to shut down the server in an orderly fashion.

Feeding the Watchdog toc

The server Pi has to "feed" the watchdog on a regular basis. It does this by toggling the state of its heartbeat output pin. I use a Python script to do this. First some requisite modules must be installed in the virtual Python environment for system utilities.

woopi@goldserver:~ $ ve .systempy (.systempy) woopi@goldserver:~ $ pip install --upgrade rpi.gpio gpiozero Looking in indexes: https://pypi.org/simple, https://www.piwheels.org/simple ... Successfully installed colorzero-1.1 gpiozero-1.5.1 rpi.gpio-0.7.0 (.systempy) woopi@goldserver:~ $ (.systempy) woopi@goldserver:~ $ pip freeze colorzero==1.1 gpiozero==1.5.1 pkg-resources==0.0.0 RPi.GPIO==0.7.0
Virtual Python Environments:

I prefer using virtual environments for Python development. There is more than one way of doing this, the one used here was described in an older post: Python 3 virtual environments. Lately, I have systematically created a virtual environment for "system" scripts as last described in the Working Directories section of a post on installing Raspbian. On goldserver that environment is called .systempy.

To activate the virtual environment, the command ve is used. It is a Bash alias for source $1/bin/activate, which means ve .systempy is the same as source .systempy/bin/activate.

It is very simple to create a LED object with gpiozero which nominally blinks a LED but which is in fact the required heartbeat.

(.systempy) woopi@goldserver:~ $ nano .systempy/wdfeed.py

#!/home/woopi/.systempy/bin/python # Python 3 script that toggles a GPIO output pin on at regular interval. # This is the heartbeat signal meant to feed a hardware watchdog. ### User settable values ########################################## HEARTBEAT_PIN = 17 # GPIO17 = header pin 11 HEARTBEAT_INTERVAL = 5 # seconds wait between heartbeats ################################################################### from gpiozero import LED from signal import pause led = LED(HEARTBEAT_PIN, initial_value=False) led.blink(on_time=0.2, off_time=HEARTBEAT_INTERVAL) pause()

Feeding the Watchdog the Hard Way toc

Disappointed that the script to feed the watchdog is so simple? If you are satisfied with just three lines of actual code, move on to the next section. If you want to do it the hard way, read on.

Real programmers don't use newfangled things like gpiozero. So, if you were following along in the previous section just remove it from the virtual environment. Might as well remove colorzero which was installed along with that module.

(.systempy) woopi@goldserver:~ $ pip uninstall gpiozero colorzero Looking in indexes: https://pypi.org/simple, https://www.piwheels.org/simple ...

If you were not following along, then go ahead, start the virtual environment and install RPi.GPIO at this point.

woopi@goldserver:~ $ ve .systempy (.systempy) woopi@goldserver:~ $ pip install --upgrade rpi.gpio ...

No matter how you started, the virtual environment should now contain RPi.GPIO

(.systempy) woopi@goldserver:~ $ pip freeze pkg-resources==0.0.0 RPi.GPIO==0.7.0

The idea is to create a timer that will wait for a long while and after will toggle the output pin to HIGH before returning it to LOW. Obviously, we want that cycle to repeat indefinitely. However, Python timers are simple one shot things. Simple compared to timers in Free Pascal which can be one shot or can be made to repeat. Fortunately, right2clicky created an elegant RepeatTimer class, which is just what we need, by sub classing the Timer class (see: StackOverflow.)

(.systempy) woopi@goldserver:~ $ nano .systempy/wdfeed.py

#!/home/woopi/.systempy/bin/python ''' Python 3 script that toggles a GPIO output pin on/off twice at regular interval. This is the heartbeat signal meant to feed a hardware watchdog. ''' ### User settable values ########################################## HEARTBEAT_PIN = 17 # GPIO17 = header pin 11 HEARTBEAT_INTERVAL = 5 # seconds wait between heartbeats ################################################################### import atexit from time import sleep import RPi.GPIO as GPIO from threading import Timer from signal import pause # Subclassed Timer that will restart itself after executing the function # specified when created. It will execute the same function over and over # at the specified interval. Reference: # right2clicky on StackOverflow: https://stackoverflow.com/a/48741004 # class RepeatTimer(Timer): def run(self): while not self.finished.wait(self.interval): self.function(*self.args, **self.kwargs) def gpioCleanup(): #print("gpioCleanup") GPIO.cleanup() def toggleHeartbeat(): #print("toggleHeartbeat") GPIO.output(HEARTBEAT_PIN, GPIO.HIGH) sleep(0.2) GPIO.output(HEARTBEAT_PIN, GPIO.LOW) atexit.register(gpioCleanup) GPIO.setwarnings(False) GPIO.setmode(GPIO.BCM) GPIO.setup(HEARTBEAT_PIN, GPIO.OUT, initial=GPIO.LOW) timer = RepeatTimer(HEARTBEAT_INTERVAL, toggleHeartbeat) timer.start() pause()

Heartbeat Service toc

A LED could be connected to GPIO17 (header pin 11) to test either of these scripts. Don't forget the current limiting resistor and respect the polarity of the LED. With the second version, testing could be as simple as enabling the print statement in the toggleHeartbeat function. Make the script executable and then execute it and check that the LED flashes on for two tenths of a second every five seconds or that toggleHeartbeat is printed to the console every five seconds.

(.systempy) woopi@goldserver:~ $ ev woopi@goldserver:~ $ sudo chmod +x .systempy/wdfeed.pv woopi@goldserver:~ $ .systempy/wdfeed.pv toggleHeartbeat toggleHeartbeat ...

Press the CtrlC key combination to stop execution of the script. Notice how the virtual environment was deactivated, yet the script executed correctly even if the required Python modules are not installed in the default Python directory. That is because the "shebang" line, #!/home/woopi/.systempy/bin/python, at the start of the script informs the shell that the .systempy virtual environment Python interpreter in the home/woopi/.systempy/bin directory is to be used, not the default Python interpreter in /usr/bin. It is important to adjust the shebang to the correct directory. If it is wrong, then bash will complain.

woopi@goldserver:~ $ .systempy/wdfeed.pv -bash: .systempy/wdfeed.py: home/woopi/.systempy/bin/python: No such file or directory

Do not forget to adjust the constant HEARTBEAT_PIN = 17 to the correct GPIO pin if the watchdog and server were wired differently than how I did it. Those two things are about the only two possible errors, aside from some typo, of course. If the script is working correctly, you may want to write-protect it.

woopi@goldserver:~ $ ls -l .systempy/wdfeed.py -rwxr-xr-x 1 woopi woopi 605 Feb 6 15:40 .systempy/wdfeed.py woopi@goldserver:~ $ sudo chmod -w .systempy/wdfeed.py woopi@goldserver:~ $ ls -l .systempy/wdfeed.py -r-xr-xr-x 1 woopi woopi 605 Feb 6 15:40 .systempy/wdfeed.py

The script needs to be run automatically whenever the server Pi is booted up. One way is to create a cron task performed at each reboot.

woopi@goldserver:~ $ woopi@goldserver:~ $ sudo crontab -e

And add the last line shown below.

... # For more information see the manual pages of crontab(5) and cron(8) # # m h dom mon dow command @reboot /home/woopi/.systempy/bin/python /home/woopi/.systempy/wdfeed.py &

While that is simple to put in place, it is preferable to run the script in the background as a daemon. Here is a basic systemd unit file for it.

[Unit] Description=Hardware watchdog feeding service After=network.target [Service] Type=simple Restart=always RestartSec=1 User=root ExecStart=/home/woopi/.systempy/wdfeed.py [Install] WantedBy=multi-user.target

Create that file as the super user and save it in the /etc/systemd/system directory. An easy way of doing this is by starting the nano editor, copying the file from above and pasting it in the editor.

woopi@goldserver:~ $ sudo nano /etc/systemd/system/wdfeed.service

Use the systemctl utility to start the daemon and then to enable it so that it will be automatically started when the server is rebooted.

woopi@goldserver:~ $ sudo systemctl start wdfeed.service woopi@goldserver:~ $ sudo systemctl enable wdfeed.service

The value of this approach is that it is just as simple to stop the daemon.

woopi@goldserver:~ $ sudo systemctl stop wdfeed.service

That will be a good way of testing the watchdog later on. Furthermore, it is easy to verify that the service is running properly.

woopi@goldserver:~ $ sudo systemctl status wdfeed.service ● wdfeed.service - Hardware watchdog feeding service Loaded: loaded (/etc/systemd/system/wdfeed.service; disabled; vendor preset: enabled) Active: active (running) since Wed 2020-02-05 20:18:35 AST; 6s ago Main PID: 720 (wdfeed.py) Tasks: 2 (limit: 2319) Memory: 4.6M CGroup: /system.slice/wdfeed.service └─720 /home/woopi/.systempy/bin/python /home/woopi/.systempy/wdfeed.py Feb 05 20:18:35 goldserver systemd[1]: Started Hardware watchdog feeding service.

Shutdown Module toc

The gpio-shutdown module has already been discussed at length in section 5 of a previous post: Warm and Cold restarts of the Raspberry. There is no need to rehash the subject. I made three changes to the configuration file config.txt.

woopi@goldserver:~ $ sudo nano /boot/config.txt
... # Uncomment some or all of these to enable the optional hardware interfaces #dtparam=i2c_arm=on #dtparam=i2s=on #dtparam=spi=on # For access to I2C RTCand other I2C devices on hardware I2C bus (SDA on GPIO2, SCL on GPIO3) dtoverlay=i2c-rtc,ds3231 ## - compatible with gpio-shutdown as long as gpio_pin is not 2 or 3 dtoverlay=gpio-shutdown,gpio_pin=27 ... # Connect mini-UART to the GPIO header # This implies core_freq=250, a performance hit so disable this if not needed # Tx on BCM GPIO 14, Rx on BCM GPIO 15 [pins 8 and 10 on the GPIO header respectively]. # Refence: https://www.raspberrypi.org/documentation/configuration/uart.md enable_uart=1 ...

Two changes, shown in blue, were optional. Nevertheless, it is comforting to see that they are compatible with the necessary addition of the gpio-shutdown module shown in red. The latter will monitor GPIO27 and will initiate an orderly shutdown whenever that pin is brought low either manually with the push button or by the watchdog. Optionally, the hardware I2C controller and the I2C driver for a hardware clock using the DS3231 chip are included in the device tree. Also optionally the mini-UART is enabled. That makes it easier to see if an orderly shutdown is occurring or not. Once the watchdog is found to be working correctly, the UART will be disabled as it slightly slows down the system.

That completes the changes that need to be made to the server.

Watchdog Setup toc

I will present three versions of the Python watchdog script. The first will emulate the hardware mining rig watchdog. This lean and mean version does take care of the problems associated with the mining rig watchdog without doing more. With the second version, the obedient watchdog, it will be possible to use its power button to shut down the server without the watchdog restarting it. In the final version the watchdog will bark, meaning it will log its actions and, when possible, send out e-mail notification when it reboots the server.

As mentioned in the introduction to this series of posts, an early Raspberry Pi is being used as a proxy for a Raspberry Pi Zero or Raspberry Pi Zero W which are better choices because of their size and price.

pi@wdog:~ $ cat /proc/device-tree/model Raspberry Pi Model B Rev 2 pi@wdog:~ $ cat /proc/cpuinfo | grep Revision Revision : 000e

That was the last model 1 Raspberry Pi with only two USB ports and a 26 pin GPIO header. Like the Zero it has 512 Mbytes of RAM. Both have the same Broadcom system on a chip (BCM2835) with the same one core ARM processor (ARM1176JZF-S). The Zero runs at 1GHz while the model 1 has a lower clock speed of 700 MHz which can be overclocked.

The operating system is Rasbian Buster Lite (kernel 4.19) version 2019-09-26 to which only a few modifications have been made. The host name was changed to wdog while the default user remains pi. The virtual Python environment for systems utilities such as the watchdog is named .systempy. For more details, see the post titled Installation and Configuration of Raspbian Buster Lite.

A Wi-Fi USB dongle makes life much easier because there is no simple way to connect to the local area network with Ethernet in the room where this experiment is being run. Happily, the dongle is based on the Realtek RTL8188CUS chip which is supported by Buster.

pi@wdog:~ $ lsusb Bus 001 Device 004: ID 0bda:8176 Realtek Semiconductor Corp. RTL8188CUS 802.11n WLAN Adapter

The Raspberry Pi Zero does not have conventional network capabilities, but I understand that this would be possible to open SSH sessions using a USB connection between the Raspberry Pi Zero and the desktop.

As always, the operating system was upgraded just before starting this project.

pi@wdog:~ $ sudo apt update; sudo apt upgrade -y

As more and more changes are made to the OS after the Raspberry Foundation updates the download image, the update will take longer. Given that the image was four months old and the Raspberry Pi has a relatively under-powered processor, I had time for a quick lunch at this point. Do not forget the -y flag, if you want this upgrade to proceed unattended.

Lean and Mean Watchdog toc

The aim here is to replicate the mining rig watchdog while overcoming its main drawbacks. To that extent, this minimal watchdog will

The watchdog will be implemented with a Python script. Again the RPi.GPIO module is a prerequisite that is added in the Python virtual environment for system utilities.

pi@wdog:~ $ ve .systempy (.systempy) pi@wdog:~ $ pip install --upgrade rpi.gpio Looking in indexes: https://pypi.org/simple, https://www.piwheels.org/simple ... Successfully installed rpi.gpio-0.7.0 (.systempy) pi@wdog:~ $ pip freeze pkg-resources==0.0.0 RPi.GPIO==0.7.0

After deactivating the virtual environment, the script was created.

(.systempy) pi@wdog:~ $ ev pi@wdog:~ $ nano .systempy/wdog.py

#!/home/pi/.systempy/bin/python # coding: utf-8 ### User settable values ##################################################### # Timing constants CHECK_INTERVAL = 10 # seconds between checks of the last alive signal WATCHDOG_TIMEOUT = 45 # seconds without an alive signal before rebooting the server PULSE_TIME = 0.3 # length (seconds) of pulses sent to the server shutdown and reset pins SHUTDOWN_DELAY = 25 # time allowed (seconds) for the server to shut down RESET_DELAY = 5 # time allowed (seconds) for the server to cold boot START_COUNT = 4 # number of initial heartbeats to start watchdog # Watchdog GPIO connections HEARTBEAT_GPIO = 17 # watchdog input connected to server's alive pin SERVER_SHUTDOWN_GPIO = 27 # watchdog output connected to server's shutdown pin SERVER_RESET_GPIO = 22 # watchdog output connected to server's RUN pin ############################################################################## ## Global variables ## startCount = 0 # Count of initial heartbeats aliveTime = 0.0 # last time the alive signal received from server watchdogActive = False # True = watchdog was started by initial hearbeat ## Required modules ## import RPi.GPIO as GPIO from threading import Timer from signal import pause from subprocess import check_call import os import time # Common routine to assert a normally HIGH GPIO pin LOW for a short # period of time and to optionally sleep for a specified amount of time # def pulseServerPin(aPin, sleepTime=None): global watchdogActive watchdogActive = False #print("pulseServerPin {} low".format(aPin)) GPIO.output(aPin, GPIO.LOW) time.sleep(PULSE_TIME) GPIO.output(aPin, GPIO.HIGH) if sleepTime: #print("waiting {} seconds after pulse".format(sleeptime)) time.sleep(sleepTime) # Restarts the watchdog so that it resumes waiting for an initial # feeding before starting # def initWatchdog(): global watchdogActive global startCount print("Watchdog - resetting watchdog") watchdogActive = False startCount = 0 # Shuts down the server by activating its shutdown pin and, # after a delay to allow a proper shutdown of the OS, it # restarts the server by activating its RUN pin. Sleeps # to allow the boot process to complete on the server # def rebootServer(): global aliveTime global watchdogActive global startCount print('Watchdog - rebooting server') if not watchdogActive: print('Watchdog - already rebooting') return # Shut down the server properly and then wake it up pulseServerPin(SERVER_SHUTDOWN_GPIO, sleepTime=SHUTDOWN_DELAY) pulseServerPin(SERVER_RESET_GPIO, sleepTime=RESET_DELAY) # Reset the watchdog initWatchdog() # Subclassed Timer that will restart itself after executing the function # specified when created. It will execute the same function over and over # at the specified intervals. # Reference: # right2clicky on StackOverflow: https://stackoverflow.com/a/48741004 # class RepeatTimer(Timer): def run(self): while not self.finished.wait(self.interval): self.function(*self.args, **self.kwargs) # Routine called by the timer at regular intervals (CHECK_INTERVAL) # to check last time the server sent heartbeat. Reboots the server if # the alive signal has not been received for too long a period # def checkAlive(): if watchdogActive and (time.time() - aliveTime > WATCHDOG_TIMEOUT): print("Watchdog - watchdog timed out after {0:.2f} seconds".format(time.time() - aliveTime)) rebootServer() # This is the call back routine for the interrupt generated by the # server heartbeat signal. It updates the time at which the signal was # received. If the watchdog has not been started then it increments # the number of times a heartbeat has been detected and if it is # now large enough, the watchdog is started. # def aliveCallback(channel): global watchdogActive global startCount global aliveTime aliveTime = time.time() if not watchdogActive: startCount += 1 #print("Watchdog - startCount: ", startCount) if startCount > START_COUNT: print("Watchdog - watchdog started") watchdogActive = True # Setup the GPIO pins GPIO.setwarnings(False) GPIO.setmode(GPIO.BCM) GPIO.setup(SERVER_SHUTDOWN_GPIO, GPIO.OUT, initial=GPIO.HIGH) GPIO.setup(SERVER_RESET_GPIO, GPIO.OUT, initial=GPIO.HIGH) GPIO.setup(HEARTBEAT_GPIO, GPIO.IN, pull_up_down=GPIO.PUD_UP) GPIO.add_event_detect(HEARTBEAT_GPIO, GPIO.FALLING, callback=aliveCallback) # Setup the timer aliveTime = time.time() timer = RepeatTimer(CHECK_INTERVAL, checkAlive) # Run the watchdog timer.start() print("Watchdog - watchdog loaded") try: pause() finally: GPIO.cleanup() if timer.is_alive(): timer.cancel() print("Watchdog - watchdog terminated")

The script, renamed wdog_lm.py to distinguish the three versions, can be downloaded by clicking on the link, but here is a quick way to obtain the script, to rename it wdog.py and to make it executable.

pi@wdog:~ $ wget -O .systempy/wdog.py https://sigmdel.ca/michel/ha/rpi/dnld/wdog_lm.py pi@wdog:~ $ sudo chmod +x ./syspy/wdog.py

If the virtual environment directory is not named .systempy the above commands will have to be adjusted as well as the first "shebang" line of the script.

If the comments were omitted, it would be obvious that this is a short script with not much to it. Whenever the server sends a heartbeat, an interrupt occurs and its handler, aliveCallback, updates the time of reception of the signal. A timer regularly executes checkAlive which will reboot the server if it the last received heartbeat occurred too long ago.

Sharp-eyed readers will have noticed that the cleanup code was not registered with atexit as done in the previous script. Instead the pause statement is encased in a try...finally block and the cleanup code is performed in the finally clause which is certain to be executed. The cleanup now includes cancelling the timer, otherwise the CtrlC keyboard combination will not halt the timer thread. And by the way, the RepeatTime class introduced above is used again.

Rebooting the server is done in two steps. First it is shut down properly by activating the server GPIO pin bound to the gpio-shutdown kernel module. The watchdog then waits while the shutdown is performed. After an appropriate delay, the server is restarted by activating its RUN pin. This two-step approach provides a fail-safe mechanism. If the server had gone off the tracks to the extent that activating the GPIO pin bound to gpio-shutdown did not shut down the operating system properly, the second step will reset the system, albeit without a proper shutdown.

Note that when started and when it has rebooted the server, the watchdog is not active and it will not reboot the server even in the absence of a heartbeat. The watchdog must be activated which occurs when it has received a specific, user definable, number of heartbeats from the server. This is on purpose. It is possible to disable the feed service on the server and after an initial reboot by the watchdog, the latter will no longer try to reboot the server. Similarly, it is possible to change the operating system on the server and the watchdog will not interfere as the OS is updated and services are installed. That startup feature entails that it is not necessary to wait for the completion of the server Pi reboot process before restarting the watchdog. It will patiently wait for the heartbeat to resume before starting its job. I think this is a clever idea, but it is not mine. It can be found in the software watchdog (see Raspberry Pi and Domoticz Watchdog or the man page for watchdog).

One fortunate consequence of not starting the watchdog until it has received at least one signal from the system being monitored is that it does not matter which device is started first. Unfortunately, it does matter which is shut down first. If the Raspberry Pi server is shut first and if the watchdog was started then the latter will restart the server after the time-out delay. It does not matter if this is done with a command line utility or with the reboot or reset buttons. This was a problem with the mining rig watchdog also. The work around is to first stop the wdog.py script or shut down the watchdog Pi.

Obedient Watchdog toc

There is a way to partially avoid the last problem. Instead of using either the shutdown or reset buttons of the server, the "power" button connected to the watchdog Pi will be used. Indeed, during this experimental phase, the power button will be able to perform four different functions depending on the number of times it is pressed in quick succession.

Press count Action Note
1 Reboot the server The watchdog continues to function
2 Shut down the server The watchdog is disabled until the server restarts
3 Reboot the watchdog Server unaffected
4 Shut down the watchdog

The gpiozero module is added to the virtual environment because it has a convenient button object.

pi@wdog:~ $ ve .systempy (.systempy) pi@wdog:~ $ pip install gpiozero ... Successfully installed colorzero-1.1 gpiozero-1.5.1 (.systempy) pi@wdog:~ $ pip freeze colorzero==1.1 gpiozero==1.5.1 pkg-resources==0.0.0 RPi.GPIO==0.7.0

The listing below only shows the additions made to the ~/.systempy/wdog.py script. Because the number of times the button is pressed must be counted in quick succession, a timer, (called buttonTimer, a global variable) is started whenever the button is released. If the button is pressed before time runs out, the button pressed count in incremented and the timer is restarted.

#!/home/pi/.systempy/bin/python # coding: utf-8 ### User settable values ### # Timing constants ... BUTTON_WAIT = 0.5 # seconds to wait for a repeat power button press BUTTON_BOUNCE = 0.08 # seconds of debounce time for power button # Watchdog GPIO connections POWER_BUTTON_GPIO = 3 # watchdog input connected to watchdog power button ... ## Global variables ## ... buttonCount = 0 buttonTimer = None ## Required modules ## from gpiozero import Button ... # Routine performed when the power button timer times out # What is done depends on the number of times the power button was # pressed. The button count is reset. # def doButton(): global buttonCount count = buttonCount buttonCount = 0 if count &lt; 1: return elif count == 1: rebootServer() elif count == 2: print("Watchdog - shutdown server") pulseServerPin(SERVER_SHUTDOWN_GPIO, sleepTime=5) initWatchdog() elif count == 3: print("Watchdog - reboot watchdog") check_call(['/sbin/reboot']) # must be root for this to work else: print("halt watchdog") check_call(['/sbin/poweroff']) # must be root for this to work # Button released callback. It increments the power button release # count and starts a one shot timer that will call on doButton when it # times out. If a timer was already running, it is cancelled before # being # restarted. This is the mechanism to take care of multiple button # presses. # def buttonUpCallback(): global buttonCount global buttonDownTime global buttonTimer buttonCount += 1 if not buttonTimer is None: buttonTimer.cancel() buttonTimer = Timer(BUTTON_WAIT, doButton) buttonTimer.start() # Setup the power button button = Button(POWER_BUTTON_GPIO, bounce_time=BUTTON_BOUNCE) button.when_released = buttonUpCallback # Setup the GPIO pins ...

When the timer does run out, then doButton is executed. Note that it will be necessary to run the script as root, which will be the case when the script is set up as a service.

To try this version, get the complete script and make it executable. You may want to preserve the older script as shown in the first line below. The script, wdog_o.py, can also be downloaded.

pi@wdog:~ $ mv ./syspy/wdog.py wdog_lm.py pi@wdog:~ $ wget -O .systempy/wdog.py https://sigmdel.ca/michel/ha/rpi/dnld/wdog_o.py pi@wdog:~ $ sudo chmod +x ./syspy/wdog.py

Of course this solution does not stop the watchdog from restarting the server Pi when the later is shutdown with a bash command. Here are some initial ideas about ways to take care of that problem:

It's fun to speculate about these solutions to what I judge to be a minor problem. Before spending time examining them any further, it would be best to verify just how effective the hardware watchdog will be.

The watchdog described above has the minimum capabilities that will be required of all the other devices to be evaluated as potential hardware watchdogs. This will be done in future posts as announced in the introduction to this series of posts.

Obedient and Barking Watchdog toc

So far the watchdog has been doing its job very quietly. This will be especially true when the watchdog is run as a service because the print statements in the wdog.py scripts, meant to help in the initial testing, will not be visible. But even the lowly Raspberry Pi Zero has logging capabilities. So I converted the print statements into logging statements. Here is an example.

# Shuts down the server by activating its shutdown pin and, # after a delay to allow a proper shutdown of the OS, it # restarts the server by activating its RUN pin. Sleeps # to allow the boot process to complete on the server # def rebootServer(): global aliveTime global watchdogActive global startCount log(LOG_INFO, "Rebooting server") if not watchdogActive: log(LOG_INFO, "Already rebooting") return sendNotification(REBOOT_MSG) # Shut down the server properly and then wake it up pulseServerPin(SERVER_SHUTDOWN_GPIO, sleepTime=SHUTDOWN_DELAY) pulseServerPin(SERVER_RESET_GPIO, sleepTime=RESET_DELAY) # Reset the watchdog initWatchdog()

The log function is merely a wrapper around the syslog function of the syslog Python module which optionally prints out the log message to the console as before except for the addition of a time stamp. The sendNotification function calls on a library function, postmail to send an email when ever the server Pi is about to be rebooted or shutdown.

# Routine to send messages to syslog and echo it to the console # def log(level, msg): syslog(level, msg) if (VERBOSE) and (level <= CONSOLELOG_LEVEL): print(time.strftime('%Y-%m-%d %H:%M:%S ', time.localtime()) + msg) # Routine to send a notification (e-mail) # def sendNotification(msg): if SEND_NOTIFICATION: try: log(LOG_INFO, 'Sending e-mail notification') postmail(EMAIL_SUBJECT, msg.format(time.strftime('%Y-%m-%d %H:%M:%S ', time.localtime())), EMAIL_DESTINATION) log(LOG_INFO, 'E-mail notification sent') except BaseException as error: log(LOG_ERR, 'An exception occurred in postmail: {}'.format(str(error)))

Of course you will need the pymail.py module containing the postmail function. The module and a "secrets" file, pymail_secrets.py, are in an archive that can be obtained here: pymail_0-2-0.zip. Values in the secrets file and in pymail.py will have to be adjusted. If the watchdog does not have access to the Internet, then set the constant SEND_NOTIFICATION to false.

If an SSH session can be opened on the watchdog Pi, then it will be possible to see the logging messages in real time.

pi@wdog:~ $ journalctl -f SYSLOG_IDENTIFIER=PiWatchdog -- Logs begin at Fri 2020-02-07 16:10:03 AST. -- Feb 08 02:28:25 wdog PiWatchdog[3975]: Watchdog loaded Feb 08 02:28:49 wdog PiWatchdog[3975]: Watchdog started Feb 08 02:31:55 wdog PiWatchdog[3975]: Watchdog timed out after 51.29 seconds Feb 08 02:31:55 wdog PiWatchdog[3975]: Rebooting server Feb 08 02:32:28 wdog PiWatchdog[3975]: Resetting watchdog Feb 08 02:33:01 wdog PiWatchdog[3975]: Watchdog started

Finally, the power button function was simplified. One short click of the button will reboot both the watchdog and the server Pi. One long press of the power button will shut down the watchdog doing nothing to the server. Once the watchdog Pi is down, it will be possible to restart the Pi by pressing the button again because it is connected to GPIO3. I think it is much more likely that I will remember these two possible actions instead of the four. And activating the wanted action will be less finicky.

buttonPressedTime = None # Time when power button was pressed # Callback routine when power button is pressed # def buttonPressed(): global buttonPressedTime buttonPressedTime = time.time() # Callback routine when power button is released # def buttonReleased(): elapsed = time.time()-buttonPressedTime log(LOG_DEBUG, "Power button pressed for {0:.2f}".format(elapsed)) if elapsed > 3: log(LOG_INFO, "Power button pressed to shut down the server and watchdog") sendNotification(SHUTDOWN_MSG) pulseServerPin(SERVER_SHUTDOWN_GPIO) check_call(['/sbin/poweroff']) # must be root for this to work else: log(LOG_INFO, "Rebooting the server and watchdog") rebootServer() check_call(['/sbin/reboot']) # must be root for this to work ... # Setup the power button button = Button(POWER_BUTTON_GPIO, bounce_time=BUTTON_BOUNCE) button.when_pressed = buttonPressed button.when_released = buttonReleased

When both devices are down, then it will be possible to restart them without toggling their power off and then back on. Pressing the power button of the watchdog will restart the latter, the server can be restarted by pressing its reset button.

To try this last version of the watchdog script to be presented in this post, download it and make it executable. Again you may want to save any previous version before downloading this version of the script.

pi@wdog:~ $ mv ./syspy/wdog.py wdog_bak.py pi@wdog:~ $ wget -O .systempy/wdog.py https://sigmdel.ca/michel/ha/rpi/dnld/wdog_ob.py pi@wdog:~ $ sudo chmod +x ./syspy/wdog.py

It will be necessary to adapt some constants at the beginning of the script.

Unleashing the Watchdog toc

All that needs to be done now is to ensure that the watchdog script is executed automatically when the watchdog Pi is booted. This is done with a unit file that is almost identical to the one created for the heartbeat script on the server Pi.

pi@wdog:~ $ sudo nano /etc/systemd/system/piwdog.service

[Unit] Description=Raspberry Pi Server Watchdog After=network.target [Service] Type=simple Restart=always RestartSec=1 User=root ExecStart=/home/pi/.systempy/wdog.py [Install] WantedBy=multi-user.target

As before, it is easy to perform the ususal tasks.

  1. Start the service:
    pi@wdog:~ $ sudo systemctl start piwdog.service
  2. Stop the service:
    pi@wdog:~ $ sudo systemctl stop piwdog.service
  3. Verify the status of the service:
    pi@wdog:~ $ sudo systemctl status piwdog.service
  4. Enable automatic starting of the service at boot:
    pi@wdog:~ $ sudo systemctl enable piwdog.service
  5. Disable automatic starting of the service at boot:
    pi@wdog:~ $ sudo systemctl disable piwdog.service

Timing toc

How quickly should the hardware watchdog reboot the server when it no longer receives the heartbeat signal? It would be preferable that the home automation system be on line all the time to execute scheduled tasks as planned. That could lead one to decide on a fast response from the watchdog. However, provisions have been made in the circuit shown above for manual reboots of the server. Furthermore, I have rebooted the server from outside the house using one of the functions of the home automation system Domoticz more than once. When rebooting, the server will not be sending out heartbeat signals and it would be unfortunate if the starved watchdog were to reboot the server while it is in the process of booting. It would not be a catastrophe because when the watchdog itself reboots the server it knows to wait long enough for the server to reboot before trying to restart it. Actually, the watchdog does not know much of anything, the script contains no less than seven timing constants that will probably need to be adjusted in actual use.

# Timing constants CHECK_INTERVAL = 10 # seconds between checks of the last alive signal WATCHDOG_TIMEOUT = 45 # seconds without an alive signal before rebooting the server PULSE_TIME = 0.3 # length (seconds) of pulses sent to the server shutdown and reset pins SHUTDOWN_DELAY = 25 # time allowed (seconds) for the server to shut down RESET_DELAY = 5 # time allowed (seconds) for the server to cold boot START_COUNT = 4 # number of initial heartbeats to start watchdog BUTTON_BOUNCE = 0.08 # seconds of debounce time for power button

The script is basically event driven. One event is when the repeat timer expires. The CHECK_INTERVAL is the time between timeouts. When that happens, the event handler checks how long it has been since the last time a heartbeat was received from the server Pi. If that time exceeds WATCHDOG_TIMEOUT, then the watchdog tries to reboot the server. That time period should be greater than the time the server needs to reboot as explained above. Right now the timeout is set at 45 seconds, but the test server, a Raspberry Pi 3 B, is just a skeleton. It is important to actually time a few reboots and set the timeout in accordance with the measured time plus a safety margin just in case adding another piece of software adds to the boot time. The constant SHUTDOWN_DELAY is related, because it should correspond to the time needed by the server Pi to shut down which should be in approximately half the time needed for a complete reboot. It is important not to underestimate this time because when that interval is over the watchdog will activate the server RUN signal. If that happens too soon, the effect would be to stop the whole shutdown process instead of restarting a machine. The RESET_DELAY may seem very short at 5 seconds, but this is not a very important value. After all, the watchdog is reset after initiating a reboot of the server and it will then wait however long it takes for the server to send a few initial heartbeats (as specified by the START_COUNT constant) before beginning to function.

If the pulse activating the server shutdown GPIO pin is too short, it will not work. No doubt because of the debounce delay in the gpio-shutdown module. A 3 tenths of a second pulse seems ok, but if the shutdowns initiated by the watchdog do not seem to work, it may be worthwhile to increase the value of PULSE_TIME. The BUTTON_BOUNCE constant is not too critical. It matters more in the previous version of the script when the number of consecutive button presses was being counted. In this version, all that needs to be distinguished is a short versus a long button press and the debounce delay could easily be 3 or 4 times greater without creating much difficulty.

Testing toc

Testing of the watchdog was done with two approaches. The simplest is to turn on both Raspberry Pi and, after the watchdog has started, to stop feeding it.

pi@wdog:~ $ journalctl -f SYSLOG_IDENTIFIER=PiWatchdog -- Logs begin at Tue 2020-02-25 18:49:05 AST. -- Feb 25 18:49:37 wdog PiWatchdog[386]: Watchdog loaded Feb 25 18:50:14 wdog PiWatchdog[386]: Watchdog started

To help with the timing, the current time will be obtained just before halting the wdfeed service.

woopi@goldserver:~ $ date; sudo systemctl stop wdfeed.service Tue 25 Feb 18:58:24 AST 2020

Feb 25 18:59:13 wdog PiWatchdog[386]: Watchdog timed out after 51.26 seconds Feb 25 18:59:13 wdog PiWatchdog[386]: Rebooting server Feb 25 18:59:13 wdog PiWatchdog[386]: Sending e-mail notification Feb 25 18:59:16 wdog PiWatchdog[386]: E-mail notification sent Feb 25 18:59:46 wdog PiWatchdog[386]: Resetting watchdog Feb 25 19:00:19 wdog PiWatchdog[386]: Watchdog started

If the 51-seconds timeout seems excessive, remember that the watchdog checks the last time it was fed once every 10 seconds. Only if it has been more than 45 seconds since the last heartbeat was received will the watchdog initiate a reboot. So the time-out could be anywhere between 45 and 55 seconds. The SSH session opened with the server Pi was closed and the following message was received.

Rebooting the home automation server at 2020-02-25 18:59:13 local time.

This confirmed that the watchdog performed as expected. I ran the test overnight by adding the following cron task.

woopi@goldserver:~ $ crontab -e

... # For more information see the manual pages of crontab(5) and cron(8) # # m h dom mon dow command */15 * * * * sudo systemctl stop wdfeed.service

The server stops feeding the watchdog every fifteen minutes triggering the watchdog whcih reboots the server Pi. This was verified by looking at the incoming emails the following morning. To be pedantic, the task only happens once, 15 minutes after the server Pi boots up, but then the cycle repeat. That verifies the mechanics of the watchdog. But to see it in action, it was necessary to "crash" the server Pi. In the past I have used the forkbomb.sh script.

!/bin/bash # forkbomb swapoff -a :(){ :|:&amp; };:

It has the disadvantage of taking a relatively long while to use up all the resources. Others have come up with a similar script, let's call it crash.sh

#!/usr/bin/env bash # crash echo c | sudo tee /proc/sysrq-trigger # Reference: How to cause kernel panic with a single command? # Answer by artyom and Stephen Kitt # https://unix.stackexchange.com/a/66205

So that is my arsenal for testing.

woopi@goldserver:~ $ crontab -e

... # For more information see the manual pages of crontab(5) and cron(8) # # m h dom mon dow command */15 * * * * /home/pi/crash.sh #*/15 * * * * /home/pi/forkbomb.sh #*/15 * * * * sudo systemctl stop wdfeed.service

Do not forget to make the scripts executable, and remember that there is no point in enabling more than one of these tasks because the watchdog will reboot the server Pi when one of these tasks is first performed. (That's not exactly true, if crash.sh were started right after either of the other two, it would probably crash the Linux kernel before the previous task could be completed.

<-Warm and Cold Reboots of the Raspberry Pi
<-Rethinking the Raspberry Pi Hardware Watchdog