Updated: April 3, 2018
The ESP8266 contains two watchdog timers: a hardware timer with a 7 to 8 seconds lifespan and a software timer with a shorter lifespan slightly over 3 seconds long. These watchdogs are the subject of a previous post. Just like others, I have decided that a third watchdog is a good idea for some of my ESP8266/Arduino based projects. That's because it is very easy to write code that feeds the watchdogs but nevertheless goes off the deep end.
A variant of this third watchdog will be added to two projects based on the ESP8266. The first is a Sonoff with custom firmware that, interestingly enough, will be used as an external watchdog for our ISP provided modem which has not been reliable lately. I have had to trek down to the basement on numerous occasions to power cycle that device. The modem was connected to a mechanical timer which turned it off for about a quarter of an hour at 4am every day while we were away for a couple of weeks. Having an assurance that the home automation system which runs on the modem's WiFi network would not be out of commission for more than 24 hours was comforting. However the timing was put off when power to the house was lost for a while. The Sonoff will replace the mechanical timer and toggle the power to the modem only when necessary and for much shorter intervals.
The second project is based on the Wemos D1 mini which will act as garage door monitor. I have mentioned this project before and it is coming close to completion.
Table of Contents
- Need for a Third Watchdog Timer?
- Adding a Loop Watchdog
- Setting an Appropriate lwdt timeout interval
- Set the lwdt timeout period long enough to accomplish all the operations in the main program loop. Be careful, when estimating the time required for blocking operations (network, serial, file) which are unpredictable in length.
- Don't feed the lwdt except for the require one time feeding in each main program loop (usually at the top of the loop).
- Improved lwdt Watchdog
All that is needed to freeze the micro-controller is an endless loop that updates the ESP watchdogs. This is what happens in case '3' of the following sketch (available for download: esp_3rd_watchdog_01.ino
The only way to break the ESP out of its endless loop is to press its reset button or to power it off. For the garage door monitor that would mean climbing about 12 feet up a ladder to reach the device or toggling off the correct circuit breaker at the electrical panel. Neither of these two methods is particularly appealing.
There is a better way.
Markus (Links2004) added a sketch managed watchdog timer to the built-in ESP8266 watchdogs which will prevent the type of endless loop created in the previous example. Its operation is illustrated in a modified version of the previous sketch.
Available for download: esp_3rd_watchdog_02.ino.
The loop watchdog timer (lwdt) is implemented using a Ticker
object, lwdTicker
, which will invoke a callback function
attached to it every 15 seconds. The callback routine, lwdtcb()
checks the elapsed time since the last time the watchdog was fed and if it is
greater than the timeout period, it restarts the ESP8266.
In principle, the lwdt should be fed at one point only: at the start of
the program loop
code. Thus the lwdt is a fail-safe mechanism
that ensures that the ESP8266 is continuously executing the main program
loop. Clearly, the timeout period of the lwdt should be greater than that
of the built-in ESP8266 watchdogs and greater than the worst case scenario
for executing the program loop. It is good practice to add some extra time
above that. In our case 15 seconds seemed a reasonable timeout period.
Even this simple example shows that some care must be exercised when
feeding the lwdt. The program waits for user input in the function cleverly
named user_input()
. This could take longer than 15 seconds.
Accordingly, all watchdogs (the built in software and hardware and the
loop watchdog) are fed every 10 millisecondes while waiting for user input:
while (!Serial.available()) { lwdTime = millis(); delay(10); }
But the last case ('4') shows that feeding the lwdt elsewhere than at the top of the main program loop can be a problem. That loop feeding all watchdogs is effectively disabling them. Hardly a fail-safe mechanism!
Its best to stick to the recommendations:
What happens if the newly flashed firmware goes rogue and starts clobbering memory? As pointed out by , it could fill the lwdTime variable with a value. This is one instance where an incrementing timer is nominally better than a decrementing one. If a decrementing watchdog timer is constantly set to the value 2345688799 by a runaway program, it will never bite since it will never reach 0. Presumably, at some point millis() will be greater than 2345688799 + LWD_TIMEOUT so the loop watchdog timer will bite. But that can be cold comfort since the elapsed time rollover period is more than 49 days.
A simple improvement would be to use two global variables for the timer:
lwdTime
which as before holds the time the watchdog was last fed
and lwdTimeout
which will contain the value in lwdTime
plus a constant, which I have chosen to be LWD_TIMEOUT. Then the Ticker
callback routine checks that the watchdog has been fed in the last
LWD_TIMEOUT period as before and check that the difference between the two
global variable+s is still LWD_TIMEOUT. What are the chances that a rogue program
changing the value of lwdTime
would change the value of
lwdTimeout
correctly at the same time ?
Here are the changes to make.
The complete sketch is available for download: esp_3rd_watchdog_03.ino.
Now that updating the lwdt watchdog timer is more complex, a
lwdtFeed()
function has been added. When adding this to a sketch
it may be tempting to replace the test in the lwdt callback function with
the similar
if ((millis() > lwdTimeout) || (lwdTimeout - lwdTime != LWD_TIMEOUT))but that would be a bad idea. The test,
millis() > lwdTimeout
,
will not give the expected result when
lwdTime > 0xFFFFFFFF - LWD_TIMEOUT
. At that
point lwdTimeout
(= lwdTime + LWD_TIMEOUT
which
would be greater than 0xFFFFFFFF) will roll over and become a small value less
than LWD_TIMEOUT. The inequality will be true, the
watchdog will bite even though the timeout period has not been reached.
This "improvement" may be over the top. Markus did suggest that his simpler version was useful. On the other hand, this change is based on only one measure proposed by Niall Murphy and Jack Ganssle for increasing the reliability of software watchdogs. I may come back to this subject, if I find there is a need to follow their advice.
References:
Markus (Links2004) (2016), Watchdog like functionality for the Arduino loop.
Murphy, Niall (2000), Watchdog Timers.
Ganssle, Jack (2016), Great Watchdog Timers for Embedded Systems.