September 11, 2017
Updated: September 17, 2017
Arduino Sketch Managed ESP8266 Watchdog

Presumably, if a watchdog times out, there is a flaw in the latest firmware being executed. Thus, there is every chance the problem will arise again. Indeed, the ESP will probably fall into an endless loop, with the same watchdog timing out over and over again. In the same vein, what if the problem is an exception, say an unforeseen division by 0 that causes an endless sequence of restarts without ever causing any of the watchdogs to bite?

In other words, we have to return to the subject of recovery briefly mentioned in the first post of this series. In this blog I will show how to break out of an endless loop of resets by automatically reloading a "known good" version of the firmware over the air. At the same time, I will incorporate even more ideas from Nial Murphy and Jack Ganssle about a better watchdog into the loop watchdog introduced in the previous post.

Table of Contents

  1. Possible Remedies
  2. Non-Volatile Memory
  3. The restart Header File
  4. An Improved Loop Watchdog
  5. An Improved Recovery
  6. Conclusion

  1. Possible Remedies
  2. Here is the scenario that I dread. After performing an over the air (OTA) upgrade of the firmware on my garage door monitor, the system breaks down. It will no longer automatically close the door when I forget to do it. Worse, it is impossible to reload a previous version of the firmware because neither the web server nor the mqtt service is working. As mentioned above, that is what would happen if the ESP8266 is locked in an endless loop of restarts.

    Since OTA upgrades are implemented, that means that there are probably two versions of the firmware stored in the ESP flash memory: the current malfunctioning version and the previous version. So why not switch back to the previous version and from there upgrade to a corrected new version? It would be an elegant solution if the presence of a "good enough" older version of the firmware could be guaranteed. It could very well be that this is not the case, so this strategy would be a first line of defence only. That is not how it works. In the end, an OTA uploaded upgrade is copied over the previous sketch which will no longer be resident in the flash memory (see Updater class).

    A "hands off" solution to recover from an endless loop of restarts is needed for an embedded ESP8266 based device. In a previous post entitled Over the Air Sonoff Flashing I discussed how my home automation system was hosted by a Raspberry Pi that remains on at all times. That system also runs a little web server from which upgrades to various IoT devices can be downloaded over the radio. Why not leave a working version of the garage monitor firmware on that web server and automatically use the over the air update capabilities of the ESP8266 Arduino Core at startup to load that version if needed? Actually, a previous version of the monitor need not be used. All that would be needed is a minimal program capable of OTA uploading to then get a working version of the firmware back onto the garage door monitor.

    There are three complications that I can think of. The first is that when the loop watchdog (lwdt) times out, it restarts the device using the ESP.restart() method. Of course the standard method to find out why the ESP was rebooted does not know about the lwdt and it reports that a software or system reset occurred. So we have to find a way to differentiate software restarts from a lwdt timeout and from a user invoked restart. Of course this distinction would not be necessary if the EspClass methods restart() and reset() were never used. However, they are used in the sketches to which I want to add this recovery method.

    The second complication is that it may well be that a watchdog timed out because of a transient spike in the power line or some cosmic ray causing a bit to flip in RAM but without affecting the flashed firmware itself. It would be foolish to replace the firmware on a single occurrence of a problem. So we need to keep track of the previous reason for a restart of the ESP as well the consecutive count of that reason.

    Finally, RAM cannot be used for storing this information because it will not be persistent across system restarts. Each time the ESP is restarted, no matter the reason, all RAM variables will be restored to their initial programmed values. Non-volatile memory has to be used to store the information needed.

  3. Non-Volatile Memory
  4. Like many other microcontrollers, the ESP8266 has non-volatile memory or rather its embedded memory controller can access up to 16 megabytes of external serial flash memory using the Serial Peripheral Interface Bus (SPI). On the Sonoff WiFi switch, there is 1 megabyte of flash memory on a Wemos D1 mini there are 4 megabytes. This is where the firmware is stored. But some of that memory can also be set aside to store data which can be accessed with the EEPROM library.

    Since my sketches already use this library to save configuration data, it is not too difficult to add the information needed for recovery purposes. However, this is probably not the best idea because of a limitation of flash memory. Most flash memory can perform a limited number of write operations: from a hundred thousand to one million. That is a lot of changes to a configuration, but not for data that will be written each time the ESP8266 restarts. Remember that exception caused restarts will come quickly and pile up a significant number of flash memory writes before the problem is noticed and corrected.

    Luckily, the ESP8266 has a small amount of non-volatile memory incorporated in its real-time clock (RTC). This is static random access memory (RAM) just like the working RAM, but it is always powered even when the chip is put into deep sleep. The downside is that there is only 512 bytes of RTC memory available to us.

    In this post, I will show how to use RTC memory. However, the sketch which can be downloaded can use either RTC or EEPROM memory. This is controlled by a preprocessor directive.

  5. The restart Header File
  6. This is the content of the file header.h.

    #ifndef __RESTART__ #define __RESTART__ extern "C" { #include "user_interface.h" } enum restartReason_t {  RR_POWER_ON = REASON_DEFAULT_RST,         /* = 0, normal startup by power on */  RR_HARD_WDT = REASON_WDT_RST,            /* = 1, hardware watch dog reset */  RR_EXCEPTION = REASON_EXCEPTION_RST,      /* = 2, exception reset, GPIO status won’t change */  RR_SOFT_WDT = REASON_SOFT_WDT_RST,        /* = 3, software watch dog reset, GPIO status won’t change */  RR_SOFT_RESTART = REASON_SOFT_RESTART,    /* = 4, software restart ,system_restart , GPIO status won’t change */  RR_DEEP_SLEEP = REASON_DEEP_SLEEP_AWAKE,  /* = 5, wake up from deep-sleep */  RR_RESET = REASON_EXT_SYS_RST,            /* = 6, external system reset */    RR_LOOP_WDT                               /* loop watchdog reset */ }; /* * This routine must be called before using any of the following * functions. * * The parameter addr is the RTC memory bucket address at which * the restart data is to be saved. */ boolean restartBegin(uint32 addr); /* * The loop watchdog calls this method to restart the ESP. */ void lwdtRestart(unsigned long where); /* * Returns the reason for the restart and the number of consecutive times * the ESP has been restarted for that same reason. * * If the reason returned is an exception, data returns the exception number. * If the reason returned is a loop watchdog timeout, data returns where. * If the reason returned is any other cause, data is meaningless. */ restartReason_t getRestartReason(int &count, unsigned long &data); #endif

    The enumerated type restartReasont_t extends the rst_reason found in user_interface.h to include the new loop watchdog restart. There is probably a better way of doing this in C++ but I not familiar with that language.

    As the comment says, before using the other functions, restartBegin must be invoked with the address in RTC memory where the restart information will be saved. The parameter can be any "bucket address" from 0 to 126. RTC memory is divided in 4 byte buckets. Because the restart data occupies two buckets, it would overflow if stored at bucket 127. Hence, if addr is greater than 126 is specified the function returns false.

    The loop watchdog does not restart the ESP directly. It must call the lwdtRestart function which stores information in RTC memory before calling ESP.restart(). That is how it will be possible to discriminate restarts caused by the loop watchdog.

    The comment for the last function is self-explanatory I hope. Note that this function should only be called once. There is no real mechanism to enforce that rule except that it will systematically return RR_POWER_ON with a count of -1 after the first call.

    I decided to treat different exceptions as different reasons for restarting the ESP. So if an exception 3 follows an exception 0, the value of count returned with the second exception is 1.

    The details of the implementation are in the restart.eno file. I will not discuss these here. The next section looks at how all this is used. It also explains what is meant by where a loop watchdog bite occurs.

  7. An Improved Loop Watchdog
  8. I tend to create very short modular Arduino program loops. Here is an example:

    void loop() {  lwdtFeed();  buttonModule();  inputModule();  ledModule();  netModule();  lwdWhere = LOOP_START; }

    All the work is done in four "modules" (xxxxModule()). As before, the program loop starts by feeding the loop watchdog. The last task is setting the value of the lwdWhere variable to LOOP_START. This is part of the improvements brought to the loop watchdog.

    Each module begins by setting lwdWhere to a unique value identifying the start of the module:

    void inputModule() {
      lwdWhere = INPUT_MODULE;
    

    That way the loop watchdog can report which module was being executed if it bites. This module identifier is what the loop watchdog passes on to the lwdtRestart function discussed above.

    There are a coupe of special values that are not associated with an individual module. Recall that two values are constantly fed to the watchdog. The difference between the two values is monitored by the watchdog and if it changes then the watchdog bites because presumably the firmware has gone rogue and is overwriting memory. In that case the loop watchdog will set the "location" of the timeout at LWD_OVERWRITTEN.

    void ICACHE_RAM_ATTR lwdtcb(void) {  if (lwdTimeout - lwdTime != LWD_TIMEOUT)    lwdWhere = LWD_OVERWRITTEN;                // lwdTimeout and lwdTime out of phase  else if (millis() - lwdTime < LWD_TIMEOUT)    return;  lwdtRestart(lwdWhere); // lwd timedout }

    The LOOP_START value actually identifies the behind-the-scene code performed at the top of the program loop. The function feeding the watchdog checks that lwdtWhere is still equal to LOOP_START. If not then the behind the scene code modified the content of lwdtRestart or, somehow, there was a short circuit so that the complete sequence of modules was not executed before the program loop restarted. This will cause the watchdog to bite and the situation is signalled albeit in a rather arcane way.

    #define LOOP_START        0xAB001001 #define BUTTON_MODULE     0xAB002002 #define LED_MODULE        0xAB003003 #define INPUT_MODULE      0xAB004004 #define NET_MODULE        0xAB005005 #define LWD_OVERWRITTEN   0xAB006006 #define INCOMPLETE_LOOP   0xBA000000 #define AND_MASK          0x00FFFFFF #define OR_MASK           0xAB000000 void lwdtFeed(void) {  lwdTime = millis();  lwdTimeout = lwdTime + LWD_TIMEOUT;  if (lwdWhere != LOOP_START) {    lwdtRestart( ((lwdWhere & AND_MASK) | INCOMPLETE_LOOP) );  } }

    Notice how the top 16 bits of all module identifiers is 0xAB00. When lwdtFeed restarts the ESP, it replaces the top 16 bits with the value 0xBA00.

    References:


    Murphy, Niall (2000), Watchdog Timers.
    Ganssle, Jack (2016), Great Watchdog Timers for Embedded Systems.

  9. An Improved Recovery
  10. Recovery is done in the setup() function of the sketch. Here is striped down version of what the code could be.

    #include <errno.h> #include <EEPROM.h> #include <Ticker.h> #include <ESP8266WiFi.h> #include <ESP8266httpUpdate.h> #include "restart.h" #define AUTO_UPDATE_COUNT 3    // 0 to disable auto update of firmware #define AUTO_SSID "your_ssid"   #define AUTO_PSK  "your_password" #define AUTO_URL  "http://192.168.0.22/myprog/myprog.good.bin" void setup() {    Serial.begin (115200);  if (getBootDevice() == 1) {    Serial.println("\nPress the reset button or power device off and on now!");    while (1) {      yield();    }  }      restartBegin(0);  int restartCount;  unsigned long restartData;    restartReason_t reason = getRestartReason(restartCount, restartData);  boolean updateNeeded = (reason == RR_HARD_WDT) || (reason == RR_EXCEPTION)                      || (reason == RR_SOFT_WDT) || (reason == RR_LOOP_WDT);                if ( updateNeeded && (restartCount >= AUTO_UPDATE_COUNT) && (AUTO_UPDATE_COUNT > 0) ) {    Serial.printf("\Connecting to %s to get firmware %s\n", AUTO_SSID, AUTO_URL);    WiFi.begin(AUTO_SSID, AUTO_PSK);    int count = 0;    while (!WiFi.isConnected()) {      Serial.print(".");      count++;      if (count > 50) {        Serial.println();        count = 0;      }      delay(100);     }     Serial.println();     updateOta(AUTO_URL);   }     //...    lwdTime = millis();  lwdTicker.attach_ms(LWD_TIMEOUT, lwdtcb); // attach lwdt interrupt service routine to ticker  Serial.println("setup() completed"); } void updateOta(const char *url) {    ESPhttpUpdate.rebootOnUpdate(false);  t_httpUpdate_return ret = ESPhttpUpdate.update(url);  switch (ret)  {        case HTTP_UPDATE_FAILED:          Serial.printf("OTA update failed. Error (%d): %s\n", ESPhttpUpdate.getLastError(), ESPhttpUpdate.getLastErrorString().c_str());          break;        case HTTP_UPDATE_NO_UPDATES:          Serial.println("OTA update failed. Error: No updates");          break;        case HTTP_UPDATE_OK:          Serial.println("OTA update successful. Restarting...");          delay(1000);          ESP.restart();          break;  } }

    This is not too complicated. There is some initial housekeeping including opening the serial port and making sure that we are not trapped by the restart after flashing bug (that was covered in the first post on this subject). Then the restart module is initialized and the reason for the latest restart is obtained. There is a potential need for automatic updating of the firmware if the restart was caused by a watchdog timeout or an exception. Updating will be done if the number of consecutive restarts for that reason is greater than AUTO_UPDATE_COUNT and if the latter is greater than 0.

  11. Conclusion
  12. In the case of a rogue program thrashing RAM and flash memory because of a catastrophic programming error or because of a particularly disruptive cosmic rays, I am not convinced that the technique will save the day. Would it not be amazing that it spared all the code presented here and the ESP WiFi code and the HTTP update code and so on?

    On the other hand, I do think that the technique will be useful as protection from self-inflicted wounds. I am an optimist and likely to do OTA uploads of new, probably buggy, versions of the firmware to embedded devices. The errors introduced at such times will probably not be immense, and reloading a known good version may very well work as expected.

    You can download the complete example. It is a more sophisticated "blinky" that serves as a test bed with a bunch of defines at the beginning to create watchdog timeouts or exceptions in a particular module.

    In that example, there is a configuration module using the EEPROM library to save data that presumably could be changed at runtime. It shows how the restart module can play nice with the EEPROM configuration module even when the former is also implemented using EEPROM memory.

    The setup() reports in much more detail the cause of the system restart. This could be useful for diagnostic purposes if there were a way to get at it remotely. It will be useful during development stage if the Arduino serial window is open.

    As I said, I am not familiar with C and C++. I taught myself procedural programming in Pascal and then object oriented programming with a then new language: Java. With the advent of Delphi 2, I returned to Pascal which I have used almost exclusively since then. All that to say that while I recognize that what I called the "restart module" should be redone as a class, I will not do it any time soon.

    Clearly then, the example does not contain well-formed C++ code. If you find particularly egregious bits, you can inform me by clicking on my name at the bottom of the page. Be aware that this is not a public forum, snide remarks will not be published and if you do not provide a better way to do something or point to a helpful resource, then your message will end up in the round file cabinet. But a big thank you ahead of time to all those that do provide useful criticism.

Arduino Sketch Managed ESP8266 Watchdog