md
A Better ESP8266 Loop Watchdog with Better Recovery
September 11, 2017
Updated: June 24, 2018
<-Arduino Sketch Managed ESP8266 Watchdog A Third ESP8266 Watchdog, Final Version->

Presumably, if a watchdog times out, there is a flaw in the latest firmware being executed. Thus, there is every chance the problem will arise again. Indeed, the ESP will probably fall into an endless loop, with the same watchdog timing out over and over again. In the same vein, what if the problem is an exception, say an unforeseen division by 0 that causes an endless sequence of restarts without ever causing any of the watchdogs to bite?

In other words, we have to return to the subject of recovery briefly mentioned in the first post of this series. In this blog I will show how to break out of an endless loop of resets by automatically reloading a "known good" version of the firmware over the air. At the same time, I will incorporate even more ideas from Nial Murphy and Jack Ganssle about a better watchdog into the loop watchdog introduced in the previous post.

The general discussion about watchdogs found in this post remains valid. However, the latest version of the loop watchdog, which is now a library is, in my opinion, much better. Unfortunately, I have yet to translate to English the detailed presentation that is available in French: Un troisième temporisateur de surveillance du ESP8266, version finale. Do Look at A Third ESP8266 Watchdog, Final Version for links to the newer version and a summary presentation in English.
June 24, 2018

Table of Contents

  1. Possible Remedies
  2. Non-Volatile Memory
  3. The restart Header File
  4. An Improved Loop Watchdog
  5. An Improved Recovery
  6. Conclusion

  1. Possible Remedies
  2. Here is the scenario that I dread. After performing an over the air (OTA) upgrade of the firmware on my garage door monitor, the system breaks down. It will no longer automatically close the door when I forget to do it. Worse, it is impossible to reload a previous version of the firmware because neither the web server nor the mqtt service is working. As mentioned above, that is what would happen if the ESP8266 were locked in an endless loop of restarts.

    Since OTA upgrades are implemented, that means that there are probably two versions of the firmware stored in the ESP flash memory: the current malfunctioning version and the previous version. So why not switch back to the previous version and from there upgrade to a corrected new version? It would be an elegant solution if the presence of a "good enough" older version of the firmware could be guaranteed. It could very well be that this is not the case, so this strategy would be a first line of defence only. That is not how it works. In the end, an OTA uploaded upgrade is copied over the previous sketch which will no longer be resident in the flash memory (see Updater class).

    A "hands off" solution to recover from an endless loop of restarts is needed for an embedded ESP8266 based device. In a previous post entitled Over the Air Sonoff Flashing I discussed how my home automation system was hosted by a Raspberry Pi that remains on at all times. That system also runs a little web server from which upgrades to various IoT devices can be downloaded over the radio. Why not leave a working version of the garage monitor firmware on that web server and automatically use the over the air update capabilities of the ESP8266 Arduino Core at startup to load that version if needed? Actually, a previous version of the monitor need not be used. All that would be needed is a minimal program capable of OTA uploading to then get a working version of the firmware back onto the garage door monitor.

    There are three complications that I can think of. The first is that when the loop watchdog (lwdt) times out, it restarts the device using the ESP.restart() method. Of course the standard method to find out why the ESP was rebooted does not know about the lwdt and it reports that a software or system reset occurred. So we have to find a way to differentiate software restarts from lwdt timeouts and from user invoked restarts. This distinction would not be necessary if the EspClass methods restart() and reset() were never used. However, they are used in the sketches to which I want to add this recovery method.

    The second complication is that it may well be that a watchdog timed out because of a transient spike in the power line or some cosmic ray causing a bit to flip in RAM but without affecting the flashed firmware itself. It would be foolish to replace the firmware on a single occurrence of a problem. So we need to keep track of the previous reason for a restart of the ESP as well the consecutive count of that reason.

    Finally, RAM cannot be used for storing this information because it will not be persistent across system restarts. Each time the ESP is restarted, no matter the reason, all RAM variables will be restored to their initial programmed values. Non-volatile memory has to be used to store the information needed.

  3. Non-Volatile Memory
  4. Like many other microcontrollers, the ESP8266 has non-volatile memory or rather its embedded memory controller can access up to 16 megabytes of external serial flash memory using the Serial Peripheral Interface Bus (SPI). On the Sonoff WiFi switch, there is 1 megabyte of flash memory, on a Wemos D1 mini there are 4 megabytes. This is where the firmware is stored. But some of that memory can also be set aside to store data which can be accessed with the EEPROM library.

    Since my sketches already use this library to save configuration data, it is not too difficult to add the information needed for recovery purposes. However, this is probably not the best idea because of a limitation of flash memory. Most flash memory can perform a limited number of write operations: from a hundred thousand to one million. That may appear to be a non binding limit on the number of changes to the configuration data. However the sketch will write data to flash memory each time the ESP8266 restarts so it could become a problem. Remember that restarts caused by an exception will be typically become loops that pile up a significant number of flash memory writes before the problem is noticed and corrected.

    Luckily, the ESP8266 has a small amount of non-volatile memory incorporated in its real-time clock (RTC). This is static random access memory (RAM) just like the working RAM, but it is always powered even when the chip is put into deep sleep. The downside is that there is only 512 bytes of RTC memory available to us.

    In this post, I will show how to use RTC memory. However, the sketch which can be downloaded can use either RTC or EEPROM memory. This is controlled by a preprocessor directive.

  5. The restart Header File
  6. This is the content of the file restart header file restart.h.

    #ifndef __RESTART__ #define __RESTART__ extern "C" { #include "user_interface.h" } enum restartReason_t {  RR_POWER_ON = REASON_DEFAULT_RST,         /* = 0, normal startup by power on */  RR_HARD_WDT = REASON_WDT_RST,            /* = 1, hardware watch dog reset */  RR_EXCEPTION = REASON_EXCEPTION_RST,      /* = 2, exception reset, GPIO status won’t change */  RR_SOFT_WDT = REASON_SOFT_WDT_RST,        /* = 3, software watch dog reset, GPIO status won’t change */  RR_SOFT_RESTART = REASON_SOFT_RESTART,    /* = 4, software restart ,system_restart , GPIO status won’t change */  RR_DEEP_SLEEP = REASON_DEEP_SLEEP_AWAKE,  /* = 5, wake up from deep-sleep */  RR_RESET = REASON_EXT_SYS_RST,            /* = 6, external system reset */    RR_LOOP_WDT                               /* loop watchdog reset */ }; /* * This routine must be called before using any of the following * functions. * * The parameter addr is the RTC memory bucket address at which * the restart data is to be saved. */ boolean restartBegin(uint32 addr); /* * The loop watchdog calls this method to restart the ESP. */ void lwdtRestart(unsigned long where); /* * Returns the reason for the restart and the number of consecutive times * the ESP has been restarted for that same reason. * * If the reason returned is an exception, data returns the exception number. * If the reason returned is a loop watchdog timeout, data returns where. * If the reason returned is any other cause, data is meaningless. */ restartReason_t getRestartReason(int &count, unsigned long &data); #endif

    The enumerated type restartReasont_t extends the rst_reason found in user_interface.h to include the new loop watchdog restart. There is probably a better way of doing this in C++ but I not familiar with that language.

    As the comment says, before using the other functions, restartBegin must be invoked with the address in RTC memory where the restart information will be saved. The parameter can be any "bucket address" from 0 to 126. RTC memory is divided in 4 byte buckets. Because the restart data occupies two buckets, it would overflow if stored at bucket 127. Hence, if addr is greater than 126 is specified the function returns false.

    The loop watchdog does not restart the ESP directly. It must call the lwdtRestart function which stores information in RTC memory before calling ESP.restart(). That is how it will be possible to discriminate restarts caused by the loop watchdog.

    The comment for the last function is self-explanatory I hope. Note that this function should only be called once. There is no real mechanism to enforce that rule except that it will systematically return RR_POWER_ON with a count of -1 after the first call.

    I decided to treat different exceptions as different reasons for restarting the ESP. So if an exception 3 follows an exception 0, the value of count returned with the second exception is 1.

    The details of the implementation are in the restart.eno file. I will not discuss these here. The next section looks at how all this is used. It also explains what is meant by where a loop watchdog bite occurs.

  7. An Improved Loop Watchdog
  8. I tend to create very short modular Arduino program loops. Here is an example:

    void loop() {  lwdtFeed();  buttonModule();  inputModule();  ledModule();  netModule();  lwdWhere = LOOP_START; }

    All the work is done in four "modules" (xxxxModule()). As before, the program loop starts by feeding the loop watchdog. The last task is setting the value of the lwdWhere variable to LOOP_START. This is part of the improvements brought to the loop watchdog.

    Each module begins by setting lwdWhere to a unique value identifying the start of the module:

    void inputModule() {
      lwdWhere = INPUT_MODULE;
    

    That way the loop watchdog can report which module was being executed if it bites. This module identifier is what the loop watchdog passes on to the lwdtRestart function discussed above.

    There are a coupe of special values that are not associated with an individual module. Recall that two values are constantly fed to the watchdog. The difference between the two values is monitored by the watchdog and if it changes then the watchdog bites because presumably the firmware has gone rogue and is overwriting memory. In that case the loop watchdog will set the "location" of the timeout at LWD_OVERWRITTEN.

    void ICACHE_RAM_ATTR lwdtcb(void) {  if (lwdTimeout - lwdTime != LWD_TIMEOUT)    lwdWhere = LWD_OVERWRITTEN;                // lwdTimeout and lwdTime out of phase  else if (millis() - lwdTime < LWD_TIMEOUT)    return;  lwdtRestart(lwdWhere); // lwd timedout }

    The LOOP_START value actually identifies the behind-the-scene code performed at the top of the program loop. The function feeding the watchdog checks that lwdtWhere is still equal to LOOP_START. If not then the behind the scene code modified the content of lwdtRestart lwdtWhere or, somehow, there was a short circuit so that the complete sequence of modules was not executed before the program loop restarted. This will cause the watchdog to bite and the situation is signalled albeit in a rather arcane way.

    #define LOOP_START        0xAB001001 #define BUTTON_MODULE     0xAB002002 #define LED_MODULE        0xAB003003 #define INPUT_MODULE      0xAB004004 #define NET_MODULE        0xAB005005 #define LWD_OVERWRITTEN   0xAB006006 #define INCOMPLETE_LOOP   0xBA000000 #define AND_MASK          0x00FFFFFF #define OR_MASK           0xAB000000 void lwdtFeed(void) {  lwdTime = millis();  lwdTimeout = lwdTime + LWD_TIMEOUT;  if (lwdWhere != LOOP_START) {    lwdtRestart( ((lwdWhere & AND_MASK) | INCOMPLETE_LOOP) );  } }

    Notice how the top 16 bits of all module identifiers is 0xAB00. When lwdtFeed restarts the ESP, it replaces the top 16 bits with the value 0xBA00.

    Thanks to Kyle Fleming, co-founder of Black Prism, for pointing out that the content of lwdtWhere might be overwritten and not lwdtRestart. Of course, the lwdtRestart code could be clobbered, but then all bets are off as discussed in the conclusion.
    April 3, 2018

    References:


    Murphy, Niall (2000), Watchdog Timers.
    Ganssle, Jack (2016), Great Watchdog Timers for Embedded Systems.

  9. An Improved Recovery
  10. Recovery is done in the setup() function of the sketch. Here is striped down version of what the code could be.

    #include <errno.h> #include <EEPROM.h> #include <Ticker.h> #include <ESP8266WiFi.h> #include <ESP8266httpUpdate.h> #include "restart.h" #define AUTO_UPDATE_COUNT 3    // 0 to disable auto update of firmware #define AUTO_SSID "your_ssid"   #define AUTO_PSK  "your_password" #define AUTO_URL  "http://192.168.0.22/myprog/myprog.good.bin" void setup() {    Serial.begin (115200);  if (getBootDevice() == 1) {    Serial.println("\nPress the reset button or power device off and on now!");    while (1) {      yield();    }  }      restartBegin(0);  int restartCount;  unsigned long restartData;    restartReason_t reason = getRestartReason(restartCount, restartData);  boolean updateNeeded = (reason == RR_HARD_WDT) || (reason == RR_EXCEPTION)                      || (reason == RR_SOFT_WDT) || (reason == RR_LOOP_WDT);                if ( updateNeeded && (restartCount >= AUTO_UPDATE_COUNT) && (AUTO_UPDATE_COUNT > 0) ) {    Serial.printf("\Connecting to %s to get firmware %s\n", AUTO_SSID, AUTO_URL);    WiFi.begin(AUTO_SSID, AUTO_PSK);    int count = 0;    while (!WiFi.isConnected()) {      Serial.print(".");      count++;      if (count > 50) {        Serial.println();        count = 0;      }      delay(100);     }     Serial.println();     updateOta(AUTO_URL);   }     //...    lwdTime = millis();  lwdTicker.attach_ms(LWD_TIMEOUT, lwdtcb); // attach lwdt interrupt service routine to ticker  Serial.println("setup() completed"); } void updateOta(const char *url) {    ESPhttpUpdate.rebootOnUpdate(false);  t_httpUpdate_return ret = ESPhttpUpdate.update(url);  switch (ret)  {        case HTTP_UPDATE_FAILED:          Serial.printf("OTA update failed. Error (%d): %s\n", ESPhttpUpdate.getLastError(), ESPhttpUpdate.getLastErrorString().c_str());          break;        case HTTP_UPDATE_NO_UPDATES:          Serial.println("OTA update failed. Error: No updates");          break;        case HTTP_UPDATE_OK:          Serial.println("OTA update successful. Restarting...");          delay(1000);          ESP.restart();          break;  } }

    This is not too complicated. There is some initial housekeeping including opening the serial port and making sure that we are not trapped by the restart after flashing bug (that was covered in the first post on this subject). Then the restart module is initialized and the reason for the latest restart is obtained. There is a potential need for automatic updating of the firmware if the restart was caused by a watchdog timeout or an exception. Updating will be done if the number of consecutive restarts for that reason is greater than AUTO_UPDATE_COUNT and if the latter is greater than 0.

  11. Conclusion
  12. In the case of a rogue program thrashing RAM and flash memory because of a catastrophic programming error or because of particularly disruptive cosmic rays, I am not convinced that the technique will save the day. Would it not be amazing that it spared all the code presented here and the ESP WiFi code and the HTTP update code and so on?

    On the other hand, I do think that the technique will be useful as protection from self-inflicted wounds. I am an optimist and likely to do OTA uploads of new, probably buggy, versions of the firmware to embedded devices. The errors introduced at such times will probably not be immense, and reloading a known good version may very well work as expected.

    You can download the complete example (***). It is a more sophisticated "blinky" that serves as a test bed with a bunch of defines at the beginning to create watchdog timeouts or exceptions in a particular module.

    *** This watchdog should not be used. Look at those found in A Third ESP8266 Watchdog, Final Version instead. This archive is still available only because is shows how EEPROM memory could be used instead of RTC memory which is exclusively used in the newer watchdog.
    June 24, 2018
    There was a nasty bug in the previous version of restart.ino. It did not report the name of the "module" in which the loop watchdog was biting. The cause of the error in the implementation of the loop watchdog, was that sizeof(RESTART) returned 4 bytes, which I assume is the size of a pointer and not 8 bytes which is the size the RESTART structure. In Pascal, my preferred programming language, the size of the record would have been returned. That is an understandable error on my part but, in all humility, it was inexcusable to not have noticed the problem in the first place. Maybe I had done all the tests with sizeof(restart) and then when it was time to create the archive, I just did a search and replace because I thought sizeof(RESTART) was "better". Malarkey! The function sizeof returns the same value when its argument is the name of the struct or an instance of the struct.

    There was a problem because the restart structure was not aligned on a 32 bit boundary. It appears that reading and writing to the RTC memory can only be done when the destination or source address is aligned properly.

    In that example, there is a configuration module using the EEPROM library to save data that presumably could be changed at runtime. It shows how the restart module can play nice with the EEPROM configuration module even when the former is also implemented using EEPROM memory.

    The setup() reports in much more detail the cause of the system restart. This could be useful for diagnostic purposes if there were a way to get at it remotely. It will be useful during development stage if the Arduino serial window is open.

    As I said, I am not familiar with C and C++. I taught myself procedural programming in Pascal and then object oriented programming with a then new language: Java. With the advent of Delphi 2, I returned to Pascal which I have used almost exclusively since then. All that to say that while I recognize that what I called the "restart module" should be redone as a class, I will not do it any time soon.

    Clearly then, the example does not contain well-formed C++ code. If you find particularly egregious bits, you can inform me by clicking on my name at the bottom of the page. A big thank you ahead of time to all who send in corrections and suggestions or useful criticism.

<-Arduino Sketch Managed ESP8266 Watchdog A Third ESP8266 Watchdog, Final Version->