md
An ESP8266 Based Router Watchdog
September 25, 2019
<-A Better ESP8266 Loop Watchdog with Better Recovery --

In a previous post (Arduino Sketch Managed ESP8266 Watchdog) in this series, I talked about using a Sonoff WiFi switch as a router watchdog. It was working in a fashion, but I decided that I had to improve the firmware and hardware.

Some way of suspending monitoring had to be added. I found that the Sonoff would power cycle the router while I was trying to change some of the latter's settings. Each time that happened, I had to trudge up and down two flights of stairs to physically remove the Sonoff. After a few rounds of this dance, I removed the switch completely which defeated the whole purpose, of course.

Secondly, the Sonoff/watchdog restarted the router only when the local Wi-Fi network went down. There was no check for loss of connection to the Internet. While the former is a necessary element of the home automation system which runs mostly on the local wireless network, the later is also important for personal and professional reasons.

Table of Contents

  1. The Router Monitor
  2. What Others Have Done
  3. Is the Network Down?
  4. Conflicting Goals
  5. Network Monitoring State Machine
  6. Support Capabilities
  7. Hardware Interface
  8. Commands
  9. Installing Router Monitor

The Router Monitor toc

I will not go into the details about the hardware. The case of a Sonoff Basic was shortened by sawing off the ends meant to clamp the input and output wires. It was then hot glued to the bottom plate of plastic wall mounted device box. Although not visible on the photographs, holes were drilled in the bottom plate to give access to the Sonoff tactile switch and the LED. A push button switch was installed along the top of the box. A standard North American duplex receptacle was installed inside the case. One receptacle is controlled by the Sonoff Basic and that is where the router's AC adapter is plugged into. Power is always available at the other receptacle. The following photographs should make all this clear.

What Others Have Done toc

Not long after my experiment with a Sonoff Basic, Charlie Romer (It Kinda Works) produced a couple of videos about rebooting a router when the Internet connection is lost. He used the same technique of turning the router off and on as I had, but it was Internet access which was monitored instead of just Wi-Fi connection. I would suggest listening to the Charlie Romer videos before reading on: Router Booter - Never reboot your router again! and the update Router Booter - What went wrong? He responded to critics that suggested that he should use a Sonoff to do this by saying that he preferred building the hardware using a Wemos D1 mini, a relay board and a power supply because more is learned that way. Fair enough. I happen to think the software is the more interesting aspect of this project so that I concentrated my efforts on creating more sophisticated firmware for the Sonoff. If you want, there is nothing to prevent you from using my firmware with Charlie Romer type hardware. As a mater of fact, I developed the firmware on a Wemos D1 mini emulating a Sonoff.

A crowdfunding project back in 2016, WiReboot has a slightly different take. It is an ESP8266 based gadget that sits between the router power supply and the router and interrupts the power when the connection is lost.

My original router monitor and these two devices are rather dumb brutes that only know one way of dealing with a perceived network problem. They turn the power off to the router, wait a short while and then turn the power back.

Some router makers provide a software watchdog based on ICMP requests. See ICMP Watchdog in the Ubiquiti Networks devices and Manual:System/Watchdog on the MikroTik Wiki. There is a page about Hardware Watchdog on the OpenWrt site. It goes on to discuss using a USB watchdogs which could be of interest if you can tinker with both the software and hardware of the router. If more information is needed about these cheap devices look at How to create your own usb watchdog script by David Gouveia. Such devices would have the advantage of performing system restarts before going to the drastic step of pulling the plug.

Is the Network Down? toc

Let's start by discussing how to test if the router is performing its job correctly. Here I propose to test the typical box supplied by an Internet service provider (ISP). These are typically multifunction devices acting as a bridge to the ISP servers and thence on to the rest of the Internet, as Ethernet switches, as Wi-Fi hotpoints, DHCP server, firewall, and so on. Since I have decided to use an ESP8266 based device to act as the router watchdog, lots of things done by the router will not be tested. Actually only two things are tested:

Testing the 2.4 GHz Wi-Fi network is particularly important because it is the backbone of the home automation system. I am keen on having this system up and running dependably especially when I am away from the house. There is no need to explain why Internet access is important anymore. I will add that I have a vested interest in the proper functioning of the network because others in the household need the Internet for personal and professional reasons.

It may seem incorrect, but I will argue that the Wi-Fi network must be running to get access to the Internet. It is true that when the radio is down, the Ethernet switch part of the router could be functioning, or the 5 GHz radio could still be working. However, there are only a couple of older computers that are wired into the router, all others, plus all tablets, smart phones, etc. use Wi-Fi. And they also use the 2.4 GHz band because the range of the 5 GHz band does not quite cover all area of the house.

That means that the two tested services can be in one of three states:

  1. The Wi-Fi network is up and the Internet is reachable,
  2. The Wi-Fi network is up and the Internet is unreachable,
  3. The Wi-Fi network is down (and consequently the Internet is unreachable).

How can it be determined that the Wi-Fi network is up or down? On an ESP device, this is easily verified.

if (WiFi.status() != WL_CONNECTED) Serial.println("Wi-Fi not connected"); or if (WiFi.status() == WL_CONNECTED) Serial.println("Wi-Fi connected");

The test for determining if the Internet in unreachable is a ping (ICMP request) to a major site known to be (almost) always up. While I have no solid data to support this assertion, the router monitor will often power cycle the router for no good reason when the service is in state 2 with such a cursory test. A host on the Internet could be unreachable because of all sorts of problems that have nothing to do with the router or the immediate connection between it and the ISP. Perhaps the site chosen as a target for an ICMP request is down. Perhaps it is under a denial of service attack. Perhaps there is a problem with the domain name system and the IP address of the target site cannot be obtained. Perhaps there is a major power outage and the only backbone that can be used to get to my isolated location is not available. Turning the router off and on will not do anything to fix these problems.

Given the amount of time needed for the router to power up and reboot and for all wireless devices to reconnect, it is clearly desirable to ensure that the router monitor act only when it is certain that it must. Accordingly, Wi-Fi must not be connected for at least 30 seconds before state 3 is declared. In the same vein, I have chosen to ping three different Internet hosts instead of only one. Contact with all of them must be lost for a certain amount of time before the Internet is deemed unreachable. During that grace period, the Internet hosts are regularly pinged.

In other words, two watchdogs are setup. Each time the loop() function in the sketch is executed, the Wi-Fi watchdog will be fed if the Wi-Fi network is connected. Similarly the Internet watchdog is fed if a ping with one of the targets was successful. Here is the function that tests if the wireless network is functioning and if the Internet can be reached.

// result for testNetwork() function // enum netResult_t { NET_OK, WIFI_DOWN, INTERNET_UNREACHABLE }; // keep track of why network is down // enum netResult_t netDownReason; // one of WIFI_DOWN, INTERNET_UNREACHABLE // timers unsigned long lastTimeWifiUp; unsigned long lastTimeInternetReached; unsigned long lastPingTime; int pingTarget = 0; netResult_t testNetwork(void) { if (WiFi.status() != WL_CONNECTED) { if (millis() - lastTimeWifiUp > config.wifiDownInterval) { sendToLog(LOG_ERR, "*** WiFi DOWN ***"); return WIFI_DOWN; } else { return NET_OK; // wifi may down but wait before declaring it so } } else { lastTimeWifiUp = millis(); // WiFi is up, reset the timer } if (millis() - lastPingTime < config.intervalBetweenPings) { if (millis() - lastTimeInternetReached > config.internetLostInterval) { sendToLog(LOG_DEBUG, "*** INTERNET LOST ***"); return INTERNET_UNREACHABLE; } return NET_OK; } lastPingTime = millis(); if (Ping.ping(config.targets[pingTarget], 1)) { sendToLogf(LOG_INFO, "Pinged %s with success", config.targets[pingTarget]); lastTimeInternetReached = millis(); } else { sendToLogf(LOG_INFO, "Failed pinging %s", config.targets[pingTarget]); } pingTarget = (++pingTarget) % PING_TARGET_COUNT; return NET_OK; }

The two variables lastTimeWifiUp and lastTimeInternetReached record the clock tick count when the Wi-Fi and Internet watchdogs were last fed by the testNetwork() function. If the time elapsed since the last feeding is greater than config.wifiDownInterval or config.internetLostInterval then the routine returns ad WIFI_DOWN or INTERNET_UNREACHABLE value; otherwise a NET_OK value is returned. The only complications are the use of multiple ping targets as explained before and the minimum config.intervalBetweenPings milliseconds delay between successive pings. That is implemented with the lastPingTime timer. This is to ensure that the router monitor does not overburden the network with ping requests.

Conflicting Goals toc

For the sake of the home automation system, the 2.4 GHz Wi-Fi network needs to be running without interruption. Essential parts of this system do not need access to the Internet. In fact, the only Internet-based services that do not have a local backup are weather and tide updates. Clearly loss of Internet access is just an inconvenience for the home automation system. But for humans about the house having a functional Wi-Fi network without access to the Web is not that useful. They would prefer that the router be restarted as often as possible when the Internet is unreachable, which interferes with the home automation system. Such are the conflicting goals associated with having a single Wi-Fi network handle both the Internet of Things and normal Internet usage.

A compromise is implemented. The spinning or cool-down period after turning the router off and then on is shorter when trying to reestablish the Wi-Fi connection and longer when trying to reach the Internet. That way the home automation system will be able to operate more or less unimpeded for longer periods of time when it is only access to the Internet is lost.

The better solution would be to operate two distinct Wi-Fi networks. That has been in the plans for quite a long while but I am not sure I can justify the expense of setting up an edge router, intelligent switches and separate Wi-Fi networks.

Network Monitoring State Machine toc

It seems a bit pompous to talk about a state machine in this case as the router monitor can be in one of only four states. In my defence, the testNetwork function described in the previous section was actually implemented in the state machine in the first version of this sketch.

Hopefully, the state machine will spend most of its time in the monitoring state in which all it does is check if the network is down. If it should happen to be down, the time-out period will be set according to the reason for the loss of network access and then the machine will move on into the cycling state. This is a short period when the power to the router is turned off. Once power is restored the state machine will be idle for quite a while, giving the router ample time to boot and some respite before potential follow up power cycles.

// states of the monitorState machine // enum monitorState_t { DISABLED, MONITORING, CYCLING, SPINNING }; enum monitorState_t monitorState = MONITORING; void monitorUpdate(void) { if (monitorState == DISABLED) return; if (monitorState == MONITORING) { switch (testNetwork()) { case WIFI_DOWN: { netDownReason = WIFI_DOWN; monitorState = CYCLING; } break; case INTERNET_UNREACHABLE: { netDownReason = INTERNET_UNREACHABLE; monitorState = CYCLING; } break; } if (monitorState == CYCLING) { setRouterOff(); setBlinkyPattern(ROUTER_CYCLING_PATTERN); spinInterval = (netDownReason == WIFI_DOWN) ? config.waitAfterWifiDown : config.waitAfterInternetLost; cycleStartTime = millis(); } return; } if (monitorState == CYCLING) { if (millis() - cycleStartTime >= config.routerOffInterval) { setRouterOn(); setBlinkyPattern(WAIT_PATTERN); monitorState = SPINNING; spinStartTime = millis(); } return; } if (monitorState == SPINNING) { if (millis() - spinStartTime >= spinInterval) { enableMonitor(); } } }

As can be seen, the state machine will never reach the DISABLED state on its own. That state can only be entered and left as a result of a command from the user.

This is a very simple routine. Amazingly, given the size of the sketch, this is all there is to the core function of the router monitor. The rest of the code provides support routines.

Support Capabilities toc

There are a number of auxiliary functions in the sketch some of which will be briefly described below.

Hardware Facilities

The Sonoff Basic LED is used to report the current state of the monitor. The Sonoff tactile button and another push button can be used to control the relay, enable or disable the monitoring function, initiate an over-the-air update of the firmware and restart the device. Details are provided in the next section.

MQTT Functionality

The primary method to interact with the router monitor is through an MQTT broker. The ESP subscribes to the routermon-1/command topic and respond by publishing messages to the routermon-1/response topic.

Here is an example of how this works. First open a terminal and subscribe to all topics related to the device.

michel@hp:~$ mosquitto_sub -h 192.168.1.22 -v -t "routermon-1/#"

Then open a second terminal and publish a message, in this case the help command.

michel@hp:~$ mosquitto_pub -h 192.168.1.22 -t "routermon-1/command" -m "help"

The command and the response will be displayed in the first terminal.

michel@hp:~$ mosquitto_sub -h 192.168.1.22 -v -t "routermon-test/#" routermon-test/command help routermon-test/response 00:05:18.0089 mqtt> help routermon-test/response 00:05:18.0091 commands: clientip config cycle help log monitor mqtt name net ota ping reach restart router syslog time topic update url version

Some details about all the commands are given in a section further down.

Logging Facilities

The source code contains numerous logging messages. Some of these are used for debugging purposes but most are informational messages as will be explained. Logging messages are sent to four destinations.

The files logging.h and logging.ino contain the code that performs the logging functions. All logging messages are accompanied with a logging level parameter which will determine if the message is actually sent on to each of the logging destinations. Please note that this is only a partial implementation of the usual Syslog protocol. It is not possible to pick and chose that only "alert" and "warning" messages will be displayed. Each destination has a threshold level and all messages with that priority or higher will be sent to the destination.

It is important to realize that all messages sent to the MQTT broker are in fact sent through the logging function. In order to use MQTT to control the router monitor as explained above, the threshold log level for MQTT logging should thus be set at "info" or "debug".

Command Processing

The commands.h and commands.ino files contain the command interpreter. My apologies for the quality of the code. It more or less grew in size as commands were added without much regard for an overall design. It is clearly in need of refactoring. Indeed, I suspect there are better ways of implementing the interpreter altogether. Nevertheless it does work and it does incorporate a minimal error reporting mechanism which hopefully will help the user understand why a command was not executed.

OTA Updates

This firmware is a work in progress. So it was important to include a mechanism to update the firmware without needed to take the device apart. So an update command exists which will download a new firmware file from a web server. The network on which the web server can be found can be specified with the ota command, and the URL of the firmware file can be set using the url command with the (ota option).

I have included my own ESP8266 watchdog routines to avoid infinite restart loops that could be introduced by a wayward firmware update. A known "good" version of the firmware will be reloaded over the air if a restart loop is detected. It uses the same web server as that used for the update command. The URL of the good version of the firmware can be specified with the url command using the auto option this time.

Persistent Settings

If the name or password of the monitored Wi-Fi network needs to be changed, then in all likelihood, this should be permanent. This can be accomplished by saving all the important settings in persistent memory on the ESP8266. This is done with the command config save.

All the settings are in a structure called config which is defined in the config.h file. The default values for all the settings are also defined in the same file. The code implementing the functions that save and load the settings from persistent memory, erases the latter, reloads the default values are in the file named config.ino.

Hardware Interface toc

The Sonoff Basic LED can be observed through a hole in the bottom of the device. It displays various flashing patterns depending on the state of the monitor.

Heartbeat: two 2/10th second flashes with a short off time between repeated every two seconds.
The router monitor is in MONITORING state, which means that the Wi-Fi 2.4 GHz network is up and the Internet can be reached or any loss of these functions has been for too short a period to confirm the loss.
50% duty cycle: 1/2 second on, 1/2 second off.
The router monitor is turning the power off to the router for 10 seconds.
Double heartbeat: four 1/10th second flashes with a short off time between repeated every three seconds.
The router monitor is in SPINNING state, which giving the router and wireless device time to recover from the restart of the router..
Almost always on: very short 2/100th second interruptions every two seconds.
The router is powered up, and the router monitor is disabled.
Almost always off: very short 2/100th second flashes every two seconds.
The router is not powered, and the router monitor is disabled.
Always off.
The firmware is being updated.

Either the Sonoff push button or another push button wired across the ESP8266 GPIO14 pin and ground can be used to physically control the device to some extent.

Single button click.
This toggles the state of the Sonoff Basic relay (i.e. power to the router). When power to the router is controlled manually in this fashion, the router monitor is disabled.
Two button clicks.
This toggles the state of the router monitor. If it was disabled, the router monitor is put in MONITORING mode. It the state machine was enabled (no matter if it was in MONITORING, CYCLING or SPINNING state), it is disabled. This does not affect the Sonoff relay.
Four or more button clicks.
This launches an over-the-air update of the router monitor firmware.
Long button press.
Restarts the device. This is more or less the equivalent of removing power from the Sonoff device and then powering it up again.

Commands toc

Much finer control can be achieved with commands that can be transmitted to the ESP8266 by a serial connection or a through an MQTT broker. Of course the serial connection with the ESP8266 UART is not very practical and its main purpose is to facilitate software development. Here is the typical output displayed in the serial monitor of the Arduino IDE as the ESP is powered up.

00:00:00.0058 Network Monitor (version 0.2.4) 00:00:00.0059 Loaded current configuration from flash memory (version 65535) 00:00:00.0060 Using default configuration (version 1) 00:00:00.0063 WiFi.mode set to STA 00:00:00.0066 Hostname set to routermon-1 00:00:00.0070 Client IP, gateway and subnet assigned dynamically by monitored network (DHCP) 00:00:04.0069 Connected to WiFi network sonoffDiy as routermon-1 at 192.168.1.128 00:00:04.0070 Device turned on 00:00:04.0070 Count of successive restarts for that reason: 1 00:00:04.0074 Enable monitoring 00:00:04.0077 Setup completed 00:00:04.0079 Reconnecting to mqtt broker 00:00:04.0325 Connected to mqtt broker as routermon-1 00:00:04.0327 Subscribing to mqtt topic "routermon-1/command/#" 00:00:46.0829 uart> help 00:00:46.0831 commands: clientip config cycle help log monitor mqtt name net ota ping reach restart router syslog time topic update url version 00:00:47.0416 uart> help clientip 00:00:47.0418 clientip [auto | <ip>, <gateway>, <mask>]

Everything up to the last four lines is output by the firmware. The first two of the last four lines are in response to a help command send via the serial connection. First the firmware echoes the command preceding it with the source which could be uart, as in this case, or mqtt if the command had been send as an MQTT message. The help commmand is executed which in this case amount to listing all the commands known to the monitor. Details about each command can be obtained by entering the command after help. The last two lines of the output correspond to such a command.

I used a simplified Backus-Naur form akin to the Wirth syntax notation to describe the options of each command. The principal parts of the notation are:

There can be more than two elements in the ( ) or [ ] list if necessary.

clientip [-clear | <ip>, <gateway>, <mask>]
Reports (no options) or sets the client IP address, gateway and subnet mask.
These must be valid IPv4 address such as 192.168.0.99. -clear will let the monitored network assign the IP, gateway subnet mask (DHCP).
config (save|load|default|erase)
Manages the configuration.
  • save saves the current configuration to persistent memory.
  • load replaces the current configuration with the saved configuration in persistent memory.
  • default replaces the current configuration with default values (does not change any saved configuration in persistent memory).
  • erase removes any saved configuration in persistent memory. Default values will be used on the next restart.
Some settings modified by load or default will not take effect. It may be necessary to save the configuration to persistent memory and then restart the ESP8266. (This is not the best approach and needs at a minimum better documentation.)
cycle [<ms>]
Turns router off and then back on after ms milliseconds. If ms is not specified or set to 0, the wait will be the same as when Wi-Fi is down (10 seconds by default).
help [<command>]
Display succinct help messages. help with no command displays a list of command. help with a command displays the command options.
log (uart|mqtt|syslog) [<level>]
Reports (no level option) or sets the level of the specified log output. The level can be specified numerically or by name:
  • 0 - emerg
  • 1 - alert
  • 2 - crit
  • 3 - err
  • 4 - warning
  • 5 - notice
  • 6 - info
  • 7 - debug
Logging levels are listed from the most concise to the most verbose. By default the serial monitor shows all log messages with level 6 or less, while the log messages with level 3 or less are sent to the MQTT broker.
monitor [on|off]
Reports (no options) or sets Wi-Fi/Internet monitoring.
mqtt [<ip>|<port>]
Reports (no options) or sets the MQTT broker IP address and port number.
name
Not yet implemented.
net [<ssid> [<password>]]
Reports (no options) or sets the monitored Wi-Fi network point credentials.
If the password is not specified, the Wi-Fi network must be open to all. If the password is specified, it must contain at least 8 characters and no spaces. The password is never reported.
ota [-clear|<ssid> [<password>]]
Reports (no option) or sets the credentials of the Wi-Fi network used for over-the-air update of this device firmware.
If the password is not specified, the OTA Wi-Fi network must be open to all. If the password is specified, it must contain at least 8 characters and no spaces. The password is never reported.
ping <host>
Pings a specified host.
reach [(1|2|3) (<host>)]
Reports (no option) or sets the hosts that are pinged to verify if the Internet can be reached.
restart
Restarts this device.
router [(on|off|toggle)]
Reports (no option) or sets the router power outlet on or off.
Toggling the power outlet on or off can also be done with a single button press. If the state is changed, monitoring is turned off. It can be turned back on with two buttons presses or the monitor command. Be careful if giving this command through an MQTT broker as the Wi-Fi connection will be lost. The cycle command might be more appropriate in that case.
syslog [<ip>|<port>]
Reports (no options) or sets the Syslog server IP address and port number.
time (wifi|internet|ping|longwait|shortwait|cycling|connect|mqtt [<ms>])
Reports or sets time intervals in milliseconds.
topic (in|out) [<topic>]
Reports or sets MQTT topics. in is the topic to which the device is subscribed, out is the topic used to publish to the broker. Send commands to the in topic and subscribe to the out topic to see the result.
update [<url>]
Flashes devices firmware. If the url is not given, uses the ota url.
url (ota|auto) [<url>]
Reports (no url given) or sets the firmware url. The ota url is the default url for the ota, the auto url is used when the devices boots if it is trapped in a boot cycle.
version
Reports the current firmware version.

Installing Router Monitor toc

The firmware routermon_1 is an Arduino sketch for ESP8266 based devices with at least 1 MB of flash memory. ESP8285 devices should also work.

Before flashing the firmware on a device, some defines in config.h should be modified.

HOST_NAME
The host name is used as the prefix for MQTT in and out topics. The same name is assigned to the ESP8266 Wi-Fi module. Valid host names are comprised of letters (upper and lower case) and digits. The hyphen "-" may also be used but must not be at the start or end of the name. It is best to limit the length of the name to 31 characters.
NET_SSID
The name of the 2.4 GHz Wi-Fi network to be monitored.
NET_PSK
The password of the monitored Wi-Fi network.
MQTT_HOST
The IP address or domain name of the MQTT broker. The default MQTT port can be changed if necessary. For the time being a secure connection to the MQTT broker is not implemented.
OTA_URL
The URL of the binary file to be downloaded and flashed on the ESP8266 when an update command is given. It is possible to bypass this URL by specifying an optional URL in the update command.
AUTO_URL
The URL of the binary file to be downloaded and flashed on the ESP8266 when a restart loop is detected by the ESP8266 loop watchdog.

You may also want to change other default values. For example, instead of using google.com and other well-known web sites as targets to check if the Internet is reachable, I prefer using major DNS servers with fixed IP addresses. Pinging these is not noticeably faster, but it does bypass the domain name system which could be a fault even if technically the Internet is reachable. A web search will quickly yield good targets.

It is also possible to set a static IP address to be used instead of relying on a dynamically assigned IP address by the DHCP server on the monitored network. This is not that significant in this current version of the router monitor, but there should be a Web server in a future version. In that case, it would be nice to have a static address to reach the router monitor web page. And even in this version it could be helpful to be able to ping the device at a known address.

The Arduino sketch can be downloaded by clicking on the following link: routermon_1 (v 0.2.4).

Of course ESP8266 libraries, such as the Wi-Fi, UDP and HTTP Update libraries are used. These will have been installed in the Arduino IDE when the ESP8266 Core was added with the Boards manager. I am using three libraries of my own and these will have to be downloaded from here and installed in the Arduino IDE. There is a tutorial on how to install additional Arduino libraries.

  1. mdEspRestart
  2. mdBlinky
  3. mdButton

Finally, third party libraries are used.

  1. PubSubClient by Nick O'Leary
  2. ESP8266Ping by Daniele Colanardi

PubSubClient can be installed with the Arduino IDE Library Manager. ESP8266Ping will have to be downloaded and installed manually in the same way as my own libraries have to be added to the Arduino IDE.

<-A Better ESP8266 Loop Watchdog with Better Recovery --