2021-11-25
md
Cleaning a Web Site
W3C LogValidator in cPanel->
 Wherein There Is a Confirmation of the Overconfidence Effect 

When I first started this site, I made an effort to check each web page to ensure that its links were valid and that the HTML syntax was correct. Laziness soon settled in and I never seemed to get around to verifying new web pages in the rush to post them and get on with new interests. Having just renewed the contract with the web hosting provider for this site, it seems like a good time to go back to better work habits. Unfortunately, some of the tools that I used long ago are no longer maintained or even available. Luckily, I did find useful tools that will hopefully be used in the future and retroactively on files already on the site.

This choice of tools is admittedly idiosyncratic, reflecting constraints imposed by some design decisions and particular requirements. Among the latter was the wish to use local tools instead of web-based applications. I also did not wish to install any Python scripts that required the use of Python version 2.7.x which has been deprecated for a number of years. I am sure that there are other tools and it could be worthwhile to seek them out.

Table of Contents

  1. Counting and Listing Files
  2. Nu Html Checker
    1. Web-Based Checking
    2. Local Installation of Nu Html Checker
    3. Local Checking
    4. Checking from the Command Line
  3. Other Syntax Checkers
    1. HTML Tidy
    2. Dr. Watson
    3. Checking CSS Files
    4. W3C Markup Validator
  4. Hyperlink Checkers
    1. LinkChecker
    2. Linklint
  5. Strategy

Counting and Listing Files toc

Over the years, numerous files have been added to this site. I was curious to get a handle on their number. A short bash script can do that.

michel@hp:~$ sitestats Location of site: /var/www/html/michel/ Number of files: 1805 Number of HTML files: 329 Number of English HTML files: 215 Number of French HTML files: 110 Number of JPEG image files: 1013 Number of PNG image files: 223 Number of downloads: 216 Number of ZIP archives: 63 Number of bash scripts: 9 Number of Pascal files: 3 Number of Arduino sketches: 14 Number of C files: 2 Number of PDF files: 4

Here is the script, which could easily be modified to meet other needs as will be shown later.

#!/bin/bash # Site statistics #local copy of site local=/var/www/html/michel/ echo "Location of site: $local" echo -n "Number of files: "; find $local -type f | wc -l echo -n "Number of HTML files: "; find $local -name '*html' | wc -l echo -n "Number of English HTML files: "; find $local -name '*en\.html' | wc -l echo -n "Number of French HTML files: "; find $local -name '*fr\.html' | wc -l echo -n "Number of JPEG image files: "; find $local -name '*jpg' | wc -l echo -n "Number of PNG image files: "; find $local -name '*png' | wc -l echo -n "Number of downloads: "; find $local -wholename '*dnld/*' | wc -l echo -n "Number of ZIP archives: "; find $local -name '*zip' | wc -l echo -n "Number of bash scripts: "; find $local -name '*sh' | wc -l echo -n "Number of Pascal files: "; find $local -name '*pas' | wc -l echo -n "Number of Arduino sketches: "; find $local -name '*ino' | wc -l echo -n "Number of C files: "; find $local -name '*c' | wc -l echo -n "Number of PDF files: "; find $local -name '*pdf' | wc -l

Among the changes that could be made, listing all the HTML files in the site along with their relative path could be useful. While the find utility could be used, its alphabetical sorting of files is not the best. So I installed the tree package and used that utility because it does sort directories and files separately which is what I want.

michel@hp:~$ sudo apt install tree ... 0 mis à jour, 1 nouvellement installés, 0 à enlever et 0 non mis à jour. Il est nécessaire de prendre 43,0 ko dans les archives. ...

The utility's help message provides a list of command line options and by using a few of these the output is almost what I want.

michel@hp:~$ tree --help ... -f Print the full path prefix for each file. -P pattern List only those files that match the pattern given. --matchdirs Include directory names in -P pattern matching. --noreport Turn off file/directory count at end of tree listing. -i Don't print indentation lines. michel@hp:~$ tree -f -i --noreport --dirsfirst -P '*.html' /var/www/html/michel/ /var/www/html/michel /var/www/html/michel/3d /var/www/html/michel/3d/dnld /var/www/html/michel/3d/img /var/www/html/michel/3d/first_3d_prints_en.html /var/www/html/michel/3d/intro_openscad_01_en.html /var/www/html/michel/css /var/www/html/michel/ha /var/www/html/michel/ha/ahsdk /var/www/html/michel/ha/ahsdk/ahsdk-downloads_en.html ... /var/www/html/michel/program/python/tide_cdn_fr.html /var/www/html/michel/about_en.html /var/www/html/michel/about_fr.html /var/www/html/michel/archives_en.html /var/www/html/michel/archives_fr.html /var/www/html/michel/downloads_en.html /var/www/html/michel/downloads_fr.html /var/www/html/michel/index_en.html /var/www/html/michel/index_fr.html /var/www/html/michel/major_incident_en.html /var/www/html/michel/major_incident_fr.html

Unfortunately, unneeded bare directory names are in the list. There is also unnecessary repetition of the root directory, /var/www/html/michel/, at the start of each file name. On the other hand, it can be useful to show the list of files as URLs. This is what is done by default in this final version of the script.

#!/bin/bash # Web Site Statistics ############################################# # Defaults - adjust these #-------------------------------------------- # directory containing local copy of the site local=/var/www/html/michel/ #-------------------------------------------- # output path prefix prefix=http://localhost/michel/ #-------------------------------------------- # do not list html files list="stat" ############################################# output="" usage() { echo "$(basename $0) [-h] [-a | -l] [-d srcdir] [-p prefix]" echo " -h this help message" echo " -a list all (site statistics and html files)" echo " -l list html files only" echo " -d srcdir web site directory (default: $local)" echo " -p prefix concatenate file path prefix (default: $prefix)" echo " -p '' will remove default path prefix" } while getopts ':halp:d:' OPTION; do case "$OPTION" in h) usage exit 0 ;; a) list="all" ;; l) list="files" ;; p) prefix="$OPTARG" ;; d) local="$OPTARG" ;; *) usage exit 1 ;; esac done if [ ! -d "$local" ]; then echo "$local does not exist" exit 2 fi #length of local len=${#local} if [ $list != "files" ]; then echo "Location of site: $local" echo -n "Number of files: "; find $local -type f | wc -l echo -n "Number of HTML files: "; find $local -type f -name '*html' | wc -l echo -n "Number of English HTML files: "; find $local -type f -name '*en\.html' | wc -l echo -n "Number of French HTML files: "; find $local -type f -name '*fr\.html' | wc -l echo -n "Number of JPEG image files: "; find $local -name '*jpg' | wc -l echo -n "Number of PNG image files: "; find $local -name '*png' | wc -l echo -n "Number of downloads: "; find $local -wholename '*dnld/*' | wc -l echo -n "Number of ZIP archives: "; find $local -name '*zip' | wc -l echo -n "Number of bash scripts: "; find $local -name '*sh' | wc -l echo -n "Number of Pascal files: "; find $local -name '*pas' | wc -l echo -n "Number of Arduino sketches: "; find $local -name '*ino' | wc -l echo -n "Number of C files: "; find $local -name '*c' | wc -l echo -n "Number of PDF files: "; find $local -name '*pdf' | wc -l fi if [ $list != "stat" ]; then # list files if list == all or list == files tree -f -i --noreport --dirsfirst -I 'index.html' -P '*.html' $local | sed -r "s/.{$len}//" | sed '/html$/!d' | sed -e "s#^#$prefix#" fi

As can be seen sed, the stream editor for filtering and transforming text, is used multiple times in the last line. Here is a quick explanation of each step.

There are two index files in the root directory of my site, index_fr.html and index_en.html, while the default index file, index.html is a symbolic link to one of these files. Some care is needed to ensure that the symbolic link is not counted nor listed as an HTML file. This is the reason for the -type f option which ensures that the find command lists only regular files. Similarly, in the tree command the exclusion flag is used -I 'index.html'.

There are other complications on my site. Some HTML files are not posts but are meant to be downloaded. The web site on my desktop computer contains a directory, called local, with files not copied to the publicly available site. These anomalies explain the fact that the sum of French and English HTML files is less than the number of HTML files. Since I do not want these files in the list of HTML files generated by the script, I have included yet another pass through sed.

The script can be downloaded: sitestats. Save the file in a directory in the search path such as ~/.local/bin and make it an executable. Don't forget to eliminate the | sed "/\/dnld\/\|^local/d" pipe if there is no need to deal with dnld/ and local/ directories.

michel@hp:~$ chmod +x .local/bin/sitestats

Typically, I use that script as follows.

michel@hp:~$ sitestats -l -p '' > htmlfiles.csv

This file contains a list of all the HTML files in the web site in alphabetical order, except for the final ten files which are in the files in the root directory.

3d/first_3d_prints_en.html 3d/intro_openscad_01_en.html ha/ahsdk/ahsdk-downloads_en.html ... index_en.html index_fr.html major_incident_en.html major_incident_fr.html

I imported that file into a spreadsheet which will be used to track some information on a file by file basis. In one column, I enter the date of the last time I checked the file with the Nu HTML checker (see next section) and the last time the links in the file were checked is entered in a third column. The second example shows how to use sitestats to produce a list of HTML files that can be used to locally check the complete web site with Nu HTML Checker and LinkChecker from the command line as will be explained later.

michel@hp:~$ sitestats -l > list.txt

Here is a a look at the result.

http://localhost/michel/3d/first_3d_prints_en.html http://localhost/michel/3d/intro_openscad_01_en.html http://localhost/michel/ha/ahsdk/ahsdk-downloads_en.html ... http://localhost/michel/index_en.html http://localhost/michel/index_fr.html http://localhost/michel/major_incident_en.html http://localhost/michel/major_incident_fr.html

It is possible to directly check all the local copies of the HTML files making up the web site with Nu HTML Checker without going through the web server. Here is how to generate the needed list of files.

michel@hp:~$ sitestats -l -p /var/www/html/michel/ > list2.txt

When run this way, the script starts by stripping the /var/www/html/michel/ root for the path of each file and then ends by tacking it back to the start of each path. I was just too lazy to rework sitestats after I found out that the true path of files could be used by the Nu Html Checker when run from the command line.

If the number of HTML files was incorrect because of downloadable files or because of local documents, then running the following script against any one of the three lists generated by sitestats will give the correct number of HTML files.

#!/bin/bash # HTML Statistics usage() { echo "$(basename $0) -h | file" } if [ $1 == "-h" ]; then usage exit 0 fi if [ "$#" -ne 1 ]; then usage exit 1 fi if [ ! -f "$1" ]; then echo "$1 does not exist" exit 2 fi echo -n "Number of HTML files: "; wc -l < $1 echo -n "Number of English HTML files: "; grep "_en.html" $1 | wc -l echo -n "Number of French HTML files: "; grep "_fr.html" $1 | wc -l

Here is an example of the output.

michel@hp:~$ hstats list.txt Number of HTML files: 320 Number of English HTML files: 210 Number of French HTML files: 110

Nu Html Checker toc

The World Wide Consortium (W3C) provides tools for developers including Nu HTML Checker (a.k.a v.Nu). The W3C is adamant its checker does not certify that a web page meets any standard.

The Nu Html Checker should not be used as a means to attempt to unilaterally enforce pass/fail conformance of documents to any particular specifications; it is intended solely as a checker, not as a pass/fail certification mechanism.
...
Why validate [then]?
...
To catch unintended mistakes—mistakes you might have otherwise missed—so that you can fix them.

v.Nu can be used as a web-based tool as shown in the next subsection, but I prefer to install it on my desktop machine to run checks locally. How to do this is shown in the subsequent subsections.

Web-Based Checking toc

Click on the link: https://validator.w3.org/nu/ to access the validator. The web-based checker can verify only one file at a time. I prefer to test the local copy of my site which is on the same desktop machine in which the source code is edited.

Nu Html Checker - checking local file

As can be seen, the file index.html of the copy of the site on the desktop machine is being checked using the file upload method. That is because the validator will not use a local address such as localhost/michel/index.html (localhost could be replaced with 127.0.0.1 or or the actual IP address of the desktop machine, it will not change anything).

Nu Html Checker - checking remote file

The same file, available from the web site, can be checked using the address method as shown above. This is not as useful for me because I never correct the HTML file directly. I need to correct the GTML source code when errors are found and then generate the corrected HTML file with the preprocessor and upload the corrected web page to my web hosting site before verifying the correction. It is much more straightforward to do all this on the desktop machine.

Local Installation of Nu Html Checker toc

The Nu Html Checker can also be installed locally but it does require a Java run time environment. It must be version 8 or newer. As it happens version 11 of openjdk is installed on my desktop machine.

michel@hp:~$ apt list --installed | grep openjdk WARNING: apt does not have a stable CLI interface. Use with caution in scripts. openjdk-11-jre-headless/focal-updates,focal-security,now 11.0.11+9-0ubuntu2~20.04 amd64 [installé] openjdk-11-jre/focal-updates,focal-security,now 11.0.11+9-0ubuntu2~20.04 amd64 [installé]

So the prerequisite Java run time environment is installed. Otherwise, it can be installed easily with the usual package manager.

michel@hp:~$ sudo apt-get install openjdk-11-jre-headless

Get latest version of Nu Html Checker (v.Nu) from https://github.com/validator/validator/releases. Currently this is version 20.6.30. I copied the zip file into a subdirectory of my download directory (called ~/Téléchargements in French systems.)

michel@hp:~$ cd Téléchargements michel@hp:~/Téléchargements$ mkdir vnu michel@hp:~/Téléchargements$ cd vnu michel@hp:~/Téléchargements/vnu$ wget https://github.com/validator/validator/releases/download/20.6.30/vnu.jar_20.6.30.zip ... 2021-11-11 12:02:13 (36,5 MB/s) - «vnu.jar_20.6.30.zip» enregistré [28942603/28942603]

I then extracted the content of the archive to my local binary directory ~/.local/bin.

michel@hp:~/Téléchargements/vnu$ unzip vnu.jar_20.6.30.zip -d ~/.local/bin Archive: vnu.jar_20.6.30.zip creating: /home/michel/.local/bin/dist/ extracting: /home/michel/.local/bin/dist/index.html extracting: /home/michel/.local/bin/dist/LICENSE extracting: /home/michel/.local/bin/dist/CHANGELOG.md extracting: /home/michel/.local/bin/dist/README.md extracting: /home/michel/.local/bin/dist/vnu.jar michel@hp:~/Téléchargements/vnu$ cd michel@hp:~$ mv .local/bin/dist .local/bin/vnu

This local copy of the validator can be used immediately as shown in the next section. However, if you want to start the web server from the menu, then it is best to create a .desktop file. In that case I suggest getting a copy the Nu Html Checker icon to be displayed in the system menu.

michel@hp:~$ wget https://validator.w3.org/nu/icon.png -O .local/bin/vnu/icon.png ... 2021-11-14 18:56:08 (59,7 MB/s) - «.local/bin/vnu/icon.png» enregistré [621/621]

Then create a .desktop file to add the checker to the Mint menu.

michel@hp:~$ nano .local/share/applications/vNuChecker.desktop

Here is the content of the file.

#!/usr/bin/env xdg-open [Desktop Entry] Version=21.11.13 Encoding=UTF-8 Type=Application Exec=/usr/bin/java -Dnu.validator.servlet.bind-address=127.0.0.1 -cp .local/bin/vnu/vnu.jar nu.validator.servlet.Main 8888 Icon=/home/michel/.local/bin/vnu/icon.png Name=Nu Html Checker Categories=WebApp;Development; Terminal=true Comment=Validate HTML source Comment[fr_CA]=Valider source HTML

Notice the Terminal=true line. Usually, the terminal is hidden, but showing it was the easiest way I found to stop the checker once done with the application. Otherwise the process remains in the background until it is explicitly killed or the computer is rebooted.

Local Checking toc

To start the Nu Html Checker web server on the desktop, open a terminal from the system menu or with the keyboard shortcut AltCtrlT and enter the following command at the system prompt.

michel@hp:~$ java -cp ~/.local/bin/vnu/vnu.jar nu.validator.servlet.Main 8888 nu.validator.servlet.VerifierServletTransaction - Starting static initializer. ... Checker service started at http://0.0.0.0:8888/

Another possibility is to use the menu entry. Search for Nu Html Checker and click on it. A terminal window will open and the Java program will be launched.

nu.validator.servlet.VerifierServletTransaction - Starting static initializer. ... Checker service started at http://127.0.0.1:8888/

The difference in the IP address of the service is because of the difference in the way the checker was started.

Open the Nu Html Checker in a browser on the same computer using the following URL http://localhost:8888. The same page as found at w3org will be available locally.

Nu Html Checker - local service

As before, the Show source box is checked. When the Check button is pressed not only will errors and warnings be shown, but following that list, the HTML source will be displayed with highlights corresponding to the errors and warnings. This makes it much easier to locate the errors in the corresponding GTML source file.

If the checker was started from the system menu, the process continues to run even after the connection to the application's web server is terminated. It can be stopped by closing the terminal from which the checker was launched by pressing the CtrlC key combination.

... python ./checker.py --bind-address 192.168.0.100 run java -Dnu.validator.servlet.bind-address=192.168.0.100 -cp vnu.jar nu.validator.servlet.Main 8888 vnu-runtime-image/bin/java -Dnu.validator.servlet.bind-address=192.168.0.100 nu.validator.servlet.Main 8888 vnu-runtime-image\bin\java.exe -Dnu.validator.servlet.bind-address=192.168.0.100 nu.validator.servlet.Main 8888 Checker service started at http://127.0.0.1:8888/ nu.validator.xml.PrudentHttpEntityResolver - http://localhost/michel/program/misc/webclean_en.html ... ^C

Checking from the Command Line toc

It is possible to validate more than one file at a time when using the Nu Html Checker from the command line.

michel@hp:~$ java -jar ~/.local/bin/vnu/vnu.jar http://localhost/michel/about_fr.html http://localhost/michel/index_en.html http://localhost/michel/index_fr.html "file:http://localhost/michel/about_fr.html":60.1-60.4: error: No “p” element in scope but a “p” end tag seen. "file:http://localhost/michel/about_fr.html":89.1-89.4: error: No “p” element in scope but a “p” end tag seen. "file:http://localhost/michel/about_fr.html":137.78-137.83: error: Named character reference was not terminated by a semicolon. (Or “&” should have been escaped as “&”.) "file:http://localhost/michel/about_fr.html":178.1-178.7: error: Stray end tag “code”.

The HTML files can be passed on to the script directly instead of going through the web server as shown above.

michel@hp:~$ java -jar ~/.local/bin/vnu/vnu.jar /var/www/html/michel/about_fr.html /var/www/html/michel/index_en.html /var/www/html/michel/index_fr.html "file:/var/www/html/michel/about_fr.html":60.1-60.4: error: No “p” element in scope but a “p” end tag seen. "file:/var/www/html/michel/about_fr.html":89.1-89.4: error: No “p” element in scope but a “p” end tag seen. "file:/var/www/html/michel/about_fr.html":137.78-137.83: error: Named character reference was not terminated by a semicolon. (Or “&” should have been escaped as “&”.) "file:/var/www/html/michel/about_fr.html":178.1-178.7: error: Stray end tag “code”.

If there is no output, then there is no error according to the validator. That is what happened with the index_xx.html files. Obviously, there are errors in the about_fr.html file which are clearly identified by their line and column coordinates. Checking all files in a directory is easily done. But be careful this is recursive!

michel@hp:~$ java -jar ~/.local/bin/vnu/vnu.jar --skip-non-html /var/www/html/michel/program/misc/

Note how the actual directory containing the HTML files is specified just as if we were uploading each of the files in the directory as we did in the first example above. Trying to access the HTML files in that same directory through the local web server will not work in this case.

Warning :

When a directory does not contain a default HTML file (typically named index.html) the local web server should not be called upon to obtain the HTML files.

michel@hp:~$ java -jar ~/.local/bin/vnu/vnu.jar --skip-non-html --stdout /var/www/html/michel/program/misc/ | wc -l 648 michel@hp:~$ java -jar ~/.local/bin/vnu/vnu.jar --skip-non-html --stdout http://localhost/michel/program/misc/ | wc -l 2

How can there be only two errors or warnings in the second case? It is because the local web server transmitted a 403 error page thus blocking acces to the HTML files.

Note the addition of the --stdout option because, otherwise, the error and warning messages would have been sent to stderr and wc would not have seen them.

Instead of recursively checking all files in a directory and hoping that the checker will see every file, I prefer to supply the list of files to the tool. Unfortunately, I have not found a way to do this and had to write a batch file which loops through each filename in the list passing it on to the Checker.

#!/bin/bash # Nu Html Checker Runner usage() { echo "Usage:" echo " $(basename $0) -h | file" } if [ "$#" -ne 1 ]; then echo "Error: missing parameter" usage exit 1 fi if [ $1 == "-h" ]; then usage exit 0 fi Lines=$(cat $1) for Line in $Lines do # java -jar ~/.local/bin/vnu/vnu.jar --stdout --verbose "$Line" java -jar ~/.local/bin/vnu/vnu.jar --stdout "$Line" done

Add the --verbose option when running the checker if you want to see the name of the file being checked. As shown, only errors will be displayed on the terminal. As before, I saved that script in the ~.local/bin/ directory and made it executable with the chmod +x nvu command.

To test the bash file, I created a file, top_level.txt, with the full path to all the HTML files in /michel/, the top level directory of my personal web site.

/var/www/html/michel/about_en.html /var/www/html/michel/about_fr.html /var/www/html/michel/archives_en.html /var/www/html/michel/archives_fr.html /var/www/html/michel/downloads_en.html /var/www/html/michel/downloads_fr.html /var/www/html/michel/index_en.html /var/www/html/michel/index_fr.html /var/www/html/michel/major_incident_en.html /var/www/html/michel/major_incident_fr.html

The script ran the local copy of Nu Html Checker against every file in that list and reported some errors.

michel@hp:~$ nvu top_level.txt "file:/var/www/html/michel/major_incident_en.html":54.1-54.6: error: Stray end tag “div”. "file:/var/www/html/michel/major_incident_en.html":532.11-532.16: error: Saw “<” when expecting an attribute name. Probable cause: Missing “>” immediately before. "file:/var/www/html/michel/major_incident_en.html":532.11-532.18: error: End tag had attributes. "file:/var/www/html/michel/major_incident_en.html":532.11-532.18: error: Stray end tag “b,”. "file:/var/www/html/michel/major_incident_en.html":542.1-542.6: error: Stray end tag “div”. "file:/var/www/html/michel/major_incident_fr.html":54.1-54.6: error: Stray end tag “div”. "file:/var/www/html/michel/major_incident_fr.html":533.1-533.6: error: Stray end tag “div”.

Running the script against all files in the web site gave a disheartening total number of errors.

michel@hp:~$ nvu list2.txt | wc -l 19772

To get a better handle on what is going on, I modified the nvu script.

#!/bin/bash # Nu Html Checker Runner II usage() { echo "Usage:" echo " $(basename $0) -h | file" } if [ "$#" -ne 1 ]; then echo "Error: missing parameter" usage exit 1 fi if [ $1 == "-h" ]; then usage exit 0 fi Lines=$(cat $1) for Line in $Lines do count=`java -jar ~/.local/bin/vnu/vnu.jar --stdout "$Line" | wc -l` if [ "$count" -ne 0 ]; then echo $count $Line fi done

michel@hp:~$ nvu2 list2.txt > results.txt michel@hp:~$ cat list2.txt | wc -l 321 michel@hp:~$ cat result.txt | wc -l 295

Only 8% of the HTML files on the site passed the syntax check. Clearly, I overestimated my diligence in checking the files. This could be more than the cognitive bias known as the overconfidence effect and borders on an illusory superiority. Perhaps it confirms the findings of David Dunning and Justin Kruger. Wanting to assuage the pain to my bruised ego, I imported result.txt into a spreadsheet and discovered that a mere 10 files contained half the errors and 23 files account for two thirds of the errors. This highly skewed distribution of errors may be in large part attributable to the knock-on effect of some errors. Forget the trailing quotation mark on an inline style attribute or a hyperlink reference and chances are that the checker will report another two or three errors that will not need to be fixed. Incorrectly spell an internal style name in the <head> section and the error count will be increased by the number of times the style is used in the page. Besides, many so-called errors could be just as easily be seen as warnings. They include things such as putting a width the opening tag of a table cell (as in <td width="18">) instead of using a style sheet. Do these observations manage to rehabilitate my sense of self-worth? Hardly, what of invalid hyperlinks and spelling and grammatical errors? The mind shudders, but these things can also be checked.

Other Syntax Checkers toc

As stated in the introduction, there are numerous HTML syntax checkers. Here are a few that I have looked at.

HTML Tidy toc

Tidy by HTML Tidy Advocacy Community Group (HTACG) (pronounced H-Task) is a "smart" HTML pretty printer or formatter. By smart I mean that the application will correct common mistakes such as mismatched end tags or missing end tags and so on. See What Tidy does in the documentation for more details. The same document says "It’s probable that you already have an outdated version of HTML Tidy. It comes pre-installed on Mac OS X and many distributions of GNU/Linux and other UNIX-type operating systems.". However this is not the case in Mint Mate 20.1.

michel@hp:~$ apt-cache policy tidy tidy: Installé : (aucun) Candidat : 2:5.6.0-11 Table de version : 2:5.6.0-11 500 500 http://ubuntu.mirror.iweb.ca focal/universe amd64 Packages

While the repository does contain a tidy package, it is out of date. Accordingly, I downloaded the current .deb package to my Downloads (called Téléchargements in French language systems) and installed it the dpkg utility.

michel@hp:~$ mkdir Téléchargements/tidy michel@hp:~$ cd Téléchargements/tidy/ michel@hp:~/Téléchargements/tidy$ wget https://github.com/htacg/tidy-html5/releases/download/5.8.0/tidy-5.8.0-Linux-64bit.deb --2021-11-12 16:14:40-- https://github.com/htacg/tidy-html5/releases/download/5.8.0/tidy-5.8.0-Linux-64bit.deb ... 2021-11-12 16:14:41 (6,43 MB/s) - «tidy-5.8.0-Linux-64bit.deb» enregistré [986044/986044] michel@hp:~/Téléchargements/tidy$ sudo dpkg -i tidy-5.8.0-Linux-64bit.deb ... michll@hp:~/Téléchargements/tidy$ which tidy /usr/bin/tidy

As far as I can tell, a man page is not installed but there is extensive help from the command line. See it with the tidy --help command. Let's test-drive tidy on a file that Nu Html Checker found had no error.

michel@hp:~$ java -jar ~/.local/bin/vnu/vnu.jar /var/www/html/michel/index_en.html michel@hp:~$ tidy -lan en -e /var/www/html/michel/index_en.html Info: Document content looks like HTML5 No warnings or errors were found. ...

Note that the -e option ensures the program lists errors only, there is no "pretty printed" output of the source file. Perhaps not surprisingly, tidy also reports that it found no errors in the file. If you want the discreet output usual for Linux utilities, then add the -q option.

michel@hp:~$ java -jar ~/.local/bin/vnu/vnu.jar /var/www/html/michel/index_en.html michel@hp:~$ tidy -e -q /var/www/html/michel/index_en.html michel@hp:~$

Now let's compare the two when looking at a file which does have some errors.

michel@hp:~$ java -jar .local/bin/vnu/vnu.jar /var/www/html/michel/program/fpl/translating_fpl_fr.html "file:/var/www/html/michel/program/fpl/translating_fpl_fr.html":149.1431-149.1436: error: Named character reference was not terminated by a semicolon. (Or “&” should have been escaped as “&”.) "file:/var/www/html/michel/program/fpl/translating_fpl_fr.html":270.1-270.42: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images. "file:/var/www/html/michel/program/fpl/translating_fpl_fr.html":317.1-317.45: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images. "file:/var/www/html/michel/program/fpl/translating_fpl_fr.html":321.1-321.47: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images. "file:/var/www/html/michel/program/fpl/translating_fpl_fr.html":439.1-439.45: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images. "file:/var/www/html/michel/program/fpl/translating_fpl_fr.html":475.1-475.45: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images. "file:/var/www/html/michel/program/fpl/translating_fpl_fr.html":481.1-481.83: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images. "file:/var/www/html/michel/program/fpl/translating_fpl_fr.html":490.1-490.83: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images. "file:/var/www/html/michel/program/fpl/translating_fpl_fr.html":507.4-507.48: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images. "file:/var/www/html/michel/program/fpl/translating_fpl_fr.html":535.4-535.86: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images. "file:/var/www/html/michel/program/fpl/translating_fpl_fr.html":552.1-552.83: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images. "file:/var/www/html/michel/program/fpl/translating_fpl_fr.html":584.1-584.83: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images. "file:/var/www/html/michel/program/fpl/translating_fpl_fr.html":612.1-612.83: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images. "file:/var/www/html/michel/program/fpl/translating_fpl_fr.html":690.1-690.83: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images. michel@hp:~$ tidy -lang en -e -q /var/www/html/michel/program/fpl/translating_fpl_fr.html ine 149 column 1431 - Warning: entity "&nbsp" doesn't end in ';' line 270 column 1 - Warning: <img> lacks "alt" attribute line 317 column 1 - Warning: <img> lacks "alt" attribute line 321 column 1 - Warning: <img> lacks "alt" attribute line 439 column 1 - Warning: <img> lacks "alt" attribute line 475 column 1 - Warning: <img> lacks "alt" attribute line 481 column 1 - Warning: <img> lacks "alt" attribute line 490 column 1 - Warning: <img> lacks "alt" attribute line 507 column 4 - Warning: <img> lacks "alt" attribute line 535 column 4 - Warning: <img> lacks "alt" attribute line 552 column 1 - Warning: <img> lacks "alt" attribute line 584 column 1 - Warning: <img> lacks "alt" attribute line 612 column 1 - Warning: <img> lacks "alt" attribute line 690 column 1 - Warning: <img> lacks "alt" attribute

That's comforting, because the exact same errors are reported. The two programs are not identical by any means. As seen Nu Html Checker can check CSS stylesheets, while Tidy can check accessibility.

michel@hp:~$ tidy -lang en -e -q -access 0 /var/www/html/michel/index_fr.html michel@hp:~$ tidy -lang en -e -q -access 1 /var/www/html/michel/index_fr.html line 45 column 7 - Access: [6.1.1.3]: style sheets require testing (style attribute). line 61 column 309 - Access: [6.1.1.3]: style sheets require testing (style attribute). line 69 column 584 - Access: [6.1.1.3]: style sheets require testing (style attribute). line 311 column 1 - Access: [6.1.1.3]: style sheets require testing (style attribute). line 312 column 56 - Access: [6.1.1.3]: style sheets require testing (style attribute). line 329 column 107 - Access: [1.1.1.1]: <img> missing 'alt' text. line 329 column 107 - Access: [1.1.2.1]: <img> missing 'longdesc' and d-link.

One can look up these Accès:[a.b.c.d] codes in HTML Tidy Accessibility Checker. As can be seen, my site is not up to the better standards in this respect.

Perhaps the most interesting thing about Tidy is it's ability to fix common errors. Here is an example of what it can do. First we will display the errors in one of my HTML files, the run it through Tidy and then test the corrected output again with Nu Html Checker.

michel@hp:/var/www/html/michel$ java -jar ~/.local/bin/vnu/vnu.jar major_incident_en.html "file:/var/www/html/michel/major_incident_en.html":54.1-54.6: error: Stray end tag “div”. "file:/var/www/html/michel/major_incident_en.html":532.11-532.16: error: Saw “<” when expecting an attribute name. Probable cause: Missing “>” immediately before. "file:/var/www/html/michel/major_incident_en.html":532.11-532.18: error: End tag had attributes. "file:/var/www/html/michel/major_incident_en.html":532.11-532.18: error: Stray end tag “b,”. "file:/var/www/html/michel/major_incident_en.html":542.1-542.6: error: Stray end tag “div” michel@hp:/var/www/html/michel$ tidy major_incident_en.html > fixup.html michel@hp:/var/www/html/michel$ java -jar ~/.local/bin/vnu/vnu.jar fixup.html michel@hp:/var/www/html/michel$

That's a very good result. Of course, there's a but. In my case it is not the HTML output that needs to be corrected, it is the GTML source file used to generate the HTML file that must be fixed. Here is what happens when a GTML source is "corrected" by Tidy.

Original TextTidy Output
#define TITLE Web Site Offline in Previous Three Days
#define ORGDATE 2021-09-1
#define ORGVERSION September 1, 2021
##define REVDATE 2019-11-07
##define REVVERSION November 7, 2019
#define MODAUTHOR Michel Deslierres
#define LOCSTYLE .lmargin {margin-left: 15px}
#define LANG en
#define LANGLINK major_incident_fr
#include "2_head.gtt"
#include "2_topmenu.gtt"
##define LEFT ha/rpi/new_stretch_en.html
##define LEFT_TITLE Updating Raspbian to Stretch
##define RIGHT ha/rpi/guide_buster_02_en.html
##define RIGHT_TITLE Home Automation Servers on Raspbian Buster Lite
##define RIGHT2 ha/rpi/guide_buster_03_en.html
##define RIGHT2_TITLE Various Hardware with Raspbian Buster Lite
#include "2_links_top.gtt"
#literal ON

<div class="content">

    C O N T E N T    H E R E

<div class="scrn">
michel@hp:~$ <span class="cmd">ls /dev/tty*</span>
...
/dev/tty18  /dev/tty33  /dev/tty49  /dev/tty7       /dev/ttyS20  /dev/ttyS8
/dev/tty19  /dev/tty34  /dev/tty5   /dev/tty8       /dev/ttyS21  /dev/ttyS9
/dev/tty2   /dev/tty35  /dev/tty50  /dev/tty9       /dev/ttyS22  <b>/dev/ttyUSB0</b>
/dev/tty20  /dev/tty36  /dev/tty51  /dev/ttyprintk  /dev/ttyS23
...

michel@hp:~$ <span class="cmd">dmesg | grep tty</span>
[    0.000000] console [tty0] enabled
[25490.513501] usb 3-14: ch341-uart converter now attached to ttyUSB0
</div>

    C O N T E N T    H E R E

</div>
#literal OFF
#include "2_links_bottom.gtt"
#include "2_foot.gtt"
<!DOCTYPE html>
<html>
<head>
<meta name="generator" content=
"HTML Tidy for HTML5 for Linux version 5.8.0">
<title></title>
</head>
<body>
#define TITLE Web Site Offline in Previous Three Days #define
ORGDATE 2021-09-1 #define ORGVERSION September 1, 2021 ##define
REVDATE 2019-11-07 ##define REVVERSION November 7, 2019 #define
MODAUTHOR Michel Deslierres #define LOCSTYLE .lmargin {margin-left:
15px} #define LANG en #define LANGLINK major_incident_fr #include
"2_head.gtt" #include "2_topmenu.gtt" ##define LEFT
ha/rpi/new_stretch_en.html ##define LEFT_TITLE Updating Raspbian to
Stretch ##define RIGHT ha/rpi/guide_buster_02_en.html ##define
RIGHT_TITLE Home Automation Servers on Raspbian Buster Lite
##define RIGHT2 ha/rpi/guide_buster_03_en.html ##define
RIGHT2_TITLE Various Hardware with Raspbian Buster Lite #include
"2_links_top.gtt" #literal ON <div class="content">

  C O N T E N T    H E R E

<div class="scrn"><span class="cmd">michel@hp:~$ <span class=
"cmd">ls /dev/tty*</span> ... /dev/tty18 /dev/tty33 /dev/tty49
/dev/tty7 /dev/ttyS20 /dev/ttyS8 /dev/tty19 /dev/tty34 /dev/tty5
/dev/tty8 /dev/ttyS21 /dev/ttyS9 /dev/tty2 /dev/tty35 /dev/tty50
/dev/tty9 /dev/ttyS22 <b>/dev/ttyUSB0</b> /dev/tty20 /dev/tty36
/dev/tty51 /dev/ttyprintk /dev/ttyS23 ... michel@hp:~$ <span class=
"cmd">dmesg | grep tty</span> [ 0.000000] console [tty0] enabled
[25490.513501] usb 3-14: ch341-uart converter now attached to
ttyUSB0</span></div>

  C O N T E N T    H E R E

</div>
#literal OFF #include "2_links_bottom.gtt" #include "2_foot.gtt"
</body>
</html>

The text with a silver background added by tidy will cause a problem because the 2_head.gtt and 2_foot.gtt templates will be expanded into the proper HTML header and footer so there will be duplicates. Then all the GTML macro definitions that begin with #define are mangled because each definition must be on a single line beginning with #define. While it may be possible to fix this problem, I can't see how the other problem visible above would be fixed. The scrn style used with the <div> tag to show terminal commands and results is equivalent to the pre HTML tag which means that the text between the opening and closing tags must be formatted. Unfortunately, tidy output cannot preserve spaces, line breaks and tabs which means that all the formatting in a <div class="scrn"> ... </div> block will be lost as seen above (see Preserving original indenting not possible in the Tidy documentation).

It is unfortunate that I can't use Tidy because I think it would have automatically fixed many of the reported syntax problems.

Dr. Watson toc

Created more than 20 years ago, Dr. Watson is a "free service to analyze your web page on the Internet. You give it the URL of your page and Watson will get a copy of it directly from the web server. Watson can also check out many other aspects of your site, including link validity, download speed, search engine compatibility, and link popularity."

This is a web-based application that cannot be installed locally as far as I can make out. This makes it a bit impractical for checking the many older posts on my site but it could be used to verify new additions to the site. Unfortunately, there is an unspecified size constraint as I found out when I tried to check one of the more popular posts on my site and got the following error.

I'm sorry, but your page is 92124 bytes, which is more than I'm allowed to try and perform certain tasks on. Doing what I can ...

which, it turned out, was not much.

Checking CSS Files toc

The Nu Html Checker can verify HTML, CSS and SVG documents. However for checking CSS style sheets, I prefer to use the CSS Validation Service. The reason for this preference is that it returns a corrected version of the submitted file. It is a web-based application but it was not important for me to see if this validator can be installed locally. I have only 3 CSS style sheets and they are rarely changed.

W3C Markup Validator toc

As far as I can make out, before hosting Nu Html Checker, W3C already had a verification tool called W3C Markup Validation Service. It is "a perl-based CGI script that uses DTD to verify the validity of HTML3, HTML4 and XHTML documents; it also incorporates by reference the NU Validator used to validate HTML5 / HTML LS documents" (source).

If I interpret this correctly, this web-based application validates older HTML3 and HTML4 documents against their DTD, but is uses Nu Validator when the document is HTML5. If that is accurate, it would mean that this validator would not be of much use to verify my site.

Hyperlink Checkers toc

The invalid link is a vexing problem for both users and creators of web content. There are two types of errors related to hyperlinks on my site: those that entirely my fault and those created by others. Most of the self-inflicted errors are stupid spelling mistakes, simple inversions of letters while typing in a URL, or hurried changes in the name of an id attribute while building the menu found in most of the substantial posts. Careful verification before posting a new web page should eliminate this problem but "things happen" as "they say" (whoever "they" are, and I do know that they usually say something a bit more scatological). The other common type of error is the disappearing site. Back in 1998 Sir Tim Berners-Lee listed arguments put forth for changing URIs (Uniform Resource Identifier) and argued their invalidity: Cool URIs don't change. The message has not reached everyone (and that includes me, unfortunately), so many links to outside resources end up pointing to something that no longer exists or that has been given a new address. Try this link https://www.google.com/not-found-file.html to see how Google reports a 404 not found error. My own version https://sigmdel.ca/michel/not-found-file.html is even more terse. Experience has shown that fixing remote site 404 errors can be time consuming because there is no indication if the wanted resource has been entirely removed or if it remains available on the same host but with a different URL or on a different host. The latter is an inevitable consequence when individual creators that do not have a personal domain move their site to a different web hosting provider.

Numerous link checkers are available, but some will not work for me because when I created this site I made a couple of decisions which were not optimal. One was that I decided to use relative instead of absolute URLs when linking to other documents on my site. While there are arguments against this practice (Why relative URLs should be forbidden for web developers) and the W3C Link Checker will only work with absolute URLs, this would not have had much impact had I not also decided to use the <base href="/michel/"> HTML element. This seems to confuse many link checkers especially when it comes to internal id attributes used in links to specific positions within an HTML document.

As before, I am interested in tools that I can install on my desktop machine. In the end, I have installed only two hyperlink checkers and truth be told only one of them works well with my site.

LinkChecker toc

Luckily, LinkChecker, a Python 3 script, can handle links on my site. Version 9.4.0 is available as a package in Debian Buster while the latest version (10.0.1) is available in Debian Bullseye and Sid.Unfortunately, these packages are not available in the standard Mint 20.1 repository so I decided to install the script in a virtual environment but there are other methods of installing LinkChecker. The installation was a simple three-step procedure: create a virtual environment, enable it and then install the package within it with pip (actually pip3 since the virtual environment is created with Python 3).

michel@hp:~$ mkvenv linkchecker creating virtual environment /home/michel/linkchecker updating virtual environment /home/michel/linkchecker michel@hp:~$ ve linkchecker (linkchecker) michel@hp:~$ cd linkchecker (linkchecker) michel@hp:~/linkchecker$ pip install git+https://github.com/linkchecker/linkchecker.git Collecting git+https://github.com/linkchecker/linkchecker.git ... Successfully built LinkChecker Installing collected packages: urllib3, soupsieve, idna, charset-normalizer, certifi, requests, pyxdg, dnspython, beautifulsoup4, LinkChecker Successfully installed LinkChecker-10.0.1 beautifulsoup4-4.10.0 certifi-2021.10.8 charset-normalizer-2.0.7 dnspython-2.1.0 idna-3.3 pyxdg-0.27 requests-2.26.0 soupsieve-2.3.1 urllib3-1.26.7

The next step was to set up the configuration file which is called linkcheckerrc and which should be in a directory named .linkchecker in the user's home directory.

(linkchecker) michel@hp:~$ mkdir .linkchecker (linkchecker) michel@hp:~$ nano .linkchecker/linkcheckerrc

It is an INI file with section headers in square brackets [] and keys which are name and value pairs.

[checking] # no recursion recursionlevel=1 [filtering] # Check all links outside the web site, no recursion no matter the previous setting checkextern=1 [output] # Send the HTML formatted output to stdout (probably best to redirect to a file) log=html # Uncomment for information about each link, otherwise only problems are shown #verbose=1 # Enable checking links to local ID [AnchorCheck]

The recursionlevel=1 key-value pair in the [checking] section will ensure that only links in the file provided on the command line are verified. Without that, the default behaviour of LinkChecker would be to follow every HTML link and check its own links and to do that recursively. This unlimited recursion would be desirable when verifying all links in a web site. However this could take a long time and it would be very discouraging when all that is desired is to check a page about to be added to a site.

The next setting, checkextern=1 in the filtering section, ensures that all links to resources outside the web site are verified. No recursive verification of links in external HTML files is ever done, no matter the setting of recursionlevel.

The colour-coded HTML output, as set with the key-value pair log=html, makes it very easy to spot errors especially when the verbose output is enabled. Setting verbose=1 is a good way to verify that the checker is investigating all the links in the HTML source file.

Finally, the section header AnchorCheck which enables the AnchorCheck plug-in which is important for my site. This will ensure that internal document links to elements with id names are verified. There are other plug-ins including checking each file with Nu Html Checker. All the plug-ins can be listed.

(linkchecker) michel@hp:~$ linkchecker --list-plugins INFO linkcheck.cmdline 2021-11-16 12:58:02,135 MainThread Checking intern URLs only; use --check-extern to check extern URLs. AnchorCheck Checks validity of HTML anchors. ...

Most settings can be set with options on the command line, except for enabling plug-ins that can only be done in the configuration file. Settings set with command line options take precedence over any settings set in the configuration file and I make good use of this fact in what follows. Here is part of the output when checking one file on the site. Note that the command line option -o text to more easily to display the output from LinkChecker. The option overrides the log=html setting in the configuration file.

(linkchecker) michel@hp:~$ linkchecker -o text http://127.0.0.1/michel/program/misc/gfxfont_8bit_fr.html LinkChecker 10.0.1 Copyright (C) 2000-2016 Bastian Kleineidam, 2010-2021 LinkChecker Authors LinkChecker comes with ABSOLUTELY NO WARRANTY! This is free software, and you are welcome to redistribute it under certain conditions. Look at the file `LICENSE' within this distribution. Get the newest version at https://linkchecker.github.io/linkchecker/ Write comments and bugs to https://github.com/linkchecker/linkchecker/issues Start checking at 2021-11-12 18:52:01-003 URL `http://127.0.0.1/michel/program/misc/gfxfont_8bit_fr.html' Real URL http://127.0.0.1/michel/program/misc/gfxfont_8bit_fr.html Check time 0.044 seconds D/L time 0.000 seconds Size 60.89KB Result Valid: 200 OK URL `javascript:void(0)' Name `Citation originale' Parent URL http://127.0.0.1/michel/program/misc/gfxfont_8bit_fr.html, line 451, col 3 Base http://127.0.0.1/michel/ Real URL javascript:void(0) Info Javascript URL ignored. Result Valid: ignored ... URL `downloads_fr.html' Name `téléchargements' Parent URL http://127.0.0.1/michel/program/misc/gfxfont_8bit_fr.html, line 40, col 3 Base http://127.0.0.1/michel/ Real URL http://127.0.0.1/michel/downloads_fr.html Check time 3.294 seconds Result Valid: 200 OK ... URL `https://github.com/adafruit/Adafruit-GFX-Library/issues/64' Name `international character sets #64' Parent URL http://127.0.0.1/michel/program/misc/gfxfont_8bit_fr.html, line 509, col 38 Base http://127.0.0.1/michel/ Real URL https://github.com/adafruit/Adafruit-GFX-Library/issues/64 Check time 3.571 seconds Result Valid: 200 OK Statistics: Downloaded: 805.85KB. Content types: 15 image, 33 text, 0 video, 0 audio, 2 application, 0 mail and 1 other. URL lengths: min=18, max=114, avg=50. That's it. 51 links in 53 URLs checked. 1 warning found. 0 errors found. Stopped checking at 2021-11-12 19:15:13-003 (14 seconds)

In this example an old post is tested without overriding or adding any settings beyond those in the configuration file shown above. While the output to stdout is redirected to a file, the progress reports that are shown below are not redirected.

(linkchecker) michel@hp:~$ linkchecker http://localhost/michel/ha/x10/future_en.html > linkchecker_result.html 10 threads active, 10 links queued, 3 links in 23 URLs checked, runtime 1 seconds 5 threads active, 0 links queued, 18 links in 28 URLs checked, runtime 6 seconds 1 thread active, 0 links queued, 22 links in 33 URLs checked, runtime 11 seconds 1 thread active, 0 links queued, 22 links in 33 URLs checked, runtime 16 seconds 1 thread active, 0 links queued, 22 links in 33 URLs checked, runtime 21 seconds 1 thread active, 0 links queued, 22 links in 33 URLs checked, runtime 26 seconds 1 thread active, 0 links queued, 22 links in 33 URLs checked, runtime 31 seconds 1 thread active, 0 links queued, 22 links in 33 URLs checked, runtime 36 seconds 1 thread active, 0 links queued, 22 links in 33 URLs checked, runtime 41 seconds 1 thread active, 0 links queued, 22 links in 33 URLs checked, runtime 46 seconds 1 thread active, 0 links queued, 22 links in 33 URLs checked, runtime 51 seconds 1 thread active, 0 links queued, 22 links in 33 URLs checked, runtime 56 seconds 1 thread active, 0 links queued, 22 links in 33 URLs checked, runtime 1 minute, 1 seconds

The results can be viewed. Since verbose was not enabled, only warnings (there were none) and errors (there were two) are shown in the result file, The program checked 23 links many of which referred to other pages on the web site. Within 10 seconds or so all but one link had been verified. One of those got a 404 error, meaning that the external file no longer exists. The last link timed out. The default timeout value is 60 seconds which explains why the program ran for slightly more than a minute.

To test my whole site, I timed the following command.

michel@hp:~$ ve linkchecker (linkchecker) michel@hp:~$ time linkchecker -r -1 http://localhost/michel/index_fr.html > linkcheck_all.html 10 threads active, 41 links queued, 3 links in 54 URLs checked, runtime 1 seconds ... 10 threads active, 4 links queued, 4308 links in 4735 URLs checked, runtime 18 minutes, 11 seconds 1 thread active, 0 links queued, 4331 links in 4749 URLs checked, runtime 18 minutes, 16 seconds real 18m18,020s user 2m29,309s sys 0m2,137s (linkchecker) michel@hp:~$

Note the -r -1 command line option that sets the recursion level to a negative value implying that all links on the web site will be checked recursively. I trust that the program keeps a list of visited pages to avoid infinite loops! Obviously, that pitfall was avoided because in all over four thousand links were checked in slightly over 18 minutes, with 126 warnings mostly about invalid anchor names and 102 errors such as 404 file missing error.

What if recursion is turned off and LinkChecker is given the list of files to check? Explicitely, the checker will obtain its list of URL to check from the file list.txt redirected to stdin.

(linkchecker) michel@hp:~$ time linkchecker --stdin <list.txt >linkcheck_list.html 10 threads active, 414 links queued, 3 links in 427 URLs checked, runtime 1 seconds 10 threads active, 568 links queued, 18 links in 596 URLs checked, runtime 6 seconds ... 1 thread active, 0 links queued, 4708 links in 5142 URLs checked, runtime 19 minutes, 51 seconds 1 thread active, 0 links queued, 4708 links in 5142 URLs checked, runtime 19 minutes, 56 seconds real 20m1,610s user 2m11,465s sys 0m2,457s (linkchecker) michel@hp:~$

michel@hp:~$ cat list.txt http://localhost/michel/about_fr.html http://localhost/michel/program/misc/gfxfont_8bit_fr.html http://localhost/michel/index_en.html ...

In closing the discussion about this excellent tool, two things bear mentioning. There is a slight problem with Unicode and even when specifying the utf_8 encoding, code points above ASCII 126 will not be displayed correctly. This is a known issue (Encoding strings doesn't work #533), but to be fair this does not materially reduce the value of the program. Also be aware that the "original" LinkChecker by Bastian Kleineidam (wummel) is still available on Github. There have been no updates to the code since June 2016 and it only works with Python 2.7.x. It is a bit unfortunate that there is not any mention of the newer version anywhere that I could find.

Linklint toc

In the past, I verified links with a Perl script called linklint. Version 2.3.5 is available in the Mint repository and could be installed with the GUI package front ends Synaptic and mintinstall. According to the Linklint - fast html link checker home page, version 2.3.5 dated August 13, 2001 is the latest version. However this is not the case and if one goes down the list of archives in the download directory, there are 2.3.6.c and 2.3.6.d versions dated 2022-12-12. I installed 2.3.6.d. It's a single Perl file which I copied to ~.local/bin. Unfortunately, it gets confused with id attribute and it reported hundreds of error.

michel@hp:~$ linklint-2.3.6.d -http -host localhost /michel/@ -doc linktest -htmlonly -limit 3000 Checking links via http://localhost that match: /michel/@ 1 seed: /michel/ Seed: /michel/ checking robots.txt for localhost Checking /michel/ Checking /michel/archives_en.html Checking /michel/program/esp8266/arduino/watchdogs2_en.html ... Checking /michel/program/fpl/yweather/program/fpl/install_fpl_en.html Checking /michel/ha/rpi/mailbag_01_en.html Processing ... writing files to linktest wrote 16 txt files wrote 14 html files wrote index file index.html found 1 default index found 1 php file found 90 cgi files found 892 html files found 731 image files found 9 text files found 40 style-sheet files found 580 other files found 271 http links found 1093 https links found 2 javascript links found 2 file links found 1 unknown link found 2989 named anchors ----- 7 actions skipped ----- 9 files skipped ERROR 1510 missing named anchors Linklint found 2344 files and checked 2235 html files. There were no missing files. No files had broken links. 1510 errors, no warnings. michel@hp:~$

No broken links? Seems very unlikely. At the same time, there are 1510 missing named anchors, things like the id attribute of each section and subsection in the posts. A bit more on that later. The falsely optimistic result is in part caused by the fact that all checks have not been performed. Remote URL Checking explains that a second step is required.

michel@hp:~$ linklint-2.3.6.d @@ -doc linkcheck checking 271 urls ... 41j.com/blog/2016/08/the-esp8266-and-sd-cards/ ok activehomevista.site88.net/ unknown error (410) ... writing files to linktest wrote 9 txt files wrote 7 html files wrote index file urlindex.html found 104 urls: ok ----- 116 urls: moved permanently (301) ----- 16 urls: moved temporarily (302) ERROR 1 url: access forbidden (403) ERROR 4 urls: bad request (400) ERROR 5 urls: could not find ip address ERROR 13 urls: not found (404) ERROR 2 urls: timed out connecting to host ERROR 1 url: timed out waiting for data ERROR 14 urls: unknown error (307) ERROR 3 urls: unknown error (410) warn 124 urls: not an http link Linklink checked 271 urls: 104 were ok, 43 failed. 132 urls moved. 7 hosts failed: 8 urls could be retried. 34 files had failed urls.

That result is more in line with the sorry state of my web site. The lack of support for "named anchors" is disappointing because, in both 2.3.6.c and 2.3.6.d versions of the script, the following is found.

# Version 2.3.6 August 26, 2001 # ----------------------------- # o added -ignore_anchor option # o now look for id= attribute # o autogenerate parse tree

Let's look at the details in the output file ~/linktest/errorAX.html.

host: localhost date: Thu, 18 Nov 2021 16:24:17 (local) Linklint version: 2.3.6 #------------------------------------------------------------ # ERROR 1510 missing named anchors (cross referenced) #------------------------------------------------------------ /michel/3d/3d/intro_openscad_01_en.html#futureversion used in 1 file: /michel/3d/intro_openscad_01_en.html ...

That "missing named anchor": /michel/3d/3d/intro_openscad_01_en.html#futureversion looks very suspicious. Here are all the lines in HTML file containing the string futureversion.

<h1 id="toc">Table of Contents</h1> <ol> <li><a href="3d/intro_openscad_01_en.html#intro">Solid Geometry</a></li> ... <li><a href="3d/intro_openscad_01_en.html#futureversion">Future Version</a></li> </ol> ... <!-- ++++ Section ++++++++++++++++++++++++++++++++++++ --> <h1 id="futureversion" class="Section" style="clear: both"> <a href="3d/intro_openscad_01_en.html#futureversion">Future Versions</a> <a href="3d/intro_openscad_01_en.html#toc"><img src="arrow-up.png" alt="toc" class="up-link-arrow"/></a> </h1>

There is nothing that obviously accounts for the extraneous 3d/ in the link calculated by linklint. However, I edited the HTML file, removing the <base href="/michel/"> and adding the "/michel/" suffix to all relative URLs in the file. On running linklint against this modified HTML file, the links to named anchors were all verified as correct. In my mind that confirms that the <base href=> HTML tag is the source of the confusion.

The conclusion appears to be that linklint is a viable link checker as long as the HTML files do not contain the <base href=> HTML tag and id attributes to identify named anchors.

Strategy toc

More than 19,000 syntax errors and more than 100 invalid hyperlinks seems like an overwhelming task. Just where to begin? So far I have removed all syntax errors reported by Nu Html Checker for the 10 files in the root of my personal web site at sigmdel.ca/michel. It makes sense that the files reached by clicking on any icon at the top of each page should be error free. But where to go from there? Should I start with the 10 pages which account for half the syntax errors or should the 10 most visited pages be checked initially?

Again, the W3C provides a tool to help establish a priority. Called the Log Validator it is a " A free, simple and step-by-step tool to improve dramatically the quality of your website. Find the most popular invalid documents, broken links, etc., and prioritize the work to get them fixed.". The W3C goes beyond that and provides instructions: Making your website valid: a step by step guide.

I did install the tool in my account with my web host. Details can be found in W3C LogValidator in cPanel. The initial results from using that tool were disappointing to say the least.

To start removing errors on this site, I'll begin with the most visited pages and then look at the worst pages. If I were a betting man, I would not wager on ever getting to every page.

W3C LogValidator in cPanel->