By default fontconvert by Adafruit extracts the glyphs of the 95 printable ASCII characters (code points 32 (0x20) to 126 (0x7E)) from TTF font files and creates a GFXfont header file that can be used with the Adafruit-GFX library. Diacritical letters and other symbols needed in non-English European languages are missing from this set of characters. There are open Issues in the Adafruit-GFX-Library repository about supporting extended ASCII code pages with some proposed solutions, but it is not clear that this has yielded a solution which has found wide acceptance. This post does not pretend to fill this void, but it does show how to generate a GFXfont with all ISO 8859-1 (Latin-1) or ISO 8859-15 (Latin-9) printable glyphs from a TTF font file (provided the latter contains the glyphs of course) and how to easily access these characters in embedded systems.

The ISO 8859 Standard

ISO 8859 (formally ISO/CEI 8859) is a set of standards for 8-bit character encoding that extend the 7-bit ASCII character encoding. The encoding is divided into 15 parts, designated 8859-1 to 8859-16 (8859-12 was abandoned) that covers many European languages. The parts have more or less confusing names. Part 1 is named Latin-1 Western European, part 2 is Latin-2 Central European, but by the time part 16 is reached it is called Latin-10 South-Eastern European. Furthermore, they have been revised and cover very similar standards from other authorities. Add to the mix, vendors variations and additions and the result could be confusing. Recall the many code pages such as CP437 (the original IBM PC hardware encoding), CP850 (Latin-1 again!), CP863 (which was used in French Canada) from the DOS era? Code page 1252, also called Windows-1252 used in early versions of Microsoft Windows and which remains the "most-used single-byte character encoding in the world", is sometimes labelled ISO-8859-1 although it was a superset of the ISO 8859-1 (apparently one is supposed to spot the extra dash that differentiates the two sets of characters). This relative mess was more or less resolved with the advent of Unicode. The latter did not completely do away with the older standards. The first two blocks of the Unicode character database,

    0000..007F; Basic Latin
    0080..00FF; Latin-1 Supplement

(the first 255 code points) correspond to part 1 of 8859. Here is a table showing all the glyphs in 8859-1 along with their code points (positions, (8-bit) encodings, etc. There as so many equivalent terms in this area!).

ISO/CEI 8859-1
	x0	x1	x2	x3	x4	x5	x6	x7	x8	x9	xA	xB	xC	xD	xE	xF
0x	control codes
1x	control codes
2x	SP	!	"	#	$	%	&	'	(	)	*	+	,	-	.	/
3x	0	1	2	3	4	5	6	7	8	9	:	;	<	=	>	?
4x	@	A	B	C	D	E	F	G	H	I	J	K	L	M	N	O
5x	P	Q	R	S	T	U	V	W	X	Y	Z	[	\	]	^	_
6x	`	a	b	c	d	e	f	g	h	i	j	k	l	m	n	o
7x	p	q	r	s	t	u	v	w	x	y	z	{	\|	}	~
8x	control codes
9x	control codes
Ax	NBSP	¡	¢	£	¤	¥	¦	§	¨	©	ª	«	¬	-	®	¯
Bx	°	±	²	³	´	µ	¶	·	¸	¹	º	»	¼	½	¾	¿
Cx	À	Á	Â	Ã	Ä	Å	Æ	Ç	È	É	Ê	Ë	Ì	Í	Î	Ï
Dx	Ð	Ñ	Ò	Ó	Ô	Õ	Ö	×	Ø	Ù	Ú	Û	Ü	Ý	Þ	ß
Ex	à	á	â	ã	ä	å	æ	ç	è	é	ê	ë	ì	í	î	ï
Fx	ð	ñ	ò	ó	ô	õ	ö	÷	ø	ù	ú	û	ü	ý	þ	ÿ

In this encoding:

Code point 0x20 (hexadecimal, 32 decimal) is assigned to the printable space character often denoted SP.
Code point 0xA0 (hexadecimal, 160 decimal) is assigned to the non-breaking space character denoted NBSP.
Code point 0xAD (hexadecimal, 173 decimal) is assigned to the soft-hyphen character denoted SHY. It represents a possible hyphen in a word and is shown only if it corresponds to a line break.
65 code points are control codes and not glyphs. CP-1252 contained the same printable characters and replaced the control codes in the 0x80 to 0x9F range with more glyphs "borrowed" from other ISO 8859 tables.

Source: ISO/CEI 8859-1.

Adafruit-GFX-Library Proportional Fonts

All characters in a monospaced (fixed-pitch or fixed-width) font are the same width. The letters i and W will occupy the same width, which means a fair amount of blank area around the i and the W will probably appear to be squished a bit. These fonts were widely used in early computer systems when displays were rectangular grids of equally sized cells in which characters would be placed. The fixed width of each character also simplifies digital representation of each glyph (image) in the character set since each character bitmap will be the same size.

Assume that the nominal size of each glyph is like the "A" shown above: 6 pixels wide and 7 pixels high. Then typically, each character bitmap will be made up of 7 bytes, each byte representing a row of dots of the character and the glyphs for the complete set of n characters will be an array of n consecutive 7 byte arrays.

Finding the bitmap for the ith character in the set is very easy; it is simply at index i*7 from the start of the byte array where everything is counted starting at 0.

Like most display libraries, the Adafruit-GFX-Library includes a default monospaced font but it also has the provision for adding proportional fonts. These fonts where characters can have different width are more pleasing to the eye, have always been the norm for printed material and have become ubiquitous in computer systems except for programming environments and other niche contexts. Since the character width is not constant, the glyph bitmaps are of variable size. These could be padded with 0 bits to represent empty columns so that all characters are stored with the same width and then accessing the ith character bitmap would be as easy as with monospaced fonts. However, this wastes memory if some characters can be more than 8 columns wide; and memory is at a premium on microcontrollers. Furthermore, it would be necessary to store the character width somewhere else. The Adafruit library stores the bitmaps in a continuous sequence and it stores the offset of the start of each character bitmap in another array along with other character attributes, including the width.

Here is the structure stored for each character glyph.

/// Font data stored PER GLYPH typedef struct { uint16_t bitmapOffset; /// Pointer into GFXfont->bitmap uint8_t width; /// Bitmap dimensions in pixels uint8_t height; /// Bitmap dimensions in pixels uint8_t xAdvance; /// Distance to advance cursor (x axis) int8_t xOffset; /// X dist from cursor pos to UL corner int8_t yOffset; /// Y dist from cursor pos to UL corner } GFXglyph;

A GFXfont is a structure with pointers to the bitmap array and to the character attribute array and three other fields:

/// Data stored for FONT AS A WHOLE typedef struct { uint8_t *bitmap; /// Glyph bitmaps, concatenated GFXglyph *glyph; /// Glyph array uint8_t first; /// ASCII extents (first char) uint8_t last; /// ASCII extents (last char) uint8_t yAdvance; /// Newline distance (y axis) } GFXfont;

Note the first and last fields. By default, these are respectively 32 (0x20 hexadecimal) which is the ASCII code for the space character and 126 (0xFE) which is the ASCII code for the '~' character. Since the first 32 code points are control codes, nothing is saved into the font bitmap for these codes and there are no glyph entries for these code points. Consequently, the first GFXglyph entry points to the first printable character in the font bitmap. This also means that it is quite possible to have a font with only a subset of characters, as long as these have sequential code points. For example, in a digital clock project I use a GFXfont with big characters (in terms of width and height) which does not take up much memory because it contains only 11 characters: '0', '1', through to '9' and ':'. So first is 48 (0x30) and last is 58 (0x3A). This is a compact way of storing a subset of all code points as long as they have sequential code points.

GFXFont with ISO 8859 Character Sets

All ISO 8859 parts contain 32 control codes starting at code point 128 (0xA0) which would be wasteful to include in the GFXglyph array. The obvious approach is to skip over these control codes just as done for the first 32 control codes in ASCII. So the following table shows how I propose to store the glyphs of the printable characters from an ISO 8859 character set, using 8859-1 (Latin-1) as an example.

GFX Latin 1
	x0	x1	x2	x3	x4	x5	x6	x7	x8	x9	xA	xB	xC	xD	xE	xF
0x		!	"	#	$	%	&	'	(	)	*	+	,	-	.	/
1x	0	1	2	3	4	5	6	7	8	9	:	;	<	=	>	?
2x	@	A	B	C	D	E	F	G	H	I	J	K	L	M	N	O
3x	P	Q	R	S	T	U	V	W	X	Y	Z	[	\	]	^	_
4x	`	a	b	c	d	e	f	g	h	i	j	k	l	m	n	o
5x	p	q	r	s	t	u	v	w	x	y	z	{	\|	}	~	⍰
6x		¡	¢	£	¤	¥	¦	§	¨	©	ª	«	¬	-	®	¯
7x	°	±	²	³	´	µ	¶	·	¸	¹	º	»	¼	½	¾	¿
8x	À	Á	Â	Ã	Ä	Å	Æ	Ç	È	É	Ê	Ë	Ì	Í	Î	Ï
9x	Ð	Ñ	Ò	Ó	Ô	Õ	Ö	×	Ø	Ù	Ú	Û	Ü	Ý	Þ	ß
Ax	à	á	â	ã	ä	å	æ	ç	è	é	ê	ë	ì	í	î	ï
Bx	ð	ñ	ò	ó	ô	õ	ö	÷	ø	ù	ú	û	ü	ý	þ	ÿ

The index of a character shown on that table is actually the ordinal of the character GFXglyph entry in the font glyph array. For example, the index of "m" is 0x4D (hexadecimal, 77 decimal). Therefore the GFXGlyph entry for "m" will be glyph[77]. However the code point for "m" is 109 ( 77+32). This way of packing the bitmaps and the array of glyph attributes means that some care will have to be exercised to access the correct glyph. Let cp represent the code point of a character in ISO 8859-1.

`cp` (Latin-1 code point)	Index into GFXglyph array
0 (0x00) ≤ cp < 32 (0x20)	unmapped `cp` (*)
32 (0x20) ≤ cp < 128 (0x80)	`cp` - 32
128 (0x80) ≤ cp < 160 (0xA0)	unmapped `cp` (*)
160 (0x80) ≤ cp ≤ 255 (0xFF)	`cp` - 64

Perhaps � (U+FFFD REPLACEMENT CHARACTER) could be shown in those cases where a code point does not correspond to a glyph, such as those marked with an * in the table. This is what is done by Web browsers. The Adafruit library already does the translation by 32 (or whatever the code point of the first visible glyph is) when display a code point, but clearly things will be more complicated with the lower part of the table. More on that later, for now let's continue the discussion on the glyphs themselves.

I did not know what to expect for three characters. Looking at a GNU FreeFont TFT file, I found a space character at code point 128 (0x80) as the "printable" version of a non-breaking space, a hyphen at code point 141 (0x8D) for the soft hyphen. I suppose one could forego storing those two character bitmaps and ensure that the bitmapOffset field in their glyph entry points to the bitmap for the space and - respectively. However, it turns out that part 11 of ISO 8859 does not contain the soft hyphen and has a printable character in that position: ญ. I decided that remapping the non-breaking space to save a few bytes in the bitmap array was not worth the effort.

The Latin-9 (ISO 8859-15) character set is better suited for my purposes because it contains a couple of ligatures used in French and not available in Latin 1. It also contains the Euro symbol which can be useful at times.

GFX Latin 15
	x0	x1	x2	x3	x4	x5	x6	x7	x8	x9	xA	xB	xC	xD	xE	xF
0x		!	"	#	$	%	&	'	(	)	*	+	,	-	.	/
1x	0	1	2	3	4	5	6	7	8	9	:	;	<	=	>	?
2x	@	A	B	C	D	E	F	G	H	I	J	K	L	M	N	O
3x	P	Q	R	S	T	U	V	W	X	Y	Z	[	\	]	^	_
4x	`	a	b	c	d	e	f	g	h	i	j	k	l	m	n	o
5x	p	q	r	s	t	u	v	w	x	y	z	{	\|	}	~	⍰
6x		¡	¢	£	€	¥	Š	§	š	©	ª	«	¬	-	®	¯
7x	°	±	²	³	Ž	µ	¶	·	ž	¹	º	»	Œ	œ	Ÿ	¿
8x	À	Á	Â	Ã	Ä	Å	Æ	Ç	È	É	Ê	Ë	Ì	Í	Î	Ï
9x	Ð	Ñ	Ò	Ó	Ô	Õ	Ö	×	Ø	Ù	Ú	Û	Ü	Ý	Þ	ß
Ax	à	á	â	ã	ä	å	æ	ç	è	é	ê	ë	ì	í	î	ï
Bx	ð	ñ	ò	ó	ô	õ	ö	÷	ø	ù	ú	û	ü	ý	þ	ÿ

Latin 15 is almost the same as Latin 1 except for 8 characters: ¤, ¦, ¨, ´, ¸, ¼, ½ and ¾ which are replaced with €, Š, š, Ž, ž, Œ, œ and Ÿ. The code points for these 8 characters in Unicode are greater than 255 and this will cause some problems as will be seen later on.

Converting TrueType Fonts to Adafruit_GFX Fonts

Getting the glyphs for the font and setting up the GFXfont structures looks like a formidable hurdle. It is actually quite easy to convert a TTF font into a GFXFont thanks to the work of others. A few modifications to the fontconvert.c program provided with the Adafruit-GFX-library were all that was necessary. The main change made to the utility is that it creates a GFXfont from arbitrarily ordered Unicode code points stored in an array called chartable instead of pulling all glyphs from a given first code point to a given last code point. The source for the modified utility can be downloaded from here: fontconvert8.zip. The content of that archive can be extracted to the fontconvert directory of the GFX library, there are no name conflicts. The source must be compiled to get an executable file. This is quite easily done on most Linux systems with the command

... $ make -f Makefile8

run from the folder containing the fontconvert8.c and Makefile8 files. The executable,fontconvert8, will be created in the same directory. Windows users can consult the guide prepared by Adafruit A short guide to use fontconvert.c to create your own fonts using MinGW to compile the application.

Before compiling the utility, it may be necessary to edit the fontconvert8.c source code. Up to three adjustments may be required.

In line 59, choose the glyph to load at code point 0x7F in lieu of the DEL character.

There are four choices.

define DEL_GLYPH	Character Glyph at 0x7F
SPACE_AS_DEL	" "	U++0020 the space character
REPLACEMENT_CHARACTER_AS_DEL	"�"	U+FFFD the Unicode replacement character
APL_QUAD_QUESTION_AS_DEL	"⍰"	U+2370 the APL functional symbol quad question

If the DEL_GLYPH macro is not defined, the glyph of U+007F as defined in the source TTF file will be used, whatever that may be.

In line 75, choose a chartable header file.

The glyphs that will be added to the GFX font are enumerated in a header file. Two ISO 8859 parts are included in the archive 8859-1.h and 8859-15.h. It should be fairly straightforward to add other desired definitions, that need not correspond to any ISO 8859 definition by the way,

In line 77, set the display resolution

It turns out that 141 dots per inch is the correct resolution for the 2" ILI9225 display that I am using. It is a 176x220 pixel display with a 31.68x39.6 mm (or 1.25x1.56") display area which gives 140.8x141 dpi.

The modified executable is actually simpler to use than the original. It takes only two mandatory parameters, the name of file containing the TTF font to be used as source and the point size of the font to create.

... $ ./fontconvert8 /usr/share/fonts/truetype/freefont/FreeMono.ttf 9 > FreeMono9pt8b.h

There is no need to specify the first and last command line parameters as in the older utility because those values are implicitely defined by the chartable. The makefonts8.sh script is an adaptation of the script which generates all the fonts found in the Fonts folder of the Adafruit-GFX-Library. You may need to adjust the inpath variable if the FreeFont fonts are in another directory.

In the comment to the right of each GFXglyph entry, the modified utility prints the 8 bit code point of the character, the description as supplied in the chartable header file and its Unicode code point as a reference.

const GFXglyph FreeMono9pt8bGlyphs[] PROGMEM = { { 0, 0, 0, 11, 0, 1 }, // 0x20 ' ' U+0020 { 0, 2, 11, 11, 4, -10 }, // 0x21 '!' U+0021 { 3, 6, 5, 11, 2, -10 }, // 0x22 '"' U+0022 { 7, 7, 12, 11, 2, -10 }, // 0x23 '#' U+0023 ... { 1698, 8, 12, 11, 1, -11 }, // 0xc8 'LATIN SMALL LETTER E WITH GRAVE' U+00E8 { 1710, 8, 12, 11, 1, -11 }, // 0xc9 'LATIN SMALL LETTER E WITH ACUTE' U+00E9 ...

One could use fontconvert8 to pick out just the digits as I discussed above. However to do that one would have to create a chartable header file. It's a lot faster to simply use the original fontconvert utility for that as long as the Unicode code points of the character set are sequential and less than 256. If consecutive Unicode characters with code points greater than 255 are required, the forked fontconvert.c by Bodmer can be used. I believe that he only needed to change the type of first, last and other code point related variables to int to handle the full Unicode range of characters.

UTF-8 Encoding to GFXFont Latin 1 and Latin 9 Encodings

The string "L'été à ..." ("The summer at ...") which contains common diacritics used in French will not be displayed correctly even if a GFX Latin 1 or a GFX Latin 9 font is installed.

char msg[] = "L'été à ..." display.setCursor(5, 32); display.println(msg); display.display();

The above will show as L'ãÉtãÉ ãÀ ... on the display. The reason is simple, Unicode is used in the Arduino and PlatformIO environments (at least in Linux) and the actual byte array in msg is UTF-8 encoded.

   L    '    é        t    é        SP   à        .    .    .
  \x4C \x27 \xC3\xA9 \x74 \xC3\xA9 \x20 \xC3\xA0 \x2E \x2E \x2E

Remember that the drawChar function subtracts 32 (0x20) from each character to get its glyph. So, when decoding the above 2 byte UTF-8 encodings, the following occurs.

  0xC3 - 0x20 = 0xA3 (163) which is the position of ã in GFX Latin-1
  0xA9 - 0x20 = 0x89 (137) which is the position of É in GFX Latin-1
  0xA0 - 0x20 = 0x80 (128) which is the position of À in GFX Latin-1

To get around this problem, the string literal can be defined avoiding UTF-8 encoding.

char msg[] = "L'\xC9t\xC9 \xC0 ..."

The hexadecimal constants to embed in the string can be found by looking at the GFXglyph generated with fontconvert8. Hand coding literal strings is not very convenient, it would be better to decode UTF-8 encoded strings. Unlike Bodmer who added a "utf-8 attribute" to the Adafruit-GFX, I decided to create UTF-8 to GFX font converters. The advantage of this approach is that it can be used with other libraries that support proportional fonts using the GFXfont structures. Here is the header file gfxlatin1.h which declares two functions that perform the conversions of either Arduino String object or a C string from UTF-8 encoding to GFX Latin 1 encoding.

#ifndef GFXLATIN1_H #define GFXLATIN1_H // Replace code points not found in the GFX Latin 1 font with 0x7F extern bool showUnmapped; // Convert a UTF-8 encoded String object to a GFX Latin 1 encoded String String utf8tocp(String s); // Convert a UTF-8 encoded string to a GFX Latin 1 encoded string // Be careful, the in-situ conversion will "destroy" the UTF-8 string s. void utf8tocp(char* s); #endif

The header file for UTF-8 encoding to GFX Latin 9, gfxlatin9.h is the same except for Latin 9 appearing where Latin 1 is found in the other file. These functions can then be used to create "wrappers" around text printing functions of the Adafruit GFX library.

#include <Arduino.h> #include <Wire.h> // using I²C SSD1306 display #include <Adafruit_GFX.h> #include <Adafruit_SSD1306.h> #include "decodeutf8.h" #include "gfxlatin1.h" #include "FreeSans10pt8b.h" // GFX Latin 1 font generated with fontconvert8 void drawUtf8Text(String str) { String xtdcp(utf8tocp(str)); display.print(xtdcp); }

As stated already, there are other graphics libraries that use the GFXFont definitions. The TFT_22_ILI9225 library by Johan Cronje (Nkawu) is one of them which I use with 2" ILI9225 based colour TFT displays. The function to print strings on the display is different, but it is just as easy to "wrap".

#include <Arduino.h> #include "SPI.h" // using SPI ILI9225 display #include "TFT_22_ILI9225.h" #include "decodeutf8.h" #include "gfxlatin1.h" #include "FreeSans10pt8b.h" // GFX Latin 1 font generated with fontconvert8 void drawUtf8Text(uint16_t x, uint16_t y, String str, uint16_t color) { String xtdcp(utf8tocp(str)); display.drawGFXText(x, y, xtdcp, color); }

If the boolean variable showUnmapped is set to true then the conversion routines will print the DEL character (0x7F, GFX code point 0x5F, or more accurately the substitute character that was chosen when creating the GFXfont) in place of invalid or out of bounds characters in the source string. Perhaps this could be of help when debugging a program.

The rest of this section is about technical details that the curious, those that might want to use this approach with other character sets, or those willing to help me improve my weak C/C++ programming skills may want to read.

When first working with the GFX ISO 8859-1 character set, I tested four functions to convert UTF-8 to GFX Latin 1. However when it came time to work with the ISO 8859-15 character set, and while thinking about a general solution that might work for all character sets, I quickly narrowed down the field to one function, by Bodmer again!. It works by converting a UTF-8 8-bit stream into a UCS-2 16-bit stream and then converting some UCS-2 characters into the correct code point in the installed GFX font. Both gfxlatin1.cpp and gfxlatin9.cpp rely on uint16_t decodeUTF8(uint8_t c). The function is actually a state machine that takes an 8-bit value as input and outputs the corresponding 16-bit value in the range 0 to 0xFFFE if a valid UTF-8 character has been found or it outputs 0xFFFF if the state machine needs more input to decode a muli-byte UTF-8 character or if an invalid encoding is found. Here is how the function is used in gfxlatin1.cpp to code a String.

String utf8tocp(String s) { String r=""; uint16_t c; resetUTF8decoder(); for (int i=0; i<s.length(); i++) { c = decodeUTF8(s.charAt(i)); if (0x20 <= c && c <= 0x7F) r += (char) c; else if (0xA0 <= c && c <= 0xFF) r += (char) (c - 32); } return r; }

The state machine is reset with the resetUTF8decoder() statement at the start of the decoding process. Printable 7-bit ASCII characters are mapped as themselves while 32 is subtracted from the printable Latin 1 characters. Remember that the Unicode code point of these 8-bit characters is 64 greater than their GFX code point but the GFX library subtracts the code point of the first character in the font (which is " " with the value 32). So all the hard lifting is done by the decoder function. Just how complicated is it to parse a variable length UTF-8 character? Let's start with the syntax of these characters in Augmented Backus-Naur Form.

UTF8-char	UTF8-1 / UTF8-2 / UTF8-3 / UTF8-4
UTF8-1	%x00-7F
UTF8-2	%xC2-DF	%x80-BF
UTF8-3	%xE0	%xA0-BF	%x80-BF
	%xE1-EC	%x80-BF	%x80-BF
	%xED	%x80-9F	%x80-BF
	%xEE-EF	%x80-BF	%x80-BF
UTF8-4	%xF0	%x90-BF	%x80-BF	%x80-BF
	%xF1-F3	%x80-BF	%x80-BF	%x80-BF
	%xF4	%x80-8F	%x80-BF	%x80-BF

Source: Request for Comments: 3629 by F. Yergeau, November 2003.

Here is a slightly simplified version of the decoder function used by gfxlatin1.cpp and gfxlatin9.cpp.

uint8_t decoderState = 0; // UTF-8 decoder state uint16_t decoderBuffer; // Unicode code-point buffer uint16_t decodeUTF8(uint8_t c) { if ((c & 0x80) == 0x00) { // 7 bit Unicode Code Point decoderState = 0; return (uint16_t) c; } if (decoderState == 0) { if ((c & 0xE0) == 0xC0) { // 11 bit Unicode code point decoderBuffer = ((c & 0x1F)<<6); // Save first 5 bits decoderState = 1; } else if ((c & 0xF0) == 0xE0) { // 16 bit Unicode code point decoderBuffer = ((c & 0x0F)<<12); // Save first 4 bits decoderState = 2; } } else { decoderState--; if (decoderState == 1) decoderBuffer |= ((c & 0x3F)<<6); // Add next 6 bits of 16 bit code point else if (decoderState == 0) { decoderBuffer |= (c & 0x3F); // Add last 6 bits of code point (UTF8-tail) return decoderBuffer; } } return 0xFFFF; }

Clearly the function does not parse the grammar correctly. UTF-8 encoded strings should never contain the characters \xC0 and \xC1, yet the decoder accepts these as leading bytes of 2 byte encodings. RFC 3629 warns against such errors:

For instance, a naive implementation may decode the overlong UTF-8 sequence C0 80 into the character U+0000, or the surrogate pair ED A1 8C ED BE B4 into U+233B4. Decoding invalid sequences may have security consequences or cause other problems.
...
Another example might be a parser which prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the illegal octet sequence 2F C0 AE 2E 2F. This last exploit has actually been used in a widespread virus attacking Web servers in 2001; thus, the security threat is very real.

The decoder does indeed return NUL for the sequence "0xC0 0x80" (the function returns 0xFFFF for 0XC0, which is ignored, and 0x00 which is the code point for NUL). As for the surrogate pair, it returns two values, 0xD84C and 0xDFB4, which is not U+233B4 but which is nevertheless incorrect. The decoder does exhibit the last identified problem, the string "\x2F\xC0\xAE\x2E\x2F" is converted to the ASCII string "\..\". Furthermore, all 4 byte encodings such as "\xF0\x90\x8C\x94" for U+10314 OLD ITALIC LETTER ES '𐌔' return nothing except 0xFFFF.

Initially I was a bit taken aback by these errors and I tried to create a "validating" UTF-8 decoder. After some more thought, I decided that the simpler and smaller decoder based on Bodmer code was quite sufficient for usage in a microcontroller. I can't really see that there would be security problems in that context.

There is a complication when it comes to the GFX Latin 9 encoding. The Latin 9 characters not found in Latin 1 all have Unicode code points that are greater than 0xFF. So a function called recode maps those character to GFX Latin 9 code points.

uint16_t recode(uint8_t b) { uint16_t ucs2 = decodeUTF8(b); if (ucs2 > 0x7F) { switch (ucs2) { case 0x0152: return 0xbc; break; case 0x0153: return 0xbd; break; case 0x0160: return 0xa6; break; case 0x0161: return 0xa8; break; case 0x0178: return 0xbe; break; case 0x017D: return 0xb4; break; case 0x017E: return 0xb8; break; case 0x20AC: return 0xa4; break; } } return ucs2; }

Instead of calling decodeUTF8(), utf8tocp uses recode() to decode UTF-8 strings. There is a slight problem with that function. The string "¤€" will be displayed as "€€". That's because the sequence "\xC2\A4", the UTF-8 encoding for U+00A4 CURRENCY SIGN, character '¤', will be converted to 0xA4 by decodeUTF8 and passed on as such by recode while the sequence "\xE2\x82\xAC", the UTF-8 encoding for U+20AC EURO SIGN will be mapped to 0x20AC by decodeUTF8 and then converted to 0xA4 by recode. I don't believe that is really a problem but those that prefer using the validating UTF-8 decoder might want to define the INVALIDATE_OVERWRITTEN_LATIN_1_CHARS macro to flag the currency sign and the other overwritten Latin 1 characters as unmapped.

Other Approaches

Not too long ago, I posted a short hack (Lettres diacritiques françaises avec les polices GFXfont) which was designed to add just a few characters from the Latin-1 Unicode block. It has been withdrawn and replaced with Police GFX avec jeu de caractères arbitraire loosely translates to GFX Font with an Arbitrary Character Set which has a much more flexible approach.

Chris Young wrote Creating Custom Symbol Fonts for Adafruit GFX Library that could be used a basis for doing something similar to what is done here. The original fontconvert.c utility could be used to create a second GFXFont from the Latin-1 block. Then his drawSymbol function could be modified to select either the ASCII font or the Latin 1 font to print a character based on its code point. It would probably still be useful to add the UTF-8 conversion routines to easily write UTF-8 strings to the display. In terms of memory usage, 2 blocks of 96 glyphs should be just about the same as one block of 192 glyphs and the switching from one GFXfont to another seems a bit awkward to me. For those reasons, I don't believe this approach has an advantage over the one proposed here, but it may be a good point to begin when first using the GFX Library from Adafruit.

I have already referred to the Bodmer fork of the Adafruit-GFX-Library. Not only did he add UTF-8 support to the library, he also modified the fontconvert.c utility to work with any set of contiguous glyphs in the Unicode repertoire. For my needs this is not quite as useful because some of the displays I use are not supported by the Adafruit library and the characters for Latin 9 are not contiguous or even sequential. However, his UTF-8 conversion routine which is not dissimilar to the one available in the Arduino playground, proved to be an excellent launching point.

While reading an issue on the Adafruit-GFX-Library repository, international character sets #64, I saw that Peter Jakobs (pljakobs) had tackled extended ASCII character sets. I need to look more closely at how he handled the Latin 2 encoding because he seems to have done the "recoding" in a way that is generally applicable albeit at the cost of using more memory and perhaps with slower conversions. I must also credit him for the idea of creating a wrapper function that takes care of the UTF-8 decoding.

Downloads

This post will end with a list of downloads.

Modfied fontconvert utility: fontconvert8.zip.
UTF-8 to GFX Latin1 conversions: gfxlatin1.zip. These are the 4 UTF-8 decoding routines initially experimented with until deciding on a single approach.
Testing environment (platformIO): gfxfont_8bit_v01.zip (includes the content of fontconvert8.zip, the gfxlatin1 and gfxlatin9 libraries, the "flawed" decodeUTF8 and the validatingdecodeUTF8 decoders). The included example programs are really test programs; I obviously need to learn how to use the PlatformIO unit testing capabilities.

As usual the very lax 2 clause BSD licence applies to my source code available in the gfxfont_8bit_v01.zip archive. However the license of the original contributors must be respected. Links to what I believe are the original sources can be found in gfxlatin1.cpp contained in the gfxlatin1.zip archive.

GFX Latin 1
	x0	x1	x2	x3	x4	x5	x6	x7	x8	x9	xA	xB	xC	xD	xE	xF
0x		!	"	#	$	%	&	'	(	)	*	+	,	-	.	/
1x	0	1	2	3	4	5	6	7	8	9	:	;	<	=	>	?
2x	@	A	B	C	D	E	F	G	H	I	J	K	L	M	N	O
3x	P	Q	R	S	T	U	V	W	X	Y	Z	[	\	]	^	_
4x	`	a	b	c	d	e	f	g	h	i	j	k	l	m	n	o
5x	p	q	r	s	t	u	v	w	x	y	z	{	\|	}	~	⍰
6x		¡	¢	£	¤	¥	¦	§	¨	©	ª	«	¬	-	®	¯
7x	°	±	²	³	´	µ	¶	·	¸	¹	º	»	¼	½	¾	¿
8x	À	Á	Â	Ã	Ä	Å	Æ	Ç	È	É	Ê	Ë	Ì	Í	Î	Ï
9x	Ð	Ñ	Ò	Ó	Ô	Õ	Ö	×	Ø	Ù	Ú	Û	Ü	Ý	Þ	ß
Ax	à	á	â	ã	ä	å	æ	ç	è	é	ê	ë	ì	í	î	ï
Bx	ð	ñ	ò	ó	ô	õ	ö	÷	ø	ù	ú	û	ü	ý	þ	ÿ

GFX Latin 15
	x0	x1	x2	x3	x4	x5	x6	x7	x8	x9	xA	xB	xC	xD	xE	xF
0x		!	"	#	$	%	&	'	(	)	*	+	,	-	.	/
1x	0	1	2	3	4	5	6	7	8	9	:	;	<	=	>	?
2x	@	A	B	C	D	E	F	G	H	I	J	K	L	M	N	O
3x	P	Q	R	S	T	U	V	W	X	Y	Z	[	\	]	^	_
4x	`	a	b	c	d	e	f	g	h	i	j	k	l	m	n	o
5x	p	q	r	s	t	u	v	w	x	y	z	{	\|	}	~	⍰
6x		¡	¢	£	€	¥	Š	§	š	©	ª	«	¬	-	®	¯
7x	°	±	²	³	Ž	µ	¶	·	ž	¹	º	»	Œ	œ	Ÿ	¿
8x	À	Á	Â	Ã	Ä	Å	Æ	Ç	È	É	Ê	Ë	Ì	Í	Î	Ï
9x	Ð	Ñ	Ò	Ó	Ô	Õ	Ö	×	Ø	Ù	Ú	Û	Ü	Ý	Þ	ß
Ax	à	á	â	ã	ä	å	æ	ç	è	é	ê	ë	ì	í	î	ï
Bx	ð	ñ	ò	ó	ô	õ	ö	÷	ø	ù	ú	û	ü	ý	þ	ÿ

GFX Latin 1
	x0	x1	x2	x3	x4	x5	x6	x7	x8	x9	xA	xB	xC	xD	xE	xF
0x		!	"	#	$	%	&	'	(	)	*	+	,	-	.	/
1x	0	1	2	3	4	5	6	7	8	9	:	;	<	=	>	?
2x	@	A	B	C	D	E	F	G	H	I	J	K	L	M	N	O
3x	P	Q	R	S	T	U	V	W	X	Y	Z	[	\	]	^	_
4x	`	a	b	c	d	e	f	g	h	i	j	k	l	m	n	o
5x	p	q	r	s	t	u	v	w	x	y	z	{	\|	}	~	⍰
6x		¡	¢	£	¤	¥	¦	§	¨	©	ª	«	¬	-	®	¯
7x	°	±	²	³	´	µ	¶	·	¸	¹	º	»	¼	½	¾	¿
8x	À	Á	Â	Ã	Ä	Å	Æ	Ç	È	É	Ê	Ë	Ì	Í	Î	Ï
9x	Ð	Ñ	Ò	Ó	Ô	Õ	Ö	×	Ø	Ù	Ú	Û	Ü	Ý	Þ	ß
Ax	à	á	â	ã	ä	å	æ	ç	è	é	ê	ë	ì	í	î	ï
Bx	ð	ñ	ò	ó	ô	õ	ö	÷	ø	ù	ú	û	ü	ý	þ	ÿ

GFX Latin 15
	x0	x1	x2	x3	x4	x5	x6	x7	x8	x9	xA	xB	xC	xD	xE	xF
0x		!	"	#	$	%	&	'	(	)	*	+	,	-	.	/
1x	0	1	2	3	4	5	6	7	8	9	:	;	<	=	>	?
2x	@	A	B	C	D	E	F	G	H	I	J	K	L	M	N	O
3x	P	Q	R	S	T	U	V	W	X	Y	Z	[	\	]	^	_
4x	`	a	b	c	d	e	f	g	h	i	j	k	l	m	n	o
5x	p	q	r	s	t	u	v	w	x	y	z	{	\|	}	~	⍰
6x		¡	¢	£	€	¥	Š	§	š	©	ª	«	¬	-	®	¯
7x	°	±	²	³	Ž	µ	¶	·	ž	¹	º	»	Œ	œ	Ÿ	¿
8x	À	Á	Â	Ã	Ä	Å	Æ	Ç	È	É	Ê	Ë	Ì	Í	Î	Ï
9x	Ð	Ñ	Ò	Ó	Ô	Õ	Ö	×	Ø	Ù	Ú	Û	Ü	Ý	Þ	ß
Ax	à	á	â	ã	ä	å	æ	ç	è	é	ê	ë	ì	í	î	ï
Bx	ð	ñ	ò	ó	ô	õ	ö	÷	ø	ù	ú	û	ü	ý	þ	ÿ

Table of Contents