fontconvert by Adafruit extracts the glyphs of the 95 printable ASCII characters (code points 32 (0x20) to 126 (0x7E)) from TTF font files and creates a GFXfont header file that can be used with the Adafruit-GFX library. Diacritical letters and other symbols needed in non-English European languages are missing from this set of characters. There are open Issues in the
Adafruit-GFX-Library repository about supporting extended ASCII code pages with some proposed solutions, but it is not clear that this has yielded a solution which has found wide acceptance. This post does not pretend to fill this void, but it does show how to generate a GFXfont with all ISO 8859-1 (Latin-1) or ISO 8859-15 (Latin-9) printable glyphs from a TTF font file (provided the latter contains the glyphs of course) and how to easily access these characters in embedded systems.
Table of Contents
- The ISO 8859 Standard
- Adafruit-GFX-Library Proportional Fonts
- GFXFont with ISO 8859 Character Sets
- Converting TrueType Fonts to Adafruit_GFX Fonts
- UTF-8 Encoding to GFXFont 8859 Encoding
- Other Approaches
ISO 8859 (formally ISO/CEI 8859) is a set of standards for 8-bit character encoding that extend the 7-bit ASCII character encoding. The encoding is divided into 15 parts, designated 8859-1 to 8859-16 (8859-12 was abandoned) that covers many European languages. The parts have more or less confusing names. Part 1 is named Latin-1 Western European, part 2 is Latin-2 Central European, but by the time part 16 is reached it is called Latin-10 South-Eastern European. Furthermore, they have been revised and cover very similar standards from other authorities. Add to the mix, vendors variations and additions and the result could be confusing. Recall the many code pages such as CP437 (the original IBM PC hardware encoding), CP850 (Latin-1 again!), CP863 (which was used in French Canada) from the DOS era? Code page 1252, also called Windows-1252 used in early versions of Microsoft Windows and which remains the "most-used single-byte character encoding in the world", is sometimes labelled ISO-8859-1 although it was a superset of the ISO 8859-1 (apparently one is supposed to spot the extra dash that differentiates the two sets of characters). This relative mess was more or less resolved with the advent of Unicode. The latter did not completely do away with the older standards. The first two blocks of the Unicode character database,
0000..007F; Basic Latin 0080..00FF; Latin-1 Supplement
(the first 255 code points) correspond to part 1 of 8859. Here is a table showing all the glyphs in 8859-1 along with their code points (positions, (8-bit) encodings, etc. There as so many equivalent terms in this area!).
In this encoding:
- Code point 0x20 (hexadecimal, 32 decimal) is assigned to the printable space character often denoted SP.
- Code point 0xA0 (hexadecimal, 160 decimal) is assigned to the non-breaking space character denoted NBSP.
- Code point 0xAD (hexadecimal, 173 decimal) is assigned to the soft-hyphen character denoted SHY. It represents a possible hyphen in a word and is shown only if it corresponds to a line break.
- 65 code points are control codes and not glyphs. CP-1252 contained the same printable characters and replaced the control codes in the 0x80 to 0x9F range with more glyphs "borrowed" from other ISO 8859 tables.
Source: ISO/CEI 8859-1.
All characters in a monospaced (fixed-pitch or fixed-width) font are the same width. The letters i and W will occupy the same width, which means a fair amount of blank area around the i and the W will probably appear to be squished a bit. These fonts were widely used in early computer systems when displays were rectangular grids of equally sized cells in which characters would be placed. The fixed width of each character also simplifies digital representation of each glyph (image) in the character set since each character bitmap will be the same size.
Assume that the nominal size of each glyph is like the "A" shown above: 6 pixels wide and 7 pixels high. Then typically, each character bitmap will be made up of 7 bytes, each byte representing a row of dots of the character and the glyphs for the complete set of n characters will be an array of n consecutive 7 byte arrays.
Finding the bitmap for the ith character in the set is very easy; it is simply at index i*7 from the start of the byte array where everything is counted starting at 0.
Like most display libraries, the
Adafruit-GFX-Library includes a default monospaced font but it also has the provision for adding proportional fonts. These fonts where characters can have different width are more pleasing to the eye, have always been the norm for printed material and have become ubiquitous in computer systems except for programming environments and other niche contexts. Since the character width is not constant, the glyph bitmaps are of variable size. These could be padded with 0 bits to represent empty columns so that all characters are stored with the same width and then accessing the ith character bitmap would be as easy as with monospaced fonts. However, this wastes memory if some characters can be more than 8 columns wide; and memory is at a premium on microcontrollers. Furthermore, it would be necessary to store the character width somewhere else. The Adafruit library stores the bitmaps in a continuous sequence and it stores the offset of the start of each character bitmap in another array along with other character attributes, including the width.
Here is the structure stored for each character glyph.
A GFXfont is a structure with pointers to the bitmap array and to the character attribute array and three other fields:
last fields. By default, these are respectively 32 (0x20 hexadecimal) which is the ASCII code for the space character and 126 (0xFE) which is the ASCII code for the '~' character. Since the first 32 code points are control codes, nothing is saved into the font bitmap for these codes and there are no glyph entries for these code points. Consequently, the first
GFXglyph entry points to the first printable character in the font bitmap. This also means that it is quite possible to have a font with only a subset of characters, as long as these have sequential code points. For example, in a digital clock project I use a GFXfont with big characters (in terms of width and height) which does not take up much memory because it contains only 11 characters: '0', '1', through to '9' and ':'. So
first is 48 (0x30) and
last is 58 (0x3A). This is a compact way of storing a subset of all code points as long as they have sequential code points.
All ISO 8859 parts contain 32 control codes starting at code point 128 (0xA0) which would be wasteful to include in the GFXglyph array. The obvious approach is to skip over these control codes just as done for the first 32 control codes in ASCII. So the following table shows how I propose to store the glyphs of the printable characters from an ISO 8859 character set, using 8859-1 (Latin-1) as an example.
|GFX Latin 1|
The index of a character shown on that table is actually the ordinal of the character
GFXglyph entry in the font
glyph array. For example, the index of "m" is 0x4D (hexadecimal, 77 decimal). Therefore the
GFXGlyph entry for "m" will be
glyph. However the code point for "m" is 109 ( 77+32). This way of packing the bitmaps and the array of glyph attributes means that some care will have to be exercised to access the correct glyph. Let
cp represent the code point of a character in ISO 8859-1.
|Index into GFXglyph array|
|0 (0x00) ≤ cp < 32 (0x20)||unmapped |
|32 (0x20) ≤ cp < 128 (0x80)|| |
|128 (0x80) ≤ cp < 160 (0xA0)||unmapped |
|160 (0x80) ≤ cp ≤ 255 (0xFF)|| |
Perhaps � (U+FFFD REPLACEMENT CHARACTER) could be shown in those cases where a code point does not correspond to a glyph, such as those marked with an * in the table. This is what is done by Web browsers. The Adafruit library already does the translation by 32 (or whatever the code point of the first visible glyph is) when display a code point, but clearly things will be more complicated with the lower part of the table. More on that later, for now let's continue the discussion on the glyphs themselves.
I did not know what to expect for three characters. Looking at a GNU FreeFont TFT file, I found a space character at code point 128 (0x80) as the "printable" version of a non-breaking space, a hyphen at code point 141 (0x8D) for the soft hyphen. I suppose one could forego storing those two character bitmaps and ensure that the
bitmapOffset field in their
glyph entry points to the bitmap for the space and - respectively. However, it turns out that part 11 of ISO 8859 does not contain the soft hyphen and has a printable character in that position: ญ. I decided that remapping the non-breaking space to save a few bytes in the bitmap array was not worth the effort.
The Latin-9 (ISO 8859-15) character set is better suited for my purposes because it contains a couple of ligatures used in French and not available in Latin 1. It also contains the Euro symbol which can be useful at times.
|GFX Latin 15|
Latin 15 is almost the same as Latin 1 except for 8 characters: ¤, ¦, ¨, ´, ¸, ¼, ½ and ¾ which are replaced with €, Š, š, Ž, ž, Œ, œ and Ÿ. The code points for these 8 characters in Unicode are greater than 255 and this will cause some problems as will be seen later on.
Getting the glyphs for the font and setting up the GFXfont structures looks like a formidable hurdle. It is actually quite easy to convert a TTF font into a GFXFont thanks to the work of others. A few modifications to the
fontconvert.c program provided with the
Adafruit-GFX-library were all that was necessary. The main change made to the utility is that it creates a GFXfont from arbitrarily ordered Unicode code points stored in an array called
chartable instead of pulling all glyphs from a given first code point to a given last code point. The source for the modified utility can be downloaded from here: fontconvert8.zip. The content of that archive can be extracted to the
fontconvert directory of the GFX library, there are no name conflicts. The source must be compiled to get an executable file. This is quite easily done on most Linux systems with the command
run from the folder containing the
Makefile8 files. The executable,
fontconvert8, will be created in the same directory. Windows users can consult the guide prepared by Adafruit A short guide to use fontconvert.c to create your own fonts using MinGW to compile the application.
Before compiling the utility, it may be necessary to edit the
fontconvert8.c source code. Up to three adjustments may be required.
- In line 59, choose the glyph to load at code point 0x7F in lieu of the DEL character.
There are four choices.
define DEL_GLYPH Character Glyph at 0x7F SPACE_AS_DEL " " U++0020 the space character REPLACEMENT_CHARACTER_AS_DEL "�" U+FFFD the Unicode replacement character APL_QUAD_QUESTION_AS_DEL "⍰" U+2370 the APL functional symbol quad question
DEL_GLYPHmacro is not defined, the glyph of U+007F as defined in the source TTF file will be used, whatever that may be.
- In line 75, choose a
- In line 77, set the display resolution
The glyphs that will be added to the GFX font are enumerated in a header file. Two ISO 8859 parts are included in the archive
8859-15.h. It should be fairly straightforward to add other desired definitions, that need not correspond to any ISO 8859 definition by the way,
It turns out that 141 dots per inch is the correct resolution for the 2" ILI9225 display that I am using. It is a 176x220 pixel display with a 31.68x39.6 mm (or 1.25x1.56") display area which gives 140.8x141 dpi.
The modified executable is actually simpler to use than the original. It takes only two mandatory parameters, the name of file containing the TTF font to be used as source and the point size of the font to create.
There is no need to specify the
last command line parameters as in the older utility because those values are implicitely defined by the
makefonts8.sh script is an adaptation of the script which generates all the fonts found in the Fonts folder of the Adafruit-GFX-Library. You may need to adjust the
inpath variable if the FreeFont fonts are in another directory.
In the comment to the right of each
GFXglyph entry, the modified utility prints the 8 bit code point of the character, the description as supplied in the
chartable header file and its Unicode code point as a reference.
One could use
fontconvert8 to pick out just the digits as I discussed above. However to do that one would have to create a
chartable header file. It's a lot faster to simply use the original
fontconvert utility for that as long as the Unicode code points of the character set are sequential and less than 256. If consecutive Unicode characters with code points greater than 255 are required, the forked fontconvert.c by Bodmer can be used. I believe that he only needed to change the type of
last and other code point related variables to
int to handle the full Unicode range of characters.
The string "L'été à ..." ("The summer at ...") which contains common diacritics used in French will not be displayed correctly even if a GFX Latin 1 or a GFX Latin 9 font is installed.
The above will show as L'ãÉtãÉ ãÀ ... on the display. The reason is simple, Unicode is used in the Arduino and PlatformIO environments (at least in Linux) and the actual byte array in
msg is UTF-8 encoded.
Remember that the
drawChar function subtracts 32 (0x20) from each character to get its glyph. So, when decoding the above 2 byte UTF-8 encodings, the following occurs.
To get around this problem, the string literal can be defined avoiding UTF-8 encoding.
The hexadecimal constants to embed in the string can be found by looking at the
GFXglyph generated with
fontconvert8. Hand coding literal strings is not very convenient, it would be better to decode UTF-8 encoded strings. Unlike Bodmer who added a "utf-8 attribute" to the Adafruit-GFX, I decided to create UTF-8 to GFX font converters. The advantage of this approach is that it can be used with other libraries that support proportional fonts using the GFXfont structures. Here is the header file
gfxlatin1.h which declares two functions that perform the conversions of either
Arduino String object or a C string from UTF-8 encoding to GFX Latin 1 encoding.
The header file for UTF-8 encoding to GFX Latin 9,
gfxlatin9.h is the same except for Latin 9 appearing where Latin 1 is found in the other file. These functions can then be used to create "wrappers" around text printing functions of the Adafruit GFX library.
As stated already, there are other graphics libraries that use the GFXFont definitions. The TFT_22_ILI9225 library by Johan Cronje (Nkawu) is one of them which I use with 2" ILI9225 based colour TFT displays. The function to print strings on the display is different, but it is just as easy to "wrap".
If the boolean variable
showUnmapped is set to true then the conversion routines will print the DEL character (0x7F, GFX code point 0x5F, or more accurately the substitute character that was chosen when creating the GFXfont) in place of invalid or out of bounds characters in the source string. Perhaps this could be of help when debugging a program.
The rest of this section is about technical details that the curious, those that might want to use this approach with other character sets, or those willing to help me improve my weak C/C++ programming skills may want to read.
When first working with the GFX ISO 8859-1 character set, I tested four functions to convert UTF-8 to GFX Latin 1. However when it came time to work with the ISO 8859-15 character set, and while thinking about a general solution that might work for all character sets, I quickly narrowed down the field to one function, by Bodmer again!. It works by converting a UTF-8 8-bit stream into a UCS-2 16-bit stream and then converting some UCS-2 characters into the correct code point in the installed GFX font. Both
gfxlatin9.cpp rely on
uint16_t decodeUTF8(uint8_t c). The function is actually a state machine that takes an 8-bit value as input and outputs the corresponding 16-bit value in the range 0 to 0xFFFE if a valid UTF-8 character has been found or it outputs 0xFFFF if the state machine needs more input to decode a muli-byte UTF-8 character or if an invalid encoding is found. Here is how the function is used in
gfxlatin1.cpp to code a
The state machine is reset with the
resetUTF8decoder() statement at the start of the decoding process. Printable 7-bit ASCII characters are mapped as themselves while 32 is subtracted from the printable Latin 1 characters. Remember that the Unicode code point of these 8-bit characters is 64 greater than their GFX code point but the GFX library subtracts the code point of the first character in the font (which is " " with the value 32). So all the hard lifting is done by the decoder function. Just how complicated is it to parse a variable length UTF-8 character? Let's start with the syntax of these characters in Augmented Backus-Naur Form.
|UTF8-char||UTF8-1 / UTF8-2 / UTF8-3 / UTF8-4|
Source: Request for Comments: 3629 by F. Yergeau, November 2003.
Here is a slightly simplified version of the decoder function used by
Clearly the function does not parse the grammar correctly. UTF-8 encoded strings should never contain the characters \xC0 and \xC1, yet the decoder accepts these as leading bytes of 2 byte encodings. RFC 3629 warns against such errors:
For instance, a naive implementation may decode the overlong UTF-8 sequence C0 80 into the character U+0000, or the surrogate pair ED A1 8C ED BE B4 into U+233B4. Decoding invalid sequences may have security consequences or cause other problems.
Another example might be a parser which prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the illegal octet sequence 2F C0 AE 2E 2F. This last exploit has actually been used in a widespread virus attacking Web servers in 2001; thus, the security threat is very real.
The decoder does indeed return NUL for the sequence
"0xC0 0x80" (the function returns
0XC0, which is ignored, and
0x00 which is the code point for NUL). As for the surrogate pair, it returns two values,
0xDFB4, which is not
U+233B4 but which is nevertheless incorrect. The decoder does exhibit the last identified problem, the string
"\x2F\xC0\xAE\x2E\x2F" is converted to the ASCII string "\..\". Furthermore, all 4 byte encodings such as
"\xF0\x90\x8C\x94" for U+10314 OLD ITALIC LETTER ES '𐌔' return nothing except 0xFFFF.
Initially I was a bit taken aback by these errors and I tried to create a "validating" UTF-8 decoder. After some more thought, I decided that the simpler and smaller decoder based on Bodmer code was quite sufficient for usage in a microcontroller. I can't really see that there would be security problems in that context.
There is a complication when it comes to the GFX Latin 9 encoding. The Latin 9 characters not found in Latin 1 all have Unicode code points that are greater than
0xFF. So a function called
recode maps those character to GFX Latin 9 code points.
Instead of calling
recode() to decode UTF-8 strings. There is a slight problem with that function. The string "¤€" will be displayed as "€€". That's because the sequence "\xC2\A4", the UTF-8 encoding for U+00A4 CURRENCY SIGN, character '¤', will be converted to 0xA4 by
decodeUTF8 and passed on as such by
recode while the sequence "\xE2\x82\xAC", the UTF-8 encoding for U+20AC EURO SIGN will be mapped to 0x20AC by decodeUTF8 and then converted to 0xA4 by
recode. I don't believe that is really a problem but those that prefer using the validating UTF-8 decoder might want to define the
INVALIDATE_OVERWRITTEN_LATIN_1_CHARS macro to flag the currency sign and the other overwritten Latin 1 characters as unmapped.
Not too long ago, I posted a short hack (Lettres diacritiques françaises avec les polices GFXfont) which was designed to add just a few characters from the Latin-1 Unicode block. That approach is not of interest anymore and I hope to soon show how to adapt the current approach for the needs of the French language only.
Chris Young wrote Creating Custom Symbol Fonts for Adafruit GFX Library that could be used a basis for doing something similar to what is done here. The original
fontconvert.c utility could be used to create a second GFXFont from the Latin-1 block. Then his
drawSymbol function could be modified to select either the ASCII font or the Latin 1 font to print a character based on its code point. It would probably still be useful to add the UTF-8 conversion routines to easily write UTF-8 strings to the display. In terms of memory usage, 2 blocks of 96 glyphs should be just about the same as one block of 192 glyphs and the switching from one GFXfont to another seems a bit awkward to me. For those reasons, I don't believe this approach has an advantage over the one proposed here, but it may be a good point to begin when first using the
GFX Library from Adafruit.
I have already referred to the Bodmer fork of the
Adafruit-GFX-Library. Not only did he add UTF-8 support to the library, he also modified the
fontconvert.c utility to work with any set of contiguous glyphs in the Unicode repertoire. For my needs this is not quite as useful because some of the displays I use are not supported by the Adafruit library and the characters for Latin 9 are not contiguous or even sequential. However, his UTF-8 conversion routine which is not dissimilar to the one available in the Arduino playground, proved to be an excellent launching point.
While reading an issue on the
Adafruit-GFX-Library repository, international character sets #64, I saw that Peter Jakobs (pljakobs) had tackled extended ASCII character sets. I need to look more closely at how he handled the Latin 2 encoding because he seems to have done the "recoding" in a way that is generally applicable albeit at the cost of using more memory and perhaps with slower conversions. I must also credit him for the idea of creating a wrapper function that takes care of the UTF-8 decoding.
This post will end with a list of downloads.
- UTF-8 to GFX Latin1 conversions: gfxlatin1.zip. These are the 4 UTF-8 decoding routines initially experimented with until deciding on a single approach.
- Testing environment (platformIO): gfxfont_8bit_v01.zip (includes the content of fontconvert8.zip, the
gfxlatin9libraries, the "flawed"
validatingdecodeUTF8decoders). The included example programs are really test programs; I obviously need to learn how to use the PlatformIO unit testing capabilities.
As usual the very lax 2 clause BSD licence applies to my source code available in the gfxfont_8bit_v01.zip archive. However the license of the original contributors must be respected. Links to what I believe are the original sources can be found in
gfxlatin1.cpp contained in the gfxlatin1.zip archive.