Not very long ago, I uploaded a set of command line utilities to a GitHub repository called poutils. These utilities, which manipulate the
.po translation files automatically created by the Lazarus IDE when internationalization is enabled, are written in Free Pascal. However they were initially written in Delphi in a haphazard fashion over a number of years when I used dxgettext to internationalize some applications. In retrospect, it would have been better to be more patient before uploading these programs, until I had studied the GNU gettext translation system more closely. This text presents some of the information obtained from the GNU project and some experimentation done to identify the way Free Pascal and Lazarus implement gettext.
There is a unit called
gettext in the Free Pascal run time library. It provides procedures to translate resource strings from MO files which are compiled PO files. This is only a small part of what is called the Lazarus implementation of [GNU] gettext below. Hopefully, this will not cause confusion.
Table of Contents
- PO Files
- Simple Lazarus Example
- With a Little Help
- Extended Translations
- Edge Cases
- Message Context and MO Files
- References and Conclusion
Have I earned credits for not naming this section "PO Files In Courage"? It was very tempting.
A PO file, also called a gettext catalogue, is a text file which contains a list of strings used in a program along with their translation into another natural language. Each language into which a program has been translated has its own
.po file. The original gettext system (and dxgettext) used a complicated hierarchy of directories to identify the natural language found in each file. As is often the case, the Lazarus team simplified the approach and the convention is now to store all translation files in a subdirectory named
locale. The natural language of each translation is specified with a language code suffix (usually a two-letter code, but there are exceptions) before the file extension which is always
app.sp.po are, respectively, the French and Spanish translations of an application named
A translation file is composed of entries for each string to be translated. Each entry must contain a pair strings: the untranslated string tagged with the
msgid keyword followed by its translation marked with the
msgstr keyword. When a program is launched, each its untranslated strings is replaced with its translation except if the translated string happens to be blank. This is a one time operation and should not impact execution time afterwards. The syntax of the basic entry is as follows.
The opening and closing quotes are not part of the string but are useful as they allow for strings with leading or trailing spaces. If no translation is available, then the entry would be
msgstr field can be an empty string, the
msgid field can never be blank, which makes sense. There is a single exception which is called the header entry to be discussed later.
For convenience, long strings can be divided into manageable sized chunks.
I believe the following layout is just as valid but not usually used.
The breaks between chunks have no meaning. The untranslated string will be a concatenation of all the chunks, so the untranslated string in the last example is
By the same token, there is no requirement that the translated string be broken up into the same number of chunks, or that the chunks correspond and so on. Of course it is possible to have multiline strings in a program such as when a label has a three-line caption. This can be done by identifying the ends of lines with the usual
\n escape sequence as shown in the following example.
Again this could be written in chunks, with each chunk ending with the new line escape sequence to make the layout of the string more obvious
As a matter of fact, "untranslated-string and translated-string [respect] the C syntax for a character string, including the surrounding quotes and embedded backslashed escape sequences." according to the GNU documentation. Tthe basic escape sequences
\n for a new line,
\t for a tabulation,
\" for a double quote and
\\ for the escape character
\ will work in Free Pascal. I have not investigated more esoteric escape sequences.
The GNU specification includes other elements in an entry. Here is a definition of a standard entry.
white-space # translator-comments #. extracted-comments #: reference… #, flag… #| msgctxt previous-context #| msgid previous-untranslated-string msgctxt context msgid untranslated-string msgstr translated-string
While gettext tools, including the Lazarus implementation, generate a single blank line between entries, I believe this is optional. In GNU gettext all lines that begin with the comment character "#" are optional. In the Lazarus implementation, the reference comment that begin with "#: " is mandatory otherwise the entry will be treated as a header or ignored if a header has already been defined. The other types of comments are optional in the Lazarus implementation. The context element (
msgctxt) is optional in GNU gettext and Lazarus. As far as I know, the Lazarus IDE does not generate extracted-comments which "xgettext program extracts  from the program’s source code." Similarly, I have not seen previous context elements in a
.po file generated by the Lazarus IDE but that is not evidence that they are never present.
References are "references to the programs's code." In the Lazarus implementation, a reference for an entry is the fully qualified name of the entity that owns the untranslated string in lower case. For example, the application has a form of type
TForm2 with a label named Label5 with the caption set to "Value". Then the generated
.po file would contain these two entries at a minimum.
Of course, these references are guaranteed to be unique if two conditions are satisfied: the program compiles without error and no human has touched the generated
.po file. Unfortunately, that last condition does not obtain in the author's home.
A message context (
msgctxt) is used to resolve ambiguity whenever an untranslated string appears more than once. A unique context should then be appended to each entry to differentiate them. The Lazarus IDE automatically generates any required context, but it is just the entry reference in quotes. More on this latter.
From the GNU gettext documentation, I assume that a flag line would look like this:
where the format-strings (which could be
no-python-format etc.) specify the type of format (things like %s, %d, %2.f and so on) used in the untranslated string. As far as I know, only the
fuzzy keyword is processed by the Lazarus internationalization system, but anything else in the comment will be untouched.
The header entry, which must be the first entry in the
.po file, is the only entry that has an empty
msgstr contains meta data. Here is the basic entry created by the IDE.
It is not rare to encounter headers with much more information
In addition there is usually some information about the translator's identity and about the date of translation and so on. But what's included is beyond the scope of this post. Also outside of what will be examined in this post is the third type of entry which deals with plural forms. This is just laziness on my part, but perhaps the subject will be reexamined in the future.
Let's experiment with the Lazarus internationalization capabilities with a simple program that does nothing except display three captions and close itself when the button is pressed. Here is its main form in the IDE designer.
It contains a label and a button with text that will be translated. The caption of the button is set to "Close" at design time, while the caption for the label is set at run time in the
FormShow event. Its value is set equal to the content of the
SHello resource string defined in the source code.
When executed, this is the appearance of the form.
Enable internationalization in the IDE. To do this, bring up the
Options for Project: dialog by either pressing the CtrlShiftF11 key combination or navigating through the menu system:
Project Options. Select
i18n option in the left panel and then activate the
Enable i18n checkbox. The
PO Output Directory should also be specified.
As displayed, I chose
languages, but that is just for ease while examining the Lazarus gettext implementation. Normally I use
langs for the
PO Output Directory and avoid
locale because these are the traditional input directories that contain the distributed or manually created translation file. It would not do to let the Lazarus IDE overwrite modified translation files.
Build the project and a file named
test.po will be created in the
languages directory when the program is compiled. Here it is.
This is a correctly formed translation file with a header as the first entry and additional valid entries for the visual components and resource strings. Each of those begins with a reference identified by the
#: comment marker. The reference is the fully qualified name of the text property in lowercase. The
msgid field contains the untranslated string in quotes. The
msgstr field which may eventually contain the translated string is empty. It's up to the translator to supply the missing translations.
Since this file has no translations, it is said to be the translation template. As such it, it would probably be named
test.pot in standard gettext systems, but Lazarus chose not to use the
.pot extension. Perhaps using the extension would be a problem in systems with Microsoft Office that uses the extension for other purposes.
The IDE also created a
test.lrj file and saved it alongside the
main.pas source code.
As can be seen this a JSON formatted file. Clearly the value associated with the "name" key is the entry reference as stored in the
.po catalogue. Similarly, the value associated with the "value" key is the untranslated string. By changing the caption of
label1 to "été" it was possible to determine that the array of bytes stored associated with the key "sourcebytes" is the UTF-8 encoding of the untranslated string while the "value" is stored as string with Unicode escape values for all non-ASCII characters.
By changing the label's name to
label2 and noticing no change in the "hash" value, it can be surmised that it is the untranslated string which is hashed.
So that is the content of the file described, but what is its purpose? It is probably used by the IDE to track changes when it rebuilds the files in the
PO Output Directory. A lot more happens behind the scenes when translations are added, but these
.lrj files merit special mention because it "... is very important that you include [
.lrj files] with your source code in the version system you're using, don't add that file to ignored (say .gitignore), else your translations will be broken." (Source: Translations / i18n / localizations for programs)
As a first experiment copy the template file to
test.fr.po within the same directory and provide translations for the resource string and button in the latter:
I used a simple text editor to add the two words "Allô", and "Fermer", but
poedit or a similar translation utility could be used. In that case, expect to have a much bigger header, but that has no real consequence. Here is the command to start the program with the French translation no matter what the language settings for the system are.
This will prove disappointing as neither translations is displayed. The missing ingrediant is the
DefaultTranslator unit which needs to be added to the
uses clause of the main unit.
Now the needed code to search for the
.mo file and to use it to translate the application will be included. Compile the modified source and launch the
test program again.
This time the label and button captions are translated. That shows just how simple it is to internationalize Lazarus programs. The hard work is translating the template into other languages.
Copy the template file to
test.es.po within the same directory and provide translations for the resource string and button in the latter:
Because of my woefully inadequate knowledge of most languages, that complicated translation was done with the help of the Web with all its inherent potential for errors. My apologies to Hispanophones if the translations make no sense. To take advantage of this new translation, there's no need to recompile the application, just launch it with the appropriate language flag:
test --lang es.
This reveals the power of this system. Anyone can translate string resources of an i18n enabled program if supplied with an accurate template file.
The system is actually better than what has been shown. A user will normally not need to specify the language to be used. The
DefaultTranslation unit will use the system locale to load the appropriate language file if it exists. The locale of my system is
fr_CA which stands for French in Canada. Failing to find a regionalized French version named
test.fr_CA.po, the program will load the
test.fr.po translation when
test is launched without an overriding
--lang command line option (see the caveat below).
If a French or Spanish user wanted to see the original language, then the
--lang parameter can be used to override the automatic selection of the language file.
zebra or any other "language" for which a translation file is not provided. In that case
DefaultTranslation will not load a translation file and the default language strings will be displayed. Another way to achieve the same goal is to rename the
languages directory to something which is not searched.
The Lazarus gettext implementation provides some welcomed help for translators by suggesting translations when possible. Change the
Form1 caption to "Hello" in the IDE with the form designer. Build the program and execute it specifying that the French translation is to be used. The result is probably what one would expect.
The form caption is "Hello" because that is what was specified in the form designer. However look at the French translation file which has been updated by the IDE.
There is now a suggested translation for the form caption. But that is only a proposal because the
fuzzy flag was added to the
tform1.caption entry. This flag is used by the translation mechanism coded into the application to skip the suggested translation and to use the untranslated string which in this case was "Hello". The flag is also a signal for the translator that the entry needs to be reviewed. Remove the
#, fuzzy line from
test.fr.po and launch the application again. No need to recompile it. The form caption will now be translated.
Note the addition of the context line
msgctxt in the two entries of the PO file that have the same "Hello"
msgid. This will be discussed further on.
The IDE did the same thing, mutatis mutandis, with the Spanish translation file. It would have done the same will all other
.po files in the
PO Output Directory.
Again, all this is very useful. Presumably, the programmer could pass on the
.po files to the translators and ask them to look at the "fuzzies". They could decide that the suggested translation is correct and just remove the
#, fuzzy line in the entry. Instead they could decide that the suggested translation is not correct in the given circumstance and write in the correct translation in the
msgstr field and then remove the
#, fuzzy flag.
Instead of removing the
fuzzy flag, remove the whole form caption entry in the translation catalogue and run the application. The form caption will still be translated.
Clearly when faced with a string which does not have a specific translation, the translation system will search for a PO entry with the same untranslated string and if one is found it use that entry's translated string. This does raise a number of questions.
- Is any string used in a program translated if an entry in the PO/MO file with the same
msgidand with a translation is found?
No. Only resource strings are translated. These are the string properties of visual components (or objects compiled with the generate type info directive $M or $TYPEINFO?) and all declared resourcestring. That was probably obvious to everyone, I was just dotting i's and crossing t's.
- If there is more than one entry with the same
msgidwhich one is used to translate an orphaned string?
The first non-empty translation for the given
msgidfound in the PO/MO file is used.msgid "" msgstr "Content-Type: text/plain; charset=UTF-8" #: main.astring msgctxt "main.astring" msgid "Hello" msgstr "Salut" #: main.shello msgctxt "main.shello" msgid "Hello" msgstr "Bonjour" #: tform1.button1.caption msgid "Close" msgstr "Fermer" #: tform1.label1.caption msgid "Label1" msgstr ""
- Is a PO entry with a blank transaction the same as no entry at all?
No. Extended translation occurs only if there is no entry with a reference to the resource string. If there is an entry with a reference to
tform1.captionthen the caption will be set to the
msgstrcontent. If that happens to be blank, then the string will not be translated even if there should be another entry in the PO file with the same untranslated entry.msgid "" msgstr "Content-Type: text/plain; charset=UTF-8" #: main.shello msgctxt "main.shello" msgid "Hello" msgstr "Bonjour" #: tform1.button1.caption msgid "Close" msgstr "Fermer" #: tform1.caption msgctxt "tform1.caption" msgid "Hello" msgstr "" #: tform1.label1.caption msgid "Label1" msgstr ""
In other words, when a translation is left blank in the PO/MO file, this is interpreted as an instruction to not translate the string. When there is no PO entry for a resource string, then the system will translate it if it finds a translation among the entries in the PO file.
As a programmer I often thought it was clever to simplify the translation file by removing entries to benefit from the automatic translation feature described above. I try not to do that anymore because sometimes "Bonjour" is an appropriate translation of "Hello", but at other times "Allô" or "Salut" might be better. Leaving the entries the gettext system places in the template file and letting translators decide how to best translate duplicated untranslated strings is a much better strategy.
There is no need to follow me down the following rabbit hole because it's an investigation into what happens if the PO file is not correct. The following table shows the displayed form caption depending on the content of the PO file. The program was compiled only once with the resource string for the form caption, as specified in the
.lfm file, being "Hello". The tests, divided into 6 groups of four, consisted of changes to the PO file only. In the first two groups of tests, the PO file contained only 2 entries. There was entry 0, which is not shown in the table and which is the header. For the first 4 tests, entry 1 is the translation of the form caption. In the first two tests, the
msgid for the caption is the same as its resource string and, not surprisingly, the displayed caption is translated as expected. In the third and fourth tests of the first group, the
msgid is incorrect. Nevertheless, the displayed caption is not the original value of the resource string, which shows the importance of the
reference field in PO files within the Lazarus implementation.
In the second group of tests (5 to 8), there is no entry for the form caption in the PO file. Instead there is an entry with reference to a non-existing entity. Nevertheless, as test 6 shows, this translation will be used to translate the caption. As tests 7 and 8 show, entry 1 has no impact when it contains a
msgid which does not correspond to the caption resource string.
The third and fourth group of tests (9 to 16) shows that an entry with a translation of the form caption resource string has no impact on the translation of the caption if there is an entry with the correct reference to the form caption. And that remains true no matter if the entry for the caption has a translation or not. The order in which the entries appear is of no consequence.
|Entry 1||Entry 2||Displayed Caption|
The last two groups of tests leave me nonplussed. Compare tests 15, 19 and 23. In all three cases, there is a valid entry for the form caption with the correct
tform1.caption and instructions that the resource string should not be translated although the
msgid is not valid. But somehow the presence and order of appearance of the extraneous entry have an impact on the translation shown. This not the only anomaly that was encountered. Consider this example.
That was surprising because it seemed plausible that given the correct reference to the
SHello resourced string and its empty
msgstr, the label caption should have been untranslated and displayed as "Hello".
As mentioned above, two entries with the same untranslated string in a PO file are considered ambiguous and the GNU gettext system adds a message context field, called
msgsctxt, to differentiate the entries. The Lazarus implementation also adds a
mgsctxt field when the same untranslated string occurs more than once and this field is invariably the entry
referencereference in quotes. This duplication may appear to not be that useful, but in actuality this approach satisfies a GNU gettext requirement while preserving the Lazarus reliance on the
A MO file (extension
.mo) is a binary file compiled from a PO file. In principle the compiled file is smaller and faster than the
.po file. The set of tools provided with Lazarus does not include a compiler but the GNU gettext project provides one named msgfmt. It is included by default in Mint 20.1 and probably many other Linux distributions, but it is probably not be included in Windows. Maybe that is why the Free Pascal wiki recommends using the poedit translation editor to create MO files.
Both these compilers can be used to generate MO files from PO files created by the Lazarus gettext implementation. However, they ignore the
reference field. On the other hand, the
msgctxt field is used to provide unique translations for ambiguous entries. By keeping both fields identical, the Lazarus gettext implemention ensures compatibility with the original gettext system. At least that's my interpretation of what is happening and I'll stick to it until evidence to the contrary is provided.
It is telling that Lazarus does not provide a compiler and, more to the point, does not ship with any compiled
.mo files. The many translations of the IDE itself are stored in
.po files only. For my use case, distributing MO files would be a mistake. It only makes sense to do that when one wants to concentrate in one's hands the control for translations which is exactly what I cannot do.
A misleading statement was made about the automatic search for the regionalized version of a national language depending on the system's locale. That is not quite true. As I said, the
LANG environment variable in my system is
fr_CA.UTF-8. Accordingly, the automatic translation mechanism (enabled with the inclusion of the
DefaultTranslator unit in the program's
uses statement) searches for the file
test.fr_CA.UTF-8.po in all the usual places. If such a file is not found, then the search is repeated for a file named
test.fr.po, but a search for
test.fr_CA.po is never performed. I am not sure if this should be seen as normal behaviour or it should be considered "a feature" or even a bug.
This problem came up in 2017 when I wanted to translate console applications. In my
unittranslator.pas unit which replaces
LCLTranslator to avoid the latter's dependance on the
LCL package, the
GetLang function will remove the UTF-8 encoding suffix from the system locale. That works for my needs but I do not know if it is a generally acceptable solution.
The principal reference for translating Lazarus programs is the Translations / i18n / localizations for programs Free Pascal wiki page. Additionnal information is found in the Everything else about translations page. There are many other wiki pages dedicated to the subject. Do be careful, some of these are old and contain advice that technically may be correct but is nevertheless out of date. Translating Lazarus programs is really as simple as described in the principal reference wiki or as described above.