The first part of my Google Summer of Code project involves the creation of a library to handle Debian package control files, which are, upon closer inspection of the Debian Policy Manual, actually encoded as UTF-8 (8-bit Unicode).
Initially, the files just “looked” like ASCII (a rather common issue with a downward-compatible system like UTF-8). You see, UTF-8 is, by design, indistinguishable from ASCII unless high-order characters are used – this is so that files using only ASCII characters can still be interpreted as UTF-8.
All of this meant that there is the possibility of “wide characters” – that is, characters that require multiple bytes to render, such as those in other languages. This means that using C would become a bit tedious, as you have to handle these cases.
I had read about the GNOME Project’s GLib but not looked at it in any depth until now. Much to my surprise, I discovered that it is an entire framework of portable C code, providing I/O manipulation, string handling, common data structures like trees and hashes, among other things. These functions all provide a Unicode-safe system too, which are all manipulated internally in UTF-32 and written back out in UTF-8. I’m all about code reuse and, being lazy and not completely understanding Unicode in-depth (after all, there is the common statement that “Internationalisation is hard”), I decided that using GLib was the best way to go.
The unfortunate side effect of this is a bit of wasted space – even if all the characters are 8-bits wide, 32-bits will be required to store them, meaning 24-bits of wasted space per character. An 8kb ASCII file is roughly 8,000 characters, meaning 192kb of space are wasted for this file, which could very well be a moderately complex Debian control file. All in all, it’s not a big deal and can be converted to UTF-8 later if desired.
Admittedly, the documentation available online in the GNOME Library – the GLib Reference Manual – leaves something to be desired. In particular, each function is explained in terms of its parameters, input and output, but does not provide a trivial example of its use. Some of the functions are unclear on how they work — when manipulating strings in GLib 2.20, the documentation describes functions with a signature like:
GString* g_string_append_len(GString *string, const gchar *val,
gssize len);
In this case, it’s unclear what the returned GString is useful for. Generally, C functions like strcat (don’t use that, by the way, use strlcat instead) will update strings in-place rather than returning pointers to them.
On the other hand, without knowing the intentions of the developers, it might be designed for portability to create bindings in languages that do not have the notion of returning through a pointer, such as Perl. However, in my opinion, the particulars of that should depend on the implementation of the bindings, as in, the XS glue code should have provisions for this.
Update: From reading the body of other functions, it seems that the return value is used to allow nested calls, like: g_string_ascii_down(g_string_append_len(…)). It’s sort of neat, really; shows some design foresight in the library.