content top

Replacement Character

This is not the title of a movie or a novel, but the official name of those annoying black diamonds with white question marks (�) that often show up when surfing the Web. Why are they still happening? The Web isn’t in its infancy anymore, so there should be a way to ensure that browsers display what page creators want visitors to see. The offending character, by the way, has the Unicode value FFFD. Its purpose is “to replace an unknown or unrepresentable character.”

Funny story: In the beginning was ASCII, a series of 128 codes consisting of 7 bits that represented letters, numbers, and punctuation marks. Soon it was discovered that there were languages other than English. By utilizing all 8 bits of a byte, the coding possibilities were extended to 256. But that wasn’t enough for all special characters out there, so they created different groups of extensions. In order for devices to understand what specific character set to use, the correct encoding had to be declared along the chain of devices. So far, so good. Yet with the dawn of interchangeable fonts, an additional problem surfaced: Not all fonts included all characters – hardly a recipe for success.

Later, Unicode became popular. It is an encoding system that allows unique values for all characters and symbols by utilizing one or several bytes. Here, as well, several different flavors have sprung up. Today, many if not most web sites and e-mail programs seem to be using UTF-8. And again, fonts often make it impossible to represent all the necessary UTF-8 code points.

The dawn of CMS software introduced yet another problem: The input masks for text had to convert to the correct encoding system, and the database which held the text had to hold it in the same encoding as the rest of the site. A good example for less than satisfactory representation is this snipped from a Yahoo mailing list, a typical result of encoding discrepancies between input, storage, and output:

EX: Fachw�rterbuch Verpackung (Kr�mer)
1990. Kompendium f�r den Verpackungsmarkt der EG. Maschinen, Techniken,
Werkzeuge und Materialien. Sprachen Deutsch, Englisch, Franz�sisch.

So if an encoding declaration is missing or wrong along the way ➜ � (unknown character). And if the encoding declarations are correct but your font doesn’t contain a certain character ➜ � (unrepresentable character).

If you are in charge of your website, you can make sure that the encoding declarations and settings are correct. A declaration at the beginning of your html or php file, in your php.ini file, or even in the .htaccess file will do the trick. Making sure your database stores content with the right encoding is a bit more tricky. The main point, however, is to instruct the database to use a particular character set, preferably UTF-8 to cover all bases. You’ll find some explanations and links for further reading here.

There are good reasons to use UTF-8 as the encoding of choice. It covers all Unicode characters – including Chinese, Japanese, Thai, Sanskrit, etc. However, you often find other encodings, especially ISO-8859-1, and you may think that you don’t need the extend to which UTF-8 covers characters. After all, you only write in English and German, for example. Wrong. With ISO-8859-1, character positions 0080 to 009F are non-standard and may work only with Windows. The area contains, among others, the typographically correct single and double opening and closing quotes, the m-dash and n-dash, the trademark sign (™), the German single and double opening quotes, and the upper and lower case ligature of o and e, Œ and œ. If ISO-8859-1 is declared and you need one of these characters, it would be wise to write them as numeric entities (e.g. Œ for Œ) for a better chance that the visiting browser knows what you’re talking about. Here is a chart of all defined ISO-8859-1 characters.

And finally, incorrect coding leading to � is often caused by the application used to prepare the text. As rule of thumb, you need to prepare your text as UTF-8 without BOM. Yes, BOM! That’s a one-character byte-order mark often put at the very beginning of a document. If your web page includes PHP and contains a BOM, it may simply not run. Only use editors that explicitly allow saving without BOM. Do not prepare your text in Word, ever. I am using Notepad++ with very good results. It has an Encoding menu where you can select (or convert to) UTF-8 without BOM.

If you’re using WordPress and want to make sure that your Word-prepared text is represented properly, don’t simply copy and paste. That is going to invite all sorts of problems. Make sure you selected Visual on the top right of your editing window, and if you don’t see a symbol with a clipboard and the letter W, toggle Show/Hide Kitchen Sink. Then click on the clipboard+W icon (Paste from Word) and paste your Word-created text into the pop-up Paste from Word box.

Really? This is 2013! How come that character representation is still so ����� difficult?

Share

Leave a Reply

Your email address will not be published. Required fields are marked *