Lately I’ve been working with a friend on a daily-deal aggregator. The Groupon-like sites are popping up everywhere and the market for aggregators is still fairly unfilled. My project, Alladeals, target the Swedish daily deals market and as such it needs to support Swedish characters. In future it might have to support other languages as well so I decided that UTF8 was the way to go. Since most webpages are encoded in UTF-8 these days it has been fairly painless to actually work with UTF-8 in PHP, that is, until yesterday.
PHP does not natively support UTF-8. This is fairly important to keep in mind when dealing with UTF-8 encoded data in PHP. Usually I’m pretty good at remembering that, however yesterday I happened upon a bug which could easily have gone unnoticed for months if not for some good luck.
The bug manifested itself in the deal titles, the design is not well suited for really long titles so it was decided that it would be best to make sure that the titles did not exceed a length of 140 characters. To cut the the title the following code was used:
$title = substr($deal['title'], 0, 140);
Catch the error? Remember that PHP does not natively support UTF-8? This means that functions like substr doesn’t count characters like the PHP manual says:
“the string returned will contain at most length characters beginning from start."
Rather, it actually counts bytes. This works fine for single byte character encodings, but UTF-8 is multi-byte, meaning that some characters can be more than 1 byte in length. This means that if the 140th byte of a string happens to be a multi-byte character you effectively cut it off in the middle of a character, resulting in one of those lovely question marks on a black background characters.
Luckily PHP has the multi-byte extension which implements a lot of the standard functions in a multi-byte safe way. This means that fixing our bug is as easy as converting our code to the following:
$title = mb_substr($deal['title'], 0, 140, 'UTF-8');
To be honest this is a stupid bug, one really should keep the mb_ functions in mind, but it happens and I was lucky it showed up early before it could affect too many visitors.