Ian Landsman

Home · About Me · Twitter (daily updates) · Most Popular Articles · Search
My primary postings are now on Twitter, please follow me there: @ianlandsman

UTF-8, To Dream the Impossible Dream

If you're in the market for a powerful and user friendly Help Desk solution, please take a look at my company's flagship product HelpSpot.
I've spent the better part of a week trying to figure out how to do UTF-8 in PHP with data stored in MSSQL, MySQL, and PostgreSQL and I'm here to report that it's just about impossible. First off you have PHP which is pretty much a mess when it comes to UTF-8 or really any character encodings. Some functions take charset params some don't, there's 2 different but similar charset libraries (mbstring, iconv) both of which have issues and neither of which is installed by default until PHP 5 and then only iconv, except on FreeBSD (or so I read). It's enough to drive you insane.

If you're building a custom application or an in house application there's enough there that you can make it work, but if you're trying to write a distributable application it's very very close to impossible especially when you factor in the database.

See each database handles things totally different. First you have SQL Server which doesn't even store UTF-8 at all. Instead it stores UCS-2 so right there you'd have to convert your nice UTF-8 to UCS-2 before inserting and when you do selects you'll need to convert it back. Then you have MySQL which as no support in the 3 series, no real support in 4 until 4.1 which is a pretty big limitation in terms of requirements. Finally you have PostgreSQL which I honestly barely got to look into. It seems that it stores UTF-8 so long as you compile it with support for it. I'm not sure if that's the "standard" way to compile it or not.

Hence, I've fallen back to fixing up some issues HelpSpot currently has with ISO-8859-X encodings and making sure things work well on that front. Hopefully at some point in the future these things start to come into line. The word on the street is that PHP6 will make unicode the native format and by that time perhaps Microsoft will have a better way to handle it in SQL Server and the install base of MySQL 4.1+ will be big enough to make the switch.

In case anyone else is looking to make the UTF journey with PHP here are a few links to some of the better resources I found:

1. Great discussion of PHP issues with UTF-8
2. Dokuwiki PHP UTF-8 library
3. Some sample UTF-8 characters
4. Textpatterns UTF page - they appear to have it mostly working though they note that sorting and indexing may not always work correctly. Probably not a huge deal in an open source CMS, but that's not acceptable for HelpSpot where sorting and filtering are perhaps the most important functions the system performs.
Created on 12.12.2005 1:12 pm · Comments (7)


Discussion

The reason for doing all this is to support foreign language character sets correct? And when you say PHP doesn't support UTF-8 very well, you mean that a user inputs some information using their own character set, and you are trying to pass those characters to functions, and PHP isn't recognizing the correct characters - specifically into SQL statements to be stored in a database?

I know this may sound remedial, but I have actually had very little experience trying to deal with anything other than UCS-2 (I suppose) / ISO-8859-X.

Would a change to support these characters mean a complete rewrite of every SQL statement and and most functions in HelpSpot? Should I focus on this problem from the get-go on my recent projects - in your opinion? Of course, this depends on the nature of any project.

Created by Ben Mc on 12.12.2005 2:12 pm

Well it depends grin

It wouldn't require rewriting HelpSpot custom functions other than where those functions use PHP string functions which may potentially mess with the UTF-8 data. The UTF functions I linked to above can help alot with that, though they'll be slower than the pure PHP functions of course.

If you're writing an app for SQL Server only you can probably live with using UCS-2, though I'm not sure how the browsers handle it. I presume OK since most common browsers have very good character/encoding support.

The big issue is if you're distributing the app. If you are then you won't know if the user has access to the iconv and or mbstring functions. So you always need to check if they're available (function_exists(), etc) but worse is what if they don't? Then you could leave it and presume they'll be OK because it's probably in Latin, unless it isn't then you're in trouble.

Also you now have an installation base some of whom have data in UCS-2, some in UTF-8, some in Latin. I just imagine that it would be a nightmare to support always having to find out the database platform, if they have the mbstring/iconv functions, etc.

As for the SQL itself I could capture the statements themselves and convert them, that's not such a big deal other than a small performance hit. The bigger issue was with SQL Server because the returned dataset would need to be looped through and converted to UTF-8 before being passed to any other functions since the other functions would be expecting UTF-8. That's a big peformance hit espectially on pages with several selects. Beyond that you have the general increase in overall complexity.

I didn't even get into the issues with the MySQL and Postgres clients which must also be setup to receive UTF because if they aren't they'll corrupt the data even if the table itself is correctly setup as UTF-8.

So ....... yes you should think about it, but if your app is distributed it will probably hard to do. You will need to have significant requirements for your customers like MySQL 4.1+ which will limit sales.

Created by Ian on 12.12.2005 2:12 pm

Just a confirmation that PHP can handle UTF8 as I have worked on a large PHP/Oracle application using Unicode 100% successfully. However I agree it's nearly impossible to distribute a PHP application (using 3 different DBs) that works perfectly with Unicode. On the project I was working we had full control of the servers, Apache version, php version etc and tweaked quite a bit to make it work.

BTW: mbstring has an override option (func_overload) where if mb is available, mb string functions override their non-mb equivalent automatically (e.g. you call strpos() but mb_strpos() is called). It's not a panacea but it helps somewhat.

Created by Dimitris Giannitsaros on 12.12.2005 2:12 pm

Yeah I definitely think it could be done for an in house application, but for distribution I just don't think it can be done reliable. Maybe if your data storage is file based like dokuwiki so you can control that completely.

Even the overrides for mb require an ini setting I believe right? It can't be set at runtime so then you're into a bunch of support telling people how to change this ini setting, etc etc and you still have the DB issues. Those were really the killers. I had the PHP stuff pretty much figured out. It wasn't pretty but it would have worked, but there was just no way to handle all the DB issues in a reliable manner.

Created by Ian on 12.12.2005 2:12 pm

Interesting post and discussion. Thanks for the links.

I'm working on a PHP app on the moment (Yes, me! A C++ geeeenious, trying to do actual work with kids toys ;-D). The app is a product aimed at being distributable. And each time I do a demo to my client, the conversation goes like this:
Client: Oh! what's that garbage in the description column of these items ?
Serge: I told you last time not to worry, it's only me doing basic i18n tests. Not very successful yet but I'll address that shortly.

But each time I say so, I remind what Joel wrote in the introduction of its (excellent) article about i18n :

"... I discovered that the popular web development tool PHP has almost complete ignorance of character encoding issues, blithely using 8 bits for characters, making it darn near impossible to develop good international web applications ..."

http://www.joelonsoftware.com/articles/Unicode.html

I always hoped they missed something. Looks like they didn't :-(

Created by Serge Wautier on 12.12.2005 2:12 pm

No he didn't really miss anything. Again, it can be done but it's going to be ugly and it will probably only work if you focus entirely on one database. Even then though you're going to have some weird requirements and several hoops for your customers to jump through to get the database setup correctly and that's assuming they have access to do it. In a shared hosting environment they're probably out of luck.

Created by Ian on 12.12.2005 2:12 pm

Interesting discussion.

Further to that first link you made to the wiki page, there's now another page summarizing "problem" function in PHP: http://www.phpwact.org/php/i18n/utf-8 - also have some code under CVS at http://sourceforge.net/projects/phputf8 which is intended to help smooth issues on servers you don't control, with "native" UTF-8-aware implementations of common PHP string functions. Should release it some time in the next couple of months.

Would be great to get some of the database specific details with PHP / UTF-8 down on that wiki, if you have time. You need to register (http://www.phpwact.org/?do=register) but otherwise no restrictions. Some possible placeholders;

http://www.phpwact.org/php/i18n/utf-8/mysql
http://www.phpwact.org/php/i18n/utf-8/mssql
http://www.phpwact.org/php/i18n/utf-8/oracle
http://www.phpwact.org/php/i18n/utf-8/pqsql

The wiki is run by Jeff Moore (http://www.procata.com/blog) BTW
-----

Created by Harry Fuecks on 12.12.2005 2:12 pm

 

Leave a Comment

Commenting is not available in this weblog entry.


> RSS 2.0
> Blog Archives (complete list)
> HelpSpot Mailing List

Copyright © by Ian Landsman

Design by Jakob Nielsen