Some time ago we had to migrate a latin1 mysql 5.0.37 database to it’s utf8 counterpart, due to the addition of new languages to the application (russian, polish, …)

In the latin1 database we had 3 cases :
1. legal latin1 characters
2. html entities generated by the browser
3. double encoded characters which infact are utf8 characters :
eg ß => if you put this in hex you’ll see that this is c3 9f => this
converts to the german ß in utf8

the problem was on how to locate those occurences :
* the first case was not a problem since mysqldump would convert them properly
* the second case was not difficult we used the the function of regexp to locate
these eg. name REGEXP ‘&#[0-9]+;’
* the third case was much more difficult to find i’ll dedicate a 2nd post to this
later

Leave a Reply