ICU library provides some facilities to work with Unicode and Globalization. The problem is that this library is installed system-wide and it is not so trivial to update it. We are going to update only some parts of the ICU Database, but the rest of the Database can be updated in the same way. You just need to know what exactly you need to change. Abstract -------- This manual is not bound to PHP, it will work for other languages as well. The PHP uses `php-intl` extension which uses in its turn `libicu` and we are going to update some of the libicu database parts. This manual might be not so good for Java users, because as for Java the ICU4J must be used and the approach can differ. We'll try to solve two problems: outdated Timezone DB and IDN to ASCII conversion of domain names (IDN to Punycode conversion, `idn_to_ascii()` and `idn_to_urf8()` functions in PHP). To figure out the version of `libicu` installed on your system, run: ``` apt-cache search '^libicu' # or apt search '^libicu' ``` In my case it outputs: ``` libicu52 - International Components for Unicode ``` It means, ICU version 52 is installed in my system. Problem #1: Outdated Timezone Problem --- Here is the code snippet which reproduces the problem (reproducible in `libicu` version 52 and probably 55): ``` <?php // tztest.php $date = new DateTime(); $x = \IntlDateFormatter::create('en', 2, 2, 'Europe/Moscow', 1, 'Y-MM-dd HH:mm:ss Z'); echo $x->format($date) . PHP_EOL; ``` Expected output: current time in Europe/Moscow timezone (+0300) Actual output: +0400 Problem #2: Emoji/Unicode Domain Name Conversion --- `idn_to_utf8` and `idn_to_ascii` functions convert IDN (Unicode) domain names to IDNA ASCII (punycode) form. The problem is reproducible in `libicu` till version 57 (maybe even in 58) ``` <?php // idntest.php $xn = 'xn--4s9haa.ws'; $result = idn_to_utf8($xn, IDNA_NONTRANSITIONAL_TO_ASCII, INTL_IDNA_VARIANT_UTS46); if ($result === false) { throw new \InvalidArgumentException("Could not convert Punycode '$xn' to IDN."); } echo "SUCCESS: $result\n"; ``` Expected output: SUCCESS Actual output: Exception 'Could not convert Punycode ... to IDN' Problem Solution Approach --- According to the [ICU Data documentation](http://userguide.icu-project.org/icudata), if the ```ICU_DATA``` environment variable is set, the ICU Data will be loaded from that path. For example, if `libicu52` is installed in your system and you run your PHP script like ``` ICU_DATA=/opt/icu php tztest.php ``` then the ICU Database will be searched in `/opt/icu/icudt52l/` directory. The pattern is: ```/opt/icu/icudt<version><byte ordering>/```, where: * ```<version>``` is the version of libicu installed in your system * ```<byte ordering>``` can be `l`, `b` or `e`. See [Sharing ICU Data Between Platforms](http://userguide.icu-project.org/icudata#TOC-Sharing-ICU-Data-Between-Platforms). For regular x86_64 platform it will be `l`. So, the algorithm for finding the ICU data is the following: * If the `ICU_DATA` is not set, load the data directly from `libicudata.so.52` * If the `ICU_DATA` is set, try to load the data from `$ICU_DATA/icudt52l` * If the requested file is not found or the directory doesn't exist, fall back to `libicudata.so.52` Solving Timezone Problem ------------------------ There is a [great article (in Russian)](https://habrahabr.ru/post/254789/) about that. In two words, using `strace` we can find which files are requested by the `libicu`: ``` export ICU_DATA=/opt/icu strace php -f tztest.php ``` (Note that on Debian Jessie when `strace -o 'output.txt' php -f tztest.php` is used instead of the example above, the `libicu` ignores `ICU_DATA`. Would be interesting to figure out why.) In the `strace` log you'll see the following: ``` stat("/opt/icu/icudt52l/zoneinfo64.res", 0x7ffe20d1a7c0) = -1 ENOENT (No such file or directory) stat("/opt/icu/icudt52l/timezoneTypes.res", 0x7ffe20d1a120) = -1 ENOENT (No such file or directory) stat("/opt/icu/icudt52l/metaZones.res", 0x7ffe20d1a450) = -1 ENOENT (No such file or directory) ... and many more ``` According to the official documentation [Updating the Time Zone Data](http://userguide.icu-project.org/datetime/timezone#TOC-Updating-the-Time-Zone-Data), we need just 4 files: * zoneinfo64.res * windowsZones.res * timezoneTypes.res * metaZones.res They can be downloaded [from the repository](http://source.icu-project.org/repos/icu/data/trunk/tzdata/icunew/). Just click on the latest year, then `44`, then `le` for `little-endian` systems and place those 4 files into `/opt/icu/icudt52l/` directory. [The complete download link](http://source.icu-project.org/repos/icu/data/trunk/tzdata/icunew/2017b/44/le/). Check the output of the `tztest.php`, it will contain "+0300" instead of "+0400". Also you can check the `strace` output to be sure the files are loaded: ``` stat("/opt/icu/icudt52l/metaZones.res", {st_mode=S_IFREG|0644, st_size=40960, ...}) = 0 open("/opt/icu/icudt52l/metaZones.res", O_RDONLY) = 4 stat("/opt/icu/icudt52l/zoneinfo64.res", {st_mode=S_IFREG|0644, st_size=151872, ...}) = 0 open("/opt/icu/icudt52l/zoneinfo64.res", O_RDONLY) = 4 stat("/opt/icu/icudt52l/timezoneTypes.res", {st_mode=S_IFREG|0644, st_size=20032, ...}) = 0 open("/opt/icu/icudt52l/timezoneTypes.res", O_RDONLY) = 4 ``` Solving IDN - ASCII Conversion Problem --- First, run `strace` to see which files are requested by the `libicu`: ``` export ICU_DATA=/opt/icu strace php -f idntest.php ``` Grep by `/opt/icu` and here it is: ``` stat("/opt/icu/icudt52l/uts46.nrm", 0x7fffc812b2c0) = -1 ENOENT (No such file or directory) ``` This file is "Unicode Character Data (Normalization since ICU 4.4)" , see [ICU Data File Formats](http://userguide.icu-project.org/icudata#TOC-ICU-Data-File-Formats) for more description. Where to get this file from? On the [Download](http://site.icu-project.org/download) page, click on the latest ICU version under the `ICU4C` column (as for the mid 2017, [version 59](http://site.icu-project.org/download/59#TOC-ICU4C-Download) is the latest). Find the link to [the repository](http://source.icu-project.org/repos/icu/tags/release-59-1/) The needed `uts46.nrm` file will be in `icu4c/source/data/in`. Just download it from there (or clone the whole repo, of course) and place into `/opt/icu/icudt52l/`. That repository directory contains some pre-compiled Unicode Database, so you won't need to compile the whole ICU Database. Now if you run the script ``` export ICU_DATA=/opt/icu php -f idntest.php ``` you'll see the SUCCESS output. Lets check `strace`: ``` stat("/opt/icu/icudt52l/uts46.nrm", {st_mode=S_IFREG|0644, st_size=60668, ...}) = 0 open("/opt/icu/icudt52l/uts46.nrm", O_RDONLY) = 4 ``` Updating Other ICU DB Parts --- If you need other latest parts of the ICU DB, it seems, you'll need to compile the whole DB yourself. If you don't want to do that, you can use the [ICU Data Library Customizer](http://apps.icu-project.org/datacustom/index.html) tool. But as of mid 2017, the latest available ICU version is 57 in that tool. Select the ICU version you need (the latest, of course!), click "Get Data Library" button and download the zip archive, extract it and you'll get the `.dat` file, lets say, `icudt57l.dat`. This file is an archive containing everything. To see what is inside, be sure you installed `icu-devtools` package: ``` icupkg -l icudt57l.dat ``` In order to extract any file: ``` icupkg -x uts46.nrm icudt57l.dat ``` So you can extract some files and place them on your server's `/opt/icu/icudt52l` directory. `Strace`'ing it further, it seems that sometimes the `libicu` tries to open `/opt/icu/icudt52l.dat` file. I tried to feed this file (renaming `icudt57l.dat` into `icudt52l.dat`). But it didn't work as expected, although the file was loaded accroding to the `strace`. The reason for that could be that the `.dat` files have different formats depending on the ICU version.