Manual URL-decoding

Have you ever wondered why special characters in the browser's address bar appear as %AA-codes? It is possible to translate them back to the actual characters manually but it is considerably more work that it is practical to put in. Usually, you would just say urldecode and let PHP do the calculations..

Anyway, it is an interesting exercise to decode such characters manually. First, go and read an explanation of URL encoding.

Here is how to find the character from a URL encoding - manually! We'll deal with the character 我, as an HTML entity "我" (according to babelfish.altavista.com it means "I" in Chinese) is actually encoded as %E6%88%91.

To follow the instructions below, you need the Windows calculator or an equivalent, a text editor and the "character map" utility. To see the calculator, you can choose Start -> Run and type calc, the character map can be accessed in the same way by typing charmap.

  1. Open Windows calculator and change to Scientific mode in the View menu. Then choose "Hex" format and type the hex value from above (simply strip out the % -signs): e68891.
  2. Click the "Bin" option to get the binary value of this hexadecimal number.
  3. Copy it and paste it in Notepad. You should get 111001101000100010010001

    This is the binary, UTF-8 encoded string. We want to un-UTF-8 it to find the Unicode value. This procedure was worked out according to the technical documentation for UTF-8.
  4. First, start at the end of the string add linebreaks for each 8 digits. You now have:

    11100110
    10001000
    10010001
  5. From the first line, remove all the initial 1 - digits. From each of the next lines, remove the inital "10" - it will now look like this:
    00110
    001000
    010001
  6. Remove the line breaks and put it all on one line again:
    00110001000010001
  7. Copy that whole string and go back to the calculator. It should still be on "Binary" format, so just paste this new string.
  8. If you now click "Dec" (for decimal or "normal" format), you see the number we would use in HTML entities to get this character - 25105.
  9. Next, click "Hex". The calculator will say "6211". Now open the Windows "character map" utility. Make sure you select a Unicode-enabled font. Activate "Advanced view" if it doesn't show the "Go to Unicode" box. Then, in the "Go to Unicode" box type 6211. Voila, it shows the character you are looking for.

I'm sure you agree it is simpler to just type <? urldecode('%E6%88%91') ?> :-)

This was written in response to a question on experts-exchange.