When you’re restricted to ASCII, how can you represent more complex things like emojis or non-Latin characters? One answer is Punycode, which is a way to represent Unicode characters in ASCII. However, while you could technically encode the raw bits of Unicode into characters, like Base64, there’s a snag. The Domain Name System (DNS) generally requires that hostnames are case-insensitive, so whether you type in HACKADAY.com, HackADay.com, or just hackaday.com, it all goes to the same place.
[A. Costello] at the University of California, Berkley proposed the idea of Punycode in RFC 3492 in March 2003. It outlines a simple algorithm where all regular ASCII characters are pulled out and stuck on one side with a separator in between, in this case, a hyphen. Then the Unicode characters are encoded and stuck on the end of the string.
First, the numeric codepoint and position in the string are multiplied together. Then the number is encoded as a Base-36 (a-z and 0-9) variable-length integer. For example, a greeting and the Greek for thanks, “Hey, ευχαριστώ” becomes “Hey, -mxahn5algcq2″. Similarly, the beautiful city of München becomes mnchen-3ya.
As you might notice in the Greek example, there is nothing to help the decoder know which base-36 characters belong to which original Unicode symbol. Thanks to the variable-length integers, each significant digit is recognizable, as there are thresholds for what numbers can be encoded. A finite-state machine comes to the rescue. The RFC gives some exemplary pseudocode that outlines the algorithm. It’s pretty clever, utilizing a bias that rolls as the decoding goes along. As it is always increasing, it is a monotonic function with some clever properties.
Of course, to prevent regular URLs from being interpreted as punycodes, URLs have a special little prefix xn-- to let the browser know that it’s a code. This includes all Unicode characters, so emojis are also valid. So why can’t you go to xn--mnchen-3ya.de? If you type it into your browser or click the link, you might see your browser transform that confusing letter soup into a beautiful URL (not all browsers do this). The biggest problem is Unicode itself.
While Unicode offers incredible support for making the hundreds of languages used around the web every day possible and, dare we say, even somewhat straightforward, there are some warts. Cyrillic, zero-width letters and other Unicode oddities allow those with more nefarious intentions to set up a domain that, when rendered, displays as a well-known website. The SSL certificates are valid, and everything else checks out. Cyrillic includes characters that visually look identical to their Latin counterparts but are represented differently. The opportunities for hackers and phishing attempts are too great, and so far, punycodes haven’t been allowed on most domains.
For example, can you tell the difference between these two domains?
Some browsers will render the hover text as the Punycode, and some will keep it as its UTF-8 equivalent. The “a” (U+0061) has been replaced by the Cyrillic “a” (U+0430), which most computers render with the exact same character.
This is an IDN homograph attack, where they’re relying on a user to click on a link that they can’t tell the difference between. In 2001, two security researchers published a paper on the subject, registering “microsoft.com” with Cyrillic characters as a proof of concept. In response, top-level domains were recommended to only accept Unicode characters containing Latin characters and characters from languages used in that country. As a result, many of the common US-based top-level domains don’t accept Unicode domain names at all. At least the non-displayable characters are specifically banded by the ICANN, which avoids a large can of worms, but having visually identical but bit-wise different characters out there leads to confusion.
However, mitigations to these types of attacks are slowly being rolled out. As a first layer of protection, Firefox and Chromium-based browsers only show the non-Punycode version if all the characters are from the same language. Some browsers convert all Unicode URLs to Punycode. Other techniques use optical character recognition (OCR) to determine whether a URL can be interpreted differently. Outside the browser, links sent by text message or in emails, might not have the same smarts, and you won’t know until you’ve opened them in your browser. And by then, it’s too late.
Challenges aside, will Punycodes get their time in the sun? Will Hackaday ever get ☠️📅.com? Who knows. But in the meantime, we can enjoy a clever solution proposed in 2003 to the thorny problem of domain name internationalization that we still haven’t quite solved.