Recently, I’ve been working a lot on functionality involving querying DNS records. Since this was relatively untouched area for me, it required a lot of manual sanity checks to make sure I understand what is happening. I’ve mostly done them using dig
- a wonderful little tool to poke DNS records. If you have not used it, I highly recommend playing with it. Be it as it may, I eventually ran into a problem with certain domains that confounded me.
Digging Internationalized Domains
After dig
-ing a bunch of domains for some time, I realized that something was missing in my mental model. Consider a webpage like https://casaè.it/. It’s a fully working webpage. You can visit it and browse it meaning it has DNS records set up. However, when we try to dig
it, we get the following response:
In short, we see that such domain does not exist: we get status: NXDOMAIN
along with no answer section at all. What gives? How can it not exist if we just visited it?
Well, turns out there’s something called Internationalized Domain Name (IDN).
What Are Internationalized Domain Names?
IDNs, put simply, are domain names that contain at least one character outside of the traditional ASCII [1][2][3]. This includes the top level domain, too (for example, we could register TLD .ευ
).
IDNs are not that common in the wild. Even today, more than 35 years after original IDN proposition, most domains in the “western” world do not have non-latin characters. This is understandable as it’s often easier to remember and type in a domain name made up purely of latin characters, especially if you cater to a multilingual audience where people might not share a common tongue and/or have a substantially different alphabet. The fact that most domain names are made up of ASCII characters also results in some people not even being aware that you can, in fact, have an internationalized domain name.
Working With IDNs: Problem
Just because some domains are not common, it does not mean that they are non-existent. Sooner or later you will come across such domains, and chances are against it being a convenient time. Keeping that in mind, how can we work with them?
As we’ve seen before, we cannot just dig <IDN>
and call it a day, as you’ll find no DNS records. That is because a domain name is a string that can contain only letters from a latin alphabet (aA-zZ
), digits (0-9
), minus sign (-
) and a period (.
) [3][4][5]:
Or more succintly, from wikipedia:
The DNS, which performs a lookup service to translate mostly user-friendly names into network addresses for locating Internet resources, is restricted in practice to the use of ASCII characters, a practical limitation that initially set the standard for acceptable domain names.
This raises two obvious questions:
- How can domains with non-latin characters exist if non-ASCII characters are not allowed?
- How to query (e.g.,
dig
) those domains?
Working With IDNs: Solution
The answer to both of these questions is punycode - a representation of Unicode characters with a subset of ASCII characters [6]. Since domain names do not support Unicode characters, they are encoded in punycode instead. For example, something like münchen.de
would be converted to xn--mnchen-3ya.de
. As you can see, punycode representation contains only ASCII letters. This makes it possible to use the domain in DNS queries:
Observe that the second query (dig xn--mnchen-3ya.de
) returned an A record, along with a bunch of additional information, even though the first one (dig münchen.de
) did not. To be 100% sure, we can try digging casaè.it
, too, which translates to xn--casa-8oa.it
:
Astute reader will notice that the translation is not as simple as substituting unicode letters for some combination of ASCII characters. The translation algorithm is out of the scope for this post, but you can read its details in wikipedia’s article on punycode. You can also easily find tools online to do the conversions - just google “punycode converter” or similar. Many programming languages have this encoding in-built, too. For example, in Python we can do:
Whereas in Java/Scala this is as easy as:
You may wonder what would happen if an ASCII string was translated to punycode. In fact, such translation does not change the string:
Thus, if you think you may encounter IDNs in your system, it might be a good idea to run the domain through punycode converter first. This will ensure you’ll be able to query the domains in a way DNS expects. It will also save you an embarrassment of assuming a DNS record does not exist when in fact it does, just in a different encoding.
Conclusion
DNS is a wonderful system that stood the test of time. It’s usefulness and flexibility cannot be overstated. However, since it is an old system, handling of modern use cases is not always intuitive, as can be seen from IDNs. Nevertheless, given the restrictions within which the system has to work, I think it’s a testament to DNS engineers’ cleverness and ingenuity that the system can be adapted to unforeseen use cases without massive disruptions to the way DNS works.