Modern email addresses can be in UTF-8
Over on the Fediverse, I noted:
It has been '0' days since someone's email client helpfully let them use a Unicode '‐' instead of an ASCII '-' in dash-separated email addresses. Or perhaps the client automatically used the Unicode character instead of the ASCII dash.
You may not be surprised to hear that email systems, ours included, don't consider the two to be the same. I'm not sure how it even works, although some sending MTAs appear to just send the address as UTF-8.
Specifically, the character in question is Unicode U+2010 Hyphen (also). The email in question was sent to us using this character in a destination address that actually had the ASCII dash; given that the U+2010 version of the address didn't exist, Exim on our external MX gateway rejected it. These days, Exim's logging is in UTF-8, as is pretty much anything you'll use to read the logs, so the result was pretty confusing to disentangle. To all appearances it looked like our email system had temporarily glitched out and decided that some valid local addresses didn't actually exist.
The answer to my final question, about how this actually works, is RFC 6531: SMTP Extension for Internationalized Email, also known as SMTPUTF8. Exim supports SMTPUTF8 (if built appropriately), and it defaults to advertising this to everyone (per Main configuration and the description of smtputf8_advertise_hosts in it). To simplify, a large part of what SMTPUTF8 support does is that the sender can use UTF-8 in envelope addresses, both MAIL FROM and RCPT TO. Either or both of the local part and the (sub)domain can be in UTF-8, although the resulting DNS label needs to conform with IDNA.
Allowing email addresses to use U+2010 hyphens instead of ASCII ones is a trivial use of SMTPUTF8. A potentially much more important one for genuine internationalization is allowing people to have addresses that aren't written only in ASCII, for example because their name itself is not ASCII. Any number of Europeans have accented characters in their names and so might like to have them in their email addresses, and then there's quite a lot of people who don't write their names in any version of the Latin alphabet. SMTPUTF8 accommodates all of them.
Of course not all mail systems out there in the world support SMTPUTF8, so today anyone using such an email address is taking some degree of risk (unless their system automatically handles the situation of a destination mail server not supporting SMTPUTF8 by, for example, rewriting the envelope address and possibly message headers to a known alternate version). But I suspect that the large email providers all support it, and their support for it (and willingness to generate and use email addresses in UTF-8) will push everyone to support it sooner or later.
(I have actually encountered SMTPUTF8 before, cf, but in the time since then I forgot about it.)
|
|