Received: from localhost by CS.UTK.EDU with SMTP (cf v2.9s-UTK) id UAA03887; Wed, 6 Mar 1996 20:52:45 -0500 Received: by CS.UTK.EDU (bulk_mailer v1.4); Wed, 6 Mar 1996 20:51:49 -0500 Received: from koobera.math.uic.edu by CS.UTK.EDU with SMTP (cf v2.9s-UTK) id UAA03746; Wed, 6 Mar 1996 20:51:46 -0500 Received: (qmail-queue invoked by uid 666); 7 Mar 1996 01:53:39 GMT Date: 7 Mar 1996 01:53:39 GMT Message-ID: <19960307015339.15720.qmail@koobera.math.uic.edu> From: djb@koobera.math.uic.edu (D. J. Bernstein) To: drums@cs.utk.edu Subject: re: Message format document outline John has suggested that, for something to be ``reasonably considered a modern Internet MTA,'' it must store all mail messages in CRLF form. Presumably this is what his secret text is going to say, although he's going to use words like ``MUST'' and ``MUST NOT'' rather than ``to be reasonably considered a modern Internet MTA.'' I don't like this. The basic problem is that John has a view of mail messages that conforms to neither history or reality. The alleged virtue of his view is that it permits a certain feature; however, the feature is unreliable. Explanation follows. 1. Definition: ``Text'' is a sequence of lines, where each line is a sequence of characters. There are many ways of _encoding_ text as a single sequence of characters for storage or transmission. Examples: If you impose the restriction that lines not contain LF, you can encode text as LFLF... If you impose the restriction that no line is longer than 255 characters, you can encode text as ... where each is one byte. If you impose the restriction that lines not contain CR LF, you can encode text as CRLFCRLF... Each text encoding has advantages. For example, the CRLF encoding produces a reasonable display on standard terminals. The LF encoding is easier for programmers than the CRLF encoding, and is blessed by ANSI. The NUL encoding (see GNU find -print0 and xargs -0) is awesome for programmers. 2. Definition: A ``mail message'' is what MTAs transfer. The traditional view: a mail message is text---a sequence of lines. John's view: a mail message is a sequence of characters. John recognizes that users want to interpret messages as text, so he has selected a text encoding (CRLF, of course) and specified that a sequence of characters be interpreted as text according to this encoding. (This disallows texts that contain CRLF inside a line. Fortunately, users don't seem to care about texts that contain control characters other than 9, sometimes 8, and occasionally 12.) 3. Definition: A ``binary file'' is an arbitrary sequence of characters. The alleged virtue of John's view of mail messages is that it permits binary files to be automatically handled by MTAs---after all, in his view, a mail message _is_ a binary file. Don't you hate saying TYPE I or TYPE A in FTP? Isn't it a pain that all these DOS/Windows programs distinguish between text files and binary files? Wouldn't networking be so much easier if you could treat everything as binary? But, of course, you can't. Here's why. UNIX users will persist in creating text files in their normal editors and then mailing those text files to Windows users. I think John accepts that UNIX will, for the foreseeable future, use the LF encoding. If the LF on the UNIX end doesn't get converted to a CRLF on the Windows end, the file will be unreadable. That's corruption. On the other hand, if the LF _is_ converted to a CRLF, and a UNIX user follows the same procedure to send a binary file that contains an LF, _that_ file will be corrupted. It is therefore patently obvious that, to avoid corrupting the binary file, one of the users has to DO SOMETHING DIFFERENT. When I told John this, he disagreed, and claimed that Pine ``already handles this problem.'' But what Pine does is the moral equivalent of /bin/file. It _guesses_, based on the binary file's contents, whether the user actually meant that file as an encoded text file. Sometimes it guesses wrong and requires user intervention. It is, in short, unreliable. 4. John presumably intends to enforce his view by declaring, in the RFC 822 update, that a message _is_ a sequence of characters, to be interpreted as text lines under the CRLF encoding. This would mean that a UNIX text file containing a properly formatted header and body is _not_ a valid mail message unless it has a CR at the end of each line. Under the traditional view, of course, this is utter confusion between content and encoding. What I find most disgusting here is that John knows what he's doing. He really wants to have ``messages in canonical, CRLF-separated form FROM END TO END'' (emphasis added); that's how his mail software works, and he wants to cause trouble for anyone who takes a different view. ---Dan