Notepad bug in details
Tuesday, October 30, 2007 by Jason
I have told you guys a brief explanation about how Bush Hid the facts notepad bug occurs.
It is because of IsTextUnicode function. This function is used to guess opening files are using which types of encoding (the traditional ANSI encoding or the Unicode encoding).
For a long sentences, it's fine. But for a short word, it become pretty tricky. Let's go into details.
To store "Hello" in notepad, it will convert to encoding. There are many types of encoding notepad support. Most encoding starts with their specific prefix to tell the program what types of encoding they are using. But the tricky things happen because Notepad also support encoding with NO prefixes, such as the traditional ANSI encoding (i.e., "plain ASCII") and the Unicode (little-endian) encoding with no BOM.
Confused? See below Hello in different types of encoding.
So there is still a chance that short word with traditional ASCII or Unicode (little-endian) encoding with no BOM might start with EF BB BF, which is UTF-8 encoding prefix. Or maybe FF FE which is prefix for Unicode (little-endian) encoding with BOM.
When that happens, Notepad function IsTextUnicode might guess the wrong encoding as that prefix of other encoding. Remember no matter what encoding you saved with notepad, Notepad will try to guess again when you open the file.
But I don't see this bug in Windows Vista anymore. So I think Microsoft guys have already developed improved IsTextUnicode function or make Notepad to open in saved file encoding.
Reference : MSDN blog
It is because of IsTextUnicode function. This function is used to guess opening files are using which types of encoding (the traditional ANSI encoding or the Unicode encoding).
For a long sentences, it's fine. But for a short word, it become pretty tricky. Let's go into details.
To store "Hello" in notepad, it will convert to encoding. There are many types of encoding notepad support. Most encoding starts with their specific prefix to tell the program what types of encoding they are using. But the tricky things happen because Notepad also support encoding with NO prefixes, such as the traditional ANSI encoding (i.e., "plain ASCII") and the Unicode (little-endian) encoding with no BOM.
Confused? See below Hello in different types of encoding.
48 65 6C 6C 6F This is the traditional ANSI encoding. (No prefix)
48 00 65 00 6C 00 6C 00 6F 00 This is the Unicode (little-endian) encoding with no BOM. (No prefix)
FF FE 48 00 65 00 6C 00 6C 00 6F 00 This is the Unicode (little-endian) encoding with BOM. (FF FE is prefix)
FE FF 00 48 00 65 00 6C 00 6C 00 6F This is the Unicode (big-endian) encoding with BOM. Notice that this BOM is in the opposite order from the little-endian BOM. (FE FF is prefix)
EF BB BF 48 65 6C 6C 6F This is UTF-8 encoding. The first three bytes are the UTF-8 encoding of the BOM. (EF BB BF is prefix)
So there is still a chance that short word with traditional ASCII or Unicode (little-endian) encoding with no BOM might start with EF BB BF, which is UTF-8 encoding prefix. Or maybe FF FE which is prefix for Unicode (little-endian) encoding with BOM.
When that happens, Notepad function IsTextUnicode might guess the wrong encoding as that prefix of other encoding. Remember no matter what encoding you saved with notepad, Notepad will try to guess again when you open the file.
But I don't see this bug in Windows Vista anymore. So I think Microsoft guys have already developed improved IsTextUnicode function or make Notepad to open in saved file encoding.
Reference : MSDN blog
ok...on a lighter note I tried
bush did the f...s [fully typed though ;) ] and it returned normal. :p