Friday, May 22, 2009

Overhauling the Java UTF-8 charset

he UTF-8 charset implementation, which is available in all JDK/JRE releases from Sun, has been updated recently to reject non-shortest-form UTF-8 byte sequences. This is because the old implementation might be leveraged in security attacks. Since then I have been asked many times about what this "non-shortest-form" issue is and what the possible impact might be, so here are some answers.

The first question usually goes: "What is the non-shortest-form issue"?

The detailed and official answer is at Unicode Corrigendum #1: UTF-8 Shortest Form. Simply put, the problem is that Unicode characters can be represented in more than one way (form) in the "UTF-8 encoding" than many people think or believe. When asked what UTF-8 encoding looks like, the simplest explanation would be the following bit pattern:

# Bits Bit pattern
1 7 0xxxxxxx
2 11 110xxxxx 10xxxxxx
3 16 1110xxxx 10xxxxxx 10xxxxxx
4 21 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

The pattern is close, but it's actually wrong, based on the latest definition of UTF-8. The preceding pattern has a loophole in that you can actually have more than one form represent a Unicode character.

For ASCII characters from u+0000 to u+007f, for example, the UTF-8 encoding form maintains transparency for all of them, so they keep their ASCII code values of 0x00..0x7f (in one-byte form) in UTF-8. Based on the preceding pattern, however, these characters can also be represented in 2-bytes form as [c0, 80]..[c1, bf], the "non-shortest-form".

The following code shows all of the non-shortest-2-bytes-form for these ASCII characters, if you run code against the "old" version of the JDK and JRE (Java Runtime Environment).


byte[] bb = new byte[2];
for (int b1 = 0xc0; b1 < 0xc2; b1++) {
for (int b2 = 0x80; b2 < 0xc0; b2++) {
bb[0] = (byte)b1;
bb[1] = (byte)b2;
String cstr = new String(bb, "UTF8");
char c = cstr.toCharArray()[0];
System.out.printf("[%02x, %02x] -> U+%04x [%s]%n",
b1, b2, c & 0xffff, (c>=0x20)?cstr:"ctrl");
}
}

The output would be as follows:


...
[c0, a0] -> U+0020 [ ]
[c0, a1] -> U+0021 [!]
...
[c0, b6] -> U+0036 [6]
[c0, b7] -> U+0037 [7]
[c0, b8] -> U+0038 [8]
[c0, b9] -> U+0039 [9]
...
[c1, 80] -> U+0040 [@]
[c1, 81] -> U+0041 [A]
[c1, 82] -> U+0042 [B]
[c1, 83] -> U+0043 [C]
[c1, 84] -> U+0044 [D]
...

So, for a string like "ABC", you would have two forms of UTF-8 sequences:


"0x41 0x42 0x43" and "0xc1 0x81 0xc1 0x82 0xc1 0x83"

The Unicode Corrigendum #1: UTF-8 Shortest Form specifies explicitly that "The definition of each UTF specifies the illegal code unit sequences in that UTF. For example, the definition of UTF-8 (D36) specifies that code unit sequences such as [C0, AF] are illegal."

Our old implementation accepts those non-shortest-form (while it never generates them when encoding). The new UTF_8 charset now rejects the non-shortest-form byte sequences for all BMP characters. Only the "legal byte sequences" listed below are accepted.


/* Legal UTF-8 Byte Sequences
*
* # Code Points Bits Bit/Byte pattern
* 1 7 0xxxxxxx
* U+0000..U+007F 00..7F
* 2 11 110xxxxx 10xxxxxx
* U+0080..U+07FF C2..DF 80..BF
* 3 16 1110xxxx 10xxxxxx 10xxxxxx
* U+0800..U+0FFF E0 A0..BF 80..BF
* U+1000..U+FFFF E1..EF 80..BF 80..BF
* 4 21 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
* U+10000..U+3FFFF F0 90..BF 80..BF 80..BF
* U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
* U+100000..U10FFFF F4 80..8F 80..BF 80..BF
*/

The next question usually is: "What would be the issue/problem if we keep using the old version of JDK/JRE?"

First, I'm not a lawyer — oops, I meant I'm not a security expert:-) — so my word does not count. We consulted with our security experts instead. Their conclusion is that while "it is not a security vulnerability in Java SE per se, it may be leveraged to attack systems running software that relies on the UTF-8 charset to reject these non-shortest form of UTF-8 sequences".

A simple scenario that might give you an idea about what the above "may be leveraged to attack..." really means:

  1. A Java application would like to filter the incoming UTF-8 input stream to reject certain key words, for example "ABC".
  2. Instead of decoding the input UTF-8 byte sequences into Java char representation and then filter out the keyword string "ABC" at Java "char" level, for example:

    String utfStr = new String(bytes, "UTF-8");
    if ("ABC".equals(strUTF)) { ... }
    The application might choose to filter the raw UTF-8 byte sequences "0x41 0x42 0x43" (only) directly against the UTF-8 byte input stream and then rely on (assume) the Java UTF-8 charset to reject any other non-shortest-form of the target keyword, if there is any.
  3. The consequence is the non-shortest form input "0xc1 0x81 0xc1 0x82 0xc1 0x83" will penetrate the filter and trigger a possible security vulnerability, if the underlying JDK/JRE runtime is an OLD version.

So the recommendation is: Update to the latest JDK/JRE releases to avoid the potential risk.

Wait, there is another big bonus for updating: performance.

The UTF-8 charset implementation has not been updated or touched for years. UTF-8 encoding is very widely used as the default encoding for XML, and more and more websites use UTF-8 as their page encoding. Given that fact, we have taken the defensive position of "don't change it if it works" during the past years.

So Martin and I decided to take this opportunity to give it a speed boost as well. The following data is from one of my benchmark's run data, which compares the decoding/encoding operations of new implementation and old implementation under -server vm. (This is not an official benchmark: it is provided only to give a rough idea of the performance boost.)

The new implementation is much faster, especially when decoding or encoding single bytes (those ASCIIs). The new decoding and encoding are faster under -client vm as well, but the gap is not as big as in -server vm. I wanted to show you the best:-)


Method Millis Millis(OLD)
Decoding 1b UTF-8 : 1786 12689
Decoding 2b UTF-8 : 21061 30769
Decoding 3b UTF-8 : 23412 44256
Decoding 4b UTF-8 : 30732 35909
Decoding 1b (direct)UTF-8 : 16015 22352
Decoding 2b (direct)UTF-8 : 63813 82686
Decoding 3b (direct)UTF-8 : 89999 111579
Decoding 4b (direct)UTF-8 : 73126 60366
Encoding 1b UTF-8 : 2528 12713
Encoding 2b UTF-8 : 14372 33246
Encoding 3b UTF-8 : 25734 26000
Encoding 4b UTF-8 : 23293 31629
Encoding 1b (direct)UTF-8 : 18776 19883
Encoding 2b (direct)UTF-8 : 50309 59327
Encoding 3b (direct)UTF-8 : 77006 74286
Encoding 4b (direct)UTF-8 : 61626 66517

The new UTF-8 charset implementation has been integrated in JDK7, Open JDK 6, JDK 6 update 11 and later, JDK5.0u17, and 1.4.2_19.

If you are interested in what the change looks like, you can take a peek at the webrev of the new UTF_8.java for OpenJDK7.


Thanks to the original author and the source

No comments:

Post a Comment