# Friday, 13 June 2014
« New Development Snapshot | Main | New Development Snapshot »
Malformed UTF-16

While scanning the Java 8 version of the JVM spec for changes, I noticed this new text:

Names of methods, fields, local variables, and formal parameters are stored as unqualified names. An unqualified name must contain at least one Unicode code point and must not contain any of the ASCII characters . ; [ / (that is, period or semicolon or left square bracket or forward slash).

As per usual, the spec and HotSpot implementation don't agree on the details. In particular the text says that a name must contain at least one Unicode code point. At this point it becomes important what a Unicode code point actually is. Suffice it to say that a single UTF-16 code unit in the surrogate range is not a valid Unicode code point and that such names are accepted by HotSpot without any complaint.

The spec, of course, is also incomplete as it does not address what should happen in case of malformed UTF-16 in general. Or maybe it should just be changed to say "... UTF-16 code unit ..." to match the current behavior and not worry about surrogates at all.

On the .NET side, the strings that aren't string literals in the code are stored as UTF-8 and the ECMA CLI spec requires valid UTF-8, but this is only partially enforced which poses its own set of problems.

After discovering all this, I felt compelled to fix IKVM.NET to behave similarly to HotSpot in this regard and added support for escaping malformed UTF-16 strings when writing metadata and unescaping them when reading it back in.

Finally, let's end with a C# example that shows some of the weirdness on the .NET side:

using System;
using System.ComponentModel;

[DisplayName("\uD800")]
class Surrogate {
  static void Main() {
    var attr = (DisplayNameAttribute)typeof(Surrogate)
                 .GetCustomAttributes(false)[0];
    Console.WriteLine(attr.DisplayName == "\uD800");
  }
}
Friday, 13 June 2014 13:43:41 (W. Europe Daylight Time, UTC+02:00)  #    Comments [2]