# Friday, 13 June 2014
« New Development Snapshot | Main | New Development Snapshot »
Malformed UTF-16

While scanning the Java 8 version of the JVM spec for changes, I noticed this new text:

Names of methods, fields, local variables, and formal parameters are stored as unqualified names. An unqualified name must contain at least one Unicode code point and must not contain any of the ASCII characters . ; [ / (that is, period or semicolon or left square bracket or forward slash).

As per usual, the spec and HotSpot implementation don't agree on the details. In particular the text says that a name must contain at least one Unicode code point. At this point it becomes important what a Unicode code point actually is. Suffice it to say that a single UTF-16 code unit in the surrogate range is not a valid Unicode code point and that such names are accepted by HotSpot without any complaint.

The spec, of course, is also incomplete as it does not address what should happen in case of malformed UTF-16 in general. Or maybe it should just be changed to say "... UTF-16 code unit ..." to match the current behavior and not worry about surrogates at all.

On the .NET side, the strings that aren't string literals in the code are stored as UTF-8 and the ECMA CLI spec requires valid UTF-8, but this is only partially enforced which poses its own set of problems.

After discovering all this, I felt compelled to fix IKVM.NET to behave similarly to HotSpot in this regard and added support for escaping malformed UTF-16 strings when writing metadata and unescaping them when reading it back in.

Finally, let's end with a C# example that shows some of the weirdness on the .NET side:

using System;
using System.ComponentModel;

[DisplayName("\uD800")]
class Surrogate {
  static void Main() {
    var attr = (DisplayNameAttribute)typeof(Surrogate)
                 .GetCustomAttributes(false)[0];
    Console.WriteLine(attr.DisplayName == "\uD800");
  }
}
Friday, 13 June 2014 13:43:41 (W. Europe Daylight Time, UTC+02:00)  #    Comments [2]
Friday, 13 June 2014 19:51:05 (W. Europe Daylight Time, UTC+02:00)
"I felt compelled" - that should be the title for a talk at FOSDEM 2015, and then we should discuss over beers!
Saturday, 14 June 2014 09:15:54 (W. Europe Daylight Time, UTC+02:00)
I might just do that. There's actually a method to my madness (or so I'd claim).
Name
E-mail
Home page

I apologize for the lameness of this, but the comment spam was driving me nuts. In order to be able to post a comment, you need to answer a simple question. Hopefully this question is easy enough not to annoy serious commenters, but hard enough to keep the spammers away.

Anti-Spam Question: What method on java.lang.System returns an object's original hashcode (i.e. the one that would be returned by java.lang.Object.hashCode() if it wasn't overridden)? (case is significant)

Answer:  
Comment (HTML not allowed)  

Live Comment Preview