Since Java 9/11 (9 is not a LTS), String internals was reworked to use either 8 ...

HelloNurse · on March 22, 2022

BMP forever! The most disgusting thing I've read today.

valleyer · on March 22, 2022

Am I misunderstanding? UTF-16 can represent all Unicode characters, not just the BMP.

native_samples · on March 22, 2022

UTF-16 represents a fairly reasonable compromise, not sure what your disgust is for.

UTF-32 (with no BMP concept) doubles the memory usage of most international text and quadruples the memory usage of ASCII text (which is the most common), yet characters outside the BMP are barely used outside of emoji.

Native UTF-8 in memory makes character indexing a non-constant time operation, which would bite people badly in cases where they've written a loop over the indexes. This is of course the point at which you say, ah but what is a character exactly. If you go down this route you end up with Swift and Emoji Flag Calculus classes. The string APIs become incredibly convoluted or inefficient for the common cases. It hardly seems worth any kind of backwards compatibility break for this.

So Java does the pragmatic thing: String can switch between 8 or 16 bits per "character" and this is basically always good enough. If you care about woring with emoji or Egyptian hieroglyphs in memory, then you either have to deal with combining characters or just bite the bullet and decode to UTF-32.

kevincox · on March 22, 2022

> Native UTF-8 in memory makes character indexing a non-constant time operation

The only reason that Java's UTF-16 has constant time indexing is because they use a braindead definition of character which is "UTF-16 codepoint".

If you want constant time character indexing you need to go UTF-32. But obviously the downsides are too great for most users. So in practice everyone uses UTF-8 because it is usually the most memory efficient.

Plus it turns out that character indexing isn't actually that common of an operation, so it is really the right move for almost every application.

remexre · on March 22, 2022

UTF-32 isn't really a solution either, unless you consider a scalar value to be a character; I bet almost nobody wants U+0308 to be "a character"...

native_samples · on March 24, 2022

But in practice the Java definition of a character basically always works, because characters that aren't in the BMP are vanishingly rare in real software outside of emoji, and of course, Java long pre-dates emoji.