Hacker Newsnew | past | comments | ask | show | jobs | submit | ssivark's commentslogin

Daniel Lemire's points about low-level hardware optimization notwithstanding, it's worth pointing out that binary search (or low-level implementation variants) is the best only if you know nothing about the data beyond the fact that it is sorted / monotonic.

If you have priors about the data distribution, then it's possible to design algorithms which use that extra information to perform MUCH better. eg: a human searching a physical paper dictionary can zoom into the right bunch of pages faster than pure idealized binary search; it's a separate matter it's hard for humans to continue binary search till the very end and we might default to scanning linearly for the last few iterations (cognitive convenience / affordances of human wetware / etc).

In mathematical language, searching a sorted list is basically inverting a monotonic function, by using a closed-loop control algorithm. Often, we could very well construct a suitable cost function and use gradient descent or its accelerated cousins.

More generally, the best bet to solving a problem more efficiently is always to use more information about the specific problem you want to solve, instead of pulling up the solution for an overly abstract representations. That can offer scalable orders of magnitude speedup compared to constant factor speedups from just using hardware better.


I've spent some brainpower on binary search and have not been able to beat this:

https://github.com/protocolbuffers/protobuf/blob/44025909eb7...

1. Check for dense list O(1) 2. Check upper bound 3. Constant trip count binary search

The constant trip count is great for the branch predictor, and the core loop is pretty tightly optimized for the target hardware, avoiding multiplies. Every attempt to get more clever made the loop worse and did not pay for itself. It's hard because it's an array-of-structs format with a size of 12, and mostly pretty small N.


For a pretty small N I've found that less clever can be quite a bit faster. I'd try a linear search - possibly SIMD if you can change the data format to struct-of-arrays. An adaptive approach that uses linear search up to a certain N can also yield some benefit.

The first implementation I encountered was a linear search, starting at the last-found field. Empirically it performed better to do a binary search with early exit and branchless bounds selection, I think due to branch predictor pressure. The data representation could be changed but it's tricky, as there are other traversals that want to go in sorted order, and there are lots of places that pass just one pointer for fields. But I agree any further improvement will probably have to come from that.

SIMD is tricky even with SoA because there is significant latency going between the general registers and the vector units, plus arm little cores can be configured to share a vector unit with another core.


> SIMD is tricky even with SoA because there is significant latency going between the general registers and the vector units

My experience is mostly limited to AMD64, but libraries like glibc use SIMD in many places for faster linear search. Presumably they've done testing and found it worth while.


Yeah arm little cores are a very different story - they aren't superscalar out of order architectures, they can dispatch up to two operations per cycle.

Big cores are more like that dispatching 8 or more operations per cycle, but they're also more expensive, larger, etc.


If you control the layout, eytzinger layout typically will give you the best of both worlds. As fast as a linear scan for small N, much faster than binary search over a sorted array for large N.

I know protobuf code is extremely high quality, but I really can't stand the c-style naming conventions.

I know people train themselves into grokking this and reading and emitting this way, but it sounds like writing "bork bork bork bork" runes to me.

I'm glad Rust feels more like Ruby and Python and that method and field names are legible.

My eyes just glaze over:

    UPB_API_INLINE
    const struct upb_MiniTableField* upb_MiniTable_FindFieldByNumber(
        const struct upb_MiniTable* m, uint32_t number) {
      const uint32_t i = number - 1;  // 0 wraps to UINT32_MAX
    
      // Ideal case: index into dense fields
      if (i < m->UPB_PRIVATE(dense_below)) {
        UPB_ASSERT(m->UPB_ONLYBITS(fields)[i].UPB_ONLYBITS(number) == number);
        return &m->UPB_ONLYBITS(fields)[i];
      }
    
      // Early exit if the field number is out of range.
      uint32_t hi = m->UPB_ONLYBITS(field_count);
      uint32_t lo = m->UPB_PRIVATE(dense_below);
      UPB_ASSERT(hi >= lo);
      uint32_t search_len = hi - lo;
      if (search_len == 0 ||
          number > m->UPB_ONLYBITS(fields)[hi - 1].UPB_ONLYBITS(number)) {
        return NULL;
      }
    
      // Slow case: binary search
      const struct upb_MiniTableField* candidate;
    #ifndef NDEBUG
      candidate = UPB_PRIVATE(upb_MiniTable_ArmOptimizedLowerBound)(
          m, lo, search_len, number);
      UPB_ASSERT(candidate ==
                 UPB_PRIVATE(upb_MiniTable_LowerBound)(m, lo, search_len, number));
    #elif UPB_ARM64_ASM
      candidate = UPB_PRIVATE(upb_MiniTable_ArmOptimizedLowerBound)(
          m, lo, search_len, number);
    #else
      candidate = UPB_PRIVATE(upb_MiniTable_LowerBound)(m, lo, search_len, number);
    #endif
    
      return candidate->UPB_ONLYBITS(number) == number ? candidate : NULL;
    }

> I really can't stand the c-style naming conventions.

Honestly I don't see much difference between

  upb_MiniTable_FindFieldByNumber
and

  upb::MiniTable::FindFieldByNumber

Those are fairly indistinguishable. It's when they start removing letters from words to save... debug symbol bytes or something? That's when c-style naming annoys me.

In their defence "hi" sounds very much like "high" in my mind's ear and "lo" like "low" :)

This is also my pet peeve with a lot of code as well as commands like

    npm -g i package-name 
Like why would you teach people to do this? I understand people needed to save precious bytes in the sixties so we have cat and ls but saving 192 bytes or whatever with shorter variable names is not a worthwhile tradeoff anymore.

What exactly bothers you about this and what would you prefer to see?

I would prefer to see full names like

    npm install --global @scope/PackageName 
At least the hi and lo can get more meaningful names. And over time we can write this in another language with private/scoped methods.

Yeah namespaces and public/private would be quite nice, but C doesn't have them, so they get hacked on via macros and prefixing. The syntax was not the hard part of working or analyzing this code, though.

I think this needs way more "upb" and "UPB" to make it clear that it is, in fact, dealing with UPBs. Whatever these are.

> μpb (often written 'upb') is a small protobuf implementation written in C.

I swear I read an article about treaps but instead of being used to balance the tree, they used the weights to Huffman encode the search depth to reduce the average access time for heterogenous fetch frequencies.

I did not bookmark it and about twice a year I go searching for it again. Some say he’s still searching to this day.


Huffman coding assumes your corpus is a string of discrete elements (symbol strings) without any continuous structure (eg. topology/geometry). With that fairly mild assumption, it gives a recipe to reorganize (transform/encode) your data as a prefix-tree, to minimize the bits of information needed to communicate the contents of your corpus i.e. reducing (on average) the bits of information you need to identify a specific item. Eg. To go back to the analogy from my previous comment above... if the function you are inverting via search has long plateaus then you could simply front-load those as guesses; that's roughly the spirit of Huffman coding, except it eschews monotonicity.


Sure, but the whole point is that you often don't know anything further about the data.

That's why b-trees are the standard in databases. The data could be anything, and its characteristics could massively change at any time, as you suddenly import a whole bunch of new rows at once.

And while you can certainly design algorithms around e.g. gradient descent to try to accelerate lookup, b-trees are already incredibly fast, and have lots of other benefits like predictable worse-case performance and I/O requirements, supporting range scans, ordered traversal, prefix conditions, etc.

So yes, you can certainly design lookup algorithms that are more efficient for particular data distributions, but they will also often lack other important properties. And b-trees are already so fast, improvements are often negligible -- like even if another algorithm produces a closer initial guess, it may be slower to locate the final item, or it may be faster on average but have horrible worst-case performance that makes it unusable.

Even with a paper dictionary, I've always used pretty much a binary search beyond the first initial guess, which only saves you a couple of hops. And actually, once I get to the right handful of pages I'm probably more linear than I should be, and I'd probably be faster if I tried to do a rigorous binary search, but I have to balance that with how long it takes to flip pages.


Databases often use table statistics to try to do better at generating query plans. I wonder if they use them to make indexes faster as well?

The cost plan is a crude approximation of the actual query cost. Sometimes, the query planner makes a terrible guess. Your resident DBA won't appreciate being sometimes paged at 3 AM on a Sunday. A good strategy is to freeze the query plan once you have sufficient sample size of data in the involved tables.

Unfortunately, "freezing the query plan" isn't something available in many popular databases. It's not supported by e.g. Postgres, MySQL, or SQLite.

> If you have priors about the data distribution, then it's possible to design algorithms which use that extra information to perform MUCH better.

You don't even need priors. See interpolation search, where knowing the position and value of two elements in a sorted list already allows the search to make an educated guess about where the element it's searching for is by estimating the likely place it would be by interpolating the elements.


> knowing the position and value of two elements in a sorted list

That's a prior about the distribution, if a relatively weak one (in some sense, at least).



This relies on knowledge of the distribution, just querying in the middle of A = [1, 2, 4, 8, 16, ..., 2^(n-1)] is slower than binary search

> just querying in the middle

It's an interpolation search. You interpolate the values you evaluated by whatever method you'd like. No one forces you to do linear interpolation. You can very easily fit a quadratic polynomial with the last 3 points, for example.

Interpolation search seems to have a convergence rate of log log n. That's pretty efficient.


> it's worth pointing out that binary search (or low-level implementation variants) is the best only if you know nothing about the data beyond the fact that it is sorted / monotonic

Also if you do not learn anything about the data while performing the binary search, no? Like, if you are constantly below the estimate, you could gess that the distribution is biases toward large values and adjust your guess based on this prediction.


> Also [IFF] you do not learn anything about the data while performing the binary search, no?

Yes, absolutely!

I forgot to share this general perspective above, and it's too late to edit, so I'll add it here...

Since binary search assumes only monotonicity; splitting your interval into two equal parts extracts one bit of information per step, and any other choice would extract less information on average. One bit of information per step is how you end up needing log(n) steps to find the answer.

To accelerate your search, you basically need to extract those log(n) bits as fast as you can. You can think of that as leveraging both the prior, and everything you learn along the way -- to adaptively design each step to be the optimal experiment to extract maximum amount of information. And adaptive local models of your search space (gradient / hessian / etc) allow you to extract many more bits of information from each query / experiment, provided the function you are inverting has some local structure.

PS: That is why we leverage these ideas to "search" for the optimum, among a space of solutions.


For a list of sorted values with no other knowledge, the binary search is optimal. Provably, it is simple information theory on binary information.

You can do better if the list is stable by reusing information.

But gathering that information during searches is going to require great complexity to leverage, as searches are an irregular information gathering scheme.

So create RAM for speedup optimizations up front.

1) Create a table that maps the first 8 bits to upper and lower indexes in the list. Then binary search over the last 8 bits. That reduces the search time in half.

2) Go all the way, and create an array of 32,768 indexes, with all 1's for misses. Either way, search returns O(1).

Stable lists allow for sliding parametric trade offs between RAM-lookup vs. binary search. From full lookup, to full binary.


It's not possible to learn anything about other elements when performing binary search, _except_ the only thing there is to learn: if the target is before or after the recently compared element.

If we would guess that there is a bias in the distribution based on recently seen elements, the guess is at least as likely to be wrong as it is to be right. And if we guess incorrectly, in the worst case, the algorithm degrades to a linear scan.

Unless we have prior knowledge. For example: if there is a particular distribution, or if we know we're dealing with integers without any repetition (i.e. each element is strictly greater than the previous one), etc.


> It's not possible to learn anything about other elements when performing binary search, _except_ the only thing there is to learn: if the target is before or after the recently compared element.

You have another piece of information, you don't only know if the element was before or after the compared element. You can also know the delta between what you looked at and what you're looking for. And you also have the delta from the previous item you looked at.


And you always start off knowing the total length of the array, and the width of the datatype.

Actually deciding what to do with that information without incurring a bunch more cache misses in the process may be tricky.


Is the disconnect here that in many datasets there is some implicit distribution? For example if we are searching for english words we can assume that the number of words or sentences starting with "Q" or "Z" is very small while the ones starting with "T" are many. Or if the first three lookups in a binary search all start with "T" we are probably being asked to search just the "T" section of a dictionary.

Depending on the problem space such assumptions can prove right enough to be worth using despite sometimes being wrong. Of course if you've got the compute to throw at it (and the problem is large) take the Contact approach: why do one when you can do two in parallel for twice the price (cycles)?


Assuming your key space is anything like randomly distributed.

Thinking about it--yeah, if you can anticipate anything like a random distribution it's a few extra instructions to reduce the number of values looked up. In the old days that would have been very unlikely to be a good deal, but with so many algorithms dominated by the cache (I've seen more than one case where a clearly less efficient algorithm that reduced memory reads turned out better) I suspect there's a lot of such things that don't go the way we learned them in the stone age.


> If we would guess that there is a bias in the distribution based on recently seen elements, the guess is at least as likely to be wrong as it is to be right.

This is true for abstract and random data. I don't think it's true for real world data.

For example, python's sort function "knows nothing" about the data you're passing in. But, it does look for some shortcuts and these end up saving time, on average.


Fritz Henglein has done some interesting work on fast sorting/grouping. I think Generic Discrimination Sorting and Partitioning Unshared Data in Linear Time[1] is the main paper. Ed Kmett took those ideas and refined them into the discrimination[2] library for Haskell, and gave a very interesting tech talk about it[3].

[1]: https://dl.acm.org/doi/epdf/10.1145/1411203.1411220

[2]: https://hackage.haskell.org/package/discrimination

[3]: https://www.youtube.com/watch?v=cB8DapKQz-I


> In mathematical language, searching a sorted list is basically inverting a monotonic function, by using a closed-loop control algorithm.

Never thought about it this way. Brilliant!


More accurately, binary search is optimal only if you cannot determine distance between the data points (you can compute `<` but not `-`). It's inaccurate to say it's optimal if you know "nothing" about the distribution. If you encounter a point that's much closer to your high pivot vs much closer to your low pivot, there is no possible prior knowledge state uninformed enough to conclude that the best place to search in both cases is in the middle.

>More generally, the best bet to solving a problem more efficiently is always to use more information about the specific problem you want to solve

It is both obvious and profound, the more information you already have, the more information you already have.


For humans binary searching a dict is slower because it requires a different physical action vs. scanning and we have advanced OCR and flipping through capabilites. Especially recognizing when something hasnt changed e.g. still on Es keep flipping is maybe 10ms.

No the point is you know exactly where T is just by looking at the dictionary (or at least, you learn this if you use a dictionary a lot).

IOW your prior on the data distribution lets you skip the first 4-5 binary chops.


Furthermore, with the vast and immediate knowledge that LLMs have, we could see a proliferation of domain-specific sorting algorithms designed for all types of purposes.

> use that extra information to perform MUCH better

Do you mean using a better estimator for the median value? Or something else?


Say, if you know the function is a polynomial of degree N, with N+1 datapoints you can find it – e.g. with Lagrange's polynomial, although the finite precision of computer numbers might make that more complex.

Obvious follow-up that's begging to be asked -- if you like nix and want a lisp, have you tried guix/guile?

Couldn't help riffing off on a tangent from the title (since the article is about diagramming tools)...

Dylan Beattie has a thought-provoking presentation for anyone who believes that "plain text" is a simple / solid substrate for computing: "There's no such thing as plain text" https://www.slideshare.net/slideshow/theres-no-such-thing-as... (you'll find many videos from different conferences)


Haven't watched the videos yet, but from the slides, it looks like part of the issue he was talking about was encodings (there's a slide illustrating UTF-16LE ve UTF-16BE, for example). Thankfully, with UTF-8 becoming the default everywhere (so that you need a really good reason not to use it for any given document), we're back at "yes, there is such a thing as plain text" again. It has a much larger set of valid characters, but if you receive a text file without knowing its encoding, you can just assume it's UTF-8 and have a 99.7% chance of being right.

FINALLY.


The point is, a lot of work went into making that happen. I.e., plain text as it is today is not some inherent property of computing. It is a binary protocol and displaying text through fonts is also not a trivial matter.

So my question is: what are we leaving on the table by over focusing on text? What about graphs and visual elements?


TUIs can include these, see the kitty graphics protocol, implemented by most if not all modern terminals.

https://sw.kovidgoyal.net/kitty/graphics-protocol/


I was not very descriptive, but I was referring to the next layer up of building blocks. Instead of text, we could also express things in hybrid ways with text but also visual nodes that can carry more dense information. The usual response is that those things don't work with text-based tools, but that's my point. Text based tools needed invention and decades of refinement, and they're still not all that great.

And what do we gain by leaving things on the table?

Until you hit a CSV exported by Excel

I should have said "a text file with no byte-order mark". I would hope that Excel's CSV export, if it's writing UTF-16, is writing a byte-order mark first (though I don't have any Excel-exported CSVs lying around right now to check). The byte-order mark is necessary for UTF-16 since it has big-endian and little-endian variants, but unnecessary (and actually harmful in a few situations) for UTF-8. So naturally, if you assume something is UTF-8 but the first few bytes you encounter are FF FE or FE FF (both of which are illegal in UTF-8) then instead of throwing an error saying "Hey, that's illegal UTF-8, buddy!" you should just reparse in UTF-16 (and you now know the correct byte order to use). In fact, you should read four bytes just to make sure you're not seeing FF FE 00 00, because that would indicate a UTF-32LE document. (Which indicates an ambiguity in UTF-16, that UTF-8 doesn't have. A UTF-16 document that begins with a null byte is likely to be misinterpreted as UTF-32LE).

Before I go off on too much of a rabbit trail, I have two points I want to make:

1. Since UTF-8 should be the default assumption for any sensible software, a byte-order mark is not needed for UTF-8, but any non-UTF-8 encoding should use a byte-order mark. (And in fact needs a BOM, because both UTF-16 and UTF-32 have LE and BE variants).

2. Excel needs to fix its stupid CSV import/export defaults.


Another classic from Microsoft -the Language Server Protocol is UTF-16. Could be paying that price for the rest of time.

vaxocentrism, or “All the World’s a VAX”

http://www.catb.org/esr/jargon/html/V/vaxocentrism.html


> Thankfully, with UTF-8 becoming the default everywhere (so that you need a really good reason not to use it for any given document), we're back at "yes, there is such a thing as plain text" again.

Whenever I hear this, I hear "all text files should be 50% larger for no reason".

UTF-8 is pretty similar to the old code page system.


Hm? UTF-8 encodes all of ASCII with one byte per character, and is pretty efficient for everything else. I think the only advantage UTF-16 has over UTF-8 is that some ranges (such as Han characters I believe?) are often 3 bytes of UTF-8 while they're 2 bytes of UTF-16. Is that your use case? Seems weird to describe that as "all text files" though?

UTF-8 encodes European glyphs in two bytes and oriental glyphs in three bytes. This is due to the assumption that you're not going to be using oriental glyphs. If you are going to use them, UTF-8 is a very poor choice.

UTF-8 does not encode "European glyphs" in two bytes, no. Most European languages use variations of the latin alphabet, meaning most glyphs in European languages use the 1-byte ASCII subset of UTF-8. The occasional non-ASCII glyph becomes two bytes, that's correct, but that's a much smaller bloat than what you imply.

Anyway, what are you comparing it to, what is your preferred alternative? Do you prefer using code pages so that the bytes in a file have no meaning unless you also supply code page information and you can't mix languages in a text file? Or do you prefer using UTF-16, where all of ASCII is 2 bytes per character but you get a marginal benefit for Han texts?


> Do you prefer using code pages so that the bytes in a file have no meaning unless you also supply code page information?

Yes. Note that this is already how Unicode is supposed to work. See e.g. https://en.wikipedia.org/wiki/Byte_order_mark .

A file isn't meaningful unless you know how to interpret it; that will always be true. Assuming that all files must be in a preexisting format defeats the purpose of having file formats.

> Most European languages use variations of the latin alphabet

If you want to interpret "variations of Latin" really, really loosely, that's true.

Cyrillic and Greek characters get two bytes, even when they are by definition identical to ASCII characters. This bloat is actually worse than the bloat you get by using UTF-8 for Japanese; Cyrillic and Greek will easily fit into one byte.


As someone who has been using Cyrillic writing all my life, I've never noticed this bloat you're speaking of, honestly...

Maybe if you're one of those AI behemots who works with exabytes of training data, it would make some sense to compress it down by less than 50% (since we're using lots of Latin terms and acronyms and punctuation marks which all fit in one byte in UTF-8).

On the web and in other kinds of daily text processing, one poorly compressed image or one JavaScript-heavy webshite obliterates all "savings" you would have had in that week by encoding text in something more efficient.

It's the same with databases. I've never seen anyone pick anything other than UTF-8 in the last 10 years at least, even though 99% of what we store there is in Cyrillic. I sometimes run into old databases, which are usually Oracle, that were set up in the 90s and never really upgraded. The data is in some weird encoding that you haven't heard of for decades, and it's always a pain to integrate with them.

I remember the days of codepages. Seeing broken text was the norm. Technically advanced users would quickly learn to guess the correct text encoding by the shapes of glyphs we would see when opening a file. Do not want.


> A file isn't meaningful unless you know how to interpret it; that will always be true.

There are multiple levels of meaning, though; character encoding is just one part of it. For example, a text file might be plain text, or HTML, or JSON, or a C source code, etc; a binary file might be DER, or IFF, or ZIP, etc; and then there will be e.g. what kind of data a JSON or DER or IFF contains and how that level of the data is interpreted, etc.

> Cyrillic and Greek characters get two bytes, even when they are by definition identical to ASCII characters.

Whether or not they are identical to ASCII characters depends on the character set and on other things, such as what they are being used for; the definition of "identical" is not so simple as you make it seem. Unicode defines them as not identical, which is appropriate for some uses but is wrong for other uses. (Unicode also defines some characters as identical even though in some uses it would be more appropriate to treat them as not identical, too. So, Unicode is both ways bad.)

> This bloat is actually worse than the bloat you get by using UTF-8 for Japanese; Cyrillic and Greek will easily fit into one byte.

I agree with that (although I think UTF-8 should not be used for Japanese either), but it isn't because of which characters are considered "identical" or not. There are problems with Unicode in general regardless of which encoding you use.


> ... (although I think UTF-8 should not be used for Japanese either) ...

The people putting up websites in Japanese disagree with you, it would seem. According to Wikipedia (in the Shift JIS article), as of March 2026 99% of websites in the .jp domain were in UTF-8, with only 1% being in Shift JIS.

Japan used to have two different encodings in common use, Shift JIS (usually used on Windows) and EUC-JP (more common on Unix servers). This resulted in characters being misinterpreted often enough that they coined the word mojibake to describe the phenomenon of text coming out completely garbled. These days, it seems Japanese website makers are more than happy to accept a slight inefficiency in encoding size, because what they gain from that is never having to see mojibake again.


If they are misinterpreted, it is because the character encoding is not declared properly.

I still sometimes see mojibake in Japanese web pages, but sometimes it works; if it works, it is because the character encoding is declared properly.

In my opinion, EUC-JP is a generally better encoding of JIS (especially in e.g. C source code, which should not use Shift-JIS but EUC-JP is OK), but Shift-JIS does have some benefits in some circumstances (such as making a character grid with one byte per character cell; if using Shift-JIS for a Pascal source code then you should use (* *) instead of { } for comments please).


> If they are misinterpreted, it is because the character encoding is not declared properly.

OR because the software is buggy, or making assumptions about encoding and not checking them (which also counts as "buggy", of course). You can declare the encoding all you like, it won't protect you against the stupid decisions that other people make in writing their software. (See Excel, for example).

Yes, if you declare your encoding properly, things should work. Most of the time. And if you're using any encoding that is not the worldwide default (which these days is UTF-8), then you definitely should declare the encoding. But you'll still occasionally hit badly-written software that doesn't even think about other encodings and doesn't handle them properly. The only defense against that situation, where you declare your encoding properly and it still doesn't work, is to just use the encoding that the software was written to expect, which is almost certainly the worldwide default.


UTF-8 does not require a byte order mark. The byte order mark is a technical necessity born from UTF-16 and a desire to store UTF-16 in a machine's native endianness.

The byte order mark has has no relation to code pages.

I don't think you know what you're talking about and I do not think further engagement with you is fruitful. Bye.

EDIT: okay since you edited your comment to add the part about Greek and cryllic after I responded, I'll respond to that too. Notice how I did not say "all European languages". Norwegian, Swedish, French, Danish, Spanish, German, English, Polish, Italian, and many other European languages have writing systems where typical texts are "mostly ASCII with a few special symbols and diacritics here and there". Yes, Greek and cryllic are exceptions. That does not invalidate my point.


Unicode could have just been encoded statefuly with a "current code page" mark byte.

With UTF and emojis we can't have random access to characters anyways, so why not go the whole way?


Yikes. That would lose the ability to know the meaning of the current bytes, or misinterpret them badly, if you happen to get one critical byte dropped or mangled in transmission. At least UTF-8 is self-syncing: if you end up starting to read in the middle of a non-rewindable stream whose beginning has already passed, you can identify the start of the next valid codepoint sequence unambiguously, and then end up being able to sync up with the stream, and you're guaranteed not to have to read more than 4 bytes (6 bytes when UTF-8 was originally designed) in order to find a sync point.

But if you have to rely on a byte that may have already gone past? No way to pick up in the middle of a stream and know what went before.


We've already lost all that with emojis and other characters in supplementary planes.

No, we haven't. You can start at any byte in a UTF-8 document and resume reading coherent text. If you start reading from the middle of a multi code point sequence, then the first couple of glyphs may be wrong, for example you may see a lone skin tone modifier rendered as a beige blob where the author intended a smiley face with that skin tone. But these multi code point sequences are short, and the garbled text is bounded to the rest of the multi code point sequence. The entire rest of the document will be perfectly readable.

Compare this to missing a code page indicator. It will garble the whole section until the next code page indicator, often the whole rest of the document. The fact that you're even comparing these two situations as if they're the same is frankly ridiculous.


A huge, central, part of UTF-8 design is that you can start decoding it from any arbitrary offset, it is self-aligning.

Unicode had support for language tag codepoints. They still exist but have long been deprecated. They were intended to deal with glyph variants, especially with regards to Han unification.

UTF-8 may still be a good choice for Japanese text, though.

For one thing, pure text is often not the only thing in the file. Markup is often present, and most markup syntaxes (such as HTML or XML) use characters from the ASCII range for the markup, so those characters are one byte (but would be two bytes in UTF-16). Back when the UTF-8 Everywhere manifesto (https://utf8everywhere.org/) was being written, they took the Japanese-language Wikipedia article on Japan, and compared the size of its HTML source between UTF-8 and UTF-16. (Scroll down to section 6 to see the results I'm about to cite). UTF-8 was 767 KB, UTF-16 was 1186 KB, a bit more than 50% larger than UTF-8. The space savings from the HTML markup outweighed the extra bytes from having a less-efficient encoding of Japanese text. Then they did a copy-and-paste of just the Japanese text into a text file, to give UTF-16 the biggest win. There, the UTF-8 text was 222 KB while the UTF-16 encoding got it down to 176 KB, a 21% win for UTF-16 — but not the 50% win you would have expected from a naive comparison, because Japanese text still uses many characters from the ASCII set (space, punctuation...) and so there are still some single-byte UTF-8 characters in there. And once the files were compressed, both UTF-8 and UTF-16 were nearly the same size (83 KB vs 76 KB) which means there's little efficiency gain anyway if your content is being served over a gzip'ed connection.

So in theory, UTF-8 could be up to 50% larger than UTF-16 for Japanese, Chinese, or Korean text (or any of the other languages that fit into the higher part of the basic multilingual place). But in practice, even giving the UTF-16 text every possible advantage, they only saw a 20% improvement over UTF-8.

Which is not nearly enough to justify all the extra cost of suddenly not knowing what encoding your text file is in any more, not when we've finally reached the point of being able to open a text file and just know the encoding.

P.S. I didn't even mention the Shift JIS encoding, and there's a reason I didn't. I've never had to use it "for real", but I've read about it. No. No thank you. No. Shudder. I'm not knocking the cleverness of it, it was entirely necessary back when all you had was 8 bits to work with. But let me put it this way: it's not a coincidence that Japan invented a word (mojibake) to represent what happens when you see text interpreted in the wrong encoding. There were multiple variations of Shift JIS (and there was also EUC-JP just to throw extra confusion into the works), so Japanese people saw garbled text all the time as it moved from one computer running Windows, to an email server likely running Unix, to another computer running Windows... it was a big mess. It's also not a coincidence that (according to Wikipedia), 99.1% of Japanese websites (defined as "in the .jp domain") are encoded in UTF-8, while Shift JIS is used by only 1% (probably about 0.95% rounded up) of .jp websites.

So in practice, nearly everyone in Japan would rather have slightly less efficient encoding of text, but know for a fact that their text will be read correctly on the other end.


I can't tell what the argument is just from the slideshow. The main point appears to be that code pages, UTF-16, etc are all "plain text" but not really.

If that really was the argument, then it is, in 2026, obsolete; utf-8 is everywhere.


He has a YouTube channel, there's a talk on there.

He also discusses code pages etc.

I don't think the thesis is wrong. Eg when I think plain text I think ASCII, so we're already disagreeing about what 'plain text' is. His point isn't that we don't have a standard, it's that we've had multiple standards over what we think is the most basic of formats, with lots of hidden complications.


Tell that to programmers writing code to extract data from PCL print streams by stripping escape sequences and processing the result as "plain text" (in multiple incompatible extended ASCII encodings specified by the stripped escape sequences), or anyone exporting data from Excel in "CSV (Comma delimited) (*.csv)" format.

UTF-8 is everywhere. Until it's not. And it's impossible to distinguish UTF-8 from any other extended ASCII encoding given a sample containing only ASCII characters, so there's still no reliable way to process data that can only be described as "plain text".


Nice. I've used the phrase before, with the vague notion that a proper talk must already exist.

I read that article long time ago, and for me it's a hard disagree. A system as complex and quirky as Unicode can never be considered "plain", and even today it is common for many apps that something Unicode-related breaks. ASCII is still the only text system that will really work well everywhere, which I consider a must for calling something plain text.

And yes, ASCII means mostly limiting things to English but for many environments that's almost expected. I would even defend this not being a native English speaker myself.


I feel like that isn’t exactly a very useful definition of plaintext. If you mean “ASCII” say ASCII.

Plain text is text intended to be interpreted as bytes that map simply to characters. Complexity is irrelevant.


https://en.wikipedia.org/wiki/Plaintext

  With the advent of computing, the term plaintext expanded beyond human-readable documents to mean any data, including binary files, in a form that can be viewed or used without requiring a key or other decryption device. Information—a message, document, file, etc.—if to be communicated or stored in an unencrypted form is referred to as plaintext.
https://csrc.nist.gov/glossary/term/plaintext

    Unencrypted information that may be input to an encryption operation. Note: Plain text is not a synonym for clear text. See clear text.

    Intelligible data that has meaning and can be understood without the application of decryption.

Unfortunately no, Unicode is not simply a mapping of bytes to characters. It is a mapping of numbers to code points, and in some cases you can even get the same characters with multiple code point sequences (not a very good mapping!). Then you need to convert numbers to bytes, so aside from Unicode you also need an encoding. And there are multiple choices. So what would be "plain text" then? UTF-16? UTF-8? If so, with or without BOM? It can't be all of them. For something to really be "plain text" it has to be the same thing to everyone...

> Unfortunately no, Unicode is not simply a mapping of bytes to characters. It is a mapping of numbers to code points, and in some cases you can even get the same characters with multiple code point sequences (not a very good mapping!).

It is worse than that; you can also get different characters with the same code points, and also same code points and characters that should be different according to some uses, and also different code points and characters that should be same according to some uses, etc.


I agree with you that Unicode is too complicated and messy, although it also shows that whether or not something is considered "plain" is itself too difficult.

Unicode has caused many problems (although it was common for m17n and i18n to be not working well before Unicode either). One problem is making some programs no longer 8-bit clean.

Unicode might be considered in two ways: (1) Unicode is an approximation of multiple other character sets, (2) All character sets are an encoding of a subset of Unicode. At best, if Unicode is used at all, it should be used as (1) (as a last resort), but it is too common for Unicode to be used as (2) (as a first resort), which is not good in my opinion.

(I mostly avoid Unicode in my software, although it is also often the case (and, in many (but not all) programs, should be the case) that it only cares about ASCII but does not prevent you from using any other character encodings that are compatible with ASCII.)

> ASCII is still the only text system that will really work well everywhere, which I consider a must for calling something plain text.

Yes, it does work well (almost) everywhere.

Supersets of ASCII are also common, including UTF-8, and PC character set, ISO 2022 (if ASCII is the initially selected G0 set, which it is in the ASN.1 Graphic string and General string types, as well in most terminal emulators), EUC-JP, etc. In these cases, ASCII will also usually work well.

However, as another comment mentions, and I agree with them, that if you mean "ASCII" then it is what you should say, rather than "plain text" which does not tell you what the character encoding is. That other comment says:

> Plain text is text intended to be interpreted as bytes that map simply to characters.

However, it is not always so clear and simple what "characters" is, depending on the character sets and what language you are writing. And then, there are also control characters, to be considered, so it is again not quite so "plain".

> And yes, ASCII means mostly limiting things to English but for many environments that's almost expected. I would even defend this not being a native English speaker myself.

In my opinion, it depends on the context and usage. One character set (regardless of which one it is) cannot be suitable for all purposes. However, for many purposes, ASCII is suitable (including C source code; you might put l10n in a separate file).

You should have proper m17n (in the contexts where it is appropriate, which is not necessarily all files), but Unicode is not a good way to do it.


All character setes do not encode subsets of Unicode.

Two well-known counterexamples that come immediately to mind:

1. Mac OS Roman includes a non-Unicode Apple logo.

2. The Atari ST character set includes two non-Unicode characters that combine to create an Atari logo, and 4 non-Unicode characters that combine to create a picture of J.R. "Bob" Dobbs [1].

[1] https://en.wikipedia.org/wiki/J._R._%22Bob%22_Dobbs


Yes, I know that (I was aware of both of these cases, as well as others). In those cases, there are characters that do not correspond to any Unicode characters.

Nevertheless, what I was saying is that many programs seem to be designed as though other character sets do encode subsets of Unicode, but actually they are different character sets and are not Unicode.

However, what I meant was, in addition to things like the examples that you have, other less obvious cases. Even if characters do correspond to Unicode (and sometimes there is more than one way to do it, which is the case for several PC characters), they are not necessarily supposed to work in the same way.

At best, Unicode could be used as an approximation if the other character sets cannot be used, although sometimes there are other ways to do it.


But at the same time the motor is extremely finicky/fragile in the source of energy (negentropy) it will accept, while natural life is extremely hardy and adaptable.

I wonder how much of machine-like "efficiency" is actually "overfitting" at the cost of robustness.


Who are we to say the mechanisms of biology are “overfit”? Maybe it would be nice if our personal machinery was more robust, but that robustness comes at an evolutionary cost. The greater force that is life on earth as a system for regulating planetary energy dissipation does not care about the needs of the individual. It does not care about the fashions of the millennia. Its sights are set much farther and its history much deeper than that

The need to reproduce and repair our bodies is a big trade-off.

Electric motors are sort of like hermit crab shells - Hard and long-lasting, but they only exist because they piggyback off of a living species.


I mean, we are getting to the point where we can build self completing loops of machinery.

For more complicated organisms, robustness comes in the form of cellular turnover, and regenerative healing in response to injury, at least in youth. I wonder though if single celled organisms have or even need such a function.

Individual cells absolutely do have mechanisms to repair damage: https://pmc.ncbi.nlm.nih.gov/articles/PMC5664224/

They can even repair their own DNA: https://en.wikipedia.org/wiki/DNA_repair


That is a fair point to be honest! I guess when you a 20min lifetime you can probably compromise on reliability in favour of extra efficiency

Every race, the engine of a top-fuel dragster only completes about 900 revolutions and then has to be rebuilt! https://www.motortrend.com/features/top-fuel-dragsters

Or perhaps we are "overfitting" for robustness over time and also as a group (over society) vs just individual robustness at any given moment.

We also have extremely robust livers that allow us to eat a super wide variety of food as energy inputs


Check out shpool, whose tagline is "Think tmux, then aim... lower" :-)

https://github.com/shell-pool/shpool


> pretty diligent about applying search blocklists, closing hacking loopholes, and reading model outputs to catch unanticipated hacks. If we wanted to, we could choose to close our eyes and plug our ears and report higher scores for Terminal-bench, SWE-bench, etc. that technically comply with the reference implementation but aren't aligned with real value delivered to users

Of course, but that's the difference between sins of commission and sins of omission. The question is what "pretty diligent" actually translates to in practice. How many people will encourage delays in a model release or post-training improvement waiting "for more thorough evaluation"? How many popularized AI results can you vouch for on this?

The zeitgeist is to celebrate bias for action, avoiding analysis paralysis and shipping things (esp. with conference driven research culture, even before we get into thorny questions of market dynamics), so even if we have a few pockets of meticulous excellence, the incentive structure pushes towards making the whole field rot.


Is the cost of laying fiber (via a public utility) to each household counted as part of the monthly internet bill, or is that funded separately? (eg. as part of property taxes)


Yes, and that becomes more intuitive when you "un-curry" the nested lambdas into a single lamba with twice the number of arguments. The point is that the state of a constant does not depend whatsoever on the state of the (rest of the) world, how much ever of that state piles on.


But that also just means that BEVs will become way cheaper as companies rush to optimize value and capture users at more attractive prices.


Yes. Cars have fixed cost ranges for people so the end result is pretty much predetermined - electric cars settling at the same price and quality points as ICE cars are today.


> what's wrong with legitimately not knowing what e.g. the data structure will end up looking?

But that's not what the above comment said.

> Just let it run, check debugger/stdout/localhost page and adjust: "Oh, right, the entries are missing canonical IDs, but at the same time there are already all the comments in them, forgot they would be there

So you did have an expectation that the entries should have some canonical IDs, and anticipated/desired a certain specific behavior of the system.

Which is basically the meaning of "what will the output be?" when simplified for programming novices at university.


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: