More

sanxiyn · 2026-04-25T08:57:27 1777107447

As I understand, Rice's theorem does not apply because neural networks are not Turing-complete.

calf · 2026-04-25T10:37:22 1777113442

I'm not sure I agree with that. Even technically, my PC is not Turing-complete because its hard drive is finite. Yet there is an informal sense that Rice's Theorem is still relevant in a kind of PC abstraction sense, as we are all taught "virus checkers are strictly speaking impossible". This is a subtle point that needs further clarification from CS theorists, of which I am not.

Neural networks in general are Turing models. Human brains are in the abstract Turing complete as well, as a simple example. LLMs being run iteratively in an unbounded loop may be "effectively Turing complete" for this simple reason, as well.

Regardless, any theory purporting to be foundational ought to explicitly address this demarcation. Unless practitioners think computability and formal complexity are not scientific foundations for CS.

lou1306 · 2026-04-25T11:33:30 1777116810

But most "normal" neural networks are feed-forward, so they are guaranteed to terminate in a bounded amount of time. This rules Turing completeness right out. And even recurrent NNs can be "unfolded" into feed-forward equivalents, so they are not TC either.

You need a memory element the network can interact with, just like an ALU by itself is not TC, but a barebones stateful CPU (ALU + registers) is.

calf · 2026-04-27T09:55:53 1777283753

I already addressed this type of misargument in my first paragraph. Another way of looking at it is, if NNs are so time bounded then they cannot be computationally powerful at all. Which is really strange.

sanxiyn · 2026-04-22T04:39:28 1776832768

To some extent yes, but models are good at reverse engineering such that it isn't as great advantage as you might think.

sanxiyn · 2026-04-22T04:36:06 1776832566

Last three CVEs are collections of bugs. CVE-2026-6784 is a collection of 55 bugs. CVE-2026-6785 is a collection of 154 bugs. CVE-2026-6786 is a collection of 107 bugs.

As for credits, I think bugs are ultimately credited to people, and this time Mozilla people used Mythos, as opposed to Anthropic people using Opus or Mythos.

sanxiyn · 2026-04-21T23:12:13 1776813133

Why do you think so? I didn't get such impression.

sanxiyn · 2026-04-21T22:56:50 1776812210

The gap between formal and informal has been pointed out as an Achilles' heel of formal methods from the dawn of the field, so critique is not particularly new. The standard reference is Social processes and proofs of theorems and programs (1979), which is worth reading.

rramadass · 2026-04-22T04:39:51 1776832791

Nice; thanks for the pointer to the paper (i had not known of it).

The key to understanding and usage of Formal Methods is to realize that it is a way of thinking at many different levels. You can choose whatever level seems intuitive and accessible to you.

The must-read paper On Formal Methods Thinking in Computer Science Education posits three levels which i have highlighted here - https://news.ycombinator.com/item?id=46298961

Carroll Morgan's classic (In-)Formal Methods: The Lost Art --- A Users’ Manual and his recent book on the same are also relevant here - https://news.ycombinator.com/item?id=45490017

sanxiyn · 2026-03-27T09:06:39 1774602399

Well, that's because this post is about Gen 13. In FL2 post (presumably on same Gen 12 servers), they say 25% lower latency.

sanxiyn · 2026-03-27T03:37:54 1774582674

In this case the code is public and you can see they are not cheating in that sense.

DetroitThrow · 2026-03-27T05:13:46 1774588426

The harness seems extremely benchmark specific that gives them a huge advantage over what most models can use. This isn't a qualifying score for that reason.

Here is the ARC-AGI-3 specific harness by the way - lots of challenge information encoded inside: https://github.com/symbolica-ai/ARC-AGI-3-Agents/blob/symbol...

Davidzheng · 2026-03-27T04:08:24 1774584504

I agree it's not cheating that restricted sense. But I'm not really convinced that it can't be cheating in a more general sense. You can try like 10^10 variations of harnesses and select the one that performs best. And probably if you then look at it, it will not look like it's necessarily cheating. But you have biased the estimator by selecting the harness according to the value.

SchemaLoad · 2026-03-27T03:41:56 1774582916

Once the model has seen the questions and answers in the training stage, the questions are worthless. Only a test using previously unseen questions has merit.

lambda · 2026-03-27T03:46:52 1774583212

They aren't training new models for this. This is an agent harness for Opus 4.6.

measurablefunc · 2026-03-27T03:59:02 1774583942

All traffic is monitored, all signal sources are eventually incorporated into the training set in one way or another. The person you're responding to is correct, even a single API call to any AI provider is sufficient to discount future results from the same provider.

stale2002 · 2026-03-27T04:11:44 1774584704

ok! So if someone uses an existing, checkpointed, open source model then the answer is yes the results are valid and it doesn't matter that the tests are public.

measurablefunc · 2026-03-27T04:35:52 1774586152

Yes, assuming the checkpoint was before the announcement & public availability of the test set.

raincole · 2026-03-27T05:33:21 1774589601

You live in a conspiracy world. Those AI providers don't update the models that fast. You can try ask them solve ARC-AGI-3 without harness and see them struggle as yesterday yourself.

measurablefunc · 2026-03-27T06:23:17 1774592597

Which part is the conspiracy? Be as concrete as possible.

bberrry · 2026-03-27T09:19:34 1774603174

They are definitely cheating, they have crafted prompts[1] that explain the game rules rather than have the model explore and learn.

1. https://github.com/symbolica-ai/ARC-AGI-3-Agents/blob/symbol...

versteegen · 2026-03-27T12:26:30 1774614390

Where do you see that? I only skimmed the prompts but don't see any aspects of any of the games explained in there. There are a few hints which are legitimate prior knowledge about games in general, though some looks too inflexible to me. Prior knowledge ("Core priors") is a critical requirement of the ARC series, read the reports.

sanxiyn · 2026-03-27T01:58:05 1774576685

There is. Official leaderboard is without harness, and community leaderboard is with harness. Read ARC-AGI-3 Technical Paper for details.

falcor84 · 2026-03-27T02:09:25 1774577365

I went through the technical paper again, and while they explain why they decided against the harness, I disagree with them - my take is that if harnesses are overfitting, then they should be penalized on the hidden test set.

Anyway, searching both in ARC-AGI's paper and website and directly on kaggle, I failed to find a with-harness leaderboard; can you please give the link?

sanxiyn · 2026-03-27T02:34:44 1774578884

Here it is: https://arcprize.org/leaderboard/community

falcor84 · 2026-03-28T00:17:36 1774657056

Ah, it's based on this repo [0] and there's only 1 non-example submission there [1], from 2 weeks ago (so it only covers the preview games), and their schema doesn't a field to show that it's only the preview, nor does the thing properly parse the score or cost into the table. And the biggest thing is that apparently there's no validation whatsoever - submissions are not ever run on the hidden test games, so is essentially useless as a comparison.

[0] https://github.com/arcprize/ARC-AGI-Community-Leaderboard [1] https://github.com/arcprize/ARC-AGI-Community-Leaderboard/bl...

sanxiyn · 2026-03-26T21:49:58 1774561798

That works well. Anthropic wrote a writeup on it.

https://www.anthropic.com/engineering/harness-design-long-ru...

sanxiyn · 2026-03-26T21:48:55 1774561735

Yes, but a programming language with a proverbial sufficiently smart compiler. That is very useful.

Quekid5 · 2026-03-26T22:11:59 1774563119

Try writing an exhaustive spec for anything non-trivial and you might see the problem.

scuff3d · 2026-03-27T04:21:11 1774585271

Been saying this for a while now. I work in aerospace, and I can tell you from first hand experience software engineers don't know what designing a spec is.

Aero, mechanical, and electrical engineers spend years designing a system. Design, requirements, reviews, redesign, more reviews, more requirements. Every single corner of the system is well understood before anything gets made. It's a detailed, time consuming, arduous process.

Software engineers think they can duplicate that process with a few skills and a weekend planning session with Claude Code. Because implementation is cheaper we don't have to go as hard as the mechanical and electrical folks, but to properly spec a system is still a massive amount of up front effort.

sn9 · 2026-03-28T23:34:12 1774740852

And software isn't as constrained by physics as hardware, which massively expands both the design space as well as how many ways things can go wrong.