From a verification-topology angle, what makes algotune.io contamination-resistant? Is it because the correctness oracle is a performance metric (which can't be memorized) rather than a fixed test that can?
I addressed this in my reply to kelseyfrog above. The short version: the production work is proprietary, the tooling I used to do the analysis is open source.
Yeah, it's "my code is lives in another (gas) town, you wouldn't know her". Same for some undisclosed opensource projects.
I can't imagine letting any LLMs do 500+ hours of autonomous work on any code at my company, or even for my own project (hundreds of thousands of lines of unreviewable slop? no thank you). Especially for the amount of features you claim they implement from scratch.
I also don't believe anything about "2 agents running for 12 hours" given how fast they exhaust context, become extremely stupid, and completely ignore most of previous work on subsequent runs, and will happily ignore any explicit instructions. Despite any "guardrails".
Funnily enough literally right now in my current session Claude has "forgotten" most instructions from its global memory *and* its local CLAUDE.md
Hi, I'm the original author and I can clarify a few things.
The 543 hours are the agent compute hours, not me at the keyboard. The pipeline runs autonomously, the agents execute in parallel, and the gates verify the output. Most of the prompts are agent-to-agent, not human-to-agent.
On the timeline: I have a BSCS (1995) and MSCS (1997) with a specialty in distributed systems. I actually worked my way through school doing this work so I didn't need loans. Let's call it almost 35 years.
The terminology has evolved but the architecture hasn't changed as much as people think.
I'm the author of that post. Thank you for your feedback.
The production code is proprietary work for clients, so I can't link to it directly. But the tooling I built to support the pipeline is open source: the log analyzer that computed these statistics.
There are a couple of other in-flight projects I will open source soon, created by this process, but they aren't out yet.
The research page is about the methodology because that's what generalizes. The specific microservices I ship are just microservices.
I've been running a multi-agent software development pipeline for a while now and I've reached the same conclusion: it's a distributed systems problem.
My approach has been more pragmatic than theoretical: I break work into sequential stages (plan, design, code) with verification gates. Each gate has deterministic checks (compile, lint, etc) and an agentic reviewer for qualitative assessment.
Collectively, this looks like a distributed system. The artifacts reflect the shared state.
The author's point about external validation converting misinterpretations into detectable failures is exactly what I've found empirically. You can't make the agent reliable on its own, but you can make the protocol reliable by checking at every boundary.
The deterministic gates provide a hard floor of guarantees. The agentic gates provide soft probabilistic assertions.
Exactly this. I'm writing my own little orchestrator and memory system and because I have a modest number of workflows, I'm taking the time to specify them deterministically, describe them as a DAG (with goto's for the inevitable loops) and generate deternministic orchestration code. I'm trying to make most of the tool calls as clear and comprehensive as possible (don't make Opus convert a PDF, have a script do that and give it the text instead) and I'm putting all the things you'd expect to track state and assume ~20% task failure rate so I can simply wipe and repeat failed tasks.
Small model and (where still required) human in the loop steps for deterministic workflows can solve a surprisingly large number of problems and don't depend on the models to be consistent or not to fail.
Just invest heavily in adversarial agents and quality gates and apply transforms on intermediate artifacts that can be validated for some dimensions of quality to minimize drift.
But I think the coordination problem is subtler than version control implies. In the (plan, design, code) pipeline they aren't collaborating on the same artifact. They're producing different artifacts that are all expressions of the same intent in different spaces: a plan in natural language, a design in a structured spec, code in a formal language.
Different artifacts which are different projections in different Chomsky levels but all from the same thing: user intent.
The coordination challenge is keeping these consistent with each other as each stage transforms the prior projection into the new one. That's where the gates earn their place: they verify that each transformation preserves the intent from the previous stage.
I created my own framework. Long ago it started as shell scripts that I used in conjunction with aider. It was a very manual process.
It's grown over time to be a full MCP and CLI with stages and gates defined in YAML. I was thinking about open sourcing it but since the code grew organically I would need to do extensive cleanup to make it presentable.
Non-coders often think all engineers do is write code. They don't realize how much more hours are spent on making sure the code we write is correct, from many angles. Functional bugs? Easy to maintain? Cost optimized? Meets user expectations?
When they have a machine that cranks out code, and honestly pretty good code, they think it's the same thing.
Eventually though I suspect many people will discover what many studies over many decades have shown: the most expensive part of software is maintenance.
This will eventually be the problem for your client. If I were in your shoes, I'd probably start laying the foundation for that. If they're willing to bear the cost, then fine. If not, then ultimately you're set up for a clean exit when you can no longer get the software to run as they want in a timely fashion.
I see this has been updated by the user showing it is their own tool doing the damage.
These things happen. They happened before coding agents, they happen now. I've done plenty of damage with my own ten fingers on the keyboard without any help from an LLM.
This is exactly why I develop on a Mac with Time Machine. It has saved my bacon many times. Both from things I did and from things Claude did. I've had several recent incidents that went like this:
"me: Claude, did you delete X?"
"claude: Yes, sorry, I shouldn't have done that. I can reconstruct it."
[Narrator: no, claude cannot reconstruct it.]
"me: Should I just restore it from Time Machine?"
"claude: Yes! That's perfect!"
I swear I can feel a sense of relief from Claude when I tell it I can just restore from backup.
Everyone is comparing this to Playwright but it's solving a different problem. Playwright checks structural properties, like does element X exist, is it visible, etc. That's useful but it can't tell you whether the page actually looks right.
I built something similar that takes a screenshot and uses a multi-modal LLM to evaluate it against a design mock. It catches a completely different class of error. The DOM can be structurally perfect and still look nothing like what was intended. Colors wrong, layout shifted, spacing off, components overlapping. No amount of DOM assertions will catch that.
These are two different kinds of gates: structural which are fast and deterministic, and stochastic which are slow but catch things that are completely different. There is very little overlap between the issues, and you want to catch both.
That way I can invest a lot of time getting the mock just right, then let the agents "make it so".
It's the whole tool that's important, not so much how you get screenshots. That's what I'm saying: this is headed in the right direction, it just falls a little short of what I do, where I get tons of value over and above just playwright (or whatever gets the screenshot).
The critical part is that viewed at a high level, this method tests something different, which means it catches different errors.
reply