More

mrothroc · 2026-04-27T15:51:36 1777305096

From a verification-topology angle, what makes algotune.io contamination-resistant? Is it because the correctness oracle is a performance metric (which can't be memorized) rather than a fixed test that can?

mrothroc · 2026-04-22T15:50:19 1776873019

Simple example to show how configs are defined:

{ "name": "plain_3L",

  // Minimal causal transformer baseline: 3 attention layers plus 3 SwiGLU layers.
  "model_dim": 128,
  "vocab_size": 1024,
  "seq_len": 128,

  // Blocks execute sequentially, alternating token mixing and feed-forward mixing.
  "blocks": [
    {"type": "plain", "heads": 4},
    {"type": "swiglu"},
    {"type": "plain", "heads": 4},
    {"type": "swiglu"},
    {"type": "plain", "heads": 4},
    {"type": "swiglu"}
  ],

  // Slightly longer than smoke-test configs so the baseline loss moves visibly.
  "training": {
    "steps": 200,
    "lr": 3e-4,
    "grad_clip": 1.0,
    "weight_decay": 0.01,
    "seed": 42,
    "batch_tokens": 1024
  }
}

mrothroc · 2026-04-20T16:18:13 1776701893

I addressed this in my reply to kelseyfrog above. The short version: the production work is proprietary, the tooling I used to do the analysis is open source.

troupo · 2026-04-20T18:21:00 1776709260

Yeah, it's "my code is lives in another (gas) town, you wouldn't know her". Same for some undisclosed opensource projects.

I can't imagine letting any LLMs do 500+ hours of autonomous work on any code at my company, or even for my own project (hundreds of thousands of lines of unreviewable slop? no thank you). Especially for the amount of features you claim they implement from scratch.

I also don't believe anything about "2 agents running for 12 hours" given how fast they exhaust context, become extremely stupid, and completely ignore most of previous work on subsequent runs, and will happily ignore any explicit instructions. Despite any "guardrails".

Funnily enough literally right now in my current session Claude has "forgotten" most instructions from its global memory *and* its local CLAUDE.md

mrothroc · 2026-04-20T16:16:45 1776701805

Hi, I'm the original author and I can clarify a few things.

The 543 hours are the agent compute hours, not me at the keyboard. The pipeline runs autonomously, the agents execute in parallel, and the gates verify the output. Most of the prompts are agent-to-agent, not human-to-agent.

On the timeline: I have a BSCS (1995) and MSCS (1997) with a specialty in distributed systems. I actually worked my way through school doing this work so I didn't need loans. Let's call it almost 35 years.

The terminology has evolved but the architecture hasn't changed as much as people think.

sarchertech · 2026-04-20T16:56:51 1776704211

> Most of the prompts are agent-to-agent, not human-to-agent.

I can’t even begin to parse any of this if that’s the case.

> Let's call it almost 35 years.

I was hooking up TI-83s to each other when I was 12, so I guess I’ll tell people I’ve been building distributed systems for 30 years.

I’m going to bet that you didn’t have “building event driven distributed systems for 15 years” on your resume in 2006.

mrothroc · 2026-04-20T16:11:17 1776701477

Thank you for your feedback. These are fair points.

I get that "top performer" is off-putting. You're right that authority has to be earned in the text (and I hope I do that), not declared.

On the structure: yes, it's a novel format and I can see how that would be hard to parse. It won't work for everyone.

Both of these are artifacts of trying to blend research into the modern social-media driven world.

mrothroc · 2026-04-20T16:07:17 1776701237

I'm the author of that post. Thank you for your feedback.

The production code is proprietary work for clients, so I can't link to it directly. But the tooling I built to support the pipeline is open source: the log analyzer that computed these statistics.

There are a couple of other in-flight projects I will open source soon, created by this process, but they aren't out yet.

The research page is about the methodology because that's what generalizes. The specific microservices I ship are just microservices.

kelseyfrog · 2026-04-20T21:42:54 1776721374

Thank you for being so graceful. I regret being too harsh in my criticism. I'm sorry.

mrothroc · 2026-04-14T14:10:01 1776175801

I've been running a multi-agent software development pipeline for a while now and I've reached the same conclusion: it's a distributed systems problem.

My approach has been more pragmatic than theoretical: I break work into sequential stages (plan, design, code) with verification gates. Each gate has deterministic checks (compile, lint, etc) and an agentic reviewer for qualitative assessment.

Collectively, this looks like a distributed system. The artifacts reflect the shared state.

The author's point about external validation converting misinterpretations into detectable failures is exactly what I've found empirically. You can't make the agent reliable on its own, but you can make the protocol reliable by checking at every boundary.

The deterministic gates provide a hard floor of guarantees. The agentic gates provide soft probabilistic assertions.

I wrote up the data and the framework I use: https://michael.roth.rocks/research/trust-topology/

peterbell_nyc · 2026-04-14T17:24:29 1776187469

Exactly this. I'm writing my own little orchestrator and memory system and because I have a modest number of workflows, I'm taking the time to specify them deterministically, describe them as a DAG (with goto's for the inevitable loops) and generate deternministic orchestration code. I'm trying to make most of the tool calls as clear and comprehensive as possible (don't make Opus convert a PDF, have a script do that and give it the text instead) and I'm putting all the things you'd expect to track state and assume ~20% task failure rate so I can simply wipe and repeat failed tasks.

Small model and (where still required) human in the loop steps for deterministic workflows can solve a surprisingly large number of problems and don't depend on the models to be consistent or not to fail.

Just invest heavily in adversarial agents and quality gates and apply transforms on intermediate artifacts that can be validated for some dimensions of quality to minimize drift.

mrothroc · 2026-04-15T15:54:47 1776268487

I did the same with my own orchestrator. That's where I get my data.

It's amazing the power a simple workflow with automatic gate enforcement brings to agenting coding.

binyu · 2026-04-14T14:36:41 1776177401

Interesting perspective. While the analogy may be somewhat intuitive, distributed computing exhibit a wider and more diverse set of challenges imo.

Example: Synchronization in naturally async environments, consensus, failure-safe system, etc.

mrothroc · 2026-04-14T16:02:45 1776182565

Agreed that full consensus is overkill.

But I think the coordination problem is subtler than version control implies. In the (plan, design, code) pipeline they aren't collaborating on the same artifact. They're producing different artifacts that are all expressions of the same intent in different spaces: a plan in natural language, a design in a structured spec, code in a formal language.

Different artifacts which are different projections in different Chomsky levels but all from the same thing: user intent.

The coordination challenge is keeping these consistent with each other as each stage transforms the prior projection into the new one. That's where the gates earn their place: they verify that each transformation preserves the intent from the previous stage.

bmitc · 2026-04-15T00:34:34 1776213274

How did you launch, orchestrate, and run the agents? Did you build your own framework, or are you just using the various CLI tooling?

mrothroc · 2026-04-15T15:53:09 1776268389

I created my own framework. Long ago it started as shell scripts that I used in conjunction with aider. It was a very manual process.

It's grown over time to be a full MCP and CLI with stages and gates defined in YAML. I was thinking about open sourcing it but since the code grew organically I would need to do extensive cleanup to make it presentable.

But I do walk through the process on page 9: https://michael.roth.rocks/research/trust-topology/#9

mrothroc · 2026-04-02T22:13:36 1775168016

Non-coders often think all engineers do is write code. They don't realize how much more hours are spent on making sure the code we write is correct, from many angles. Functional bugs? Easy to maintain? Cost optimized? Meets user expectations?

When they have a machine that cranks out code, and honestly pretty good code, they think it's the same thing.

Eventually though I suspect many people will discover what many studies over many decades have shown: the most expensive part of software is maintenance.

This will eventually be the problem for your client. If I were in your shoes, I'd probably start laying the foundation for that. If they're willing to bear the cost, then fine. If not, then ultimately you're set up for a clean exit when you can no longer get the software to run as they want in a timely fashion.

mrothroc · 2026-03-30T16:32:48 1774888368

I see this has been updated by the user showing it is their own tool doing the damage.

These things happen. They happened before coding agents, they happen now. I've done plenty of damage with my own ten fingers on the keyboard without any help from an LLM.

This is exactly why I develop on a Mac with Time Machine. It has saved my bacon many times. Both from things I did and from things Claude did. I've had several recent incidents that went like this:

"me: Claude, did you delete X?" "claude: Yes, sorry, I shouldn't have done that. I can reconstruct it." [Narrator: no, claude cannot reconstruct it.] "me: Should I just restore it from Time Machine?" "claude: Yes! That's perfect!"

I swear I can feel a sense of relief from Claude when I tell it I can just restore from backup.

mrothroc · 2026-03-24T16:00:54 1774368054

Everyone is comparing this to Playwright but it's solving a different problem. Playwright checks structural properties, like does element X exist, is it visible, etc. That's useful but it can't tell you whether the page actually looks right.

I built something similar that takes a screenshot and uses a multi-modal LLM to evaluate it against a design mock. It catches a completely different class of error. The DOM can be structurally perfect and still look nothing like what was intended. Colors wrong, layout shifted, spacing off, components overlapping. No amount of DOM assertions will catch that.

These are two different kinds of gates: structural which are fast and deterministic, and stochastic which are slow but catch things that are completely different. There is very little overlap between the issues, and you want to catch both.

That way I can invest a lot of time getting the mock just right, then let the agents "make it so".

tptacek · 2026-03-24T18:50:58 1774378258

Playwright seems to do fine at visual stuff? It takes screenshots and the model evaluates them. That's most of what I use Playwright for.

morkalork · 2026-03-24T16:27:12 1774369632

Copilot + Playwright MCP can take screenshots and send the images to LLM tho?

mrothroc · 2026-03-24T16:31:38 1774369898

It's the whole tool that's important, not so much how you get screenshots. That's what I'm saying: this is headed in the right direction, it just falls a little short of what I do, where I get tons of value over and above just playwright (or whatever gets the screenshot).

The critical part is that viewed at a high level, this method tests something different, which means it catches different errors.