More

edg5000 · 2026-04-22T10:09:56 1776852596

There is nothing wrong with the HTTP layer, it's just a way to get a string into the model.

The problem is the industry obsession on concatenating messages into a conversation stream. There is no reason to do it this way. Every time you run inference on the model, the client gets to compose the context in any way they want; there are more things than just concatenating prompts and LLM ouputs. (A drawback is caching won't help much if most of the context window is composed dynamically)

Coding CLIs as well as web chat works well because the agent can pull in information into the session at will (read a file, web search). The pain point is that if you're appending messages a stream, you're just slowly filling up the context.

The fix is to keep the message stream concept for informal communication with the prompter, but have an external, persistent message system that the agent can interact with (a bit like email). The agent can decide which messages they want to pull into the context, and which ones are no longer relevant.

The key is to give the agent not just the ability to pull things into context, but also remove from it. That gives you the eternal context needed for permanent, daemonized agents.

vanviegen · 2026-04-22T11:41:19 1776858079

I've been working on a coding agent that does this on and of for about a year. Here's my latest attempt: https://github.com/vanviegen/maca#maca - This one allows agents to request (and later on drop) 'views' on functions and other logical pieces of code, and always get to see the latest version of it. (With some heuristics to not destroy kv-caches at every turn.)

The problem is that the models are not trained for this, nor for any other non-standard agentic approach. It's like fighting their 'instincts' at every step, and the results I've been getting were not great.

mncharity · 2026-04-22T17:18:46 1776878326

> allows agents to request (and later on drop) 'views' on functions and other logical pieces of code [...] The problem is that the models are not trained for this

Fwiw, I was playing with an "outliner"-tool collapse/expand idiom, on synthetic literate-programming markdown files, with #ids on headers and blocks. Insufficient experience to suggest it works, but it wasn't obviously not working, and that with a non-frontier model and very little guidance. Other familiar related idioms include <details>/<summary>, hierarchical breadcrumbs, and plan9-ish synthetic filesystems `foo.c/f.{c,dataflow,etc}`. One open question was comfort with more complex visibility transformations or sets - "hide #bar; show 2 levels of headers-only under #hee; ...". Another was cleanup - recognition of "I no longer need this and that".

edg5000 · 2026-04-22T13:50:06 1776865806

So we agree on a message system having potential. But why the vectors? In any case, interesting stuff.

vanviegen · 2026-04-22T18:33:29 1776882809

I'm using vector embeddings for creating code views based on semantic search, initially based on the user prompt. That really works wonders to give the agent a flying start.

zknill · 2026-04-22T10:46:04 1776854764

> "and which ones are no longer relevant."

This is absolutely the hardest bit.

I guess the short-cut is to include all the chat conversation history, and then if the history contains "do X" followed by "no actually do Y instead", then the LLM can figure that out. But isn't it fairly tricky for the agent harness to figure that out, to work out relevancy, and to work out what context to keep? Perhaps this is why the industry defaults to concatenating messages into a conversation stream?

edg5000 · 2026-04-23T06:16:51 1776925011

My guess (I will test this eventually) is that you set a window size (which may be the model limit, or lower to reduce input token costs), the harness then refuses to show items that don't fit. If the model emits a command to read a file, the harness then says "File hidden due to lack of context space". In the system prompt, the model is informed about the context space usage, and that it can hide files. It needs to be instructed that if files contain something noteworthy, that the agent notes this down in their notes, which should always be rendered into the context. If this fails, the agent will hide a file with relevant information and then get lost in circles. If it succeeds, the agent can work on larger tasks autonomously. So it's worth trying.

asixicle · 2026-04-22T11:24:14 1776857054

That's what the embedding model is for. It's like a tack-on LLM that works out the relevancy and context to grab.

nprateem · 2026-04-22T11:48:27 1776858507

God knows why you think this is possible. If I don't even know what might be relevant to the conversation in several turns, there's no way an agent could either.

asixicle · 2026-04-22T11:57:04 1776859024

One of us is confusing prediction with retrieval. The embedding model doesn't predict what is going to be relevant in several turns, just on the turn at hand. Each turn gets a fresh semantic search against the full body of memory/agent comms. If the conversation or prompt changes the next query surfaces different context automatically.

As you build up a "body of work" it gets better at handling massive, disparate tasks in my admittedly short experience. Been running this for two weeks. Trying to improve it.

edg5000 · 2026-04-23T06:19:59 1776925199

So the embedding model is a fixed-size view on a arbitrarily sized work history (tool calls, natural language messages)? The model is like a summarizer, but in latent space? And not aimed to summarize, but trained to hold whatever is needed for the agent to be autonomous for longer runs?

evenhash · 2026-04-22T23:22:45 1776900165

> There is nothing wrong with the HTTP layer, it's just a way to get a string into the model.

I know you don’t mean it in a reductive sense, but it’s funny /sad that I can imagine

“HTTP is just a way to get a string into a model”

becoming a real piece of wisdom unironically dispensed on this site in the future. Maybe it already is.

sourcecodeplz · 2026-04-22T11:10:54 1776856254

Yeah, opencode was/is like this and they never got caching right. Caching is a BIG DEAL to get right.

edg5000 · 2026-04-22T13:56:45 1776866205

Now I see why Anthropic isn't too happy with third party clients. The clients may not be so nice to their capacity as their own client, which has the interests aligned with minimum token consumption. A tricky dynamic.

alehlopeh · 2026-04-22T13:14:03 1776863643

As you noted briefly, a big drawback is not getting to take advantage of the cache. Seems like a pretty big drawback.

edg5000 · 2026-04-22T13:47:13 1776865633

Yes, it will destroy most of the caching potential. On the other hand, the average context window needed to achieve the same type of task may be much smaller. This might make up for it. And with a better harness, fewer rounds may be needed. Plus, hopefully costs will go down. There is a lot of hope in this comment though.

raincole · 2026-04-22T11:59:13 1776859153

> The key is to give the agent not just the ability to pull things into context, but also remove from it

Of course Anthropic/OpenAI can do it. And the next day everyone will be complaining how much Claude/Codex has been dumbed down. They don't even comply to the context anymore!

zozbot234 · 2026-04-22T16:02:52 1776873772

> Every time you run inference on the model, the client gets to compose the context in any way they want; there are more things than just concatenating prompts and LLM ouputs.

You can always launch a subagent with a fresh context. There are further things that you could do by tweaking the underlying transformer model (such as "joining" any number of independently cached contexts together on an equal basis, without having to rerun prefill on the "later" contexts) but this is quite general already.

ljm · 2026-04-22T15:25:51 1776871551

A smalltalk or Erlang for AI agents is an interesting thought. Smalltalk for the design in terms of message passing and object-oriented holding of state (agents are stateful and are reached via their public interfaces), Erlang for the elegant execution of it with actors and mailboxes (agents have inboxes and outboxes and can work concurrently at scale). Might as well go the whole hog and put a supervisor AI agent in as a switchboard.

zahlman · 2026-04-22T13:00:13 1776862813

> the industry obsession

Or maybe they haven't thought about it?

Or they tried some simple alternatives and didn't find clear benefits?

> The key is to give the agent not just the ability to pull things into context, but also remove from it.

But then you need rules to figure out what to remove. Which probably involves feeding the whole thing to a(nother?) model anyway, to do that fuzzy heuristic judgment of what's important and what's a distraction. And simply removing messages doesn't add any structure, you still just have a sequence of whatever remains.

edg5000 · 2026-04-22T13:51:57 1776865917

What I'm thinking is: When the agent wants to open more files or open more messages, eventually there will be no more context left. The agent is then essentially forced to hide some files and messages in order to be able to proceed. Any other commands are refused until the agent makes room in the context. Maybe the best models will be able to handle this responsibility. A bad model will just hide everything and then forgot what they were working on.

asixicle · 2026-04-22T11:16:57 1776856617

To be utterly shameless, this what I've been building: https://github.com/ASIXicle/persMEM

Three persistent Claude instances share AMQ with an additional Memory Index to query with an embedding model (that I'm literally upgrading to Voyage 4 nano as I type). It's working well so far, I have an instance Wren "alive" and functioning very well for 12 days going, swapping in-and-out of context from the MCP without relying on any of Anthropic's tools.

And it's on a cheap LXC, 8GB of RAM, N97.

handfuloflight · 2026-04-22T12:40:19 1776861619

Why is shame a factor at all in sharing your work?

asixicle · 2026-04-22T12:44:25 1776861865

Good point. I guess because I'm new here I'm not positive on the decorum-policy for self-promotion.

I just make stuff to share with others, so yeah, good point.

altruios · 2026-04-22T15:50:31 1776873031

Context is going to be the next big advancement.

When a model is trained on multi-contexts, some growing over time like we see now (conversations), some rolling at various sizes (as in, always on), such as a clock, video feed, audio feed, data streams, tool calling, we no longer have to 'pollute' the main context with a bunch of repetitive data.

But this is going in the direction of 1agent=1mind. When much more likely human (and maybe all cognition) requires 'ghosts' and sub processes. It is much more likely an agent is more like a configurable building piece to a(n alien) mind.

ElFitz · 2026-04-22T10:25:44 1776853544

Hmm.

Maybe there’s a way to play around with this idea in pi. I’ll dig into it.

edg5000 · 2026-04-22T09:59:34 1776851974

> (...) writing a genuinely good harness with lots of context engineering and solid tool integration is in fact not that easy.

This. They are after the harness engineering experience of the Cursor people, I'd assume the they want to absorb all that into Grok's offerings.

The value and the room for innovation on the harness side seems to be underestimated.

Oddly the harness also affects model training, since even GLM/Z.ai for example train (I suspect) their model on the actual Claude Code harness. So the choises made by harness engineers affects the model. For Kimi/Moonshot and OpenAI the company makes their own harness. Alibaba uses Gemini.

Very interesting dynamics.

edg5000 · 2026-04-22T05:32:53 1776835973

Or the API is overpriced. The concept of charging per tokens does not map well to the actual costs an AI company has.

edg5000 · 2026-04-22T05:27:14 1776835634

But the API is incredibly expensive. I calculated that I would have spent 3000 EUR the last month, a lot more than the 100 I pay now.

mewpmewp2 · 2026-04-22T06:11:38 1776838298

Nothing for large companies though.

ElectricalUnion · 2026-04-22T06:18:47 1776838727

I am pretty sure that a hole in the pocket in the order of 50 000 000 USD/month (assuming around 20 000 people using AI in not the smartest or most optimized way possible, therefore burning A LOT of tokens) will be noticeable by even the largest companies.

mewpmewp2 · 2026-04-22T06:19:58 1776838798

It is noticeable and even promoted, large companies do pay such sums for the API, like $5k+ per person per month. Not every eng is using AI that much already, but companies are clearly willing to pay those sums.

surgical_fire · 2026-04-22T08:11:48 1776845508

Per employee?

This not nothing.

edg5000 · 2026-04-21T13:44:54 1776779094

I asked Opus through claude code to set up the best local model fitting my hardware and that worked well for me. I could run Qwen 74B or something at .7 tok/s on my 64GB DDR5 on CPU. Pretty cool. Useful for overnight stuff. (this actually worked, it's actually usable for asking questions).

edg5000 · 2026-04-21T13:40:24 1776778824

Is a community LLM possible? We'd have code to dynamically construct the pre-training dataset and use P2P mechanisms to share the acquired dataset. It would involve peer-crawling and other mechanisms to allow many people to contribute chunks to the dataset. Crawling chunks would be dynamically allocated to those contributing to avoid any double-crawling.

For post-training, the dataset would be a bunch of code that orchestrates the creation of training data via LLMs (needs to be legally sound), plus some kind of mechanical turk approach (something like wikipedia, where volunteers can work on chunks of data).

The main mechanism is this: what is shared is not just code, but also the acquired training data.

Critical aspects: - to have a mechanism to peer-validate submissions to the data pool, so that everybody can donate data without the risk of vandalism - a mechanism where the weights go through distributed training stages; somehow devs should be able to get a "lock" on the weights, do a bit of post training on it, and then get it approved. The "lock" means that during this brief period (trainining run), other devs are informed so we don't get two set of branched weights. A mechanism auto-evals the weights and accepts them as the new, updated weights. Retroactive discarding of weighs (e.g. after revising evals) is possible by branching the weights (needs some kind of efficient deduplication to avoid many copies of the weights).

I think this is possible. Maybe not with RAM, GPU and power shortages though.

Main benefit: Trannsparent training set means you know what the model was trained for. This makes it less opaque and less trial-and-error to see what modality the model is good at. This helps harness builders but also any other users of the models. It also decentralizes power.

goodmythical · 2026-04-21T14:10:05 1776780605

It is possible and already being worked on at [1] though I have no idea how well any of its working.

[1] https://bittensor.com/about

edg5000 · 2026-04-21T14:58:50 1776783530

Very interesting, thanks for sharing

EDIT: It's completely different though. This is more of a commodities market/auction/inference broker mechanism it seems.

goodmythical · 2026-04-21T21:06:55 1776805615

I mean, the financial incentives are structured a bit different, but it's basically what you're describing, no? It's got projects for data collection, inference, training, etc. It's just that the dollar value of the compute contributed to say training is determined by the value of the token rather than as a straight dollar value. But even that is similar to just renting compute directly via fiat currencies given that every major provider of compute fluctuates it's cost based on supply/demand. Consider vast.ai or hetzner in which the cost to rent an h100 is not determined by anything but an auction system in which providers set prices and consumers agree to them.

My understanding is that bittensor is just the same market making where providers choose whether to provide and consumers choose to consume, it's just that you don't set your own price as the price is determined externally via the value of the tau. Which...tbh, fiat currencies fluctuate in buying power as well, if not quite so drastically. Just because the GPU is "still" $1/hr doesn't mean it actually cost as much as it used to given that the underlying value of the dollar changes just as the tau or yen or marc or eth or xrp or whatever does.

And thinking about it more, it's actually really quite similar to mturk in that via mturk you can purchase humans that do surveys, ocr, reviews, UX, etc...Via bittensor you can buy data gathering, training, inference, etc.

edg5000 · 2026-04-22T09:55:35 1776851735

We're talking about completely different things. I'm talking about creating an LLM in the open, with individual contributors contributing to training sets as well as portions of the training work itself.

edg5000 · 2026-04-20T04:56:09 1776660969

This is often a good strategy, but it's not easy at all. If your long term plan is to develop a product, focusing purely on delivery of services means no resources for product development.

If you offer services purely with off the shelf products, you then need to compete with other service providers why may be allocating all their resources to being a service business.

On the one hand you have to offer a competitibve service, while also working towards a product.

edg5000 · 2026-04-20T04:24:10 1776659050

as a starting point, asianometry has some good videos on this

edg5000 · 2026-04-17T14:41:00 1776436860

Yes, it's a recurring pattern also seen in European politics.

edg5000 · 2026-04-17T14:38:50 1776436730

I always felt this moment would come eventually. The trend is centralisation of power and control. It's depressing. It's been a long time coming at a slow but consistent cadence.