dhedlund's comments

dhedlund · 2026-04-20T06:01:58 1776664918

One flag that you might have missed. If you have sparse files such as qcow disk images, you might also need `-S`, `--sparse` or it will expand the file to its full disk size when copied to the destination.

fxtentacle · 2026-04-20T07:10:14 1776669014

Thanks :)

dhedlund · 2026-04-19T15:20:55 1776612055

The "Picking delaySeconds" section is quite enlightening.

I feel like this explains about a quarter to half of my token burn. It was never really clear to me whether tool calls in an agent session would keep the context hot or whether I would have to pay the entire context loading penalty after each call; from my perspective it's one request. I have Claude routinely do large numbers of sequential tool calls, or have long running processes with fairly large context windows. Ouch.

> The Anthropic prompt cache has a 5-minute TTL. Sleeping past 300 seconds means the next wake-up reads your full conversation context uncached — slower and more expensive. So the natural breakpoints:

> - *Under 5 minutes (60s–270s)*: cache stays warm. Right for active work — checking a build, polling for state that's about to change, watching a process you just started.

> - *5 minutes to 1 hour (300s–3600s)*: pay the cache miss. Right when there's no point checking sooner — waiting on something that takes minutes to change, or genuinely idle.

> *Don't pick 300s.* It's the worst-of-both: you pay the cache miss without amortizing it. If you're tempted to "wait 5 minutes," either drop to 270s (stay in cache) or commit to 1200s+ (one cache miss buys a much longer wait). Don't think in round-number minutes — think in cache windows.

> For idle ticks with no specific signal to watch, default to *1200s–1800s* (20–30 min). The loop checks back, you don't burn cache 12× per hour for nothing, and the user can always interrupt if they need you sooner.

> Think about what you're actually waiting for, not just "how long should I sleep." If you kicked off an 8-minute build, sleeping 60s burns the cache 8 times before it finishes — sleep ~270s twice instead.

> The runtime clamps to [60, 3600], so you don't need to clamp yourself.

Definitely not clear if you're only used to the subscription plan that every single interaction triggers a full context load. It's all one session session to most people. So long as they keep replying quickly, or queue up a long arc of work, then there's probably a expectation that you wouldn't incur that much context loading cost. But this suggests that's not at all true.

wongarsu · 2026-04-20T00:22:11 1776644531

They really should have just set the cache window to 5:30 or some other slightly odd number instead of using all those tokens to tell claude not to pick one of the most common timeout values

stingraycharles · 2026-04-20T02:10:17 1776651017

This is somewhat obvious if you realize that HTTP is a stateless protocol and Anthropic also needs to re-load the entire context every time a new request arrives.

The part that does get cached - attention KVs - is significantly cheaper.

If you read documentation on this, they (and all other LLM providers) make this fairly clear.

dhedlund · 2026-04-20T05:42:19 1776663739

For people who spend a significant amount of time understanding how LLMs and the associated harnesses work, sure. For the majority of people who just want to use it, it's not quite so obvious.

The interface strongly suggests that you're having a running conversation. Tool calls are a non-interactive part of that conversation; the agent is still just crunching away to give you an answer. From the user's perspective, the conversation feels less like stateless HTTP where the next paragraph comes from a random server, and more like a stateful websocket where you're still interacting with the original server that retains your conversation in memory as it's working.

Unloading the conversation after 5 minutes idling can make sense to most users, which is why the current complaints in HN threads tend to align with that 1 hour to 5 minute timeout change. But I suspect a significant amount of what's going on is with people who:

* don't realize that tool calls really add up, especially when context windows are larger.

* had things take more than 5 minutes in a single conversation, such as a large context spinning up subagents that are each doing things that then return a response after 5+ minutes. With the more recent claude code changes, you're conditioned to feel like it's 5 minutes of human idle time for the session. They don't warn you that the same 5 minute rule applies to tool calls, and I'd suspect longer-running delegations to subagents.

NitpickLawyer · 2026-04-20T04:23:59 1776659039

Unless I'm parsing your reply very badly, I see no world in which anything dealing with HTTP would be more expensive than dealing with kv cache (loading from "cold" storage, deciding which compute unit to load it into, doing the actual computations for the next call, etc).

stingraycharles · 2026-04-20T10:45:05 1776681905

No, that’s not the issue. What people fail to understand is that every request - eg every message you send, but also tool call responses - require the entire conversation history to be sent, and the LLM providers need to reprocess things.

The attention part of LLMs (that is, for every token, how much their attention is to all other tokens) is cached in a KV cache.

You can imagine that with large context windows, the overhead becomes enormous (attention has exponential complexity).

dhedlund · 2026-04-14T03:13:39 1776136419

This reminds me a bit of monoliths vs microservices. People would see microservices as the next new shiny thing and bring it with them to their next job, or read a great blog post that sounds great in theory, but falls apart in practice. People would see it as as purely architectural decision. But the reality was that you had to have the organizational structure to support that development model or you'd find out that it just doesn't scale the way you expect and introduces its own sets of problems. My experience is that most teams that didn't have large orgs got bogged down by the weight of microservices (or things called "microservices"). It required a lot of tooling and orchestration to manage. But there was this promise that you could easily just rewrite that microservice from scratch or change languages and nobody would notice or care.

LLM-generated code feels the same. Reviewing LLM-generated code when it's in the context of a monolith is more taxing than reviewing it in the context of the microservice; the blast radius is larger and the risk is greater, as you can make decisions around how important that service actually is for system-wide stability with microservices. You can effectively not care for some services, and can go back and iterate or rewrite it several times over. But more importantly, the organizational structures that are needed to support microservice like architectures effectively also feel like the organizational structures that are needed to support LLM-generated codebases effectively; more silo-ing, more ownership, more contract and spec-based communication between teams, etc. Teams might become one person and an agent in that org structure. But communication and responsibilities feel like they're require something similar to what is needed to support microservices...just that services are probably closer in size to what many companies end up building when they try to build microservices.

And then there are majestic monoliths, very well curated monoliths that feel like a monorepo of services with clear design and architecture. If they've been well managed, these are also likely to work well for agents, but still suffer the same cognitive overhead when reviewing their work because organizationally people working on or reviewing code for these projects are often still responsible for more than just a narrow slice, with a lot of overlap with other devs, requiring more eyes and buy-in for each change as a result.

The organizational structures that we have in place for today might be forced to adapt over time, to silo in ways that ownership and responsibility narrow to fit within what we can juggle mentally. Or they'll be forced to slow down an accept the limitations of the organizational structure. Personal projects have been the area that people have had a lot of success with for LLMs, which feels closer to smaller siloed teams. Open-source collaboration with LLM PRs feels like it falls apart for the same cognitive overhead reasons as existing team structures that adopt AI.