Why does the system work like that? Is the cache local, or on Claude's servers?
Why not store the prompt cache to disk when it goes cold for a certain period of time, and then when a long-lived, cold conversation gets re-initiated, you can re-hydrate the cache from disk. Purge the cached prompts from disk after X days of inactivity, and tell users they cannot resume conversations over X days without burning budget.
The cache is on Antropics server, its like a freeze frame of the LLM inner workings at the time. the LLM can pick up directly from this save state. as you can guess this save state has bits of the underlying model, their secret sauce. so it cannot be saved locally...
Maybe they could let users store an encrypted copy of the cache? Since the users wouldn't have Anthropic's keys, it wouldn't leak any information about the model (beyond perhaps its number of parameters judging by the size).
I'm unsure of the sizes needed for prompt cache, but I suspect its several gigs in size (A percentage of the model weight size), how would the user upload this every time they started a resumed a old idle session, also are they going to save /every/ session you do this with?
They could let you nominate an S3 bucket (or Azure/GCP/etc equivalent). Instead of dropping data from the cache, they encrypt it and save it to the bucket; on a cache miss they check the bucket and try to reload from it. You pay for the bucket; you control the expiry time for it; if it costs too much you just turn it off.
A few gigs of disk is not that expensive. Imo they should allocate every paying user (at least) one disk cache slot that doesn't expire after any time. Use it for their most recent long chat (a very short question-answer that could easily be replayed shouldn't evict a long convo).
Whats lost on this thread is these caches are in very tight supply - they are literally on the GPUs running inference. the GPUs must load all the tokens in the conversation (expensive) and then continuing the conversation can leverage the GPU cache to avoid re-loading the full context up to that point. but obviously GPUs are in super tight supply, so if a thread has been dead for a while, they need to re-use the GPU for other customers.
Encryption can only ensure the confidentiality of a message from a non-trusted third party but when that non-trusted third party happens to be your own machine hosting Claude Code, then it is pointless. You can always dump the keys (from your memory) that were used to encrypt/decrypt the message and use it to reconstruct the model weights (from the dump of your memory).
jetbalsa said that the cache is on Anthropic's server, so the encryption and decryption would be server-side. You'd never see the encryption key, Anthropic would just give you an encrypted dump of the cache that would otherwise live on its server, and then decrypt with their own key when you replay the copy.
This is also an oversimplification. If I understand the issue correctly, the notification with the message contents was what was cashed locally and then accessed. This same vulnerability would exist with Signal if you had the notifications configured to display the full message contents. In this case, it has nothing to do with either Apple or Signal.
I think Martin isn't wrong here, but I've first hand seen AI produce "lazy" code, where the answer was actually more code.
A concrete example, I had a set of python models that defined a database schema for a given set of logical concepts.
I added a new logical concept to the system, very analogous to the existing logical set. Claude decided that it should just re-use the existing model set, which worked in theory, but caused the consumers to have to do all sorts of gymnastics to do type inference at runtime. It "worked", but it was definitely the wrong layer of abstraction.
Is more code really bad? For humans, yes we want thing abstracted, but sometimes it may make more sense to actually repeat yourself. If a machine is writing and maintaining the code, do we need that extra layer now?
In the olden days we used Duff's devices and manually unrolled loops with duplicated code that we wrote ourselves.
Now, the compiler is "smart" enough to understand your intent and actually generates repeated assembly code that is duplicated. You don't care that it's duplicated because the compiler is doing it for you.
I've had some projects recently where I was using an LLM where I needed a few snippets of non-trivial computational geometry. In the old days, I'd have to go search for a library and get permission from compliance to import the library and then I'd have to convert my domain representations of stuff into the formats that library needed. All of that would have been cheaper than me writing the code myself, but it was non-trivial.
Now the LLM can write for me only the stuff I need (no extra big library to import) and it will use the data in the format I stored it in (no needing to translate data structures). The canon says the "right" way to do it would be to have a geometry library to prevent repeated code, but here I have a self contained function that "just works".
This kind of thinking only works as long as the machine can actually fix its own errors.
I've had several bugs that required manual intervention (yes, even with $YOUR_FAVORITE_MODEL -- I've tried them all at this point). After the first few sessions of deleting countless lines of pointless cruft, I quickly learned the benefits of preemptively trimming down the code by hand.
We have confidence in the extra code a compiler generates because it’s deterministic. We don’t have that in LLMs, neither those that wrote nor read the code.
Anecdotally, I've always been reasonably pleased with their products. I think I've owned a couple of powerbanks, and a USB/HDMI hub. Of the <Insert_random_smattering_of_letters> brand names on Amazon, I do tend to lean towards them a bit more.
edit: having said all of that, relating to this article, I don't want AI anywhere near the products of theirs I'm currently buying.
I've been digging into some of the history of modern cryptography, as I seek to better explain how an app of mine works under the hood. It has been fascinating. I didn't expect AES-256-GCM's history to involve a duel where someone literally died. I'm hoping to write a few part series digging into more specifically, the history of the cryptography space.
We'll keep dangerous devices like the SuperBox in our homes, if it helps us get access to free movies and tv.
We'll use single-use plastics, even if we know they're bad for the environment, because they're just so damn easy.
We'll let AI run that thing for us, because it's just too easy.
A whole generation has grown up without knowing what it was like to infect your computer with AIDS trying to download an MP3, and it shows. That caution will come back, just at a terrible cost.
More generically, our species' Achilles heel is our inability to factor in the long-term cost of negative externalities when evaluating processes that yield short-term positive results.
This. From simple personal choices to the marker economy and politics. With games we're introduced to cheat codes pretty early in our lives. Some people outgrow them, some don't. Too bad our systems encourage their use, whether it's a time-to-market thing, cutting costs, or the next election.
just because there's a chance of something bad happening doesn't mean its worth it to abandon all convenience and workflow improvements, though. if no one ever used workflow tools that could access the contents of their emails because of the risk of a leak, its possible the productivity loss across society from that would be much worse than from the security incidents (like this one). there are pros and cons to things. it's not wrong to choose something just because it has a small risk associated with it.
This is a great idea, but respectfully, if you're going to get traction you need to be the one instigating getting people to talk to you. Have a pitch, have an explicit ask, and be willing to put effort into making it happen.
reply