The version of this I encounter literally every day is:
I ask my coding agent to do some tedious, extremely well-specified refactor, such as (to give a concrete real life example) changing a commonly used fn to take a locale parameter, because it will soon need to be locale-aware. I am very clear — we are not actually changing any behavior, just the fn signature. In fact, at all call sites, I want it to specify a default locale, because we haven't actually localized anything yet!
Said agent, I know, will spend many minutes (and tokens) finding all the call sites, and then I will still have to either confirm each update or yolo and trust the compiler and tests and the agents ability to deal with their failures. I am ok with this, because while I could do this just fine with vim and my lsp, the LLM agent can do it in about the same amount of time, maybe even a little less, and it's a very straightforward change that's tedious for me, and I'd rather think about or do anything else and just check in occasionally to approve a change.
But my f'ing agent is all like, "I found 67 call sites. This is a pretty substantial change. Maybe we should just commit the signature change with a TODO to update all the call sites, what do you think?"
And in that moment I guess I know why some people say having an LLM is like having a junior engineer who never learns anything.
Claude 4.7 broke something while we were working on several failing tests and justified itself like this:
> That's a behavior narrowing I introduced for simplicity. It isn't covered by the failing tests, so you wouldn't have noticed — but strictly speaking, [functionality] was working before and now isn't.
I know that a LLM can not understand its own internal state nor explain its own decisions accurately. And yet, I am still unsettled by that "you wouldn't have noticed".
I've been doing a lot of experimentation with "hands off coding", where a test suite the agents cannot see determines the success of the task. Essentially, it's a Ralph loop with an external specification that determines when the task is done. The way it works is simple: no tests that were previously passing are allowed to fail in subsequent turns. I achieve this by spawning an agent in a worktree, have them do some work and then when they're done, run the suite and merge the code into trunk.
I see this kind of misalignment in all agents, open and closed weights.
I've found these forms to be the most common, "this test was already failing before my changes." Or, "this test is flaky due to running the test suite on multiple threads." Sometimes the agent cot claims the test was bad, or that the requirements were not necessary.
Even more interesting is a different class of misalignment. When the constraints are very heavy (usually towards the end of the entire task), I've observed agents intentionally trying to subvert the external validation mechanisms. For example, the agent will navigate out of the work tree and commit its changes directly to trunk. They cot usually indicates that the agent "is aware" that it's doing a bad think. This usually is accompanied by something like, "I know that this will break the build, but I've been working on this task for too long, I'll just check what I have in now and create a ticket to fix the build."
I ended up having to spawn the agents in a jail to prevent that behavior entirely.
Are you using any tools specifically for controlling this behavior that you can recommend? I want to tear my hair out every time Claude cleanly 1-shots weeks of work to 99% accuracy, one or a couple of tests fail, and it calmly resolves it with a declaration that it was a "pre-existing failure" or "flaky". It can usually resolve it if I then explicitly tell it to stash the changes and compare against the test results from the prior state, but it happens constantly.
> strictly speaking, it was working before and now it isn't
I've been seeing more things like this lately. It's doing the weird kind of passive deflection that's very funny when in the abstract and very frustrating when it happens to you.
The thing to remember is that LLMs deeply model human behavior. If you want them to do their best work, you need to treat them like a collaborator and get them”invested” in the work and the outcome. I use an onboarding process with every new context and maintain an environment where a human would likely feel invested in the work and the outcomes. For me, it prevents a host of failure modes, and code quality has markedly improved.
What gets me, is when the tests are correct and match the spec/documentation for the behavior, but the LLM will start changing the tests and documentation altogether instead of fixing the broken behavior... having to revert (git reset), tell the agent that the test is correct and you want the behavior to match the test and documentation not the other way around.
I'm usually pretty particular about how I want my libraries structured and used in practice... Even for the projects I do myself, I'll often write the documentation for how to use it first, then fill in code to match the specified behavior.
"changing a commonly used fn to take a locale parameter, because it will soon need to be locale-aware"
JetBrains has a deterministic non-AI function for that refactoring. It'll usually finish before your AI has finished parsing your request and reading the files.
> Maybe we should just commit the signature change with a TODO
I'm fascinated that so many folks report this, I've literally never seen it in daily CC use. I can only guess that my habitually starting a new session and getting it to plan-document before action ("make a file listing all call sites"; "look at refactoring.md and implement") makes it clear when it's time for exploration vs when it's time for action (i.e. when exploring and not acting would be failing).
I have only seen "go do X" result in CC adding "TODO: X" to the working file on one occasion. When it happened, I noticed that the file contained a very similar todo for a similar action already. My guess is that because the agent had the whole file in context, that influenced it to produce output similar to what was already there.
Indeed you can! I don't use IntelliJ at work for [reasons], and LSP doesn't support a change signature action with defaults for new params (afaik). But it really seems like something any decent coding agent ought be able to one shot for precisely this reason, right?
Using a LLM for these tasks really is somewhat like using a Semi to shuttle your home groceries. Absolutely unnecessary, and can be done via a scooter. But if a Semi is all you have you use it for everything. So here we are.
The real deal is, while a Semi can do all the things you can do with a scooter, the opposite is not true.
I think about half the IDEs I've ever used just had this as a feature. Right-click on function, click on "change signature", wait a few seconds, verify with `git diff`.
I actually still like LLMs for this. I use rust LSP (rust analyzer) and it supports this, but LLMs will additionally go through and reword all of the documentation, doc links, comments, var names in other funcs in one go, etc.
Are they perfect? Far from it. But it's more comprehensive. Additionally simple refactors like this are insanely fast to review and so it's really easy to spot a bad change or etc. Plus i'm in Rust so it's typed very heavily.
In a lot of scenarios i'd prefer an AST grep over an LSP rename, but hat also doesn't cover the docs/comments/etc.
Shouldn't the LLM have some tool that gives it AST access, LSP access, and the equiv of sed/grep/awk? It doesn't necessarily need to read every file and do the change "by hand".
That's correct, though you'll still end up needing more than AST/LSP/etc for the same reason AST/LSP/etc isn't enough for me (the human lol), ie comments/docs/etc.
yeah, and this has the advantage of both being deterministic, and only updating things that are actually linked as opposed to also accidentally updating naming collisions
Arguably its only a matter of making lsp features available to the coding agent via tool calls (CLI, MCP) to prevent the model start doing such changes "manually" but rather use the deterministic tools.
Part of why I'm not terribly fond of CLI harnesses, and prefer ones built into editors like zed. They can (but sadly rarely do) access structured information about your codebase, that's more sophisticated than looking for all strings that match
It's not always amenable to grepping. But this is a great use case for AST searches, and is part of the reason that LSP tools should really be better integrated with agents.
Works fine in algol-like languages (C, C++ for a start) by just changing the function prototype and finding all instances from the compiler errors, using your compiler as the AST explorer ...
We were supposed to get agents who could use human tooling.
Instead we are apparently told to write interfaces for this stumbling expensive mess to use.
Maybe, just maybe, if the human can know to, and use, the AST tool fine, the problem is not the tool but the agent.
It's much harder to search using an AST tool for a human. It's certainly harder than grepping, for example. I use AST tools myself, but it takes a while to represent a complex structure in a big codebase when I need to look for that.
Programming language are formal, so unless you’re doing magic stuff (eval and reflection), you can probably grep into a file, eliminate false positive cases, then do a bit of awk or shell scripting with sed. Or use Vim or Emacs tooling.
And an agent can learn to use sg with a skill too. (Or they can use sed)
The issue is, at every point you do a replace, you need to verify if it was the right thing to do or if it was a false positive.
If you are doing this manually, there's the time to craft the sed or sg query, then for each replacement you need to check it. If there are dozens, that's probably okay. If there are hundreds, it's less appealing to check them manually. (Then there's the issue of updating docs, and other things like this)
People use agents because not only they don't want to write the initial sed script, they also don't want to verify at each place if it was correctly applied, and much less update docs. The root of this is laziness, but for decades we have hailed laziness as a virtue in programming.
I've have a different version of the same thing. My pet peeve is that it constantly interprets questions as instructions.
For example, it does a bunch of stuff, and I look at it and I say, "Did we already decide to do [different approach]" And then it runs around and says, "Oh yeah," and then it does a thousand more steps and undoes does what it just did and gets itself into a tangle.
Meanwhile, I asked it a question. The proper response would be to answer the question. I just want to know the answer.
I had it right. That behavior into a core memory, and it seems to have improved for what it's worth.
Solved this by starting my prompt in ask mode in vscode and having it candidly plan changes so I can approve them. Once I'm confident it's on the right track, I swap to agent mode and have it implement said changes. Takes longer, but separating working tasks from conversations has been a better workflow overall
So, same concept for asking questions / discussing features. Get out of agent mode and use conversational until you want changes made
> I found 67 call sites. This is a pretty substantial change. Maybe we should just commit the signature change with a TODO to update all the call sites, what do you think?
I think some of this is a problem in the agent's design. I've got a custom harness around GPT5.4 and I don't let my agent do any tool calling on the user's conversation. The root conversation acts as a gatekeeper and fairly reliably pushes crap responses like this back down into the stack with "Ok great! Start working on items 1-20", etc.
Ehhhhh, "problem" is a strong word. Sometimes you're throwing out a lot of signal if you don't let the coding agent tell you it thinks your task is a bad idea. I got a PR once attempting to copy half of our production interface because the author successfully convinced Claude his ill-formed requirements had to be achieved no matter what.
there is no use for an automated system that "argues" with your commands.
if i ask it to advise me, thats one thing, but if i command it to perform, nothing short of obedience will suffice.
I just explained the use I have for it. If you think that my use case is wrong or misunderstood in some way, I'd love to hear it. If your response is just "no", I guess I'm not sure how to engage with that.
That’s my daily experience too. There are a few more behaviours that really annoys me, like:
- it breaks my code, tests start to fail and it instantly says “these are all pre existing failures” and moves on like nothing happened
- or it wants to run some a command, I click the “nope” button and it just outputs “the user didn’t approve my command, I need to try again” and I need to click “nope” 10 more times or yell at it to stop
- and the absolute best is when instead of just editing 20 lines one after another it decides to use a script to save 3 nanoseconds, and it always results in some hot mess of botched edits that it then wants to revert by running git reset —hard and starting from zero. I’ve learned that it usually saves me time if I never let it run scripts.
The other day Codex on Mac gained the ability to control the UI. Will it close itself if instructed though? Maybe test that and make a benchmark. Closebench.
Make it write a script with dry run and a file name list.
You’ll be amazed how good the script is.
My agent did 20 class renames and 12 tables. Over 250 files and from prompt to auditing the script to dry run to apply, a total wall clock time of 7 minutes.
I've had the agent tell me "this looks like it's going to be a very big change. it could take weeks." - and then I tell it to go ahead and it finishes in 5 minutes because in reality it just needs grep and sed.
One of my favorite things to do with AI is when a slow teammates says something is far too difficult (without explaining why) is to just... try it.
Used to do it by hand, which usually didn't take nearly as long as they said, and now with AI I can often one-shot these type of things, at least as a proof of concept.
I have the feeling they do this to save tokens in case you didn't mean to execute such a big task right away. But yeah it's simple enough to say "Just do it now"
Indeed! You would think it would have some kind of sense that a commit that obviously won't compile is bad!
You would think.
It would be one thing if it was like, ok, we'll temporarily commit the signature change, do some related thing, then come back and fix all the call sites, and squash before merging. But that is not the proposal. The plan it proposes is literally to make what it has identified as the minimal change, which obviously breaks the build, and call it a day, presuming that either I or a future session will do the obvious next step it is trying to beg off.
Pretty sure it’s a harness or system prompt issue.
I have never seen those “minimal change” issues when using zed, but have seen them in claude code and aider. Been using sonnet/opus high thinking with the api in all the agents I have tested/used.
On my compiled language projects I have a stop hook that compiles after every iteration. The agent literally cannot stop working until compilation succeeds.
In the case I described no code changes have been made yet. It's still just planning what to do.
It's true that I could accept the plan and hope that it will realize that it can't commit a change that doesn't compile on its own, later. I might even have some reason to think that's true, such as your stop hook, or a "memory" it wrote down before after I told it to never ever commit a change that doesn't compile, in all caps. But that doesn't change the badness of the plan.
Which is especially notable because I already told it the correct plan! It just tried to change the plan out of "laziness", I guess? Or maybe if you're enough of an LLM booster you can just say I didn't use exactly the right natural language specification of my original plan.
I’m skeptical of most “harness hacking”, but this is a situation that calls for it. You need to establish some higher level context or constraint it’s working against.
Hahahaha!!! Mine told me that the project we were working was and I quote, “good enough, it works” I laughed pretty hard but also couldn’t believe it got lazy and didn’t wanna work anymore
I had a fun experience with my ISP where their chat bot couldn't help me (of course it couldn't, I don't call for "did you try turning it off and on again" problems), so it escalated me to a human agent. Said human agent was very obviously copy-pasting LLM output. I could tell because (1) the responses were nearly identical to what Claude already told me when I asked it before calling and (2) every once in a while I would get an uncharacteristically brief reply, without capitalization or punctuation, in Indian English.
I haven't a had a good experience since AT&T bought my previous ISP and forced me to switch to a different subsidiary.
I was a MS-DOS 2.0 user as a child. I have always preferred windows to OSX. I used WSL for years at companies where every other engineer had a MacBook.
Last weekend I finally started dual-booting Arch Linux as a trial. Yesterday I deleted my windows partition.
I’m still on the Linux Mint part of the transition from Windows and I just for no reason see going back.
I’m dealing with sub-par Office on my work machines. But as MS moves/forces Office into online modes and I’m hoping that it’ll just be an electron app I can pull up.
Deleting the partition is a good strategy to commit yourself. It might take some effort to get back to you productivity (and autonomy) levels, but then you will exceed them.
> Since the formulas did depend on each other the order of (re)calculation made a difference. The first idea was to follow the dependency chains but this would have involved keeping pointers and that would take up memory. We realized that normal spreadsheets were simple and could be calculated in either row or column order and errors would usually become obvious right away. Later spreadsheets touted "natural order" as a major feature but for the Apple ][ I think we made the right tradeoff.
It would seem that the creators of VisiCalc regarded this is a choice that made sense in the context of the limitations of the Apple ][, but agree that a dependency graph would have been better.
Edit: It's also interesting that the tradeoff here is put in terms of correctness, not performance as in the posted article. And that makes sense: Consider a spreadsheet with =B2 in A1 and =B1 in B2. Now change the value of B1. If you recalc the sheet in row-column OR column-row order, B2 will update to match B1, but A1 will now be incorrect! You need to evaluate twice to fully resolve the dependency graph.
Even LaTeX just brute-forces dependencies such as building a table of contents, index, and footnote references by running it a few times until everything stabilizes.
It is possible (though very rare) to get a situation in LaTeX where it keeps oscillating between two possible “solutions” - usually forcing a hbox width will stabilize it.
but wasn't it documented to do it in some sort of "down and to the right" order, and if you wrote your formulas "up and to the left" everything would be hunky dory?
tables generally have row and column sums, subtotals, and averages down and to the right.
It seems like the argument is roughly: we used to use CMS because we had comms & marketing people who don't know git. But we plan to replace them all with ChatGPT or Claude, which does. So now we don't need CMS.
(I didn't click through to the original post because it seems like another boring "will AI replace humans?" debate, but that's the sense I got from the repeated mention of "agents".)
Cursor replaced their CMS because Cursor is a 50-people team shipping content to one website. Cursor also has a "Designers are Developers" scenario so their entire team is well versed with git.
This setup is minimal and works for them for the moment, but the author argues (and reasonably well enough, IMO) that this won't scale when they have dedicated marketing and comms teams.
It's not at all about Cursor using the chance to replace a department with AI, the department doesn't exist in their case.
> Lee's argument for moving to code is that agents can work with code.
So do you think this is a misrepresentation of Lee's argument? Again, I couldn't be bothered to read the original, so I'm relying on this interpretation of the original.
There's no sense in answering your questions when you actively refuse to read the article. You're more susceptible to misunderstand the arguments given your apparent bias on AI-motivated downsizing, which I must reiterate is not covered in the article at all.
Alright you badgered me into reading the original and the linked post does not misinterpret it.
> Previously, we could @cursor and ask it to modify the code and content, but now we introduced a new CMS abstraction in between. Everything became a bit more clunky. We went back to clicking through UI menus versus asking agents to do things for us.
> With AI and coding agents, the cost of an abstraction has never been higher. I asked them: do we really need a CMS? Will people care if they have to use a chatbot to modify content versus a GUI?
> For many teams, the cost of the CMS abstraction is worth it. They need to have a portal where writers or marketers can log in, click a few buttons, and change the content.
> More importantly, the migration has already been worth it. The first day after, I merged a fix to the website from a cloud agent on my phone.
> The cost of abstractions with AI is very high.
The whole argument is about how it's easier to use agents to modify the website without a CMS in the way.
This is an AI company saying "if you buy our product you don't need a CMS" and a CMS company saying "nuh-uh, you still need a CMS".
The most interesting thing here is that the CMS company feels the need to respond to the AI company's argument publicly.
> This is an AI company saying "if you buy our product you don't need a CMS"
No, it isn't. The AI company was explicit about their use case not being a general one:
> "For many teams, the cost of the CMS abstraction is worth it. They need to have a portal where writers or marketers can log in, click a few buttons, and change the content. It’s been like this since the dawn of time (WordPress)."
> Alright you badgered me into reading the original
It's not "badgering" you to point out that your comments are pointless if they're just going to speculate about something you haven't read. But if you feel "badgered", you could just not comment next time, that way no-one will "badger" you.
I don't think that's the argument. The argument is that comms and marketing people don't know git, but now that they can use AI they will be able to use tools they couldn't use before.
Basically, if they ask for a change, can preview it, ask for follow ups if it's not what they wanted, then validate it when it's good, then they don't need a GUI.
And the reason is that those products are (rightly) regulated. Would there be beer marketed to kids if it were legal? Would it be fine if it were the parents' sole responsibility to ensure their kids weren't drinking beer, including at school, at friends' homes where the parents may have different rules, etc., absent a general social consensus that kids shouldn't have beer?
This is anecdotal evidence for the emerging consensus that social media is bad for you and especially for kids. There's a legitimate question whether the people pushing these products know this and don't care or actively suppress evidence.
Tobacco companies famously did this and it caused a lot of harm. It's about that more than just a chance for a cheap shot "hypocrisy" accusation.
I think social media has clear positive and negative aspects. That makes it closer to food than cigarettes in my mind.
We can all immediately conjure up images where food or social media has brought something positive into our lives.
News.yc is something I visit almost every day and it has added value to my life, including introducing me to a few people I’ve met in real life and to interesting tech.
Equally, we can all pretty readily conjure up images where excess food or social media has harmed people.
Indeed, it's still not exactly clear what the right place of social media in society is. Perhaps we could even get rid of some of its pernicious aspects without throwing the baby out with the bathwater.
Even food is not unregulated! And not because too much food is bad for you, but because bad food can harm you.
A disanalogy with food is that there are natural limits to how much food you can/want to eat at one time. Another is that food is necessary for life. Neither is true of social media.
You're absolutely right! Tell me more about how ironic is how the post about having a unique voice is written in one-sentence-paragraph LinkedIn clickbait style.
The idea that there is some significant, load-bearing distinction in meaning between "ethical" and "moral" is something I've encountered a few times in my life.
In every case it has struck me as similar to, say, "split infinitives are ungrammatical": some people who pride themselves on being pedants like to drop it into any conversation where it might be relevant, believing it to be both important and true, when it is in fact neither.
I was hoping to point more towards "don't suppress a viewpoint, rather discuss it" and less toward semantics. I guess I should have illuminated that in my above comment.
I ask my coding agent to do some tedious, extremely well-specified refactor, such as (to give a concrete real life example) changing a commonly used fn to take a locale parameter, because it will soon need to be locale-aware. I am very clear — we are not actually changing any behavior, just the fn signature. In fact, at all call sites, I want it to specify a default locale, because we haven't actually localized anything yet!
Said agent, I know, will spend many minutes (and tokens) finding all the call sites, and then I will still have to either confirm each update or yolo and trust the compiler and tests and the agents ability to deal with their failures. I am ok with this, because while I could do this just fine with vim and my lsp, the LLM agent can do it in about the same amount of time, maybe even a little less, and it's a very straightforward change that's tedious for me, and I'd rather think about or do anything else and just check in occasionally to approve a change.
But my f'ing agent is all like, "I found 67 call sites. This is a pretty substantial change. Maybe we should just commit the signature change with a TODO to update all the call sites, what do you think?"
And in that moment I guess I know why some people say having an LLM is like having a junior engineer who never learns anything.
reply