Hacker Newsnew | past | comments | ask | show | jobs | submit | btrettel's commentslogin

Similar to bragging about LOC, I have noticed in my own field of computational fluid dynamics that some vibe coders brag about how large or rigorous their test suites are. The problem is that whenever I look more closely into the tests, the tests are not outstanding and less rigorous than my own manually created tests. There often are big gaps in vibe coded tests. I don't care if you have 1 million tests. 1 million easy tests or 1 million tests that don't cover the right parts of the code aren't worth much.

Yes, I've found tests are the one thing I need to write. I then also need to be sure to keep 'git diff'ing the tests, to make sure claude doesn't decide to 'fix' the tests when it's code doesn't work.

When I am rigourous about the tests, Claude has done an amazing job implementing some tricky algorithms from some difficult academic papers, saving me time overall, but it does require more babysitting than I would like.


Give claude a separate user, make the tests not writable for it. Generally you should limit claude to only have write access to the specific things it needs to edit, this will save you tokens because it will fail faster when it goes off the rails.

Don't even need a separate user if you're on linux (or wsl), just use the sandbox feature, you can specify allowed directories for read and/or write.

The sandbox is powered by bubblewrap (used by Flatpaks) so I trust it.


You might want to look into property based testing, eg python-hypothesis, if you use that language. It's great, and even finds minimal counter-examples.

The “red/green TDD” (ie. actual tdd) and mutation testing (which LLMs can help with) are good ways to keep those tests under control.

Not gonna help with the test code quality, but at least the tests are going to be relevant.


If you start with the failing tests, you can use them plus the spec to give to review to another agent (human or silicon).

It's a bit like pre-registering your study in medicine.


It's a struggle to get LLMs to generate tests that aren't entirely stupid.

Like grepping source code for a string. or assert(1==1, true)

You have to have a curated list of every kind of test not to write or you get hundreds of pointless-at-best tests.


What I've observed in computational fluid dynamics is that LLMs seem to grab common validation cases used often in the literature, regardless of the relevance to the problem at hand. "Lid-driven cavity" cases were used by the two vibe coded simulators I commented on at r/cfd, for instance. I never liked the lid-driven cavity problem because it rarely ever resembles an actual use case. A way better validation case would be an experiment on the same type of problem the user intends to solve. I think the lid-driven cavity problem is often picked in the literature because the geometry is easy to set up, not because it's relevant or particularly challenging. I don't know if this problem is due to vibe coders not actually having a particular use case in mind or LLMs overemphasizing what's common.

LLMs seem to also avoid checking the math of the simulator. In CFD, this is called verification. The comparisons are almost exclusively against experiments (validation), but it's possible for a model to be implemented incorrectly and for calibration of the model to hide that fact. It's common to check the order-of-accuracy of the numerical scheme to test whether it was implemented correctly, but I haven't seen any vibe coders do that. (LLMs definitely know about that procedure as I've asked multiple LLMs about it before. It's not an obscure procedure.)


Both of these points seem like they would be easy to instruct an LLM to shape its testing strategy.

I think so too. If unclear, I don't use LLMs for coding at the moment and was just commenting on what I've seen from others who do in computational fluid dynamics.

Edit: Let me add that while I think it would be easy to instruct a LLM to do what I'd like, LLMs don't do these things by default despite them being recognized as best practices, and I'm not confident in LLMs getting the data or references right for validation tests. My own experience is that LLMs are pretty bad when it comes to reproducing citations, and they tend to miss a lot of the literature.


> You have to have a curated list of every kind of test not to write

This should be distilled into a tool. Some kind of AST based code analyser/linter that fails if it sees stupid test structures.

Just having it in plain english in a HOW-TO-TEST.md file is hit and miss.


> have a curated list of every kind of test not to write

I've seen a lot of people interact with LLMs like this and I'm skeptical.

It's not how you'd "teach" a human (effectively). Teaching (humans) with positive examples is generally much more effective than with negative examples. You'd show them examples of good tests to write, discuss the properties you want, etc...

I try to interact with LLMs the same way. I certainly wouldn't say I've solved "how to interact with LLMs" but it seems to at least mostly work - though I haven't done any (pseudo-)scientific comparison testing or anything.

I'm curious if anyone else has opinions on what the best approach is here? Especially if backed up by actual data.


It's going to be difficult for anyone to have any more "data" than you already do. It's early days for all of us. It's not like there's anyone with 20 years of 2026 AI coding assistant experience.

However we can say based on the architecture of the LLMs and how they work that if you want them to not do something, you really don't want to mention the thing you don't want them to do at all. Eventually the negation gets smeared away and the thing you don't want them to do becomes something they consider. You want to stay as positive as possible and flood them with what you do want them to do, so they're too busy doing that to even consider what you didn't want them to do. You just plain don't want the thing you don't want in their vector space at all, not even with adjectives hanging on them.


I don't have much data to go on (in accordance with what 'jerf wrote), however I offer a high-level, abstract perspective.

The ideal set of outcomes exist as a tiny subspace of a high-dimensional space of possible solutions. Almost all those solutions are bad. Giving negative examples is removing some specific bits of the possibility space from consideration[0] - not very useful, since almost everything else that remains is bad too. Giving positive examples is narrowing down the search area to where the good solutions are likely to be - drastically more effective.

A more humane intuition[1], something I've observed as a parent and also through introspection. When I tell my kid to do something, and they don't understand WTF it is that I want, they'll do something weird and entirely undesirable. If I tell them, "don't do that - and also don't do [some other thing they haven't even thought of yet]", it's not going to improve the outcome; even repeated attempts at correction don't seem effective. In contrast, if I tell (or better, show) them what to do, they usually get the idea quickly, and whatever random experiments/play they invent, is more likely to still be helpful.

--

[0] - While paradoxically also highlighting them - it's the "don't think of a pink elephant" phenomenon.

[1] - Yes, I love anthropomorphizing LLMs, because it works.


It's not a person. Unlike a person it has a tremendous "memory" of everything ever done its creators could get access to.

If I tell it what to do, I bias it towards doing those things and limit its ability to think of things I didn't think of myself, which is what I want in testing. In separate passes, sure a pass where I prescribe types and specific tests is effective. But I also want it to think of things I didn't, a prompt like "write excellent tests that don't break these rules..." is how you get that.


Two things:

1. Tests have always been both about the function of the application, but also the communication of what should be occurring to the larger team or yourself six months down the road.

With automated software development the communication with the LLM itself is a much larger part of it so I feel like it's "ok" to have lots of easy tests that are less about rigor and more about "yes this is how this should work"

2. Ideally we're going to get to the point where the tooling allows for adversarial agents with one writing code and one writing tests. Even for now just popping open a separate terminal window and generating+running tests in it from your main coding terminal is helpful.


The trick is crafting the minimal number of tests.

it is like reward hacking, where the reward function in this case the test is exploited to achieve its goals. it wants to declare victory and be rewarded so the tests are not critical to the code under test. This is probably in the RL pre-training data, I am of course merely speculating.

I do CFD in my day job, though not for electronics cooling. I don't think this is as easy as you imagine. It's relatively easy to make pretty pictures, but just because the picture is pretty doesn't mean that it's physical accurate or mathematically correct. Lack of resolution could be an issue, but there are plenty of more subtle problems as well. Jet impingement is known to cause problems with turbulence models, though some models claim to solve the issue. Plus, turbulence modeling isn't always predictive, and might require a certain amount of calibration any time a model is used in a new scenario. Add on top of that the fact that the computational cost of these simulations often is extremely high, even with turbulence models. Maybe people building PCs have plenty of unused CPUs and GPUs, though.

Unfortunately, I don't think CFD and turbulence modeling are things that you can just start doing well without learning a lot before starting.


You are probably right, my only exposure to CFD was through listening in at conferences, haha. It seems neat, though. They always had the coolest pictures.

I wonder, could there be any play in the fact that PC cases tend to be a little bit less general than just, like, any 3D model? There are only so many cases. Plus most of the parts are rectangular, and most of the surfaces are aligned the same set of axes.

Cabling might be a problem.


Location: United States (Open to any US location)

Remote: Yes, open to remote, hybrid, or in office

Willing to relocate: Yes

Technologies: Fortran, Python (Matplotlib, Numpy, Pandas, Scipy), OpenMP, Git/GitHub, Linux, Bash, others...

Résumé/CV: Available on request

Email: 7b8ci3kl@trettel.us

GitHub: https://github.com/btrettel

Personal website: http://trettel.us/

I'm Ben Trettel, an experienced mechanical engineer with a PhD, specializing in computational fluid dynamics, design optimization, and verification & validation of computer simulations.

I am particularly interested in opportunities to build cutting-edge physical products where computational simulation and design optimization are key.


Cool! Are you interested in defense? If so, we're based in Austin, TX and in the current YC batch.


I agree, my immediate reaction was that mechanical engineer is not a trades worker.

I majored in mechanical engineering at college. We had a required programming class. A lot of people like myself already knew how to program before we took the class too. We also had a required electronics class. My experience is that most folks with CS degrees would be surprised by the breadth of what mechanical/aerospace/chemical/etc. engineers learn.


Location: United States (Open to any US location)

Remote: Yes, open to remote, hybrid, or in office

Willing to relocate: Yes

Technologies: Fortran, Python (Matplotlib, Numpy, Pandas, Scipy), OpenMP, Git/GitHub, Linux, Bash, others...

Résumé/CV: Available on request

Email: 7b8ci3kl@trettel.us

GitHub: https://github.com/btrettel

Personal website: http://trettel.us/

I'm Ben Trettel, an experienced mechanical engineer with a PhD, specializing in computational fluid dynamics, design optimization, and verification & validation of computer simulations. Also, I am knowledgeable about patent law from time spent at the USPTO as a patent examiner.

I am particularly interested in opportunities to build cutting-edge physical products where computational simulation and design optimization are key.


Nice work. I made a similar (but much less capable) Python script [1] for my own use before and I can say that a tool like this is useful to keep the docs in sync with the code.

My script only detects whether a checksum for a segment of code doesn't match, using directives placed in the code (not a separate file as you've done). For example:

    #tripwire$ begin 094359D3 Update docs section blah if necessary.
    [...]
    #tripwire$ end
Also, my script knows nothing about pull requests and is basically a linter. So it's definitely not as capable.

[1] https://github.com/btrettel/flt/blob/main/py/tripwire.py

***

Edit: I just checked my notes. I might have got the idea for my script from this earlier Hacker News comment: https://news.ycombinator.com/item?id=25423514


This is really cool, had no idea someone had solved a similar problem this way. The checksum idea is genius!!


Where's the report referred to here? I'm doing Google searches including `site:morganstanley.com` for a bunch of quotes in this article and I can't find any single report that contains all of what's mentioned. I couldn't find anything by browsing their website either. I'm wondering if a lot of this is AI hallucination.


A directory hierarchy works well for me. I've described my setup online before:

https://academia.stackexchange.com/a/173314/31143

https://www.reddit.com/r/datacurator/comments/p75xlu/how_i_o...

I don't read everything I have from start to finish. A lot of this is for future reference.

Since that StackExchange post, I'm now up to about 36.6K PDF files in 4.4K directories, with 14.5K symlinks so I can put files in multiple directories.

I also have a separate version controlled repo with notes a bunch of subjects. I'm planning to eventually merge my PDF hierarchy and the notes to have a unified system. It's going to have to be done in stages.


How many GB is your PDF collection? Have you considered sharing it more widely?

I know about Sci-Hub, Anna's Archive, etc., but I'm not so interested in a giant landfill containing all papers ever written. I'm much more interested in a curated collection of useful papers.


The root directory of the archive is 142 GB large. It's not only PDFs, but mostly PDFs. It includes many things that were never online and some things that were online at one point but are not online any longer.

For copyright reasons I can not share the entire thing as-is. I have plans to share most notes in there and bibliographic data for most directories. Doing so would be a major project in itself as this was never designed for that. I have some information I would prefer to keep private in there that's going to have to be filtered out, and I would prefer to clean some of it up to be in a more "presentable" state.

As for how useful you'd find it, I think that depends entirely on the overlap between my interests and yours.

You might be interested in this project of mine: https://github.com/btrettel/specialized-bibs


> As for how useful you'd find it, I think that depends entirely on the overlap between my interests and yours.

If that specialized-bibs repo is any indication, there seems to be reasonable overlap.

> For copyright reasons I can not share the entire thing as-is.

Of course. But if you'd like to store a non-encrypted backup copy on my system, I would be happy to offer my data storage services free of charge.

Alternatively: I'm training an LLM and it's transformative fair use.

My email is in my profile.

> I have some information I would prefer to keep private in there that's going to have to be filtered out, and I would prefer to clean some of it up to be in a more "presentable" state.

Totally understandable. If you ever get it into an acceptable state, please shoot me an email and I'll be happy to help out logistically.


That’s an impressive and thoughtfully structured system, especially at that scale. The use of symlinks and a separate version-controlled notes repository makes a lot of sense for long-term archival.

I’m curious — when working with such a large collection, how do you typically rediscover material or connect related ideas across different parts of the hierarchy? Do you rely primarily on directory structure, full-text search, or your notes as the main index?

And as you move toward merging the PDFs and notes into a unified system, do you see the notes becoming the central navigation layer, or will the directory structure remain primary?


It's mostly navigating the PDF directories or notes repository, full-text search of my notes, or (less frequently) searching Zotero for bibliographic data. I don't use tagging for this and I'll address full-text search of the documents in a bit. I can't say that either direct navigation or text search of the notes is dominant as I do a lot of both. Having multiple ways to find information is good for redundancy as if one way fails, you can try another. So I don't think the balanced approach I have will change in the future.

For navigating the directories, I have a Python script called cdref that will search the directory names, which has proved to be very useful. If there's one match, it'll go directly to that directory, and if there are multiple, a TUI will pop up and allow me to select the directory I want.

I haven't found full-text search of the documents themselves to be particularly useful because terminology varies, frequently what I'm looking for isn't in the text (could be a figure, for instance), and probably thousands of my documents haven't been OCRed. I think that relying too heavily on full-text search of the documents assumes that other people will organize information in a way useful to me, which isn't realistic [1]. Full-text search of the documents is a part of my system, still, but it's mostly used to find things to put in the directories or notes so that I can easily find the documents again without having to remember the right keywords. (Though I also often keep track of useful keywords.)

Often I won't remember where I keep some things or even if I have a directory or note on something at all. So I might accidentally create a redundant directory or note. But frequently I later realize that and use it as an opportunity to increase the connectivity of my directories and notes through symlinks. Then if I go to the "wrong" place, a symlink will send me where I should go. And if something pops into my head as related, I add a symlink or a note in the README file for a particular directory. (The README files in the directories are separate from the version controlled notes but will eventually merge, as I indicated.) Over the years, I've accumulated a lot of connections like this.

With all of this said, I think the important thing is to find a system that works for you that you can slowly scale over time. It doesn't need to look like my system. I've iteratively developed a system that works for me over 10+ years at this point. The scale is easy if you have a system you contribute a bit to on a regular basis over a long period of time.

[1] I've been also looking into having a large local bibliographic database to in part as an alternative to online scientific search engines like Google Scholar because I don't want to assume such services will always be available.


I'm reminded of something I read recently about disclosure of AI use in scientific papers [1]:

> Authors should be asked to indicate categories of AI use (e.g., literature discovery, data analysis, code generation, language editing), not narrate workflows or share prompts. This standardization reduces ambiguity, minimizes burden, and creates consistent signals for editors without inviting overinterpretation. Crucially, such declarations should be routine and neutral, not framed as exceptional or suspicious.

I think that sharing at least some of the prompts is a reasonable thing to do/require. I log every prompt to a LLM that I make. Still, I think this is a discussion worth having.

[1] https://scholarlykitchen.sspnet.org/2026/02/03/why-authors-a...


This is totally infeasible.

If I have a vibe coded project with 175k lines of python, there would be genuinely thousands and thousands of prompts to hundreds of agents, some fed into one another.

Whats the worth of digging through that? What do you learn? How would you know that I shared all of them?


> I log every prompt to a LLM that I make.

How many do you have in the log total?


I have a daily journal where I put every online post I make. I include anything I send to a LLM on my own time in there. (I have a separate work log on their computer, though I don't log my work prompts.) Likely I miss a few posts/prompts, but this should have the vast majority.

A few caveats: I'm not a heavy LLM user (this is probably what you're getting at) and the following is a low estimate. Often, I'll save the URL only for the first prompt and just put all subsequent prompts under that one URL.

Anyhow, running a simple grep command suggests that I have at least 82 prompts saved.

In my view, it would be better to organize saved prompts by project. This system was not set up with prompt disclosure in mind, so getting prompts for any particular project would be annoying. The point is more to keep track of what I'm thinking of at a point in time.

Right now, I don't think there are tools to properly "share the prompts" at the scale you mentioned in your other comment, but I think we will have those tools in the future. This is a real and tractable problem.

> Whats the worth of digging through that? What do you learn? How would you know that I shared all of them?

The same questions could be asked for the source code of any large scale project. The answers to the first two are going to depend on the project. I've learned quite a bit from looking at source code, personally, and I'm sure I could learn a lot from looking at prompts. As for the third question, there's no guarantee.


I can't reply to the other comment, but here goes:

This is one (1) conversation: https://chatgpt.com/share/69991d7e-87fc-8002-8c0e-2b38ed6673...

It has 9 "prompts" On just the issue of path re-writing, that's probably one of a dozen conversations, NOT INCLUDING prompts fed into an LLM that existed to strip spaces and newlines caused by copying things out of a TUI.

It's ok for things to be different than they used to be. It's ok for "prompts" to have been a meaningful unit of analysis 2 years ago but pointless today.


No the same question CANNOT be asked of source code because it can execute.

You might as well ask for a record of the conversations between two engineers while code was being written. That's what the chat is. I have a pre-pre-alpha project which already has potentially hundreds of "prompts"--really turns in continuing conversations. Some of them with 1 kind of embedded agent, some with another. Some with agents on the web with no project access.

Sometimes I would have conversations about plans that I drop. do I include those, if no code came out of them but my perspective changed or the agent's context changed so that later work was possible?

I don't mean to be dismissive, but maybe you don't have the necessary perspective to understand what you're asking for.


> maybe you don't have the necessary perspective to understand what you're asking for

Please don't cross into personal attack. You're making fine points, and that's enough.

https://news.ycombinator.com/newsguidelines.html

Btw, I think this is a particularly good point: "You might as well ask for a record of the conversations between two engineers while code was being written. That's what the chat is."

That's a good reframing. I can see why it might be impractical to share all of that, hard to make sense of as a reader, and too onerous to demand of submitters.

Since you have experience in this area, I'd like to hear your view on what we could reasonably require submitters to share, given that the flood of generated Github repos is creating a lot of low-quality submissions that don't gratify curiosity and thus don't fit the spirit of either Show HN or HN in general.

Some people would say "just ban them", but I'd rather find a way to adapt to this wave, since it is the largest technical development in a long time, and the price of opposing it is obsolescence.


"maybe you don't have the necessary perspective to understand what you're asking for"

this is in no way a personal attack. It's just a statement that's true. I didn't imply anything about them or their character or limitations, but they might not have the necessary perspective if that's the question they are asking.

I think it's critically important people figure out what they want to learn from what's being shared.

What do you need from submitters here? Even setting aside the burden of supplying it, what do you hope to learn?


> I think it's critically important people figure out what they want to learn from what's being shared.

> What do you need from submitters here? Even setting aside the burden of supplying it, what do you hope to learn?

I appreciate your comments on this - they are the most interesting responses I've seen so far about this question (so I hope the meta stuff doesn't get too much in the way).

The hope is to make the submissions of AI-generated Show HNs more interesting than they are when someone submits just a repo with generated code and a generated README.

The question is what could, at least in principle, be supplied that could have this desired effect.


(I thought I'd fork my reply to keep the meta stuff separate from the interesting stuff)

I believe you that it wasn't your intention, but when you address someone in the second person while commenting negatively on their perspective and understanding, it's going to land with a lot of readers (as it did with me) as personally pejorative. It's common for commenters (me too of course) not to perceive the provocations in their own posts, while being extra sensitive to the provocations in others' posts. If the skew is 10x both ways, that's quite a combination. It's necessary to remember and compensate for the skew, a la "objects in the mirror are closer than they appear".

Edit: total coincidence but I just noticed https://news.ycombinator.com/item?id=47115097 and made a similar reply there. I thought you might find this amusing, as I did.


> maybe you don't have the necessary perspective to understand what you're asking for.

I disagree. Thinking about this more, I can give an example from my time working as a patent examiner at the USPTO. We were required to include detailed search logs, which were primarily autogenerated using the USPTO's internal search tools. Basically, every query I made was listed. Often this was hundreds of queries for a particular application. You could also add manual entries. Looking at other examiners' search logs was absolutely useful to learn good queries, and I believe primary examiners checked the search logs to evaluate the quality of the search before posting office actions (primary examiners had to review the work of junior examiners like myself). With the right tools, this is useful and not burdensome, I think. Like prompts, this doesn't include the full story (the search results are obviously important too but excluded from the logs), but that doesn't stop the search logs from being useful.

> You might as well ask for a record of the conversations between two engineers while code was being written.

No, that's not typically logged, so it would be very burdensome. LLM prompts and responses, if not automatically logged, can easily be automatically logged.


> LLM prompts and responses, if not automatically logged, can easily be automatically logged.

What will you do with what you’ve logged? Where is “the prompt” when the chat is a chat? What prompt “made” the software?

If you’re assuming that it is prompt > generation > release, that’s not a correct model at all. The model is *much* closer to conversations between engineers which you’ve indicated would be burdensome to log and noisy to review.


> What will you do with what you’ve logged?

Could be a wide variety of things. I'd be interested in how rigorously a software was developed, or if I can learn any prompting tricks.

> Where is “the prompt” when the chat is a chat?

> The model is much closer to conversations between engineers which you’ve indicated would be burdensome to log and noisy to review.

I disagree. Yes, prompts build on responses to past prompts, and prompts alone are not the full story. But exactly the same thing is true at the USPTO if you replace "prompts" with "search queries" and no one is claiming that their autogenerated search logs are burdensome.

Also, the burden in actual conversations would come from the fact that such conversations are often not recorded in the first place. And now that I think about it, some organizations do record many meetings, so it might be easier than I'm thinking.

> What prompt “made” the software?

All of them.


This was interesting as I face a lot of issues maintaining my own notes accumulated over the past 15 or so years. The approach discussed might work great for the OP, but I'm skeptical this would work for me. I've found a lot of value in the "boring" maintenance tasks. Thinking about where to place something has caused me to make exactly the sort of connections the OP wrote about wanting to find. I work with a combination of a directory hierarchy, text search, and links (symlinks, URLs, bibliographic citations) which serve the same purpose as the tags and links the OP discussed. Links are how I express a lot of connections, in fact. So I don't see the organizing as some sort of non-core operation that's "labor" and not "thinking". For me, it's both.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: