If you have a test that fails 50% times - is that test valuable or not? A 50% failure rate alone looks like a coin toss, but by itself that does not tell us whether the test is noise or whether it is separating bad states from good ones. For a test to be useful it needs to have positive Youden’s statistic (https://en.wikipedia.org/wiki/Youden%27s_J_statistic): sensitivity + specificity - 1. A 50% failure rate alone does not let us calculate sensitivity and specificity.
I can see a similar problem with this article - the author notices that LLMs produce a lot of errors - then concludes that they are useless and produce only simulacrum of work. The author has an interesting observation about how llms disrupt the way we judge knowledge work. But when he concludes that llms do only simulacrum of work - this is where his arguments fail.
Gee, a thing by a guy, with a name. What are you saying exactly? So the test in question is a test the LLM is asked to carry out, right? Then your point is that if it's a load of vacuous flannel 49% of the time, but meaningful 51% of the time, on average this is genuine work so we can't complain about the 49%?
Wait, you're probably talking about the test of discarding a report based on something superficial like spelling errors. Which fails with LLMs due to their basic conman personalities and smooth talking. And therefore ..?
> For a test to be useful it needs to have positive Youden’s statistic
This is not true as stated. I'd try to gloss over the absolutes relative to the context, but if I'm totally honest, I'm not sure I understand what idea you're trying to communicate.
I don't know - looks like an interesting idea - but ... I am struggling to put that in a polite manner. When I go into the repo and find out that it does stuff like lip syncing of talking avatars then I start to think what percentage of the development effort goes into marketing?
The idea is for non tech people to relate to agents through a human style interaction - that part is actually only a rfelatively small pice of the system but it brings it to life for people.
Its a way to encapsulate the personality and expertise - at least that's the idea :)
It is a third llm wiki on front page in 24 hours!
Obviously it is a hot topic. I have my own horse in that race - so I might not be objective - but I've compiled a wishlist for these system: https://zby.github.io/commonplace/notes/designing-agent-memo...
I wish there was a chance for collaboration - everybody coding their own system seems like a lot of effort duplication.
Your notes look really interesting, thanks. I'm curious --from the prose style it's clear they were written by an LLM. For design notes like this do you sort of have a mental TODO to go back and write them up in your own words to make sure they really capture your own opinions?
Overall the knowledgebase is a mixture of these. I have this disclaimer on the first page:
This KB is itself agent-operated: a human directs the inquiry, AI agents draft, connect, and maintain the notes. The framework for building knowledge bases is documented using that framework.
I hope it is enough - I've seen many people get angry with publishing LLM generated work.
love the "Borrowable Ideas" section. would suggest to definitely borrow them.
full disclosure: we started as a context infra company (nex.ai) from long long before Karpathy even came up with the LLM wiki idea, and have barely exposed any of that stuff to WUPHF but starting to open some of that now. glad to see the concerns in the comparison are things that our context infra already built for.
still, happy to collab & share learnings, and of course avoid duplication.
I mean honestly this stuff is now in roll your own territory now. Run QMD on an obsidian vault and that's like 80% of the way there and you can probably do that in < 2 hours
This report lists failures of some AI systems. They look consequential - but the company does not seem to care. This is very strange - how can it be? I really like AI products they help me all the time - but I know I need to take into account their failure modes and be careful. But lots of organisations don't seem to do that calculation. Will competition root them out? I don't know - I am so enthusiastic about AI - but ever after the LangChain situation I can see that what is adopted is always something that has a lot of flows. The more careful developers that notice the flaws and try to find true workarounds fail because it takes time to do the design well. It is not new thing - there were Betamax mourners for decades - but it seems that the hype machine is now more and more powerful.
What I meant was how LangChain dominated the llm frameworks scene because it loaded VC money. It was just at the beginning - now it has normalised - but I believe it did a lot of damage at that early stage by sucking all oxygen.
I wish the scene was more collaborative - instead of everyone writing their own. But I guess this is the llm curse - too easy to start. I am afraid it will all go in the LangChain direction with VC funding designs that are not yet ready solidifying choices that would normally be superseded.
I am open to changing these instructions - it cannot be about just making your system look better - but I'll try to incorporate genuine ideas how to improve these reviews.
> Everybody is building their own llm-wiki systems these days
there is a dark side to this. my coworker is insistent that his variant of this is going to become the teams backbone and i can't get him to stop even when i showed him a page of beyond wrong answers. he straight up doesn't understand that having a knowledge base != claude now sees all of it at once and can consider the endless breadth and shades of gray that make up human decisions. he's 100% convinced that claude grepping through the files is foolproof and won't miss any details lol
i personally just stopped messing with grand knowledge base ideas and these techs. i think everyone's shooting a lil too high and can't fully define what exactly they're after.
so i stepped back and i keep claude there for a very black and white need. claude's there to speed up stuff in domains i know well so i can guardrail him with massive success and have him code pieces im simply too lazy to code myself or alley oop something im struggling with. in a tortoise and the hare parable kind of way im the only guy here who isnt getting huge gotcha holes from AI in the solutions im delivering. all polished with the same attention to detail ive always had. i've just found these grand wiki everything ideas are just not yielding what people think they're yielding. for whatever reason i'm still the meatware layer thats a better index in the end if ive done my homework. perhaps something is lost when we cede a huge chunk of our journeys to seek information. i've still yet to be impressed by any "claude tied all these things together and found this insight this is insane" moments, every single time ive pointed out that any of that could've been a report.
Hey zby, if you're collecting these, Hjarni (hjarni.com) would fit your source-only tier alongside Fintool and Supermemory. Hosted SaaS with MCP built in, hierarchical LLM instructions (global/team/container/note), and a shared-note protocol for Claude/ChatGPT multi-agent workflows. Happy to write up a page in whatever shape you want.
The wishlist doc you linked is good, would be up for collaborating on that.
This is a really cool list and repository of ideas. Seems like the focus of the work is on making knowledge legible to AI. I wonder if you (or others) have done a similar level thinking about the inverse – making AI more legible to humans?
reply