Hacker Newsnew | past | comments | ask | show | jobs | submit | polotics's commentslogin

"Surgical "is the kind of wordage that LLMs seem to love to output. I have had to put in my .md file the explicit statement that the word "surgical" should only be used when referring to an actual operation at the block...

you're right, they are tools. that's kind of the point. PAL is a subprocess that runs a python expression. Z3 is a constraint solver. regex is regex. calling them "surgical" is just about when they fire, not what they are. the model generates correctly 90%+ of the time. the guardrails only trigger on the 7 specific patterns we found in the tape. to be clear, the ~8.0 score is the raw model with zero augmentation. no tools, no tricks. just the naive wrapper. the guardrail projections are documented separately. all the code is in the article for anyone who wants to review it.

The core issue is that the LLM is using rhetoric to try to convince or persuade you. That's what you need to tell it not to do.

Which will not work. Don't think of a pink genitalia, I mean elephant...

In this day and age, without serious evidence that the software presented has seen some real usage, or at least has a good reviewable regression test suite, sadly the assumption may be that this is a slopcoded brainwave. The ascii-diagram doesn't help. Also maybe explain the design more.

Fair. "Does consolidation actually improve recall quality on a running system?" is exactly the benchmark I haven't published, and it's the one that would settle the question.

What I do have right now:

1178 core unit tests including CRDT convergence property tests via proptest (for any sequence of ops, final state is order-independent) Chaos test harness: Docker'd 3-node cluster with leader-kill / network-partition / kill-9 scenarios (tests/chaos/ in the repo) cargo-fuzz targets against the wire protocol and oplog deserializer Live usage: running on my 3-node homelab cluster with two real tenants (small — a TV-writing agent and another experiment) for the past few weeks. Caught a real production self-deadlock during this period (v0.5.8), which is what triggered the 42-task hardening sprint. What I don't have and should: a recall-quality-over-time benchmark. Something like: seed 5,000 memories with known redundancy and contradictions, measure recall precision@10 before and after think(), and publish the curve. That's the evidence you're asking for, and you're right it's missing. I'll run that and post the numbers in a follow-up.

The ASCII diagram fair point too — website has proper rendering (yantrikdb.com) but the README should have an SVG.

Appreciate the pushback — this is more useful than encouragement.


I kind of agree with the comment here that a lot of stuff happening around comes out from an idea without proof that the project has a meaningful result. A compacting memory bench is not something difficult to put off but I'm also having difficulties understanding what would be the outcome on a running system

I have been using the memory while building it. I have a central server and all my workspaces are connected to it via the MCP server. This changed everything for me. But that's me. Now I don't have to repeat things, the agent knows my preferences, can connect different projects I am working on without me asking and it knows my infra so can plan the test deployments and stuff on its own. That is somewhat I was aiming for.

Is there already a name for that effect where grandiose plans somehow appear to be more feasible than the simple mundane step-by-step issues resolution that even though they clearly stand in the way of said grand plans, are not worth investing thought and effort in?

When it comes to software projects my pet-name for it is the "big-bang theory", but in the article's domain that's kind of already taken.


I use the term "30,000 foot view" a lot: https://nanoglobals.com/glossary/30000-foot-view/

It appeals to me because if you've ever taken a flight you can see how the details get progressively erased as you lift. Details that matter for a lot of reasons even if you can't see them.


It's also called "vision". It's what provides and powers directions on large and long term scales. Those "simple" and "mundane step-by-step issues" are just chores by themselves, yet at the same time may become stepping stones in the context of a well thought vision that people buy into and rally behind.


Possible, but unlikely. To organise such a stunt and keep undetected you're going to need other consigliere than what Sam's got I presume.

Like another commenter wrote... anyone can cast a fireball. Sam has been called a sociopath by many who know him personally. So it seems more likely than it might be otherwise.

It's not "boots on the ground" if it's a rescue mission, I guess.



I think you will like Robert Sapolski lectures on YouTube...

AGI is here? Yann Le Cun has a few weeks ago once more presented his PoV about how current LLMs fail: https://youtu.be/nqDHPpKha_A?is=sQsO57UWwR8LGZkW

in french ...so in my own words:

1) Still unreliable at logic and general inference: try and try again seems to be SoTA...

2) Comically bad at pro-activity and taking the right initiative: eg. "You're right to be upset."

3) Most likely already reaching the end of the line in terms of available good training data: looking at the posted article here, I would tend to agree...


The problem is that LeCun was obviously wrong on LLMs before. You have to take what he says with the caveat that he probably talks about these in a purist (academic) way. Most of the "downsides" and "failures" are not really happening in the real world, or if they happen, they're eventually fixed / improved.

~2 years ago he made 3 statements that he considered failures at the time, and he was quite adamant that they were real problems:

1. LLMs can't do math

2. LLMs can't plan

3. (autoregressive) LLMs can't maintain a long session because errors compound as you generate more tokens.

ALL of these were obviously overcome by the industry, and today we have experts in their field using them for heavy, hard math (Tao, Knuth, etc), anyone who's used a coding agent can tell you that they can indeed plan and follow that plan, edit the plan and generally complete the plan, and the long session stuff is again obvious (agentic systems often remain useful at >100k ctx length).

So yeah, I really hope one of Yann, Ilya or Fei-Fei can come up with something better than transformers, but take anything they say with a grain of salt until they do. They often speak on more abstract, academic downsides, not necessarily what we see in practice. And don't dismiss the amout of money and brainpower going into making LLMs useful, even if from an academic pov it seems like we're bashing a square peg into a round hole. If it fits, it fits...


As a sizable share of the market is going to want to use this for local LLMs, I do not think this is that misleading.

Most people I know are not using TinyGrad for inference, but CUDA or Vulkan (neither of which are provided here).

Ollama got some first-mover advantage at the time when actually building and git pulling llama.cpp was a bit of a moat. The devs' docker past probably made them overestimate how much they could lay claim to mindshare. However, no one really could have known how quickly things would evolve... Now I mostly recommend LM-studio to people.

What does unsloth-studio bring on top?


LM Studio has been around longer. I’ve used it since three years ago. I’d also agree it is generally a better beginner choice then and now.

Unsloth Studio is more featureful (well integrated tool calling, web search, and code execution being headline features), and comes from the people consistently making some of the best GGUF quants of all popular models. It also is well documented, easy to setup, and also has good fine-tuning support.


LM Studio isn't free/libre/open source software, which misses the point of using open weights and open source LLMs in the first place.


Disagree, there are a lot of reasons to use open source local LLMs that aren't related to free/libre/oss principles. Privacy being a major one.


If you care about privacy making sure the closed source software does not call home is a concern...


I run Little Snitch[1] on my Mac, and I haven't seen LM Studio make any calls that I feel like it shouldn't be making.

Point it to a local models folder, and you can firewall the entire app if you feel like it.

Digressing, but the issue with open source software is that most OSS software don't understand UX. UX requires a strong hand and opinionated decision making on whether or not something belongs front-and-center and it's something that developers struggle with. The only counterexample I can think of is Blender and it's a rare exception and sadly not the norm.

LM Studio manages the backend well, hides its complexities and serves as a good front-end for downloading/managing models. Since I download the models to a shared common location, If I don't want to deal with the LM Studio UX, I then easily use the downloaded models with direct llama.cpp, llama-swap and mlx_lm calls.

[1]: https://obdev.at


Hamas as financed, and helped grow, and fostered as an entity, and rid of its more middle-of-the-road competitors, by two you-know-who parties in this farcical tragedy, parties that comically keep identifying themselves as one another's mortal enemy... glad I never set foot in that madhouse in 20 year and most likely won't ever again.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: