More

bfeynman · 2026-04-16T17:44:36 1776361476

I feel bad that people have to read this. It's complete puffery, made up for clicks, and the biggest thing is the pure bravado with which a company says, "Hey, let's just waste a ton of money, all for a potential blog and marketing piece." This is not really automated in any fashion. I was dubious at first, but then I saw the screencaps showing the devs interacting with Luna via a Slack workflow with a human in the loop — meaning they're literally just proxying their own behavior through an LLM. This is no different than anyone who consults AI for any decision with context. To get even more technical on the fallacy: this is not automation, as there is data leakage at every step where there is a human in the loop. A broken clock is right twice a day; an LLM could cycle through 100 guesses to pick a number, but don't market that as an oracle. Aside from that, you could just look at the pictures and context (retail in SF) and assume making a profit here would be near impossible. An actual AI ceo would probably have immediately cancel the lease.

insane_dreamer · 2026-04-16T19:01:43 1776366103

> I was dubious at first, but then I saw the screencaps showing the devs interacting with Luna via a Slack workflow with a human in the loop — meaning they're literally just proxying their own behavior through an LLM. This is no different than anyone who consults AI for any decision with context.

A human can be in the loop if the human is exactly executing the orders of the AI. It's still the AI making all the decisions, which is the purpose of the experiment - not to see whether agents can handle every interaction necessary to run a business (pick up the phone and place orders, etc.). That's also why Luna hired humans.

bfeynman · 2026-04-16T19:34:41 1776368081

that is ... not correct? This is classic example of data leakage, the yes/no things are signals feeding back to the model influencing (and here, basically guiding) future decisions.

insane_dreamer · 2026-04-16T20:28:55 1776371335

It's not data leakage.

If the experiment is to see how the AI behaves on its own, then of course it needs to know the outcomes of its decisions (either automatically, or fed to it by a human), which of course influence its next decisions. This is providing the AI with retained memory, which is essential to the experiment. It's similar to an AI writing code which it then runs and parses the logs to see the outcome and make improvements to it. (It is not _retrained_ on those outcomes, and neither is that the case here; but it can reference them in stored memory.)

bfeynman · 2026-04-16T21:28:50 1776374930

How is it not analogous to data leakage? The claim is that the system works autonomously, or at minimum could, but there is effectively signal via human in the loop feedback. That's leakage into test time evaluation. Also the coding analogy is malappropriated, in that the llm is using its own signals autonomously in the environment. Using a kalman filter on a ICBM with its own sensors is analogous to the coding agent and is autonomous. A system where a human is course correcting based on signals/sensor data is what's presented here, that is not autonomous.

insane_dreamer · 2026-04-17T15:48:19 1776440899

> A system where a human is course correcting based on signals/sensor data

the human isn't course correcting; the agent is course correcting based on the feedback data; the human is just inputting the feedback data to the agent in cases where the agent isn't able to access that data itself (due to the tooling not yet being in place for such)

data leakage would be the following:

  - agent makes a prediction for problem A based on training data 
  - feedback from the result is fed back to the agent 
  - agent regenerates a prediction for problem A, incorporating the feedback

but in this case:

  - agent makes a decision on Problem A based on training data 
  - feedback from the decision is fed back to the agent 
  - agent makes a decision for Problem B (not revisiting Problem A), a new Problem that is dependent on the outcome of Problem B

j2kun · 2026-04-16T17:59:35 1776362375

[flagged]

antonvs · 2026-04-16T19:46:21 1776368781

I appreciated the analysis given by the other commenter, so I'm glad they didn't take that lazy way out.

kryogen1c · 2026-04-16T18:37:33 1776364653

The submitter appears to be a co-founder of the company the article is about (omitted from the HN account bio), and the article is misleading to the point of lying.

This company now has strong a strong negative reputation in my mind that I will gladly share with others.

themafia · 2026-04-16T18:21:01 1776363661

This is Hacker News. It should be filled with curious people who are willing to express their opinions and points of view. To tell someone to just punitively flag something and then "move on" is absurdly reductive and small minded.

graybeardhacker · 2026-04-16T18:56:47 1776365807

A stopped clock is right twice a day; a broken one can be wrong forever. Just saying.

bfeynman · 2026-04-08T18:11:41 1775671901

Isn't what the leading labs are currently chasing after is not pretraining and massive parameters but enriched and deep fine tuning and post training for agentic tasks/coding? MoE with just new post training paradigms lets smaller models perform quite well, and much more pragmatic to scale inference with. Given that, this choice seems super odd, as the frontier labs seem to stay neck and neck, and I don't even see Grok being used in any benchmarks because of how poorly it performs

bfeynman · 2026-04-07T16:34:47 1775579687

Nice read but falls into a vast reductionist trap, a lot of survivorship bias dressed up as design philosophy or strategic bets. The context of decisions made decades ago != now, people were working under different constraints etc. Trying to frame the avionics example as the "subtractive" innovation is the most egregious, transistors were over 1000x times smaller, weight wasn't even a consideration.

esseph · 2026-04-07T18:11:35 1775585495

> a lot of survivorship bias dressed up as design philosophy or strategic bets

I wish more people realized this

bfeynman · 2026-04-02T18:46:12 1775155572

Robinhood did exact same thing, it's more for marketing reach and distribution stuff. Wouldn't be surprised in few years they let it go or spin it down, just paying for a funnel/some narrative control

bfeynman · 2026-03-24T12:57:31 1774357051

pretty horrifying. I only use it as lightweight wrapper and will most likely move away from it entirely. Not worth the risk

dot_treo · 2026-03-24T13:10:48 1774357848

Even just having an import statement for it is enough to trigger the malware in 1.82.8.

bfeynman · 2026-03-14T02:10:49 1773454249

not to mention they are using 3p apis for everything.. gemini, reranking etc...

bfeynman · 2026-03-14T02:10:18 1773454218

I think I've lost count of how many of these start ups I've seen. But what I really cant fathom is that pricing which is completely out of band. You can already talk to files directly with gemini, just wrapping other apis etc makes no sense. This is even stuff now you can easily codegen entire solutions for esp object storage based ones. Don't see actual any value add or differentiators here. It's obviously not that secure, and ingestion pipeline/connectors are also commodity.

CMLewis · 2026-03-14T03:45:51 1773459951

You're right that you can chat with files using Gemini or a codegen'd RAG pipeline, and that does work well for a lot of teams.

The problem that Captain really addresses comes when production pipelines need to run continuously over large file corpora with fast, incremental indexing, and reliable latency. The maintenance required in these situations is often quite significant.

Captain focuses specifically on making sure the retrieval layer can operate smoothly so folks don't have to scale & maintain the infrastructure themselves.

edmundsauto · 2026-03-14T05:14:24 1773465264

For use cases where the increased value ~= 20%, the cost of the distraction with that low of a margin is a hard sell. (Just based on your intro, that was my read)

No disputation of the core idea, I think you are on the right track, but the pitch isn't compelling. People looking for these kinds of AI solutions tend to favor simplicity and ~80% is fine, because the overall perceived productivity improvement is 5-10x, with such wide error bars that the approximate gain is just not worth maximizing for right now.

You might be a few months-years early, or target people who have maxxed out because they cannot retrieve from their second brain effectively. Most folks I've talked to are just trying to keep up, optimization/efficiency is not on their radar.

bfeynman · 2026-03-08T19:38:23 1772998703

Does anyone else find the way they are writing this full marketing hubris that definitely misconstrues how most people would interpret this. Codex isn't a "model" that is self improving, it's using GPT to write code that is in a wrapper program that also uses GPT. Sure it's kind of neat loop for development, but why are they anthropomorphizing it so much? People designing chips don't say that computers are self evolving, even anthropic just says that claude (the model) writes most of claude code. Heck, you could use claude or any llm to write code for codex..

bfeynman · 2026-03-05T20:10:42 1772741442

Lot of puffery in this describing constraint and actual messy problems that you are all most likely just being thrown into the context for an llm agent... None of the case studies demonstrate complex scheduling at all and are just all individual serial threads. buffers, preferences and options are all simple. The hard part of scheduling is when you have multiple pending invites or invitations that have to resolve and track it down, if someone asks for a meeting on a day that you currently already have a pending invite for, and how far away that day is, and how important the relationship is etc...

Gobhanu · 2026-03-05T21:05:46 1772744746

The concurrent resolution problem you're describing is exactly what we deal with. When a staffing coordinator has 15 interviews to book across shared interviewers, confirming one cascades into others. We track pending holds, rank by urgency, and when a confirmation on one thread invalidates a proposal on another, Vela detects the conflict and re-proposes. Theres

The only other alternative is a booking link but this, slows down business, doesnt work in many many real life situations and more :)

Gobhanu · 2026-03-05T21:05:54 1772744754

Fair feedback that the case studies don't show this well - they're simplified to demonstrate the flow. The multi-party dependency resolution is happening underneath but we could surface that better.

On the LLM point - agreed that context window alone doesn't cut it. The coordination and state management layer sits outside the model. We learned that the hard way early on.

bfeynman · 2026-03-03T12:51:13 1772542273

openclaw while cool just allowed a larger tranche of technophiles who didn't necessarily have all the skills/understanding or time to do a bunch of things that have been readily available for like over 1.5 years. There is value in that, but there is huge surge in the number of people who are even able to take advantage of the novelty. Reminds me of when hugging face came out with transformers and all of a sudden you no longer needed to wrestle with anaconda and order of installation for all the deps.