More

himata4113 · 2026-04-29T10:44:28 1777459468

Me and my friends call it CveLab because there was a time where there was a critical security update every week or multiple times a week.

himata4113 · 2026-04-28T11:32:42 1777375962

so what they're saying is that Co-Authored-By claude@anthropic.com is overloading their systems?

and that azure cannot scale fast enough to handle the load so they're embracing multi-cloud as a company... owned by microsoft?

woah. what am I reading.

2ndorderthought · 2026-04-28T11:39:25 1777376365

AI is the new DNS when it comes to service failure.

himata4113 · 2026-04-28T08:36:39 1777365399

That's not a battery, that's a reusable bomb. Good thing they also figured out how to keep them from having runaway reactions.

sigmoid10 · 2026-04-28T08:50:03 1777366203

It's just a 92kWh battery. There are many cars with 100kWh or more on the market already. And that's only a fraction of the energy stored in an average gas tank (upwards of 500kWh). A combustion car just loses most of that energy to heat from actual explosions. From a physics perspective, a normal car is a much bigger bomb than even the longest range EV.

adrian_b · 2026-04-28T09:20:51 1777368051

Batteries are much less powerful bombs than fuel tanks, because they cannot produce a so great volume of gas.

Batteries are dangerous mainly as sources of fire that is difficult to extinguish. For instance extinguishing with water may actually cause an explosion, by gas produced by the decomposition of water.

Most lithium-based batteries are more dangerous than other batteries not because they are batteries, but because they use an organic electrolyte instead of a water-based electrolyte. So their electrolyte is a fuel, which may explode when the battery catches fire.

However, there is much less electrolyte in a battery than fuel in a fuel tank, so the volume of expanding gas during an explosion is much less.

himata4113 · 2026-04-27T19:17:33 1777317453

the technology openai-sells is actually not that good for kill bots, we have boston dynamics for that. I mean to be real here, they're already better than human soldiers, deploying 100 of the doggies and letting them run loose could wipe out any fortified group.

Especially if you include things that are not normally acceptable such as suicide bombers, poison gas, etc.

Also it has been proven that in real modern warfare cheap drones seem to dominate. So unless we have a kill-bot that can withstand explosives while staying lightweight and operable with good KD (drones are 1.0 or less). kill-bots would have to have a KD of 100 to break even.

zulux · 2026-04-27T19:23:37 1777317817

Counterpoint: Killbots are vulnerable to smaller, cheaper bots deployed in defensive positions.

himata4113 · 2026-04-27T15:06:28 1777302388

That's why ARC-AGI-3 doesn't allow the use of a harnesses. The model has to create the harness instead.

grzracz · 2026-04-27T18:09:01 1777313341

Seems completely backwards to me. This is like judging Formula 1 just by the raw power of the engine. The rest of the car has just as much engineering, if not more.

wyre · 2026-04-27T19:06:17 1777316777

ARC-AGI is testing raw intelligence, like the raw power of a Formula 1 engine. The rest of the car is the harness.

gchamonlive · 2026-04-27T19:31:00 1777318260

Maybe there is a complex relationship between harness, model and the emergent perceived intelligence we just can't access by isolating the model alone to evaluate "raw intelligence". I don't think it's absurd to imagine a model that by itself wouldn't be that impressive, but would outperform other models given the right harness. It's also not absurd to think of a model that has incredible raw intelligence, but would not scale much with different harnesses. Model performance given different scenarios depend a LOT on dataset and training strategies, so we need to account for these complex relationships, otherwise measuring "raw intelligence" would be the next AI benchmark that is purely for show.

vova_hn2 · 2026-04-27T17:46:51 1777312011

The model is not allowed to create a harness either, I think.

himata4113 · 2026-04-27T20:03:30 1777320210

it can, it just has to be within the same 'session', but it's mostly limited to scratch notes afaik since there's no python or bash, yah if there's no way to execute code there's no real way to build a harness.

himata4113 · 2026-04-27T12:16:09 1777292169

I run agents en-masse and they've deleted my database at least a dozen times I just don't really care since I always run agents on a snapshot basis, what that means is that agents work on a snapshot of a database that needs to be reconciled which often makes the agent realize "wait that would delete all of the data".

Telling the agents what the (sensitive) action will result in is how you avoid such issues, but you shouldn't be running agents with production data anyway.

But because people will continue to do so, explaining to the agent what the command will do is the way forward.

himata4113 · 2026-04-24T19:33:08 1777059188

ask it 10 times.

pixel_popping · 2026-04-24T19:39:14 1777059554

MASSIVE ADVERSARIAL x50

himata4113 · 2026-04-23T03:16:20 1776914180

Clippy with unstable features enabled catches most if not all of these cases automatically? This seems like it needs more work to do the same thing clippy does.

I do see a value in validating constraints, but the examples are either too simple or I'm too dumb.

Kab1r · 2026-04-23T07:09:56 1776928196

I have written complex proofs for distributed system using verus which are certainly not expressed by clippy

himata4113 · 2026-04-23T03:02:49 1776913369

"April 10, 2026"

I don't blame you, took me awhile to find the date.

himata4113 · 2026-04-22T14:06:09 1776866769

I already felt that gemini 3 proved what is possible if you train a model for efficiency. If I had to guess the pro and flash variants are 5x to 10x smaller than opus and gpt-5 class models.

They produce drastically lower amount of tokens to solve a problem, but they haven't seem to have put enough effort into refinining their reasoning and execution as they produce broken toolcalls and generally struggle with 'agentic' tasks, but for raw problem solving without tools or search they match opus and gpt while presumably being a fraction of the size.

I feel like google will surprise everyone with a model that will be an entire generation beyond SOTA at some point in time once they go from prototyping to making a model that's not a preview model anymore. All models up till now feel like they're just prototypes that were pushed to GA just so they have something to show to investors and to integrate into their suite as a proof of concept.

deaux · 2026-04-22T18:14:05 1776881645

> If I had to guess the pro and flash variants are 5x to 10x smaller than opus and gpt-5 class models.

I really doubt it, especially Pro. If anything I wouldn't be surprised if their hardware lets them run bigger models more cheaply and quickly than the others. Pro is probably smaller than GPT 5.4 and Opus 4.6 (looks like 4.7 decreased in size), but 5x seems way too much. IMO Gemini 3 Pro is the most "intelligent" in an all-round human way. Especially in the humanities. It's highly knowledgeable and undeniably the number one model at producing natural text in a large number of (human!) languages. The difference becomes especially large for more niche languages. That does not suggest a smaller model, more the opposite. The top 4 models at multilinguality are all Google : 1. 3 Pro 2. 3 Flash 3. 2.5 Pro 4. 2.5 Flash. Even the biggest OpenAI and Anthropic models can't compete in that dimension.

It's definitely weaker at math and much worse at agentic things. Gemini chat as an app is also lightyears behind, it's barely different from ChatGPT at release over 3 yeaes ago. These things make it feel much weaker than it is.

orbital-decay · 2026-04-22T19:06:01 1776884761

Regarding Anthropic, they used to make best multilingual and generalist models, it's their policy thing, not a capability issue. Claude 3 was best at this, including dead and low-resource languages. Neither modern Claude nor Gemini are remotely close to what Claude 3 was capable of (e.g. zero-shot writing styles). Anthropic basically reversed their "character training" policy and started optimizing their models for code generation at the cost of everything else, starting with Sonnet 3.5. Claude 4 took a huge hit in multilingual ability

GPT, on the other hand, was always terrible at languages, except for the short-lived gpt-4.5-preview.

All modern models including Gemini have bugs in basic language coherency - random language switching, self-correction attempts resulting in hallucinations etc. I speculate it's a problem with heavy RL with rewards and policies not optimized for creative writing.

deaux · 2026-04-23T22:56:30 1776984990

I've never ever had Gemini over the API switch languages in translation tasks and that's across more than 10 language pairs and 6 figures of calls, across both short and long outputs. Maybe your languages are even lower resource ones, though we do include Central Asian languages.

The Chinese models are very prone to it, they love to mix them up.

I've seen it in chat, but IMO that's more of a system prompt/harness issue.

I'll admit I don't remember Claude 3, the oldest data I have seems to be 3.5. And at that time Gemini 1.5 Pro did a much better job across all of our language pairs, it wasn't close.

rao-v · 2026-04-23T02:38:22 1776911902

This always bothers me because models will almost never see text that is mostly English with a little other language in training data (opposite happens of course) and certainly not in RL data. Why do they occasionally language switch?

awongh · 2026-04-22T20:46:03 1776890763

The benchmarks don’t seem to say that language ability has gotten worse?

orbital-decay · 2026-04-22T21:25:30 1776893130

That's the thing with benchmarks, without evals and actual hands-on experience they can give you false confidence. Claude now sounds almost clinical, and is unable to speak in different styles as easily. Claude 4+ uses a lot more constructions borrowed from English than Claude 3, especially in Slavic languages where they sound unnatural. And most modern models eventually glitch out in longer texts, spitting a few garbage tokens in a random language (Telugu, Georgian, Ukrainian, totally unrelated), then continuing in the main language like nothing happened. It's rare but it happens. Samplers do not help with this, you need a second run to spellcheck it. This wasn't a problem in older models, it's a widespread issue that roughly correlates with the introduction of reasoning. Another new failure mode is self-correction in complicated texts that need reading comprehension: if the model hallucinates an incorrect fact and spots it, it tries to justify or explain it immediately. Which is much more awkward than leaving it incorrect, and also those hallucinations are more common now (maybe because the model learns to make those mistakes together with the correction? I don't know.)

Der_Einzige · 2026-04-23T03:24:27 1776914667

Btw samplers do in fact help with this. Random tokens deep in your output context are due to accumulated sampling errors from using shit samplers like top_p and top_k with temperature.

Use a full distribution aware sampler like p-less decoding, top-H, or top-n sigma, and this goes away

Yes the paper for this will be up for review at NeurIPS this year.

awongh · 2026-04-22T21:45:24 1776894324

Not disputing this might be true, but this seems like something that should be capturable in a multi-lingual benchmark.

Maybe it's just something that people aren't bothered with?

orbital-decay · 2026-04-22T22:29:27 1776896967

Basically everyone who experiments with creative writing is keenly aware of that (e.g. roleplayers), it's just the devs that have the experience training the models for it (Anthropic, DeepMind) aren't bothered doing this anymore since there's no money in it.

>this seems like something that should be capturable in a multi-lingual benchmark

Creative writing benchmarks just don't have good objectives to measure against. In particular, valid but inauthentic language constructions can't be captured well if your LLM judge lacks fidelity to capture it to begin with. Which is I think what typically happens.

An easy litmus test would be making a selected character in a story speak Ebonics or Haitian Creole or TikTok. Claude 3 Opus was light years ahead of any model in authenticity in using them, and it was immediately obvious in a side-by-side comparison with any model including Claude 3.5+. Nuances of Polish or Russian profanities/mat or British obscenities are always the hardest for any model (they tend to either swear like dockers or tone it down, lacking the eloquence), but Opus 3 was also ahead in any of those.

deaux · 2026-04-23T22:49:37 1776984577

There are no real benchmarks of how "natural/idiomatic" output is in a multitude of languages.

"Multilingual benchmarks" are usually something like "How good is it at a multiple choice exam like the SAT in language X". This is a completely unrelated metric.

awongh · 2026-04-24T14:26:08 1777040768

then there should be such a benchmark :)

blueblisters · 2026-04-23T03:36:20 1776915380

3/3.1 Pro appears to have knowledge about eccentric topics with no obvious sources that often turns out to be right.

It does hallucinate a lot though, and is the most affected by context rot in multi-turn conversations

deaux · 2026-04-23T23:00:57 1776985257

Agreed on both, especially hallucination. That's what makes their chat app even worse, it's very opaque about web search and sources, so you can't tell whether it's a hallucination.

algoth1 · 2026-04-22T18:56:53 1776884213

Aistudio should be their default app

ahmadyan · 2026-04-22T21:14:17 1776892457

generally speaking

ultra ~ mythos ~ gpt-4.5 ~ 4x behemoth

pro ~ opus ~ 2x maverick

flash ~ sonnet ~ scout ~ other 20-30b active Chinese models

onlyrealcuzzo · 2026-04-22T14:17:09 1776867429

> They produce drastically lower amount of tokens to solve a problem, but they haven't seem to have put enough effort into refinining their reasoning and execution as they produce broken toolcalls and generally struggle with 'agentic' tasks, but for raw problem solving without tools or search they match opus and gpt while presumably being a fraction of the size.

Agreed, Gemini-cli is terrible compared to CC and even Codex.

But Google is clearly prioritizing to have the best AI to augment and/or replace traditional search. That's their bread and butter. They'll be in a far better place to monetize that than anyone else. They've got a 1B+ user lead on anyone - and even adding in all LLMs together, they still probably have more query volume than everyone else put together.

I hope they start prioritizing Gemini-cli, as I think they'd force a lot more competition into the space.

JeremyNT · 2026-04-22T15:08:24 1776870504

> Agreed, Gemini-cli is terrible compared to CC and even Codex.

Using it with opencode I don't find the actual model to cause worse results with tool calling versus Opus/GPT. This could be a harness problem more than a model problem?

I do prefer the overall results with GPT 5.4, which seems to catch more bugs in reviews that Gemini misses and produce cleaner code overall.

(And no, I can't quantify any of that, just "vibes" based)

rjh29 · 2026-04-22T16:28:01 1776875281

I wonder what I am missing, because I can use gemini-cli with English descriptions of features or entire projects and it just cranks out the code. Built a bunch of stuff with it. Can't think of anything it's currently lacking.

CraigJPerry · 2026-04-22T16:38:09 1776875889

>> Can't think of anything it's currently lacking.

Speed? The pro models are slow for me

The model 3.1 pro model is good and i don't recognise the GP's complaint of broken tool calls but i'm only using via gemini cli harness, sounds like they might be hosting their own agentic loop?

xnx · 2026-04-22T17:42:37 1776879757

Same. I've built dozens of small tools and scripts and never felt the need to try something else.

asah · 2026-04-22T15:01:35 1776870095

also, for incorporating into gsuite, youtube, maps, gcp and their other winning apps and behind-the-scenes infra...

Iulioh · 2026-04-22T15:28:31 1776871711

Not only that, google has an advange because they don't need to always generate a response.

When a lot of people ask the same thing they can just index the questions, like a results on the search engine and recalculate it only so often,

toraway · 2026-04-22T16:44:22 1776876262

I thought the same for a long time, borderline unusable with loops/bizarre decisions compared to Claude Code and later Codex.

But I picked it up again about a month ago and I have been quite impressed. Haven’t hit any of those frustrating QoL issues yet it was famous for and I’ve been using it a few hours a day.

Maybe it will let me down sooner or later but so far it has been working really well for me and is pretty snappy with the auto model selection.

After cancelling my Claude Pro plan months ago due to Anthropic enshittification I’ve been nervous relying solely on Codex in case they do the same, so I’ve been glad to have it available on my Google One plan.

ljm · 2026-04-22T17:59:08 1776880748

Google doesn't need to give a shit, because so much of the internet is infested with with google ad trackers and adwords, and everybody uses Chrome, that they will continue to make billions even without AI. Facebook did the same with their pixel so they could soak up data.

Gemini will be dead in 2 years and there'll be something else, but the ad and search company will remain given that they basically own the world wide web.

Except now, so much of the WWW is filled with AI slop that it breaks the system.

what · 2026-04-23T01:20:05 1776907205

Which ever shitty model they’re using for search is so much better than the free offerings from the other companies. It’s not even close. It’s not going anywhere.

UncleOxidant · 2026-04-22T15:38:35 1776872315

IIRC when Gemini 3 Pro came out it was considered to be just about on par with whatever version of Claude was out then (4?). Now Gemini 3 is looking long in the tooth. Considering how many Chinese models have been released since then, and at least 2 or 3 versions of Claude, it's starting to look like Google is kind of sitting still here. Maybe you're right and they'll surprise us soon with a large step improvement over what they currently have. Note: I do realize that there's been a Gemini 3.1 release, but it didn't seem like a noticeable change from 3.

cmrdporcupine · 2026-04-22T18:23:48 1776882228

As other people are saying here: the Gemini models are mostly terrible at tool use and long context management. And maybe not quite as good with finicky "detail" parts of coding generally.

Where they excel is just total holistic _knowledge_ about the world. I don't like "talking" to it, because I kind of hate its tone, but I find Gemini generally extremely useful for research and analysis tasks and looking up information.

SXX · 2026-04-22T20:37:57 1776890277

People who say Gemini is bad at long contexts are so wrong.

You can put whole 50,000 - 70,000 LOC codebase into Gemini 3.1 Pro context making it 800,000+ tokens, give it detailed task and ask for whole changed files back and it will execute it sometimes in one shot, sometimes in two. E.g depend on whatever stack you work with let you see all the errors at once so it can fix everything on single reply.

Yes it will give you back 5-15 files up to 4000 LOC total with only relevant parts changed.

This is terrible inefficient way to burn $10 of tokens in 20 minutes, but attention and 1:1 context retention is truly amazing.

PS: At the same time it is bad at tool use, but this have nothing to do with context.

oezi · 2026-04-23T02:44:19 1776912259

This! And with AI studio you get a couple of free calls per day (it has gotten less and less). I have had days where I would be able to get 100 USD worth of tokens from AI studio for free. 1m tokens in and great code out.

Der_Einzige · 2026-04-23T03:27:36 1776914856

You can even turn most of the censorship off in the AI studio (but not the hidden top_k of 64 they force in there).

AI studio is where you go if you want an actually good mostly uncensored model. Gemini 3.1 is fully and somehow still quietly coomer approved.

CuriouslyC · 2026-04-22T18:31:45 1776882705

Gemini had the best long context support for the longest time, and even now at >400k tokens it's still got the best long context recall.

Gemini is just not trained for autonomy/tool use/agentic behavior to the same degree as the other frontier models. Goog seems to emphasize video/images/scientific+world knowledge.

cmrdporcupine · 2026-04-22T18:36:24 1776882984

My experience is it advertises large context and then just becomes incoherent and confused as it climbs to fill that context.

e.g. it sucks at general tool use but sucks even more at it after a chunk of time in a session. One frustrating situation is to watch it go into a loop trying and failing to edit source files.

I often wonder how my old coworkers from Google get by, if this is the the agentic coding they have available to them for working on projects on Google3. But I suspect the models they work with have been fine tuned on Google's custom tooling and perform better?

orbital-decay · 2026-04-22T16:29:46 1776875386

Their "preview" naming is pretty arbitrary. It's just their way to avoid making any availability or persistence promises, let alone guarantees. It's also a PR tactic to mask any failures by pretending it's beta quality.

solarkraft · 2026-04-22T20:16:23 1776888983

I really wonder what I’m missing with Gemini. It’s a second rate model for me at best. I find it okay (not great) at collecting information and completely useless at agentic tasks. It’s like it’s always drunk. When the Claude credits expire in Antigravity, I’m done for the day.

> They produce drastically lower amount of tokens to solve a problem

I LOLed at this because I of the constant death loops that don’t even solve the problem at all.

himata4113 · 2026-04-23T02:16:21 1776910581

Yah it doesn't even make sense how they got through their benchmarks without death loops. Gemini-cli even has a hotfix to break the model of such death-loops. But if you were to ignore this bug/quirk that will be fixed in the next patch release my point still stands.

8note · 2026-04-23T01:08:17 1776906497

i get much better results with it using a different toolset. give it serena and it mostly works, and is less likely to hit a death loop.

i feel like the geminicli app is missing some tools for making sure the session history is actually valid

nl · 2026-04-23T03:44:14 1776915854

> the pro and flash variants are 5x to 10x smaller than opus and gpt-5 class models.

The rumor is that Gemini Pro is the largest model being served today (or at least was prior to Mythos)

Source: some podcast where they were discussing TPU vs Nvidia cluster topologies, and how Google is exploiting their topology to allow this. But I can't remember exactly which podcast, so hopefully someone else will know.

big-chungus4 · 2026-04-22T16:54:52 1776876892

Am I tripping or is this an AI reply? Like it barely has anything to do with the article other than both are related to AI

Jensson · 2026-04-22T21:51:50 1776894710

An AI reply would be more relevant to the headline / article, humans often write something tangential since we have more going on in our head and not just the context at hand while AI can't ignore context.

himata4113 · 2026-04-23T02:14:07 1776910447

Google uses these chips to create gemini, I simply used this as an excuse to rant and predict the future.

robocat · 2026-04-22T19:31:53 1776886313

> a model that will be an entire generation beyond SOTA

That model would then be SOTA.

Tautologically you can't be better than SOTA

himata4113 · 2026-04-23T02:15:20 1776910520

SOTA at that time*

mrcwinn · 2026-04-22T16:47:10 1776876430

Interesting mix of words: "I felt" -> "proved" -> "guess". One of those is not like the others!

himata4113 · 2026-04-23T02:17:35 1776910655

I guess I felt pretty uncertain that day which proved that a lack of sleep is bad for your mental cognition.

ALLTaken · 2026-04-22T14:17:40 1776867460

[flagged]

_boffin_ · 2026-04-22T15:13:01 1776870781

Is your friend on the JAX team?

neonstatic · 2026-04-22T15:17:02 1776871022

I'm really struggling with terrible bloating today, but I deemed it too dangerous to release.

tclancy · 2026-04-22T16:43:36 1776876216

Thank you for your sacrifice. Could you speak to my dog please? You may wish to yell from a distance, actually.