You can meaningfully test if one slot machine hits the jackpot more often than another, just that the methodology should involve a large number of repeats rather than a few anecdotes. There are some LLM leaderboard sites that do it with blind comparisons.
I'd imagine it's not that they lacked the time to email linux-distros, but that they were unaware they were supposed to do so.
Feels like the more sensible process would be for kernel maintainers to announce when a version contains a fix for a high-impact security vulnerability and for distro maintainers to pay attention to that. Could be done without revealing what the vulnerability actually is in most cases, trusting the kernel maintainer's judgement. There does seem to be a public linux-cve-announce mailing list.
> Could be done without revealing what the vulnerability actually is in most cases
No it can’t. The bad actors that should actually worry most people are actively combing through commits on mainstream codebases, using a combination of automation/AI and manual review to pluck vulns out by their remediations.
The patch itself can be made to look fairly innocuous, as was done here. Won't always successfully prevent bad actors finding the vulnerability, but seems better to at least not unnecessarily increase that risk.
I’m saying it doesn’t matter how innocuous you try to make the patch when there are known bad actors directly evaluating every commit for “so did this close a vuln”, using both AI and human expertise.
This is true no matter what, but the comment I’m replying to was also pitching that maintainers actively call out that the patch includes a high sev security fix:
> for kernel maintainers to announce when a version contains a fix for a high-impact security vulnerability
I'm suggesting that less information about the vulnerability could be circulated than the current process, not more, due to distro maintainers being able to trust just "version X contains a fix for a high-impact security vulnerability" coming from a kernel maintainer - whereas they'll need some information/proof of that claim when coming from an outsider.
The information exposed in the current process was: code changes in the git commits and a commit message that did not mention the vulnerability.
In the current model, attackers are actively looking at all commits as potential vulnerabilities, regardless of what anybody says or doesn’t say about them.
You can’t make the commits not exist, or not be visible, because that’s a core part of how the kernel is developed and released.
So anything you do with notifications to distro maintainers about the vuln, or the existence of a vuln, or a nudge to patch with no context, or whatever, is totally irrelevant and does not change the calculus: the moment the fix is committed, bad actors who were not already aware notice it.
This is, of course, to say nothing of bad actors who had already found the vulnerability on their own.
From the paper they're using structured JSON schema mode opposed to freeform answers, so it can't. Models do typically caveat their answer for questions like this, in my experience.
They'll qualify their answers in English but as the article mentions, if your prompt asks for a confidence score, that "uncertainty" doesn't translate into low numerical confidence.
Quantifying their own confidence is also something they're not good at, and which the format would prevent them from refusing to do or preceding with a caveat if that's what you'd want of them. Particularly since the response format seems backwards - giving confidence, then carbs estimate, then observations/notes, rather than being able to base carbs estimate off of observations/notes and then confidence estimate off of both of those.
> They'll qualify their answers in English but [...]
That the default user-facing chat as a normal user would use it gives a warning is the key part IMO. I don't think expectations of there being no "wrong way" to use the model can necessarily extend to API usage with long custom system prompt and restricted output format.
I'd go further than that and say for me personally, the fact it's just a file is a selling point, not a "good enough" concession. I can just put passwords.kdbx alongside my notes.txt and other files (originally on a thumbdrive, now on my FTP server) - no additional setup required.
There will be people who use multiple devices but don't already have a good way to access files across them, but even then I'm not fully convinced that SaaS specifically for syncing [notes/passwords/photos/...] really is the most convenient option for them opposed to just being a well-marketed local maximum. Easy to add one more subscription, easy to suck it up when terms changes forbid you syncing your laptop, easy to pray you're not affected by recurring breaches, ... but I'd suspect often (not always) adds up to more hassle overall.
> presumably this comprise was only found out because a lot of people did update
This was supposedly discovered by "Socket researchers", and the product they're selling is proactive scanning to detect/block malicious packages, so I'd assume this would've been discovered even if no regular users had updated.
But I'd claim even for malware that's only discovered due to normal users updating, it'd generally be better to reduce the number of people affected with a slow roll-out (which should happen somewhat naturally if everyone sets, or doesn't set, their cool-down based on their own risk tolerance/threat model) rather than everyone jumping onto the malicious package at once and having way more people compromised than was necessary for discovery of the malware.
The cooldown is a defence against malicious actors compromising the release infrastructure.
Having the forge control it half-defeats the point; the attackers who gained permission to push a malicious release, might well have also gained permission to mark it as "urgent security hotfix, install immediately 0 cooldown".
I have not heard anyone seriously discuss that cooldown prevents compromise of the forge itself. It’s a concern but not the pressing concern today.
And no, however compromised packages to the forge happens, that is not the same thing as marking “urgent security hotfix” which would require manual approval from the forge maintainers, not an automated process. The only automated process would be a blackout period where automated scanners try to find issues and a cool off period where the release gets progressively to 100% of all projects that depend on it over the course of a few days or a week.
By "release infrastructure" I didn't mean gain admin access to github.com, I meant gaining the credentials to push out a release of that particular package.
To my understanding, Office (or "Microsoft 365") itself becoming "Copilot" was just confused messaging about the "Office Hub" app/shortcut being repurposed.
The article quote was being given as the supposed source for "Claude Code also found one thousand false positive bugs, which developers spent three months to rule out", so should substantiate that claim - which it doesn't.
If the claim was instead just "a good portion of the hundreds more potential bugs it found might be false positives", then sure.
I think Meta's position as a large company under (rightfully) a lot of media scrutiny fundamentally prevents it from creating a successful "metaverse". It'll be pushed towards being overly corporate/sanitized and centrally controlled to meet expectations of managing misinformation, player safety, etc. opposed to the less restricted conditions that resulted in the web. Smaller companies (like VRChat) or individual hobbyists can get away with more, and generally have less cynical motivations.
Microsoft was in the middle of the biggest antitrust case in history (both in the US and the EU) and successfully launched the Xbox in that time. They had Halo and local multiplayer up to 8 players across 2 connected consoles requiring no internet. Meta didn't have anything besides a naked desire to pursue the end (monetize the user) before the means (a product people wanted).
If the idea was that laws must be motivated by a negative occurrence rather than preemptive, then that'd follow yeah (if counting job loss as a reason to ban something, which I think is questionable). But note akersten is saying that it's normal for laws to be preemptive in both cases.
> The commercial bots seamlessly traverse between AI, auto-respond and human. It's very much an ensemble method.
This seems unlikely to me, given it'd increase costs and the response times would make it obvious.
The messages presented in the original source appear to be people expecting to be talking to a real person, likely on a dating app. The relation to AI is only speculative, and mostly in the direction of "my messages may be used to train a chatbot to replace my job of deceiving people" - which is plausible.
> That's why people pay for it over just downloading an abliterated model from hf with system prompt hacking.
I'd assume convenience, fine-tuning, and using a larger model than it's feasible for most people to run locally.
reply