If I'm understanding this right, this presupposes that the models were pre-trained on unfiltered data like with the "floor" models, so when comparing between the "retail" and uncensored models they will obviously not match the floor because they were not trained on the same data in the first place.
To me it stands to reason that a model that has only seen a limited amount of smut, hate speech, etc. can't just start writing that stuff at the same level just because it not longer refuses to do it.
The reason uncensored models are popular is because the uncensored models treat the user as an adult, nobody wants to ask the model some question and have it refuse because it deemed the situation too dangerous or whatever. Example being if you're using a gemma model on a plane or a place without internet and ask for medical advice and it refuses to answer because it insists on you seeking professional medical assistance.
> speculative decoding which, generally speaking, is not the same quality as serving the model without it.
I've never heard of ANY speculative decoding that wasn't lossless. If it was lossy it'd be called something else.
This page is just a port of DFLASH to gguf format, it only implements greedy decoding like you said so the outputs will be inferior, but not inferior to greedy decoding on the original model. Tho that's just a matter of implementing temperature, top_k, etc.
Same reason why prompt processing is faster than text generation.
When you already know the tokens ahead of time you can calculate the probabilities of all tokens batched together, incurring significant bandwidth savings. This won't work if you're already compute bound so people with macs/etc. won't get as much benefits from this.
Are Macs/etc compute bound with their 'it fits in unified memory' language models? Certainly by the time you're streaming weights from SSD you must be back in a bandwidth-bound regime.
From what I understood, if we’re talking a single user on a mac (not batching) you’re rarely compute bound in the first place. More rows per pass is nearly free that way when cores were sitting idle anyway.
If that’s wrong I would certainly appreciate being corrected, though. But if it’s right, a 2.9x speed-up after rejected tokens, nearly for free, sounds amazing.
That will depend on the model, but they'll hit compute limits before a typical GPU in almost all cases. Macs will still benefit a speedup from this, just not one as big as the one reported.
Official sites make things worse on purpose after getting any sort of traction because they can't stop chasing profits.
I don't watch sports, but my father watches soccer. He really only cares about 1 team and the national games from our home country. He was spending over $100/month to be able to watch the games, and they werent even in his native language. Now he pays $80/year for a pirate IPTV service and not only can he watch the games anywhere he wants, he also gets native language commentary for the games, national tv channels like news, etc.
When pirates can charge you money and offer a superior service, it absolutely is a service problem. You can claim that the realities of licensing and whatnot don't allow official channels to provide the best service they can, but that's not true in this case. When the same provider is splitting game broadcast from one team into different packages you know they're just trying to extract the most amount of money possible.
IDK the deal with scanlator sites nowadays, but I assume the official sites can provide more timely translations for manga since they can access the source material before anyone has seen it. I know most popular manga gets translated within hours of release, but if you're following some more niche stuff it can be several days. I also know a lot of scanlators have patreon pages so it's not like the demand from paying customers for translated media isn't there.
Not parent but I can guess from watching mostly from the sidelines.
They introduced a 1M context model semi-transparently without realizing the effects it would have, then refused to "make it right' to the customer which is a trait most people expect from a business when they spend money on it, specially in the US, and specially when the money spent is often in the thousands of dollars.
Unless anthropic has some secret sauce, I refuse to believe that their models perform anywhere near the same on >300k context sizes than they do on 100k. People don't realize but even a small drop in success rate becomes very noticeable if you're used to have near 100%, i.e. 99% -> 95% is more noticeable than 55% -> 50%.
I got my first claude sub last month (it expires in 4 days) and I've used it on some bigish projects with opencode, it went from compacting after 5-10 questions to just expanding the context window, I personally notice it deteriorating somewhere between 200-300k tokens and I either just fork a previous context or start a new one after that because at that size even compacting seems to generate subpar summaries. It currently no longer works with opencode so I can't attest to how it well it worked the past week or so.
If the 1M model introduction is at fault for this mass user perception that the models are getting worse, then it's anthropics fault for introducing confusion into the ecosystem. Even if there was zero problems introduced and the 1M model was perfect, if your response when the users complain is to blame it on the user, then don't expect the user will be happy. Nobody wants to hear "you're holding it wrong", but it seems that anthropic is trying to be apple of LLMs in all the wrong ways as well.
Especially since Codex faced the same issue but the team decided to explicitly default to only ~200k context to avoid surprises and degradation for users.
Instead of asking the model: "Here's this codebase, report any vulnerability." you ask. "Here's this codebase, report any vulnerability in module\main.c".
The model can still explore references and other files inside the codebase, but you start over a new context/session for each file in the codebase.
Honestly, that's the only way I've ever been able to trust the output. Once you go beyond the scope of one file it really degrades. But within a single file I've seen amazing results.
Are you not supposed to include as many _preconditions_ (in the form of test cases or function constraints like "assert" macro in C) as you can into your prompt describing an input for a particular program file before asking AI to analyze the file?
Please, read my reply to one of the authors of Angr, a binary analysis tool. Here is an excerpt:
> A "brute-force" algorithm (an exhaustive search, in other words) is the easiest way to find an answer to almost any engineering problem. But it often must be optimized before being computed. The optimization may be done by an AI agent based on neural nets, or a learning Mealy machine.
> Isn't it interesting what is more efficient: neural nets or a learning Mealy machine?
...Then I describe what is a learning Mealy machine. And then:
> Some interesting engineering (and scientific) problems are: - finding an input for a program that hacks it; - finding a machine code for a controller of a bipedal robot, which makes it able to work in factories;
Anyone familiar with the literature knows if anyone tried figuring out why we don't add "speaker" embeddings? So we'd have an embedding purely for system/assistant/user/tool, maybe even turn number if i.e. multiple tools are called in a row. Surely it would perform better than expecting the attention matrix to look for special tokens no?
You can charge $10 on the account and get unlimited requests. I abused this last week with the nemotron super to test out some stuff and made probably over 10000 requests over a couple of days and didn't get blocked or anything, expect 5xx errors and slowdowns tho.
It's probably another ASR model that focuses on benchmarks and simple uses instead of more challenging real use cases.
I upload edited gameplay vods of twitch streams on youtube, and use whisper-large-v3 to provide subtitles for accessibility reasons (youtube's own auto-subtitles suck, tho they've been getting better).
My checklist for a good ASR model for my use case is:
1. Have timestamp support.
2. Support overlapping speakers.
3. Accurate transcripts that don't coalesce half words/interrupted sentences.
4. Support non verbal stuff like [coughs], [groans], [laughs], [sighs], etc.
5. Allow context injection of non-trivial sizes (10k+ words)
1 is obvious because without it we can't have subtitles. Force alignment fails too often.
2 is crucial for real world scenarios because in the real world people talk over each other all the time, in my case it's a streamer talking over gameplay audio, or when the streamer has guests over. When 2 people speak the transcript either ignores one of them, or in the worst case, both of them.
3 and 4 are an accessibility thing, if you're deaf or hard of hearing having a more literal transcript of what's being said conveys better how the speaker is speaking. If all subtitles are properly "spell-checked" then it's clear your model is overfit to the benchmarks.
5 Is not a requirement per se, but more of a nice to have. In my use cause the streamer is often reading stream chat so feeding the model the list of users that recently talked, recent chat messages, text on screen, etc. Would make for more accurate transcripts.
I've tried many models, and the closest that fulfill my needs are LLM style models on top of forced alignment. It's too slow, so I've been sticky with whisper because with whisperx I can get a transcript in 5 minutes with just a single command.
One thing all these models do (including whisper) is just omit full sentences, it's the worst thing a model can do.
To me it stands to reason that a model that has only seen a limited amount of smut, hate speech, etc. can't just start writing that stuff at the same level just because it not longer refuses to do it.
The reason uncensored models are popular is because the uncensored models treat the user as an adult, nobody wants to ask the model some question and have it refuse because it deemed the situation too dangerous or whatever. Example being if you're using a gemma model on a plane or a place without internet and ask for medical advice and it refuses to answer because it insists on you seeking professional medical assistance.
reply