ICYMI unsloth has had some major breakthroughs today with the Qwen3.5 local mode...

danielhanchen · 2026-02-28T11:54:56 1772279696

Oh I didn't expect this to be on HN haha - but yes for our new benchmarks for Qwen3.5, we devised a slightly different approach for quantization which we plan to roll out to all new models from now on!

nnx · 2026-02-28T13:19:34 1772284774

Can you describe what is this slightly different approach and why it should work on all models?

hedora · 2026-02-28T18:26:51 1772303211

Nice! Your stuff ran LLMs extremely well on < $500 boxes (24-32GB ram) with iGPUS before this update.

I’m eager to try it out, especially if 16GB is viable now.

gundmc · 2026-03-01T03:46:33 1772336793

The 5080 is 16GB VRAM, not system memory. I don't think you can get 24-32GB VRAM in a $500 box

Kayou · 2026-02-28T09:53:39 1772272419

Wait, the Q4 quantization which is more than 20GB fits in your 16GB GPU ? I didn't know that was possible, I was always restricting myself to smaller model than the VRAM I had

Maxious · 2026-02-28T10:38:09 1772275089

Yep. These Mixture of Experts models are well suited for paging in only the relevant data for a certain task https://huggingface.co/blog/moe

There's some experiments of just removing or merging experts post training to shrink models even more https://bknyaz.github.io/blog/2026/moe/

vlovich123 · 2026-02-28T17:55:13 1772301313

MoE is not suited for paging because it’s essentially a random expert per token. It only improves throughput because you reduce the memory bandwidth requirements for generating a token since 1/n of the weights are accessed per token (but a different 1/n on each loop).

Now shrinking them sure, but I’ve seen nothing that indicates you can just page weights in and out without cratering your performance like you would with a non MoE model

FuckButtons · 2026-02-28T19:00:06 1772305206

Not entirely true, it’s random access within the relevant subset of experts and since concepts are clustered you actually have a much higher probability of repeatedly accessing the same subset of experts more frequently.

vlovich123 · 2026-03-01T01:26:17 1772328377

It’s called mixture of experts but it’s not that concepts map cleanly or even roughly to different experts. Otherwise you wouldn’t get a new expert on every token. You have to remember these were designed to improve throughput in cloud deployments where different GPUs load an expert. There you precisely want each expert to handle randomly to improve your GPU utilization rate. I have not heard anyone training local MoE models to aid sharding.

cagenut · 2026-03-01T02:10:04 1772331004

is there anywhere good to read/follow to get operational clarity on this stuff?

my current system of looking for 1 in 1000 posts on HN or 1 in 100 on r/locallama is tedious.

p1esk · 2026-03-01T04:51:55 1772340715

Ask any of the models to explain this to you

bee_rider · 2026-02-28T16:10:18 1772295018

That blog post was super interesting. It is neat that he can select experts and control the routing in the model—not having played with the models in detail, tended to assume the “mixing” in mixture of experts was more like a blender, haha. The models are still quite lumpy I guess!

segmondy · 2026-02-28T10:00:32 1772272832

llama.cpp is designed for partial offloading, the most important part of the model will be loaded into the GPU and the rest on system ram. I run 500B+ models such as DeepSeek/KimiK2.5/GLM-5 without having that much GPU vram.

pyuser583 · 2026-03-01T05:33:24 1772343204

How much do you use?

I have lots of trouble figuring out what the limits are of a system with x amount of vram and y amounts of ram. How do you determine this?

fc417fc802 · 2026-03-01T09:00:19 1772355619

Ideally you'd have (parameter count) * (bits per parameter) VRAM for the entire (presumably quantized, don't forget to account for that) model. So very approximately 16 GiB for a 34B model quantized to 4 bits per parameter.

You can spill to RAM in which case you at least want enough for a single active expert but really that's going to tank performance. If you're only "a bit" short of the full model the difference might not be all that large.

These things are memory bandwidth limited so if you check out RAM, VRAM, and PCIe bandwidth what I wrote above should make sense.

Also you should just ask your friendly local LLM these sorts of questions.

pyuser583 · 2026-03-02T11:56:49 1772452609

I usually do ask the llm what parameters to use. But that’s why I know so little about parameters!

nurettin · 2026-02-28T11:43:15 1772278995

This is why they say "A3B" meaning only 3B is active at a time, limiting VRAM usage.

Koffiepoeder · 2026-02-28T11:44:21 1772279061

The A3B part in the name stands for `Active 3B`, so for the inference jobs a core 3B is used in conjunction with another subpart of the model, based on the task (MoE, mixture of experts). If you use these models mostly for related/similar tasks, that means you can make do with a lot less than the 35B params in active RAM. These models are therefore also sometimes called sparse models.

roxolotl · 2026-02-28T12:24:37 1772281477

What method are you using to do that? I’ve been playing with llama.cpp a lot lately and trying to figure out the cleanest options for getting a solid context window on 32gb vram and 64gb system ram.

jychang · 2026-02-28T12:38:59 1772282339

32GB vram is more than enough for Qwen 3.5 35b

You can just load the Q4_K_XL model like normal, and put all tensors on GPU without any -ot or --cpu-moe flags.

If you need a massive context for some reason where model+kv cache won't fit in 32gb, then use -ot to move the ffn moe experts for 1-2 layers into RAM. You'll get a speed hit (due to loading params from slower RAM instead of fast VRAM) but it'll work.

roxolotl · 2026-02-28T13:11:41 1772284301

Nice ok I’ll play with that. I’m mostly just learning what’s possible. Qwen 3.5 35b has been great without any customizations but it’s interesting to learn what the options are.

mirekrusin · 2026-02-28T11:12:04 1772277124

2x RTX 4090, Q8, 256k context, 110 t/s

instagib · 2026-02-28T17:37:00 1772300220

1 4090, Qwen3.5-35B-A3B-UD-MXFP4_MOE, 64k context, 122 t/s. Llama.cpp

mirekrusin · 2026-03-01T05:00:35 1772341235

I believe it's mentioned that MXFP4 performs surprisingly bad, you may want to try other Q4s.

RS-232 · 2026-02-28T12:23:59 1772281439

That’s intriguing. I have the same card, maybe I should give it a go. Curious about your CPU/RAM/storage capacity as well.

Any resources for configuring the local setup?

My entire home media stack is a single compose file in a WSL distro so it would be cool if local LLM worked the same way.

cpburns2009 · 2026-02-28T13:37:34 1772285854

Does llama.cpp support Qwen3.5 yet? When I tried it before, it failed saying "qwen35moe" is an unsupported architecture.

hnfong · 2026-02-28T14:42:10 1772289730

Yes, but make sure you grab the latest llama.cpp release

New model archs usually involve code changes.

sowbug · 2026-03-01T04:07:44 1772338064

If you're running Ollama, you'll have to wait a little longer for its embedded version of llama.cpp to catch up. It can be a couple days or weeks behind.

cpburns2009 · 2026-02-28T15:36:41 1772293001

Awesome! It looks like the llama.cpp-hip AUR was updated today to b8179, and it works.

reactordev · 2026-02-28T13:42:48 1772286168

You would need the Dynamic 2.0 GGUF as discussed in the article.

But mmmmmm, Q8_K_XL looks mighty nice.

jychang · 2026-02-28T09:51:10 1772272270

Not really breakthroughs, more like bugfixes for their broken first batch.

danielhanchen · 2026-02-28T12:09:38 1772280578

No this is false - unsure if you saw our new blog - https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks which shows SOTA on nearly all bits, and we shared all our research as well

jychang · 2026-02-28T12:20:07 1772281207

Yeah, I saw that yesterday. The blog post does not explain why/how the Qwen 3.5 quants uploaded on 2/27 are different from the files uploaded on 2/24.

Old 2/24 Q4_K_XL commit (pre bugfix files): https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF/commit/7...

Questions for a postmortem that the blog post left unanswered:

- Why the change? Is it just to improve PPL/KLD? Sure, we can assume PPL and KLD are not perfect benchmarks. If yes, then why change the quantization anyways? Or was the old 2/24 quant actually much worse performing in the real world?I presume the Q4_K_XL quant using mxfp4 was the issue? If the 2/24 files having a lower PPL is an actual issue due to low quality tensors, then why not just say that?

- What were the main tensors that had the quantizations changed from 2/24 to 2/27? Did you now quantize attention tensors differently? Or perhaps ssm? T

- What was it changed from? Was it changed from mxfp4 or q4_k to q8, or something else?

A quick sentence in the blog post saying "ok, we've confirmed that using mxfp4 (or q3 or whatever) in the attention/ssm/biases/norms/etc is a bad idea, we had that in our old models on 2/24 and our new models today are better" that would make it clear. As it's written, it's trying to both say "PPL/KLD don't actually reflect real world quality" and "we changed our quant to increase PPL/KLD" at the same time, which seems contradictory.

zargon · 2026-02-28T18:05:37 1772301937

Explain what about that statement is false. Your original Q4_K_XL quant was broken. People noticing that it was a total outlier among other quants is what prompted this "research". Your own data proves that your new release fixes the bugs of your original, in order to match AesSedai's PPL. Fixing bugs is great. Searching for the best quant mix is helpful. I use your quants and appreciate your work. But whitewashing this situation dilutes trust and good will.