More

MobiusHorizons · 2026-04-20T14:47:40 1776696460

The article claims the photo was shared in a private group chat. I think the image only became public because of the arrest and subsequent media interest. BDA from social media posts is a very real risk, but that is not what happened here.

rasz · 2026-04-21T12:24:13 1776774253

Do we really want to dwell on sharing stuff in private group chats? cough alcoholic idiot cough Yemen

MobiusHorizons · 2026-04-15T01:25:41 1776216341

If you actually mean offshore as in located in a different country or especially a different continent, then that is a terrible idea for latency for many forms of computation. There are acceptable use cases, eg when round trips are infrequent and average latency is already high like batch workloads or some forms of LLM, but even then closer compute is pretty much always a better experience.

MobiusHorizons · 2026-04-14T01:26:32 1776129992

I mean yes, but that would require demand right? The demand is currently for building data centers, should you just wait around for better things to be in demand? It’s definitely a strategy, but it doesn’t seem like the obviously necessary strategy

bdangubic · 2026-04-14T11:42:55 1776166975

If someone wants to build a DC out in the middle of nowhere in a non-populated areas of whatever state, be my guest, I won't complain. I was specifically talking about where I live ( https://www.datacentermap.com/usa/virginia/ ), few of the top-10 most affluent counties in USA are here but also the most DCs in the world are here. There is a huge demand here for non-Data Centers that would bring a lot more to the economy than buildings guarded and surrounded by fences that you have to drive next to every day

MobiusHorizons · 2026-04-13T00:07:05 1776038825

From the readme:

> Note: This project is still heavily in development and is at an early stage.

> Compiling and running simple shaders works, and a significant portion of the core library also compiles.

> However, many things aren't implemented yet. That means that while being technically usable, this project is not yet production-ready.

Also projects like rust gpu are built on top of projects like cuda and ROCm they aren’t alternatives they are abstractions overtop

shmerl · 2026-04-13T00:55:45 1776041745

I think Rust GPU is built on top of Vulkan + SPIR-V as their main foundation, not on top of CUDA or ROCm.

What I meant more is the language of writing GPU programs themselves, not necessarily the machinery right below it. Vulkan is good to advance for that.

I.e. CUDA and ROCm focus on C++ dialect as GPU language. Rust GPU does that with Rust and also relies on Vulkan without tying it to any specific GPU type.

markisus · 2026-04-13T02:51:23 1776048683

The article mentions Triton for this purpose. I don’t think you will get maxed out performance on the hardware though because abstraction layers won’t let you access the fastest possible path.

shmerl · 2026-04-13T03:10:11 1776049811

> I don’t think you will get maxed out performance on the hardware though because abstraction layers won’t let you access the fastest possible path.

You could argue about CPU architectures the same, no? Yet compilers solve this pretty well most of the time.

fc417fc802 · 2026-04-13T11:24:58 1776079498

Sort of not really. Compilers are fantastic for the typical stuff and that includes the compilers in the CUDA/ROCm/Vulkan/etc stacks. But on the CPU for the rare critical bits where you care about every last cycle or other inane details for whatever reason you're often all but forced to fall back on intrinsics and microarch specific code paths.

shmerl · 2026-04-13T15:08:17 1776092897

Yeah, that's why I said most of the time. Sometimes even for CPUs things need assembly. But no one stops you using GPU assembly either when needed I suppose? It should not be the default approach probably.

MobiusHorizons · 2026-04-11T05:40:11 1775886011

I think you are conflating microcode with micro-ops. The distinction into the fundamental workings of the CPU is very important. Microcode is an alternative to a completely hard coded instruction decoder. It allows tweaking the behavior in the rest of the CPU for a given instruction without re-making the chip. Micro-ops are a way to break complex instructions into multiple independently executing instructions and in the case of x86 I think comparing them to RISC is completely apt.

The way I understand it, back in the day when RISC vs CISC battle started, CPUs were being pipelined for performance, but the complexity of the CISC instructions most CPUs had at the time directly impacted how fast that pipeline could be made. The RISC innovation was changing the ISA by breaking complex instructions with sources and destinations in memory to be sequences of simpler loads and stores and adding a lot more registers to hold the temporary values for computation. RISC allowed shorter pipelines (lower cost of branches or other pipeline flushes) that could also run at higher frequencies because of the relative simplicity.

What Intel did went much further than just microcode. They broke up the loads and stores into micro-ops using hidden registers to store the intermediates. This allowed them to profit from the innovations that RISC represented without changing the user facing ISA. But internal load store architecture is what people typically mean by the RISC hiding inside x86 (although I will admit most of them don't understand the nuance). Of course Intel also added Out of Order execution to the mix so the CPU is no longer a fixed length pipeline but more like a series of queues waiting for their inputs to be ready.

These days high performance RISC architectures contain all the same architectural elements as x86 CPUs (including micro-ops and extra registers) and the primary difference is the instruction decoding. I believe AMD even designed (but never released) an ARM cpu [1] that put a RISC instruction decoder in front of what I believe was the zen 1 backend.

[1]: https://en.wikipedia.org/wiki/AMD_K12

MobiusHorizons · 2026-04-11T03:49:42 1775879382

That's not really how it works. There are only a few companies on the planet that are licensed to create their own cores that can run ARM instructions. This is an artificial constraint, though and at present China is (as far as I know) cut off from those licenses. Everyone else that makes ARM chips is taking the core design directly from ARM integrating it with other pieces (called IP) like IO controllers, power management, GPU and accelerators like NPUs to make a system on a chip. But with RISC-V lots of Chinese companies have been making their own core designs, that leads to flexibility with design that is not generally available (and certainly not cost effective) on ARM.

MobiusHorizons · 2026-04-11T03:36:22 1775878582

Compiling the code is not the issue. The hard part is the system integration. Most notably the boot process and peripherals. It's not actually hard to compile code for any given ARM or x86 target. Even much less open ecosystems like IBM mainframes have free and open source compilers (eg GCC). The ISA is just how computation happens. But you have to boot the system, and get data in and out for the system to be actually useful, and pretty much all of that contains vendor specific quirks. Its really only the x86 world where that got so standardized across manufacturers, and that was mostly because people were initially trying to make compatible clones of the IBM PC.

MobiusHorizons · 2026-02-09T17:04:13 1770656653

There will always be many gaps in peoples knowledge. You start with what you need to understand, and typically dive deeper only when it is necessary. Where it starts to be a problem in my mind is when people have no curiosity about what’s going on underneath, or even worse, start to get superstitious about avoiding holes in the abstraction without the willingness to dig a little and find out why.

MobiusHorizons · 2026-02-09T06:21:56 1770618116

I mean you can always make things slower. There are lots of non-optimizing or low optimizing compilers that are _MUCH_ faster than this. TCC is probably the most famous example, but hardly the only alternative C compiler with performance somewhere between -O1 and -O2 in GCC. By comparison as I understand it, CCC has performance worse than -O0 which is honestly a bit surprising to me, since -O0 should not be a hard to achieve target. As I understand it, at -O0 C is basically just macro expanding into assembly with a bit of order of operations thrown in. I don't believe it even does register allocation.

MobiusHorizons · 2026-02-08T16:43:04 1770568984

That would be true of one using a libc, but in a boot sector, you only have the bios, so the atoi being referenced is the one defined in c near the beginning of the article

zahlman · 2026-02-08T18:32:32 1770575552

Ah, I somehow skipped over that exact code block on first read.