That seems like a pretty big mistake, are you sure that's real? ZFS has a lot of...

bombela · on Jan 29, 2025

BTRFS can handle zoned disks. The doc claims that it was a natural fit with the way BTRFS works.

The problem is that you cannot buy host managed zoned disks.

rwmj · on Jan 30, 2025

WD was pushing for this standard a few years back: https://www.westerndigital.com/company/innovation/zoned-stor... I'm not sure what happened to it / whether it was successful or not.

Joe_Cool · on Jan 30, 2025

I think they also sold HA-SMR (host aware SMR, a mix of drive managed and host managed) drives to datacenters but I couldn't find any recent news so I guess those aren't that common.

yencabulator · on Jan 29, 2025

I believe the latest on zoned devices is that they are scrapping the explicit zone operations API and are letting the OS merely provide a "region hint", to separate writes by the expected lifespan of the data stored. That could be as simple as different zfs datasets using different hints, if one dataset is archival and another contains actively-overwritten data.

scottlamb · on Jan 29, 2025

> ZFS has a lot of genuine potential to work well with SMR drives, but that potential depends on glue code that nobody has written yet.

...and I'm not sure that will change. The thing is, the "drive-managed SMR" models I've managed to get my hands on don't have a zoned storage API. I'm not sure how a filesystem (or application that directly goes to the block device layer) is supposed to work with them intelligently.

There are supposed to be host-managed and hybrid host+drive-managed SMR drives out there, but AFAICT they're only sold to enterprises/hyperscalers (maybe only the latter even). And those are using custom proprietary software (e.g. Google's D servers).

...so who would write/use this glue code? unless the manufacturers decide to contract filesystem developers as a lead-up to wider sale of these things, or something to that effect.

yencabulator · on Jan 29, 2025

> ...and I'm not sure that will change. The thing is, the "drive-managed SMR" models I've managed to get my hands on don't have a zoned storage API. I'm not sure how a filesystem (or application that directly goes to the block device layer) is supposed to work with them intelligently.

Mostly by making large sequential writes. F2FS is a lot faster on a consumer drive-managed SMR drive than the others (at least until you need to fsck / upgrade kernel version / do any of the actions that utterly suck with f2fs).

scottlamb · on Jan 29, 2025

Maybe, but I'm not sure I trust that to be enough. The drive's proprietary SMR management firmware might still end up doing something really wasteful under normal usage patterns, in a way that's hard to diagnose and might not be necessary if the application were in control.

Let's imagine an NVR that simply uses the drive as a ring buffer for media data. [1] And it does it via an application using the whole thing at the block device layer, no filesystem. And all the metadata is stored separately on an SSD. This should be the absolute best case for SMR, but even then I'm not confident it will work well. Eventually the drive fills and it starts rewriting. On each write, now it has to guarantee the bytes just beyond your write stay the same. My understanding is it has some minimum write size, or always some fixed-length overwrite, or something to that effect. Using large sequential writes surely helps, yes, and the device presumably has a generous amount of RAM-based write cache, but even so I'm suspicious it'll end up regularly reading and rewriting a bunch of data that will probably be overwritten anyway pretty soon, and even swapping to the CMR area of the drive constantly.

If the application were in control, it'd probably ensure there's some "don't care" buffer between the write zone and the read zone to avoid this. But I don't think there's any way to tell many of these drive-managed SMR HDDs to do that.

[1] You might want a ring per camera stream so you use less read bandwidth on playback of a single stream, and you might want the ability to redact bits of recording in the middle, but let's ignore that.

Dylan16807 · on Jan 29, 2025

> now it has to guarantee the bytes just beyond your write stay the same [...] "don't care" buffer between the write zone and the read zone to avoid this.

If the firmware is competent the drive will have TRIM support.

Also it could start overwriting a zone even when there's old live data in it, as long as it buffers a few megabytes to stay at least a track away from the old live data.

And even if it does have to read back the whole zone, it could keep that in RAM to keep the performance impact from being too bad. (Ideally with some kind of power loss protection.)

scottlamb · on Jan 29, 2025

> If the firmware is competent the drive will have TRIM support.

Sure, if false then false is a true statement.

It's not competent, though? iirc my ST8000DM004 did not support TRIM. I think it's typical of the genre of drives on the market.

I have no faith these drives will not degenerate to the stupidest possible access pattern; their firmware is opaque and bad and their interface is limited.

Dylan16807 · on Jan 29, 2025

I know some drives do it though.

I found a comment saying "Sadly it’s 2024 and WD seem to be the only manufacturer actually implementing TRIM."

And I'd expect it to be more common on the higher end drives where someone might intentionally choose SMR to get more space, as opposed to the <=8TB market where SMR is a cost saving they don't want anyone to notice.

scottlamb · on Jan 29, 2025

> And I'd expect it to be more common on the higher end drives where someone might intentionally choose SMR to get more space, as opposed to the <=8TB market where SMR is a cost saving they don't want anyone to notice.

Yeah, "don't want anyone to notice" is probably a big factor, and in fairness the drive I mentioned was of the surprise SMR scandal era. (8TB was decently large then though I think.)

I guess fundamentally the hidden complexity of the firmware managing the storage also applies to SSD, but I think they've got more wiggle room there due to the inherent better performance of the media, and for whatever reason the average firmware quality appears better IMHO, with some notable exceptions: https://news.ycombinator.com/item?id=32031243

magicalhippo · on Jan 29, 2025

I've tried to find it but it was almost certainly a 10+ year old video and I watched it 3-4 years ago. I do vividly recall screaming at the screen tho, trying in vain to clear up the confusion.

It was a simple mistake, one I've made myself many times. You're so primed to hear A you hear A even though the other party says E.

Anyway, this was of course for host-managed drives, and I haven't heard much talk about that since in ZFS leadership meetings or similar.

ZFS has 16MB block size support, and there has been talk about bumping that up as storage grows, so perhaps one day one could imagine 128-256MB record sizes for storing large blobs, which could fit with SMR better.

But yeah, would need to write code for that and ZFS ain't exactly overflowing with devs.