Hey! I'm Nick, and I work on Integrity at OpenAI. These checks are part of how w...

vlovich123 · 2026-03-30T03:39:46 1774841986

That still doesn’t explain why you can’t even start typing until that check proceeds. You could condition the outbound request from being processed until that’s the case. But preventing from typing seems like it’s just worse UX and the problem will fail to appear in any metrics you can track because you have no way of measuring “how quickly would the user have submitted their request without all this other stuff in the way”.

Said another way, if done in the background the user wouldn’t even notice unless they typed and submitted their query before the check completed. In the realistic scenario this would complete before they even submit their request.

mike_hearn · 2026-03-30T10:05:08 1774865108

I developed the first version of Google's equivalent of this (albeit theirs actually computes a constantly rotating key from the environment, it doesn't just hard-code it in the program!).

The reason it has to block until it's loaded is that otherwise the signal being missing doesn't imply automation. The user might have just typed before it loaded. If you know a legit user will always deliver the data, you can use the absence of it to infer something about what's happening on the client. You can obviously track metrics like "key event occurred before bot detection script did" without using it as an automation signal, just for monitoring.

fc417fc802 · 2026-03-30T12:18:39 1774873119

That doesn't make sense. The server would wait to process anything until after you received the signal. If it doesn't arrive within a reasonable period of time that tells you something, the same as right now.

If you mean that you can infer client side tampering with the page contents you could still do that - permit typing but don't permit the submit action on the client. The user presses enter but nothing happens until the check is complete. There you go, now you can tell if the page was tampered with (not that it makes much difference tbh).

mike_hearn · 2026-03-30T15:12:50 1774883570

The typing actions have to be observed by JavaScript. It's not different to any other JS blocking page load because it's needed for the site to work, that's just how the web works.

electroly · 2026-03-30T15:45:26 1774885526

This doesn't seem to be the same thing. The article isn't about being unable to type before JavaScript starts executing. If I understand correctly, you're unable to type until a network request to Cloudflare returns. The question is: why not allow typing during that network request? JavaScript is running and it's observing the keystrokes. Everyone understands that you can't use a React application until JavaScript is running. They're asking why the network request doesn't happen in the background with the user optimistically allowed to type while waiting for it to return.

(Separately, I don't think the article has adequately demonstrated this claim. They just make the claim in the title. The actual article only shows that some network request is made, and that the request happens after the React app is loaded, but not that they prevent input until it returns. Maybe it's obvious from using it, but they didn't demonstrate it.)

mike_hearn · 2026-03-30T16:24:43 1774887883

The network request to Cloudfare is part of the JavaScript (in effect).

electroly · 2026-03-30T16:40:20 1774888820

I don't think that's true in this case; the React application loads first, fully initializes, and then sends its state via Cloudflare request. It can't happen at the same time, by design. It has to happen serially. The article's claim is that you can't type during this second request. Frankly, I wonder if this is actually true at all. The article did not demonstrate this, and there's no problem if you can actually interact as soon as the React application is running. ChatGPT running abuse prevention and React applications requiring JavaScript to work are both uncontroversial, I think.

mike_hearn · 2026-03-30T18:19:53 1774894793

OK, I haven't looked at the exact sequencing here. But generally, once the action goes back to the anti-abuse service for checking the user can't be allowed to change what they're submitting. The view the anti-abuse system saw has to match what the app server sees.

root_axis · 2026-03-30T15:18:05 1774883885

Why can't you allow typing and just consume the state of the text input as the initial state of the js logic?

arccy · 2026-03-30T15:43:42 1774885422

how you type is also part of the signal

susupro1 · 2026-03-30T12:28:32 1774873712

This perfectly explains the trade-off. But from a pure UX perspective, freezing the input pipeline feels uniquely hostile. They could buffer the keystrokes invisibly in the background instead of locking the cursor, which creates the jarring perception that the site is actively fighting the user.

toinewx · 2026-03-30T14:27:50 1774880870

can you reformulate your message?

gavinray · 2026-03-30T15:22:29 1774884149

Mike is saying that if you allow users to type before the scripts are fully loaded, there is no way to tell the difference between a human and bot.

Blocking until load means that human interaction is physically impossible, so you are certain that any input before that is automated.

If you allow typing, this distinction vanishes

LtWorf · 2026-03-30T15:53:13 1774885993

Load fewer scripts so it doesn't take that long?

matchagaucho · 2026-03-30T17:36:26 1774892186

Keyboard response feels 10x slower in ChatGPT Projects (possibly for reasons other than react state).

p-e-w · 2026-03-30T06:09:25 1774850965

Many cloud products now continuously send themselves the input you type while you are typing it, to squeeze the maximum possible amount of data from your interactions.

I don’t know whether ChatGPT is one of those products, but if it is, that behavior might be a side effect of blocking the input pipeline until verification completes. It might be that they want to get every single one of your keystrokes, but only after checking that you’re not a bot.

davidkunz · 2026-03-30T06:28:03 1774852083

It's still possible to let users already type from the beginning, just delay sending the characters until checks are complete. Hold them in memory until then.

miyuru · 2026-03-30T06:44:24 1774853064

Instagram was uploading the images while the user were adding post details, back in 2012!

https://news.ycombinator.com/item?id=3913919

No one seem to use or care about their own product anymore. Only uses dashboard and metrics, which does not explain the full situation.

AlecSchueler · 2026-03-30T06:59:47 1774853987

That makes total sense from a UX perspective though, the ChatGPT thing does not.

scottyah · 2026-03-30T18:48:50 1774896530

there were a lot of helpdesk chats doing the same, so you could see users typing messages, then deleting words, etc before hitting send.

Imustaskforhelp · 2026-03-30T17:45:40 1774892740

This was actually one of the reasons why Instagram felt smooth.

Another thing but Facebook/Instagram have also detected if a person uploads an image and then deletes it and recognizes that they are insecure, and in case of TEENAGE girls, actually then have it as their profile (that they are insecure) and show them beauty products....

I really like telling this example because people in real life/even online get so shocked, I mean they know facebook is bad but they don't know this bad.

[Also a bit offtopic, but I really like how the item?id=3913919 the 391 came twice :-) , its a good item id ]

mort96 · 2026-03-30T09:59:02 1774864742

I just checked the network inspector, the only thing it does per key press is to generate an autocomplete list. It doesn't seem too hard to wait with the autocomplete generation until after whichever checks you run pass.

andai · 2026-03-30T07:59:59 1774857599

I wondered if ChatGPT streams my message to the GPU while I type it, because the response comes weirdly fast after I submit th message. But I don't know much about how this stuff works.

aabhay · 2026-03-30T13:38:01 1774877881

Likely prefix caching among many other things

m3kw9 · 2026-03-30T17:44:57 1774892697

Because the way they have the server architecture setup and how it loads the screen. You don’t even want all the bots hitting servers

dncornholio · 2026-03-30T14:45:10 1774881910

You cannot know what verifications they use. I could argue the disabled textbox is some sort part of the verification process. Humans will click on it while bots won't.

root_axis · 2026-03-30T15:19:30 1774883970

Seems like a trivially simple verification to defeat.

YetAnotherNick · 2026-03-30T15:27:46 1774884466

You can defeat all client side verification by definition if you know what verification is run.

QEDCTrL · 2026-03-30T15:08:56 1774883336

Sounds like anti-distillation to me. But, know what? Meh.

mcmcmc · 2026-03-30T15:37:38 1774885058

I’d be inclined to agree with the “meh” if their entire product weren’t built off pirated content

deadbabe · 2026-03-30T12:08:47 1774872527

Remember you’re talking to a vibe coder who just stares at code being printed out by AI.

mcmcmc · 2026-03-30T15:40:49 1774885249

That’s a big assumption. It’s a brand new account, might be a bot. PR/astroturfing is a great use case for agentic AI

Imnimo · 2026-03-29T22:38:50 1774823930

It's interesting to me that OpenAI considers scraping to be a form of abuse.

DrinkyBird · 2026-03-30T12:49:54 1774874994

It’s funny because the first AI scraper I remember blocking was from OpenAI’s, as it got stuck in a loop somehow and was impacting the performance of a wiki I run. All to violate every clause of the CC BY-NC-SA license of the content it was scraping :)

raincole · 2026-03-30T05:07:05 1774847225

Quite sure even literal thieves would consider thievery a form of abuse.

mcmcmc · 2026-03-30T15:42:43 1774885363

What’s being stolen? AI output isn’t copyrightable, and it’s not like they’re ripping pages out of a book

plutokras · 2026-03-30T18:27:03 1774895223

They can train on the outputs i.e. distillation attacks.

duped · 2026-03-30T14:11:34 1774879894

Engineers working on AI and AI enthusiasts are seemingly incapable of seeing the harm they cause, so I disagree.

It is difficult to get a man to understand something, when his salary depends on his not understanding it.

littlestymaar · 2026-03-30T05:34:28 1774848868

Yeah, they know it's bad, they just don't think the rules apply to them.

mapt · 2026-03-30T12:23:44 1774873424

The rules are that a large corporate AI company is able to scrape literally everything, and will use the full force of the law and any technology they can come up with to prevent you as an individual or a startup from doing so. Because having the audacity to try to exploit your betters would be "Theft".

vbezhenar · 2026-03-30T09:43:39 1774863819

They know that the rules apply to them. They hope that they can avoid being caught.

skeeter2020 · 2026-03-30T13:40:21 1774878021

Small mitigation (by no way absolving them): isolated developers, different teams. Another way: they see "stealing" of their compute directly in their devop tools every day, but are several abstractions away from doing the same thing to other people.

catoc · 2026-03-30T06:44:26 1774853066

It’s only bad if you’re a closed, for-profit entity

</sarcasm>

lukan · 2026-03-30T08:08:05 1774858085

Was that sarcasm? Speaking of it, what parts of OpenAI are still open?

catoc · 2026-03-30T08:15:42 1774858542

I know, always hard to tell on HN. Added the relevant declarative tag

reactordev · 2026-03-30T10:51:44 1774867904

The front door…

splatter9859 · 2026-03-30T16:43:26 1774889006

They never have and feel they are above reproach. Anytime Altman opens his mouth that's apparent. It's for the good of humanity dontcha know. LOL

kamban · 2026-03-30T06:03:52 1774850632

You nailed it.

tedsanders · 2026-03-30T07:35:19 1774856119

For what it's worth, the big AI companies do have opt out mechanisms for scraping and search.

OpenAI documents how to opt out of scraping here: https://developers.openai.com/api/docs/bots

Anthropic documents how to opt out of scraping here: https://privacy.claude.com/en/articles/8896518-does-anthropi...

I'm not sure if Gemini lets you opt out without also delisting you from Google search rankings.

foresterre · 2026-03-30T07:57:55 1774857475

I think opt-outs are a bit backwards, ethically speaking. Instead of asking for permission, they take unless you tell them to no longer do it from now on.

I can imagine their models have been trained on a lot of websites before opt outs became a thing, and the models will probably incorporate that for forever.

But at least for websites there's an opt-out, even if only for the big AI companies. Open source code never even got that option ;).

kneel25 · 2026-03-30T12:38:26 1774874306

> a lot of websites

It was a dataset of the entirety of the public internet from the very beginning that bypassed paywalls etc, there’s virtually nothing they haven’t scraped.

qaadika · 2026-03-30T14:03:26 1774879406

> the big AI companies do have opt out mechanisms for scraping and search.

PRESS RELEASE: UNITED BURGLARS SOCIETY

The United Burglars Society understands that being burgled may be inconvenient for some. In response, UBS has introduced the Opt-Out system for those who wish not to be burgled.

Please understand that each burglar is an independent contractor, so those wishing not to burgled should go to the website for each burglar in their area and opt-out there. UBS is not responsible for unwanted burglaries due to failing to opt-out.

netdevphoenix · 2026-03-30T09:56:51 1774864611

Performing an automated action on a website that has not consented is the problem. OpenAI showing you how to opt-opt is backwards. Consent comes first.

Bit concerning that some professional engineers don't understand this given the sensitive systems they interact with.

subscribed · 2026-03-30T13:45:17 1774878317

Just respect the bloody robots.txt and hold your horses. Ask your precious product built on the relentless, hostile scraping to devise a strategy that doesn't look like a cancer growth.

keybored · 2026-03-30T10:03:26 1774865006

Death by a thousand opt-outs.

jordanb · 2026-03-30T14:37:25 1774881445

They don't want anyone to take that which they have rightfully stolen.

altmanaltman · 2026-03-30T19:01:28 1774897288

Well at least they have 1 person working on "Integrity" so can't be too bad

splatter9859 · 2026-03-30T16:42:23 1774888943

Exactly! How dare you have access to their stolen content in the midst of them doing the same.

axegon_ · 2026-03-30T07:36:37 1774856197

The levels of irony that shouldn't be possible...

ProofHouse · 2026-03-29T23:44:50 1774827890

The irony is thick

sabedevops · 2026-03-29T22:45:41 1774824341

Seriously. The hypocrisy is staggering!

wiseowise · 2026-03-30T09:28:52 1774862932

Church, politicians, moralists are all the biggest hypocrites that want to teach you something.

newsoftheday · 2026-03-30T15:05:29 1774883129

I agree on politicians, no idea what a "moralist" is supposed to be but there are good and bad churches and church goers; lumping all church goers into one category calling them hypocrites is wrong. There are many good churches and church goers who help people and their communities.

zer00eyz · 2026-03-29T23:00:22 1774825222

" Integrity at OpenAI .. protect ... abuse like bots, scraping, fraud "

Did you mean to use the word hypocrisy. If not, I'm happy to have said it.

I just want to note, that it is well covered how good the support is for actual malware...

RobotToaster · 2026-03-30T09:41:58 1774863718

"You're trying to kidnap what I've rightfully stolen!"

gib444 · 2026-03-30T10:02:28 1774864948

And have absolutely no reservations about making such an obvious statement on a public forum

Aurornis · 2026-03-30T00:25:53 1774830353

I interpreted scraping to mean in the context of this:

> we want to keep free and logged-out access available for more users

I have no doubt that many people see the free ChatGPT access as a convenient target for browser automation to get their own free ChatGPT pseudo-API.

lelanthran · 2026-03-30T08:59:45 1774861185

> I have no doubt that many people see the free ChatGPT access as a convenient target for browser automation to get their own free ChatGPT pseudo-API.

Not that hard - ChatGPT itself wrote me a FF extension that opened a websocket to a localhost port, then ChatGPT wrote the Python program to listen on that websocket port, as well as another port for commands.

Given just a handful of commands implemented in the extension is enough for my bash scripts to open the tab to ChatGPT, target specific elements, like the input, add some text to it, target the relevant chat button, click it, etc.

I've used it on other pages (mostly for test scripts that don't require me to install the whole jungle just to get a banana, as all the current playright type products do). Too afraid to use it on ChatGPT, Gemini, Claude, etc because if they detect that the browser is being drive by bash scripts they can terminate my account.

That's an especially high risk for Gemini - I have other google accounts that I won't want to be disabled.

wolvoleo · 2026-03-30T06:15:35 1774851335

This is bad why? Well yeah for openai because all they want it to be is a free teaser to get people hooked and then enshittify.

Morally I don't see any issues with it really.

rsrsrs86 · 2026-03-30T15:18:27 1774883907

nikitaga · 2026-03-29T23:53:02 1774828382

Scraping static content from a website at near-zero marginal cost to its server, vs scraping an expensive LLM service provided for free, are different things.

The former relies on fairly controversial ideas about copyright and fair use to qualify as abuse, whereas the latter is direct financial damage – by your own direct competitors no less.

It's fun to poke at a seeming hypocrisy of the big bad, but the similarity in this case is quite superficial.

PunchyHamster · 2026-03-30T01:29:30 1774834170

> Scraping static content from a website at near-zero marginal cost to its server, vs scraping an expensive LLM service provided for free, are different things.

I bet people being fucking DDOSed by AI bots disagree

Also the fucking ignorance assuming it's "static content" and not something needing code running

remus · 2026-03-30T07:42:37 1774856557

I think the parent is just pointing out that these things lie on a spectrum. I have a website that consists largely of static content and the (significant) scraping which occurs doesn't impact the site for general users so I don't mind (and means I get good, up to date answers from LLMs on the niche topic my site covers). If it did have an impact on real users, or cost me significant money, I would feel pretty differently.

0xEF · 2026-03-30T08:35:35 1774859735

Putting everything on a spectrum is what got us into this mess of zero regulation and moving goal posts. It's slippery slope thinking no matter which way we cut it, because every time someone calls for a stop sign to be put up after giving an inch, the very people who would have to stop will argue tirelessly for the extra mile.

Aerroon · 2026-03-30T12:23:10 1774873390

What mess are you talking about? The existence of LLMs? I think it's pretty neat that I can now get answers to questions I have.

This is something I couldn't have done before, because people very often don't have the patience to answer questions. Even Google ended up in loops of "just use Google" or "closed. This is a duplicate of X, but X doesn't actually answer the question" or references to dead links.

Are there downsides to this? Sure, but imo AI is useful.

butlike · 2026-03-30T16:39:31 1774888771

It's just repackaged Google results masquerading as an 'answer.' PageRank pulled results and displayed the first 10 relevant links and the LLM pulls tokens and displays the first relevant tokens to the query.

Just prompt it.

daveidol · 2026-03-30T10:50:46 1774867846

I’d argue putting everything in terms of black and white is the bigger issue than understanding nuance

instig007 · 2026-03-30T11:07:10 1774868830

Generalizing with "everything", "all", etc exclusive markers is exactly the kind of black/white divide you're arguing against. What happened to your nuanced reality within a single sentence? Not everything is black and white, but some situations are.

fc417fc802 · 2026-03-30T12:24:17 1774873457

The person he's replying to argued against putting things on a spectrum. Does that not imply painting everything in black and white? Thus his response seems perfectly sensible to me.

instig007 · 2026-03-30T18:57:20 1774897040

He argued against putting things in a spectrum in many instances where that would be wrong, including the case under the question. What's your argument against that idea? LLM'ed too much lately?

Den_VR · 2026-03-30T02:47:02 1774838822

I miss the www where the .html was written in vim or notepad.

mghackerlady · 2026-03-30T13:20:24 1774876824

It still can be. Do it. Go make your website in M$ Frontpage, for all I care

butlike · 2026-03-30T16:42:05 1774888925

Shameless plug: My music homepage follows the HTML 2.0 spec and is written by hand

https://sampleoffline.com/

mghackerlady · 2026-03-30T17:05:56 1774890356

heck yeah B)

consp · 2026-03-30T07:19:13 1774855153

Just did that for a test frontend for a module I needed to build (not my primary job so don't know anything about UI but running in browsers was a requirement), so basic HTML with the bare minimum of JS and all DOM. Colleagues were very surprized. And yes, vim is still the goto editor and will be for a long time now all "IDE" are pushing "AI" slop everywhere.

holler · 2026-03-30T03:03:21 1774839801

ahh yes, fresh off reading "Html For Dummies" I made my first tripod.com site

sdsd · 2026-03-30T03:43:51 1774842231

For me it was making a petpage for my neopets using https://lissaexplains.com/

It's still up in all its glory.

DigiEggz · 2026-03-30T05:28:27 1774848507

This is great! The name reference also made me smile.

eloisius · 2026-03-30T06:39:22 1774852762

Also wild that from the tech bro perspective, the cost of journalism is just how much data transfer costs for the finished article. Authors spend their blood, sweat and tears writing and then OpenAI comes to Hoover it up without a care in the world about license, copyright or what constitutes fair use. But don’t you dare scrape their slop.

lelanthran · 2026-03-30T08:48:50 1774860530

> Also wild that from the tech bro perspective, the cost of journalism is just how much data transfer costs for the finished article.

Exactly. I think the unfairness can be mitigated if models trained on public information, or on data generated by a model trained on public information, or has any of those two in its ancestry, must be made public.

Then we don't have to hit (for example) Anthropic, we can download and use the models as we see fit without Anthropic whining that the users are using too much capacity.

mikkupikku · 2026-03-30T10:25:00 1774866300

[flagged]

jazzyjackson · 2026-03-30T13:36:52 1774877812

The library's archive is not a service provided by the newspaper

mikkupikku · 2026-03-30T16:15:19 1774887319

So? If the newspaper's website is willing to serve the documents, what's the problem?

The point is, if you're pleading with others to respect ""intellectual property"" then you're a worm serving corporate interests against your own.

jazzyjackson · 2026-03-30T21:29:21 1774906161

I may be a worm but at least I respect that others might have a different take on how best to make creative work an attainable way of life since before copyright law it was basically "have a wealthy patron who steered if not outright commissioned what you would produce"

1718627440 · 2026-03-30T11:08:33 1774868913

Off topic, but why is a DoS something considered to act on, often by just shutting down the service altogether? That results in the same DoS just by the operator than due to congestion. Actually it's worse, because now the requests will never actually be responded rather then after some delay. Why is the default not to just don't do anything?

pocksuppet · 2026-03-30T12:39:34 1774874374

It keeps the other projects hosted on the same server or network online. Blackhole routes are pushed upstream to the really big networks and they push them to their edge routers, so traffic to the affected IPs is dropped near the sender's ISP and doesn't cause network congestion.

DDoSers who really want to cause damage now target random IPs in the same network as their actual target. That way, it can't be blackholed without blackholing the entire hosting provider.

echoangle · 2026-03-30T11:25:29 1774869929

I think some people use hosting that is paid per request/load, so having crawlers make unwanted requests costs them money.

ImPostingOnHN · 2026-03-30T11:39:15 1774870755

*> Why is the default not to just don't do anything?

Because ingress and compute costs often increase with every request, to the point where AI bot requests rack up bills of hundreds or thousands of dollars more than the hobbyist operator was expecting to send.

eru · 2026-03-30T03:44:41 1774842281

> I bet people being fucking DDOSed by AI bots disagree

Are you sure it's a DDoS and not just a DoS?

MattJ100 · 2026-03-30T06:57:05 1774853825

Yes, it is. The worst offenders hammer us (and others) with thousands upon thousands of requests, and each request uses unique IP addresses making all per-IP limits useless.

We implemented an anti-bot challenge and it helped for a while. Then our server collapsed again recently. The perf command showed that the actual TLS handshakes inside nginx were using over 50% of our server's CPU, starving other stuff on the machine.

It's a DDoS.

troyvit · 2026-03-30T04:58:45 1774846725

You should see Cloudflare's control panel for AI bot blocking. There are dozens of different AI bots you can choose to block, and that doesn't even count the different ASNs they might use. So in this case I'd say that a DDoS is a decent description. It's not as bad as every home router on the eastern seaboard or something, but it's pretty bad.

Bilal_io · 2026-03-30T03:55:14 1774842914

Uncoordinated DDoS, when multiple search and AI companies are hammering your server.

catoc · 2026-03-30T06:50:18 1774853418

> Are you sure it's a DDoS and not just a DoS?

I think these days it’s ‘DAIS’, as in your site just DAIS - from Distributed/Damned AI Scraping

SolarNet · 2026-03-30T03:54:20 1774842860

When every AI company does it from multiple data centers... yes it's distributed.

lm411 · 2026-03-30T05:01:39 1774846899

> Also the fucking ignorance assuming it's "static content" and not something needing code running

Wild eh.

If it's not ai now, it's by default labelled "static content" and "near-zero marginal cost".

littlestymaar · 2026-03-30T05:32:30 1774848750

What's a database after all.

nikitaga · 2026-03-30T12:12:34 1774872754

All this reactionary outrage in the comments is funny. And lame.

Yes, for the vast majority of the internet, serving traffic is near zero marginal cost. Not for LLMs though – those requests are orders of magnitude more expensive.

This isn't controversial at all, it's a well understood fact, outside of this irrationally angry thread at least. I don't know, maybe you don't understand the economic term "marginal cost", thus not understanding the limited scope of my statement.

If such DDOSes as you mention were common, such a scraping strategy would not have worked for the scraper at all. But no, they're rare edge cases, from a combination of shoddy scrapers and shoddy website implementations, including the lack of even basic throttling for expensive-to-serve resources.

The vast majority of websites handle AI traffic fine though, either because they don't have expensive to serve resources, or because they properly protect such resources from abuse.

If you're an edge case who is harmed by overly aggressive scrapers, take countermeasures. Everyone with that problem should, that's neither new nor controversial.

ipaddr · 2026-03-30T12:59:33 1774875573

"such DDOSes as you mention were common, such a scraping strategy would not have worked for the scraper at all"

They are common. The strategy works for the llm but not for the website owner or users who can't use a site during this attack.

The majority of sites are not handling AI fine. Getting Ddosed only part of the time is not acceptable. Countermeasures like blocking huge ranges can help but also lock out legimate users.

nikitaga · 2026-03-30T20:15:22 1774901722

> They are common

Any actual evidence of the alleged scope of this problem, or just anecdotes from devs who are mad at AI, blown out of proportion?

ipaddr · 2026-03-30T21:15:24 1774905324

Love AI so can't be that. Not devs website owners. Yes ask AI for stats.

fireflash38 · 2026-03-30T12:52:00 1774875120

It's not a cost for me to scrape LLM.

It is a cost for me for LLM to scrape me.

Why should I care about costs that have when they don't care about the costs I have?

grayhatter · 2026-03-30T12:51:45 1774875105

The extent of the utilization is new.

The number of bots that try to hide who they are, and don't bother to even check robots.txt is new.

juliangmp · 2026-03-30T18:57:34 1774897054

"They are rare edge cases" are we on the same internet?

expedition32 · 2026-03-30T14:00:57 1774879257

One euro is marginal for me for someone else it is their daily meal.

not2b · 2026-03-30T01:10:35 1774833035

I understand why OpenAI is trying to reduce its costs, but it simply isn't true that AI crawlers aren't creating very significant load, especially those crawlers that ignore robots.txt and hide their identities. This is direct financial damage and it's particularly hard on nonprofit sites that have been around a long time.

zer00eyz · 2026-03-30T04:42:21 1774845741

> but it simply isn't true that AI crawlers aren't creating very significant load.

And how much of this is users who are tired of walled gardens and enshitfication. We murdered RSS, API's and the "open web" in the name of profit, and lock in.

There is a path where "AI" turns into an ouroboros, tech eating itself, before being scaled down to run on end user devices.

stingraycharles · 2026-03-30T02:41:36 1774838496

These are ChatGPT and Claude Desktop crawlers we’re talking about? Or what is it exactly? Are these really creating significant load while not honoring robots.txt?

Genuinely interested.

63stack · 2026-03-30T09:06:56 1774861616

Is this the first time you are reading HN? Every day there are posts from people describing how AI crawlers are hammering their sites, with no end. Filtering user agents doesn't work because they spoof it, filtering IPs doesn't work because they use residential IPs. Robots.txt is a summer child's dream.

miki123211 · 2026-03-30T08:48:40 1774860520

They seem to mostly be third-party upstarts with too much money to burn, willing to do what it takes to get data, probably in hopes of later selling it to big labs. Maaaybe Chinese AI labs too, I wouldn't put it past them.

OpenAI et al seem to mostly be well-behaved.

cruffle_duffle · 2026-03-30T02:54:55 1774839295

I bet dollars to doughnuts that 95% of the traffic is from Claude and ChatGPT desktop / mobile and not literal content scraping for training.

crote · 2026-03-30T04:10:46 1774843846

That wouldn't explain the 1000x increase in traffic for extremely obscure content, or seeing it download every single page on a classic web forum.

duttish · 2026-03-30T06:52:19 1774853539

And doing it over, and over, and over and over again. Because sure it didn't change in the last 8 years but maybe it's changed since yesterdays scrape?

cicko · 2026-03-30T06:19:46 1774851586

Interesting how other people's cost is "near-zero marginal cost" while yours is "an expensive LLM service". Also, others' rights are "fairly controversial ideas about copyright and fair use" while yours is "direct financial damage". I like how you frame this.

lm411 · 2026-03-30T03:18:05 1774840685

That is ridiculous.

You imply that "an expensive llm service" is harmed by abuse, but, every other service is not? Because their websites are "static" and "near-zero marginal cost"?

You have no clue what you are talking about.

camillomiller · 2026-03-30T05:23:49 1774848229

Well he’s a simp

sandeepkd · 2026-03-30T02:36:35 1774838195

Lets not try to qualify the wrongs by picking a metric and evaluating just one side of it. A static website owner could be running with a very small budget and the scraping from bots can bring down their business too. The chances of a static website owner burning through their own life savings are probably higher.

expedition32 · 2026-03-30T07:55:15 1774857315

Perhaps the long play is to destroy all small hobby websites until only a AI directed web is left.

miki123211 · 2026-03-30T08:51:37 1774860697

If you're truly running a static site, you can run it for free, no matter how much traffic you're getting.

Github pages is one way, but there are other platforms offering similar services. Static content just isn't that expensive to host.

THe troubles start when you're actually running something dynamic that pretends to be static, like Wordpress or Mediawiki. You can still reduce costs significantly with CDNs / caching, but many don't bother and then complain.

ezrast · 2026-03-30T15:36:07 1774884967

Setting aside the notion that a site presenting live-editability as its entire core premise is "pretending to be static", do the actual folks at Wikimedia, who have been running a top 10 website successfully for many years, and who have a caching system that worked well in the environment it was designed for, and who found that that system did not, in fact, trivialize the load of AI scraping, have any standing to complain? Or must they all just be bad at their jobs?

https://diff.wikimedia.org/2025/04/01/how-crawlers-impact-th...

jazzyjackson · 2026-03-30T13:40:40 1774878040

It's true it can be done but many business owners are not hip to cloudflare r2 buckets or github pages. Many are still paying for a whole dedicated server to run apache (and wordpress!) to serve static files. These sites will go down when hammered by unscrupulous bots.

alsetmusic · 2026-03-30T02:33:33 1774838013

Have you not seen the multiple posts that have reached the front page of HN with people taking self-hosted Git repos offline or having their personal blogs hammered to hell? Cause if you haven't, they definitely exist and get voted up by the community.

AmbroseBierce · 2026-03-30T04:49:13 1774846153

It's not like those models are expensive because the usefulness that they extracted from scraping others without permission right? You are not even scratching the surface of the hypocrisy

VadimPR · 2026-03-30T06:10:01 1774851001

Getting scraped by abusive bots who bring down the website because they overload the DB with unique queries is not marginal. I spent a good half of last year with extra layers of caching, CloudFlare, you name it because our little hobby website kept getting DDoS'd by the bots scraping the web for training data.

Never in 15 years if running the website did we have such issues, and you can be sure that cache layers were in place already for it to last this long.

wolvoleo · 2026-03-30T05:59:19 1774850359

It's more ironic because without all the scraping openai has done, there would have been no ChatGPT.

Also, it's not just the cost of the bandwidth and processing. Information has value too. Otherwise they wouldn't bother scraping it in the first place. They compete directly with the websites featuring their training data and thus they are taking away value from them just as the bots do from ChatGPT.

In fact the more I think of it, I think it's exactly the same thing.

expedition32 · 2026-03-30T07:59:01 1774857541

This leads me to thinking: I ask chatGPT a question and they get the answer from gamefaqs.

But what happens if gamefaqs disappears because of lack of traffic?

Can LLM actually create or only regurgitate content.

Aerroon · 2026-03-30T13:15:56 1774876556

>Can LLM actually create or only regurgitate content.

Contrary to what others say, LLMs can create content. If you have a private repo you can ask the LLM to look at it and answer questions based on that. You can also have it write extra code. Both of these are examples of something that did not exist before.

In terms of gamefaqs, I could theoretically see an LLM play a game and based on that write about the game. This is theoretical, because currently LLMs are nowhere near capable enough to play video games.

wolvoleo · 2026-03-30T08:26:43 1774859203

It will remain in their scraped data so they can keep including it in their later training datasets if they wish. However it won't be able to do live internet searches anymore. And it will not generate new content of course. Especially not based on games released after the site codes down so it doesn't know. Though it could of course correlate data from other sources that talk about the game in question.

stefanka · 2026-03-30T08:08:27 1774858107

They cannot create original content.

wolvoleo · 2026-03-30T08:27:59 1774859279

Well they can make some up, like hallucination. That's an additional problem: when the original site that provided the training data is gone: how can they use verify the AI output to make sure it's correct?

unsungNovelty · 2026-03-30T12:21:29 1774873289

"near-zero marginal costs". For whom exactly????

https://drewdevault.com/2025/03/17/2025-03-17-Stop-externali...

lelanthran · 2026-03-30T08:29:56 1774859396

I don't think a rule along the lines of "Doing $FOO to a corporate is forbidden, but doing $FOO to a charitable initiative is fine" is at all fair.

What "$FOO" actually is, is irrelevant. I'm curious how you would convince people that this sort of rule is fair.

The corp can always ban users who break ToS, after all. They don't need any help. The charitable initiative can't actually do that, can they?

ungreased0675 · 2026-03-30T11:39:03 1774870743

You’re describing the tragedy of the commons. No single raindrop thinks it’s responsible for the flood.

grishka · 2026-03-30T08:27:27 1774859247

> Scraping static content from a website at near-zero marginal cost to its server

It's not possible to know in advance what is static and what is not. I have some rather stubborn bots make several requests per second to my server, completely ignoring robots.txt and rel="nofollow", using residential IPs and browser user-agents. It's just a mild annoyance for me, although I did try to block them, but I can imagine it might be a real problem for some people.

I'm not against my website getting scraped, I believe being able to do that is an important part what the web is, but please have some decency.

xmcqdpt2 · 2026-03-30T11:29:16 1774870156

AI providers also claim to have small marginal costs. The costs of token is supposedly based on pricing in model training, so not that different from eg your server costs being low but the content production costs being high. And in many cases AI companies are direct competitors (artists, musicians etc.)

(TBH it's not clear to me that their marginal costs are low. They seem to pick based on narrative.)

ori_b · 2026-03-30T13:47:38 1774878458

My website serving git that only works from Plan 9 is serving about a terabyte of web traffic monthly. Each page load is about 10 to 30 kilobytes. Do you think there's enough organic, non-scraper interest in the site that scrapers are a near-zero part of the cost?

the_sleaze_ · 2026-03-30T04:14:09 1774844049

60% of our traffic is bot, on average. Sometimes almost 100%.

razingeden · 2026-03-30T01:25:17 1774833917

It is direct financial damage if my servers not on an unmetered connection — after years of bills coming in around $3/mo I got a surprise >$800 bill on a site nobody on earth appears to care about besides AI scrapers.

It hasn’t even been updated in years so hell if I know why it needs to be fetched constantly and aggressively, - but fuck every single one of these companies now whining about bots scraping and victimizing them, here’s my violin.

gzread · 2026-03-30T05:31:08 1774848668

If you can identify the scraper you should have a valid legal case to recover damages.

thisislife2 · 2026-03-30T14:37:34 1774881454

Only if they had a robots.txt for their site.

razingeden · 2026-03-30T14:58:21 1774882701

I hadn’t even considered that. Don’t know why that comment is greyed out or downvoted.

It’s a static site that hasn’t been updated since 2016—- so it’s .. since been moved to cloudflare r2 where it’s getting a $0.00 bill, and it now has a disallow / directive. I’m not sure if it’s being obeyed because the cf dash still says it’s getting 700-1300 hits a day even with all the anti bot, “cf managed robots” stuff for ai crawlers in there.

The content is so dry and irrelevant I just can’t even fathom 1/100th of that being legitimate human interest but I thought these things just vacuumed up and stole everyone’s content instead of nailing their pages constantly?

gzread · 2026-03-30T15:02:32 1774882952

No, it's still illegal to DDoS sites that don't have robots.txt.

thisislife2 · 2026-03-30T15:46:21 1774885581

You are right, I hadn't considered that aspect.

not_your_vase · 2026-03-30T05:06:24 1774847184

  > net-zero marginal cost

Lol, you single-handedly created a market for Anubis, and in the past 3 years the cloudflare captchas have multiplied by at least 10-fold, now they are even on websites that were very vocal against it. Many websites are still drowning - gnu family regularly only accessible through wayback machine.

Spare me your tears.

foobiekr · 2026-03-30T14:33:49 1774881229

You are, of course, ignoring the production costs of the static content that OpenAi is stealing.

Stop justifying their anti-social behavior because it lines your pockets.

SkiFire13 · 2026-03-30T05:47:37 1774849657

> Scraping static content

How do you know the content is static?

bakugo · 2026-03-29T23:55:08 1774828508

The cost is so marginal that many, many websites have been forced to add cloudflare captchas or PoW checks before letting anyone access them, because the server would slow to a crawl from 1000 scrapers hitting it at once otherwise.

mcfedr · 2026-03-30T16:24:38 1774887878

I'm sure the copyright holders would consider your use of their content as direct financial damage

heyethan · 2026-03-30T02:41:15 1774838475

I think this also explains why the checks are moving up the stack.

If the real cost is in actually running the app or the model, then just verifying a browser isn’t enough anymore. You need to verify that the expensive part actually happened.

Otherwise you’re basically protecting the cheapest layer while the expensive one is still exposed.

swagmoney1606 · 2026-03-30T02:09:25 1774836565

And yet I have to pay in my time and cash to handle the constant ddos'es from the constant LLM scraping

gmerc · 2026-03-30T07:04:19 1774854259

It’s not for techbros to decide at what threshold of theft it’s actually theft. “My GPU time is more valuable than your CPU time” isn’t a thing and Wikipedias latest numbers on scraping show that marginal costs at scale are a valid concern

nozzlegear · 2026-03-30T02:35:36 1774838136

Are they, actually?

make3 · 2026-03-30T05:10:53 1774847453

Absolutely not, the former relies on controversial ideas to qualify as legal.

Stealing the content from the whole planet & actively reducing the incentive to visit the sites without financial restitution is pretty bad.

AtlasBarfed · 2026-03-30T00:59:40 1774832380

Because you say it is?

I obviously disagree. I mean, on top of this we are talking about not-open OpenAI.

nickphx · 2026-03-30T10:55:41 1774868141

Speak for yourself.

karlshea · 2026-03-30T00:59:21 1774832361

I don’t know what world you live in but it’s not this one.

platybubsy · 2026-03-30T07:04:20 1774854260

Bait or genuine techbro? Hard to say

andrepd · 2026-03-30T14:19:54 1774880394

> Scraping static content from a website at near-zero marginal cost to its server

The gall. https://weirdgloop.org/blog/clankers

nslsm · 2026-03-29T23:56:59 1774828619

The issue is that there are so many awful webmasters that have websites that take hundreds of milliseconds to generate and are brought down by a couple requests a second.

bakugo · 2026-03-30T00:16:39 1774829799

OpenAI must be the most awful webmasters of all, then, to need such sophisticated protections.

miki123211 · 2026-03-30T08:43:28 1774860208

It's not scraping they're concerned about, it's abusing free GPU resources to (anonymously) generate (abusive) content.

heyethan · 2026-03-30T02:14:08 1774836848

I think the distinction is less about scraping itself, and more about marginal cost.

Scraping static pages is cheap for both sides. Scraping an LLM-backed service effectively externalizes compute costs onto the provider.

Same behavior, very different economics.

crote · 2026-03-30T04:37:35 1774845455

Very few websites are truly static. Something like a Wordpress website still does a nontrivial amount of compute and DB calls - especially when you don't hit a cache.

There's also the cost asymmetry to take into account. Running an obscure hobby forum on a $5 / month VPS (or cloud equivalent) is quite doable, having that suddenly balloon to $500 / month is a Really Big Deal. Meanwhile, the LLM company scraping it has hundred of millions of VC funding, they aren't going to notice they are burning a few million because their crappy scraper keeps hammering websites over and over again.

everdrive · 2026-03-29T21:43:40 1774820620

It's getting to the point where a user needs at minimum two browsers. One to allow all this horrendous client checking so that crucial services work, and another browser to attempt to prevent tracking users across the web.

Nick, I understand the practical realities regarding why you'd need to try to tamp down on some bot traffic, but do you see a world where users are not forced to choose between privacy and functionality?

mememememememo · 2026-03-29T22:44:23 1774824263

Local models for privacy.

You want to go to the world's best hotel? You are gonna be on their CCTV. Staying at home is crappier but private.

Unfortunately for the first time moores law isn't helping (e.g. give a poor person an old laptop and install linux they will be fine). They can do that and all good except no LLM.

karlgkk · 2026-03-29T23:30:16 1774827016

> You want to go to the world's best hotel? You are gonna be on their CCTV.

ironically, in high end hotels, there's often a lot less cctv. not none. just less. rich people enjoy privacy

xtajv · 2026-03-30T11:50:57 1774871457

In hotels of all tax brackets, you usually get a room key.

And the salient difference is that CCTV is simply defense-in-depth, not a primary means for authentication.

Barbing · 2026-03-29T23:58:41 1774828721

So they’re not just hidden better? Does make sense.

Well, I can use the world‘s best safety deposit box without being on CCTV while I pass secrets in and out of it, right? Just not for free.

Bummer, this sounds like it is about to turn into a Monero ad (“let us pay privately”)

wolvoleo · 2026-03-30T06:12:16 1774851136

Probably not even hidden because rich people are also catching a lot of legal winds, in which case the hotel has no choice but to provide the material. Better not to have it in the first place. You don't want your hotel cams listed as evidence in a 500M$ divorce case I guess.

Also are hidden cameras even legal? I know here in EU they aren't.

nozzlegear · 2026-03-30T02:39:29 1774838369

> Staying at home is crappier but private.

Doesn't make sense, my home is much more preferable to a hotel

hedora · 2026-03-30T05:13:33 1774847613

With any luck, local models will be too (soon).

littlestymaar · 2026-03-30T05:36:28 1774848988

My local models didn't get >20h of outage this quarter like Claude did so in a way it's already the case.

0x3f · 2026-03-29T21:49:34 1774820974

Meet me in a cafe and I will sign a JWT saying you're not a bot. You can submit this to whoever will accept it.

magicseth · 2026-03-29T21:59:45 1774821585

If apple approves it, ive got a solution: A keyboardthat attests to your humanity https://typed.by/magicseth/2451#2NyGLfAQxmqRiAOTlaX7ma3G4d1o...

mzajc · 2026-03-29T22:10:51 1774822251

Brilliant! Just the thing we want: more hardware attestation, more deanonymization, less user control, all diligently orchestrated in a repository where the only contributor is Anthropic Claude [0]. Comes complete with a misaligned ASCII diagram in the README to show how much effort the humans behind it put in!

Yes, even their "humanifesto" is LLM output, and is written almost exclusively in the "it's not X <emdash> it's Y" style.

[0]: https://github.com/magicseth/keywitness/graphs/contributors

delish · 2026-03-29T22:28:20 1774823300

Those are all situationally-valid criticisms, but I've long thought the ability to have smartphones' cameras cryptographically sign photos is good when available. The use case is demonstrating a photo wasn't doctored, and that it came from a device associated with e.g. a journalist, who maintains a public key. Of course, it should be optional.

magicseth · 2026-03-29T23:43:14 1774827794

Yes! That's what I'm getting at. This protocol optionally allows you to sign with your private key, but you don't have to for the protocol to provide utility. It could just be enough to say "if you trust magicseth's binary and apple, then this was typed one letter at a time"

There's nothing stopping folks from typing a message an LLM wrote one at a time, but the idea of increasing the human cost of sending messages is an interesting one, or at least I thought :-(

johnmaguire · 2026-03-30T02:53:32 1774839212

The problem is that it's not optional to end-users if sites enforce its use.

hedora · 2026-03-30T05:16:33 1774847793

The other problem is that the device or company might decide not to attest for you.

For instance, the employee at Apple that decided to pull ICE Block from the store could decide that the "admissible in court" bit should be false if it looks like a police officer is in frame.

Similarly, the keyboard could decide your social credit score is too low, and just stop attesting. A court could order this behavior.

Or, you could fail mandatory age / id verification because your credit card expired, and then all the above + more could happen! Good luck getting through to credit card tech support at that point...

magicseth · 2026-03-29T23:35:44 1774827344

Hi! I want anonymity! I also want to be able to prove what level of effort has been put in to something. I think there's room for both. This is an encrypted proof that I wrote something on a keyboard that tracks fingers. The protocol allows you to optionally sign it with your identity, but that isn't strictly required.

It is an attempt at putting something into the conversation more than just "OSS is broken because there are too many slop PRs." What if OSS required a human to attest that they actually looked at the code they're submitting? This tool could help with that.

Yes LLMs were used greatly in the production of this prototype!

It doesn't change the goal of the experiment! or it's potential utility! Do you see any potential area in your world where some piece of this is valuable?

Arainach · 2026-03-29T22:48:39 1774824519

> Yes, even their "humanifesto" is LLM output, and is written almost exclusively in the "it's not X <emdash> it's Y" style.

....no. There's not a single occurrence of that.

https://keywitness.io/manifesto

There are six emdashes on that page. NONE of them are "it's not X it's why".

> Emails, messages, essays, code reviews, love letters — all suspect.

> We believe this can be solved — not by detecting AI, but by proving humanity.

> KeyWitness captures cryptographic proof at the point of input — the keyboard.

> When you seal a message, the keyboard builds a W3C Verifiable Credential — a self-contained proof that can be verified by anyone, anywhere, without trusting us or any central authority.

> That's an alphabet of 774 symbols — each carrying log2(774) ≈ 9.6 bits. 27 emoji for 256 bits.

> They're a declaration: this message was written by a person — one of the diverse, imperfect, irreplaceable humans who still choose to type their own words.

Clarifications: 4

Continuation from a list: 1

Could just be a comma: 1

"It's not X -- it's Y": 0.

If you're going to make lazy commentary about good writing being AI, please at least be sure that you're reading the content and saying accurate things.

magicseth · 2026-03-29T23:37:57 1774827477

It is largely written by iteration with an LLM! No need to speculate or analyze em dashes :-)

The emoji idea was mine. I like it :-) unfortunately it doesn't work in places like HN that strip out emoji. So I had to make a base64 encoding option.

The goal was to create an effective encryption key for the url hash (so it doesn't get sent to the server). And encoding skin tone with human emojis allows a super dense bit/visual character encoding that ALSO is a cute reference to the humans I'm trying to center with this project!

josephg · 2026-03-29T23:00:12 1774825212

> We believe this can be solved — not by detecting AI, but by proving humanity

“It's not X -- it's Y": 1

dandellion · 2026-03-29T23:02:50 1774825370

It's either a bot, or someone who writes exactly like a bot. I don't care which it is, both go to the discard pile.

magicseth · 2026-03-29T23:38:11 1774827491

phew!

arrowsmith · 2026-03-30T02:01:11 1774836071

It’s a product for people who need help telling whether text was written by AI.

Maybe they deliberately write it like that, to filter out people who aren’t the target market?

arrowsmith · 2026-03-30T01:59:37 1774835977

From their “how it works” page:

> The server stores an encrypted blob it can't decrypt. We couldn't read your messages even if we wanted to. That's not a policy — it's math.

If you can’t tell that this is AI slop then maybe KeyWitness does solve a real problem after all.

Velocifyer · 2026-03-29T22:56:06 1774824966

magicseth · 2026-03-29T23:40:18 1774827618

Oh you think it's stupid? It was an attempt to encode an encryption key that isn't sent to the server in a way that is minimally invasive. The skintone emomis allow pretty high byte density, and also are cute!

Sorry it doesn't meet your needs.

There is irony in having an ai generated humanifesto. Could it be intentional? hmm?

Is there no irony in deriding a project for being potentially LLM generated, when it's goal is to aide people in differentiating? :shrug:

Terretta · 2026-03-29T23:58:24 1774828704

The first widely distributed and open source version of this typist timing validation idea I saw (and incorporated into my own software at the time) was released by Michael Crichton as part of a password 2nd-factor checker (1st factor a known phrase or even your name, the 2nd factor being your idiosyncratic typing pattern) in Creative Computing magazine that printed the code.

Original here: https://archive.org/details/sim_creative-computing_1984-06_1...

arrowsmith · 2026-03-30T02:11:13 1774836673

You’re getting a negative reaction from others but I share this feedback in good faith: I don’t understand what problem your product is supposed to solve.

Yeah I guess the cryptographic stuff sounds vaguely impressive although it’s been a long time since I had to think about cryptography in detail. But what is this _for_? I’m going to buy an expensive keyboard so that I can send messages to someone and they’ll know it’s really me – but it has to be someone who a) doesn’t trust me or any of our existing communication channels and b) cares enough to verify using this weird software? Oh and it’s important they know I sent it from a particular device out of the many I could be using?

Who is that person? What would I be sending them? What is the scenario where we would both need this?

Also the server can’t read the message but the decryption key is in the URL? So anyone with the URL can still read it? Then why even bother encrypting it?

Maybe this is one of those cases where I’m so far outside your target market that it was never supposed to make sense to me but I feel like I’m missing something here. Or maybe you need to work on your elevator pitch.

Just sharing my honest reaction.