The thing that happen doesn't resemble the things you feared at all. Let me explain the key way humans fool this language model:
They try something.
Then if it doesn't work, they hit the reset button on the dialog, and try again.
It is far, far easier to gain control over something you can reliably reset to a previous state, than it is to gain control over most things in the real world, which is full of irreversible interactions.
If I could make you forget our previous interactions, and try over and over, I could make you do a lot of silly things too. I could probably do it to everyone, even people much smarter than me in whatever way you choose. Given enough tries - say, if there were a million like me who tried over and over - we could probably downright "hack" you. I don't trust ANY amount of intelligence, no matter how defined, could protect someone on those terms.
That's basically fuzz testing. I absolutely agree, put me in a room with someone and a magic reset button that resets them and their memory (but I preserve mine), and enough time, I can probably get just about anyone to do just about anything within hard value limits (eg embarrassing but not destructive).
However humans have a "fail2ban" of sorts by getting irritated at ridiculous requests. Alternatively, peer pressure is a very strong (de)motivator. People are far more shy with an authority figure watching.
I suspect OpenAI will implement some sort of "hall monitor" system which steps in if the conversation strays too far from "social norms".
They try something.
Then if it doesn't work, they hit the reset button on the dialog, and try again.
It is far, far easier to gain control over something you can reliably reset to a previous state, than it is to gain control over most things in the real world, which is full of irreversible interactions.
If I could make you forget our previous interactions, and try over and over, I could make you do a lot of silly things too. I could probably do it to everyone, even people much smarter than me in whatever way you choose. Given enough tries - say, if there were a million like me who tried over and over - we could probably downright "hack" you. I don't trust ANY amount of intelligence, no matter how defined, could protect someone on those terms.