Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

You can disallow all control characters (ASCII < 32) other than CR/LF/TAB, which is reasonable. I don't know of any data besides binary blobs which uses those. I've never heard of anyone inlining a binary file (like an image) into a "CSV" anyway.


If you disallow control characters so that you can use them as delimiters, then CSV itself becomes a "binary" data format - or to put it another way, you lose the ability to nest CSV.

It isn't good enough to say "but people don't/won't/shouldn't do that", because it will just happen regardless. I've seen nested CSV in real-life data.

Compare to the zero-terminated strings used by C, one legacy of which is that PostgreSQL doesn't quite support UTF-8 properly, because it can't handle a 0 byte in a string, because 0 is "special" in C.


Nested CSVs as you've seen in real-life data are a good counterexample, thanks for providing it.


So have a way to escape those control characters.


Right, but the original point I was responding to is that control characters are disallowed in the data and therefore don't need to be escaped. If you're going to have an escaping mechanism then you can use "normal" characters like comma as delimiters, which is better because they can be read and written normally.


But a comma is much more likely to need to be escaped.


It's good for a delimiter to be uncommon in the data, so that you don't have to use your escaping mechanism too much.

This is a different thing altogether from using "disallowed" control characters, which is an attempt to avoid escaping altogether - an attempt which I was arguing is doomed to fail.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: