Great post, digestible without experience in capn proto specifics.
I really like that they talk in terms of “features carrying their own weight” and survey real world code to see what cases features solve for, and how (surprisingly little) certain features are actually used.
A side effect of dropping features when designing for code generation (in this case) is that it makes things more concise for everyone else as well, including the users.
There are generally no easy answers, but as a principle I think the burden of proof for new features should be high. “Oh it’s nice” is not enough, unless it’s trivial syntactic sugar. It should really be at least a strong need and ideally multiple needs per feature, so to say.
Premature optimization is the root of all evil - again. Applies to performance and features. Unless you can proove (not formally, but for yourself) that feature will make things better, don't implement it.
I think one of the most desirable and under-appreciated goals of schema languages and serialization formats is safety. These tools are typically used in places that deal with untrusted inputs, and features and design choices can go a long way in either exposing or shielding developers from potential safety bugs.
My read of Cap’n’Proto didn’t make it sound that safety was the highest priority. At least not above performance.
* The very first use case for which Cap'n Proto was designed was to be the protocol that Sandstorm.io used to talk between sandbox and supervisor -- an explicitly adversarial security scenario.
* The documentation explicitly calls out how implementations should manage resource exhaustion problems like deep recursion depth (stack overflow risk), were many serialization formats leave these things as the app's problem.
* The implementation has been fuzz-tested multiple ways, including as part of Google's oss-fuzz.
* When there are security bugs, I issue advisories like this:
* The primary aim of the entire project is to be a Capability-Based Security RPC protocol.
That's what "Cap" in the name comes from. The zero-copy serialization is actually a bonus feature.
I agree entirely, and this is one of my single greatest frustrations with the majority of the current popular IDLs/schema languages.
ASN.1 is hilariously bad in a lot of ways, but one thing it gets absolutely right is strong typing and being able to express constraints (ranges, values dependent on other values). That combined with a canonicalized encoding form (DER) goes a long way in making various error states unrepresentable.
The problem with LDAP, etc. is that they all permit BER, which is a looser superset of DER. It includes (among other things) the ability to represent indefinite-length fields, which are the single biggest source of exploitable bugs in a typical application of ASN.1. Without that, the exploitable surface of DER is much smaller (and especially when implemented in a memory-safe language).
I've written an ASN.1 parser. The problem isn't the specification (though it is definitely a kitchen sink spec). The problem is the majority of ASN.1 code was written before the year 2000.
ASN.1 started in 1984. That means there are decades of shitty implementations, written well before adversarial input was considered a factor.
There’s a wide set of best practices (use only DER for encoding, avoid legacy string types, etc.) that are widely applied in cryptographic applications, although I don’t know if anybody has written them down explicitly.
More generally: this wasn’t intended to be an endorsement of ASN.1 per se! It was only to say that it got some things right, things that Cap’n Proto and Protobuf appear to have eschewed. I’m not sure it is the right IDL for modern purposes, but I think it’s a useful piece of reference material.
Check out DFDL/Apache Daffodil. [0] a large portion of the development team is working on it specifically for use in a cybersecurity context. (Disclaimer, I was one such contributer. Although am presently not working on Daffodil).
Having said that, DFDL fails pretty miserably by the standards set in the article. The main design goal was to be able to describe as many existing data formats as possible, which means the spec is massive and supports a lot of bad ideas.
Despite having its 1.0 release in 2015, and being the most complete implementation, Apache Daffodil still does not fully implement the DFDL spec. And it is not an easy code base to jump into and understand.
> and serialization formats is safety. These tools are typically used in places that deal with untrusted inputs, and features and design choices can go a long way in either exposing or shielding developers from potential safety bugs
My potentially incorrect understanding is that Cap'n Proto's zero copy nature means the serialization format IS the in-memory representation, which means that if you build a Cap'n Proto object on top of non-zeroed memory you can leak data in the padding when transmitting. [Presumably not an issue if the packed encoding is used rather than the zero-copy one]
A MessageBuilder allocates a single large buffer, writes into it, and only allocates further if that buffer is exhausted. If you use a preallocated buffer you can avoid allocation entirely. Very different from Protobuf which allocates strings, arrays, and sub-messages all as separate heap objects.
hmm, i don’t understand how schema-languages can do anything about that though. afterall, you just serialize/de-serialize based on provided inputs, GIGO if you will.
safety w.r.t bad/malicious inputs should be a ‘higher level’ concern afaik.
Security is a concern for every layer. It's not magic pixie dust that' can be sprinkled on top of software to renders it secure!
A while ago I read a great article about how the Adobe PDF serialization format is nearly impossible to secure because it allows inherently unsafe constructs.
For example, it allows cross-references that are basically just arbitrary unaligned pointers. It uses many different alignment and padding algorithms. It has length-prefixed and not-length prefixed sections. Etc, etc...
Apparently it was a serious research exercise to make a safe PDF parser, and they only covered a fraction of the full spec!
To put things in perspective: Originally, PDF allowed arbitrary code execution as a core feature, allowing the output of shell commands to be used as document content.
Most people like the Chromium and Firefox teams have just given up and now parse PDF using a sandboxed JavaScript VM because it's too hard to do it safely with C++. They parse HTML and JavaScript with C++, but not PDF. Think about that.
A similar issue caused Log4j, where a "format string parser" contained a vulnerability because it was too flexible and allowed network requests to be triggered by user-controlled data.
Even trivial, "surely it must be safe" formats like XML and JSON are riddled with security issues, such as different layers in a microservice architecture having different handling semantics for duplicate keys, null values, etc... This can result in exploits such as authentication and authorization tokens being interpreted by a system one way, but a different way by a different system. For real-world attacks along these lines, search for "request smuggling".
Serialization and parsing are security minefields and it is dangerously naive to just hand-wave that away.
> Serialization and parsing are security minefields and it is dangerously naive to just hand-wave that away.
well, i am not hand-waving them away, i am not sure what can the serialization framework possibly _do_ to make things secure during the serialization ?
when execution of user-supplied code is allowed (in the examples that you have outlined above), surely, the layer _executing_ the code cannot really do anything about it ! perhaps you actually did intend to `rm -rf /` ?
policy checking, enforcement etc. has to happen at a higher / different layer. i am not sure why mechanism and policy are being conflated here.
in the same way, you gave the serialization layer a 10mb or whatever sized input to serialize, sure...you get an valid serialized output etc. maybe there is a genuine usecase for that in some context or another f.e. when serializing say image files, or something else etc. etc.
> I am not sure what can the serialization framework possibly _do_ to make things secure during the serialization
Loads of things!
A strict specification that can only be interpreted one way goes very far. E.g.: a machine-readable BNF grammar file or something similar with no ambiguities.
A conformance test suite covering corner-cases is surprisingly effective, even with a supposedly perfect spec.
"Be strict with what you generate and lax with what you accept" has been demonstrated over and over again to be a disaster over the long-term in an ecosystem of many groups. Be strict always with what is accepted, not just generated!
Speaking of being strict: schema validation is essential. Strong typing for scalars helps a lot.
The actual implementations of the spec can obviously have a wide range of security features. Never allowing arbitrary type instantiation is critical, yet is a mistake that keeps reoccurring much like SQL injection.
> I am not sure what can the serialization framework possibly _do_ to make things secure during the serialization
>> Loads of things!
>> A strict specification that can only be interpreted one way goes very far. E.g.: a machine-readable BNF grammar file or something similar with no ambiguities.
once again, that is not the domain of the serialization framework ! it is a policy which needs to be established and enforced at input / output layer by the entity which implements it.
a serialization framework should just serialize and deserialize objects to / from an i/o 'channel' f.e. file, network, etc. shackling it with specification / enforcement of security etc. policies seems conflating one concern with another.
Unfortunately these are lessons that have to be learned over and over. Anything based on JSON is generally suspect. If you see the terms "quick" or "simple" in some marketing splash-page, assume the author has not thought about the hard problems like security and long-term interoperability.
Similarly, if you find yourself hand-rolling RPC client code and calling methods on something like "HttpClient" manually, you've done it wrong. That code should have been spat out by a code-generator from a schema.
The schema language might for example allow you to specify that an input string/blob should be smaller than 10MB and refuse to deserialize it if it is longer, same for array/list/vector length.
The limit might apply to some specific part of the message, rather than the whole. You can't check this without actually deserialising, or at least doing most of the same work.
I believe I have already justified why it might be useful at the protocol/schema level in ways that cannot be replicated at the application level: to eagerly fail on expensive (eg memory) deserialization.
Knowing that it's from a few years ago gives helpful information to the potential reader.
1) Someone may have read the essay before, remembered the point, but forgotten the exact title. It has, after all, been posted on HN twice before. Simply reading the comment that it's from 4 years ago may be enough to confirm that it's the same article, as it did with me.
While the essay has the date right at the top, in general there is no consistency about where people post a date, or if there is one at all.
I know that I sometimes skim the HN comments first rather than follow a page link because then I don't need to worry about cookie notices, requests to subscribe to a newsletter, etc. This essay does not have those, but enough do that it's a general nuisance.
2) Software changes over time. Cap'n Proto's Schema language may have changed since then - perhaps as a response to this essay - making this essay of historic interest only. While that is not likely the case, it is an example of how having the date may provide helpful information.
This is useful enough that it is accepted practice to add the date to older information. For example, the previous two HN submissions to this essay both include (2019) in the title. See https://news.ycombinator.com/from?site=zenhack.net
The submission guidelines contain no such allowance. They specify a very limited set of cases where the title of a submission can be modified from the web page title.
You can see dang, one of the moderators, write: "It's the convention on HN to add the year to a title when the article isn't from this year." at https://news.ycombinator.com/item?id=27867435 ,
In retrospect, my original comment was probably too terse. It's not obvious if I was suggesting that "(2019)" be added to the title, or if I was claiming that the article was outdated for some reason. That's my fault.
It's common on HN to note the date for reposts. The mods will usually add the year to the title if it's not included. So that's probably all that CoastalCoder was signaling.
Users are not advised to do this themselves. The submission guidelines contain no such allowance. They specify a very limited set of cases where the title of a submission can be modified from the web page title.
Guidelines are not strict rules, as Captain Barbossa pointed out, but let's assume they are.
Since moderators have promoted this practice for at least a decade, it's much more likely your interpretation of the guidelines which is wrong.
Any moderately good rules lawyer would point out the guideline is "please use the original title" not "please use only the original title" nor the even stronger "use only the original title", which turns the polite request into a requirement.
Adding the year is not editorializing. Nor is adding "[video]" or "[pdf]", nor even adding the conference name and year as https://news.ycombinator.com/item?id=37216895 did 11 hours ago.
Since there appears to be a conflict between your understanding of the guidelines and that of the moderators, you should take it up with them at hn@ycombinator.com in hopes of resolving the issue.
After all, the guidelines say "If you're worried about abuse, email hn@ycombinator.com and we'll look at the data." ;)
recency trolling; it's trivial to find the date of publication, but some people seem to think they can get karma points by their selfless public service...
No, some people go to Hacker News to read things that are new. But people love reposting so the compromise is to put the date in the title for old posts so we can easily skip over them.
Why do all these serialization frameworks like protobufs, flatbuffers, capnproto, etc. have bespoke schema languages? Why not just use JSON-based schema so that you don't need custom parsers for it? It would definitely make metaprogramming easier as well.
>Cap’n Proto unions are not first class types. Instead, they are fields of structs
In some binary data description format, I didn't include unions because they are just a special case of what I called "conditional sequences". But then I added them, because it is very difficult or impossible to figure out whether a set of conditional sequences corresponds to a union (all must have same size, which is only known at encoding/decoding, and only one must be present), which you need to know when converting into more restrictive formats that only support unions.
I really like that they talk in terms of “features carrying their own weight” and survey real world code to see what cases features solve for, and how (surprisingly little) certain features are actually used.
A side effect of dropping features when designing for code generation (in this case) is that it makes things more concise for everyone else as well, including the users.
There are generally no easy answers, but as a principle I think the burden of proof for new features should be high. “Oh it’s nice” is not enough, unless it’s trivial syntactic sugar. It should really be at least a strong need and ideally multiple needs per feature, so to say.