Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
A Critique of the Cap'n Proto Schema Language (2019) (zenhack.net)
117 points by signa11 on Aug 20, 2023 | hide | past | favorite | 72 comments


Great post, digestible without experience in capn proto specifics.

I really like that they talk in terms of “features carrying their own weight” and survey real world code to see what cases features solve for, and how (surprisingly little) certain features are actually used.

A side effect of dropping features when designing for code generation (in this case) is that it makes things more concise for everyone else as well, including the users.

There are generally no easy answers, but as a principle I think the burden of proof for new features should be high. “Oh it’s nice” is not enough, unless it’s trivial syntactic sugar. It should really be at least a strong need and ideally multiple needs per feature, so to say.


Premature optimization is the root of all evil - again. Applies to performance and features. Unless you can proove (not formally, but for yourself) that feature will make things better, don't implement it.


I think one of the most desirable and under-appreciated goals of schema languages and serialization formats is safety. These tools are typically used in places that deal with untrusted inputs, and features and design choices can go a long way in either exposing or shielding developers from potential safety bugs.

My read of Cap’n’Proto didn’t make it sound that safety was the highest priority. At least not above performance.


With all due respect, you read completely wrong.

* The very first use case for which Cap'n Proto was designed was to be the protocol that Sandstorm.io used to talk between sandbox and supervisor -- an explicitly adversarial security scenario.

* The documentation explicitly calls out how implementations should manage resource exhaustion problems like deep recursion depth (stack overflow risk), were many serialization formats leave these things as the app's problem.

* The implementation has been fuzz-tested multiple ways, including as part of Google's oss-fuzz.

* When there are security bugs, I issue advisories like this:

https://github.com/capnproto/capnproto/tree/v2/security-advi...

* The primary aim of the entire project is to be a Capability-Based Security RPC protocol. That's what "Cap" in the name comes from. The zero-copy serialization is actually a bonus feature.

(I'm the author of Cap'n Proto.)


I agree entirely, and this is one of my single greatest frustrations with the majority of the current popular IDLs/schema languages.

ASN.1 is hilariously bad in a lot of ways, but one thing it gets absolutely right is strong typing and being able to express constraints (ranges, values dependent on other values). That combined with a canonicalized encoding form (DER) goes a long way in making various error states unrepresentable.


Except that ASN.1 is egregiously terrible at being able to be checked for wonky values due to complex parsing.

Exactly how many vulnerabilities have been exploited in LDAP, SNMP, etc. because ASN.1 is so terrible?


ASN.1 isn’t an encoding; DER is.

The problem with LDAP, etc. is that they all permit BER, which is a looser superset of DER. It includes (among other things) the ability to represent indefinite-length fields, which are the single biggest source of exploitable bugs in a typical application of ASN.1. Without that, the exploitable surface of DER is much smaller (and especially when implemented in a memory-safe language).


I've written an ASN.1 parser. The problem isn't the specification (though it is definitely a kitchen sink spec). The problem is the majority of ASN.1 code was written before the year 2000.

ASN.1 started in 1984. That means there are decades of shitty implementations, written well before adversarial input was considered a factor.


Is there a reasonable subset of ASN.1 that could get traction nowadays if specified separately?


There’s a wide set of best practices (use only DER for encoding, avoid legacy string types, etc.) that are widely applied in cryptographic applications, although I don’t know if anybody has written them down explicitly.

More generally: this wasn’t intended to be an endorsement of ASN.1 per se! It was only to say that it got some things right, things that Cap’n Proto and Protobuf appear to have eschewed. I’m not sure it is the right IDL for modern purposes, but I think it’s a useful piece of reference material.


I had to use it at work in a C++ environment and ended up settling on patching a copy of https://github.com/vlm/asn1c

Can't say I'd feel confident putting any of this stuff in a public service. Too complex and prone to bugs.


Check out DFDL/Apache Daffodil. [0] a large portion of the development team is working on it specifically for use in a cybersecurity context. (Disclaimer, I was one such contributer. Although am presently not working on Daffodil).

Having said that, DFDL fails pretty miserably by the standards set in the article. The main design goal was to be able to describe as many existing data formats as possible, which means the spec is massive and supports a lot of bad ideas.

Despite having its 1.0 release in 2015, and being the most complete implementation, Apache Daffodil still does not fully implement the DFDL spec. And it is not an easy code base to jump into and understand.

[0] https://daffodil.apache.org/


> and serialization formats is safety. These tools are typically used in places that deal with untrusted inputs, and features and design choices can go a long way in either exposing or shielding developers from potential safety bugs

My potentially incorrect understanding is that Cap'n Proto's zero copy nature means the serialization format IS the in-memory representation, which means that if you build a Cap'n Proto object on top of non-zeroed memory you can leak data in the padding when transmitting. [Presumably not an issue if the packed encoding is used rather than the zero-copy one]


It's only zero copy to parse/read. The builders allocate all over the place.


A MessageBuilder allocates a single large buffer, writes into it, and only allocates further if that buffer is exhausted. If you use a preallocated buffer you can avoid allocation entirely. Very different from Protobuf which allocates strings, arrays, and sub-messages all as separate heap objects.


hmm, i don’t understand how schema-languages can do anything about that though. afterall, you just serialize/de-serialize based on provided inputs, GIGO if you will.

safety w.r.t bad/malicious inputs should be a ‘higher level’ concern afaik.


Safety means: garbage in, error out.


right, and it should not be left to the serialization layer for that.


Security is a concern for every layer. It's not magic pixie dust that' can be sprinkled on top of software to renders it secure!

A while ago I read a great article about how the Adobe PDF serialization format is nearly impossible to secure because it allows inherently unsafe constructs.

For example, it allows cross-references that are basically just arbitrary unaligned pointers. It uses many different alignment and padding algorithms. It has length-prefixed and not-length prefixed sections. Etc, etc...

Apparently it was a serious research exercise to make a safe PDF parser, and they only covered a fraction of the full spec!

To put things in perspective: Originally, PDF allowed arbitrary code execution as a core feature, allowing the output of shell commands to be used as document content.

Most people like the Chromium and Firefox teams have just given up and now parse PDF using a sandboxed JavaScript VM because it's too hard to do it safely with C++. They parse HTML and JavaScript with C++, but not PDF. Think about that.

A similar issue caused Log4j, where a "format string parser" contained a vulnerability because it was too flexible and allowed network requests to be triggered by user-controlled data.

Even trivial, "surely it must be safe" formats like XML and JSON are riddled with security issues, such as different layers in a microservice architecture having different handling semantics for duplicate keys, null values, etc... This can result in exploits such as authentication and authorization tokens being interpreted by a system one way, but a different way by a different system. For real-world attacks along these lines, search for "request smuggling".

Serialization and parsing are security minefields and it is dangerously naive to just hand-wave that away.

See: https://seriot.ch/projects/parsing_json.html


> Serialization and parsing are security minefields and it is dangerously naive to just hand-wave that away. well, i am not hand-waving them away, i am not sure what can the serialization framework possibly _do_ to make things secure during the serialization ?

when execution of user-supplied code is allowed (in the examples that you have outlined above), surely, the layer _executing_ the code cannot really do anything about it ! perhaps you actually did intend to `rm -rf /` ?

policy checking, enforcement etc. has to happen at a higher / different layer. i am not sure why mechanism and policy are being conflated here.

in the same way, you gave the serialization layer a 10mb or whatever sized input to serialize, sure...you get an valid serialized output etc. maybe there is a genuine usecase for that in some context or another f.e. when serializing say image files, or something else etc. etc.

[edit] : minor comment.


> I am not sure what can the serialization framework possibly _do_ to make things secure during the serialization

Loads of things!

A strict specification that can only be interpreted one way goes very far. E.g.: a machine-readable BNF grammar file or something similar with no ambiguities.

A conformance test suite covering corner-cases is surprisingly effective, even with a supposedly perfect spec.

"Be strict with what you generate and lax with what you accept" has been demonstrated over and over again to be a disaster over the long-term in an ecosystem of many groups. Be strict always with what is accepted, not just generated!

Speaking of being strict: schema validation is essential. Strong typing for scalars helps a lot.

The actual implementations of the spec can obviously have a wide range of security features. Never allowing arbitrary type instantiation is critical, yet is a mistake that keeps reoccurring much like SQL injection.

Etc, etc...


> I am not sure what can the serialization framework possibly _do_ to make things secure during the serialization

>> Loads of things!

>> A strict specification that can only be interpreted one way goes very far. E.g.: a machine-readable BNF grammar file or something similar with no ambiguities.

once again, that is not the domain of the serialization framework ! it is a policy which needs to be established and enforced at input / output layer by the entity which implements it.

a serialization framework should just serialize and deserialize objects to / from an i/o 'channel' f.e. file, network, etc. shackling it with specification / enforcement of security etc. policies seems conflating one concern with another.


what's the best modern alternative that is designed in this way?


gRPC ticks most of the checkboxes.

Unfortunately these are lessons that have to be learned over and over. Anything based on JSON is generally suspect. If you see the terms "quick" or "simple" in some marketing splash-page, assume the author has not thought about the hard problems like security and long-term interoperability.

Similarly, if you find yourself hand-rolling RPC client code and calling methods on something like "HttpClient" manually, you've done it wrong. That code should have been spat out by a code-generator from a schema.


> gRPC ticks most of the checkboxes.

huh :) ! gRPC is a 'r-p-c' framework, and uses protobuf for serialization. you should be comparing protobuf to cap'nproto.


it depends on what type of safety.

The schema language might for example allow you to specify that an input string/blob should be smaller than 10MB and refuse to deserialize it if it is longer, same for array/list/vector length.


It feels like a check against an input size of 10MB is something you would do well before deserialization, no?


The limit might apply to some specific part of the message, rather than the whole. You can't check this without actually deserialising, or at least doing most of the same work.


not if it is a message you receive from a third party.

A concrete example might be a batching third party client: the app sends N messages in a single batch and each message has its own size limit.


You would, but others might not. Defense in depth.


> ... allow you to specify that an input string/blob should be smaller than 10MB and refuse to deserialize it if it is longer ...

why ? are there no cases where serializing even larger file is valid ?


sure, a lot of cases, I suspect that S3 upload limits are different from imgur.


and feel free to do that in _your_ application. don’t shackle others with the limitations of your domain.

mechanism vs policy and all that.


I believe I have already justified why it might be useful at the protocol/schema level in ways that cannot be replicated at the application level: to eagerly fail on expensive (eg memory) deserialization.


Disregard for safety and security in serialization is one of the most common, if not the most common, cause for security vulnerabilities.


Rest in peace, Ian, my friend.


For context: it seems that the author passed away on last July

https://www.winchesteruu.org/2023/07/25/joys-sorrows-and-tra...



"Last" in this context refers to the year. Last July would be July 2022; July 2023 is this July. July 2023 will become last July at the start of 2024.


Thank you. Sorry for my bad English. "This July" from now on :)


OMG. I'm so sorry to see this news!

RIP zenhack


2019


yes ?


Knowing that it's from a few years ago gives helpful information to the potential reader.

1) Someone may have read the essay before, remembered the point, but forgotten the exact title. It has, after all, been posted on HN twice before. Simply reading the comment that it's from 4 years ago may be enough to confirm that it's the same article, as it did with me.

While the essay has the date right at the top, in general there is no consistency about where people post a date, or if there is one at all.

I know that I sometimes skim the HN comments first rather than follow a page link because then I don't need to worry about cookie notices, requests to subscribe to a newsletter, etc. This essay does not have those, but enough do that it's a general nuisance.

2) Software changes over time. Cap'n Proto's Schema language may have changed since then - perhaps as a response to this essay - making this essay of historic interest only. While that is not likely the case, it is an example of how having the date may provide helpful information.

This is useful enough that it is accepted practice to add the date to older information. For example, the previous two HN submissions to this essay both include (2019) in the title. See https://news.ycombinator.com/from?site=zenhack.net


Yeah, that ^^^^ was pretty much my reasoning. Thanks for articulating it so clearly.

Honestly I didn't expect my original comment to be controversial.


The submission guidelines contain no such allowance. They specify a very limited set of cases where the title of a submission can be modified from the web page title.


Thanks. I wasn't aware that putting a date in the title might violate the guidelines.


It does not.

You can see dang, one of the moderators, write: "It's the convention on HN to add the year to a title when the article isn't from this year." at https://news.ycombinator.com/item?id=27867435 ,

write "On HN, it's the convention to add the year when a story is more than a year and a bit old." at https://news.ycombinator.com/item?id=16306830

and write "It's just the convention to add the year in parens (and for commenters to mention when we missed one)." at https://news.ycombinator.com/item?id=8244765 .

Found via https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que... .

If orhmeh09 has issue with how to interpret the submission guidelines, I suggest contacting hn@ycombinator.com to resolve the issue.


The guidelines are here. It's unreasonable to expect people to search for them in a comment. https://news.ycombinator.com/newsguidelines.html


As this is identical to orhmeh09's comment at https://news.ycombinator.com/item?id=37215578 I point to my response at https://news.ycombinator.com/item?id=37221400 .


Thanks for clarifying the rules.

In retrospect, my original comment was probably too terse. It's not obvious if I was suggesting that "(2019)" be added to the title, or if I was claiming that the article was outdated for some reason. That's my fault.


Knowing that it's from a few years ago gives helpful information to the potential reader.

because it's so difficult to find the date of publication before people waste their precious time goofing off from their overpaid jobs.


does that invalidate the criticisms?


It's common on HN to note the date for reposts. The mods will usually add the year to the title if it's not included. So that's probably all that CoastalCoder was signaling.


Users are not advised to do this themselves. The submission guidelines contain no such allowance. They specify a very limited set of cases where the title of a submission can be modified from the web page title.


This is a long-established convention to include the year if the article is a year and a bit old.

See dang, one of the moderators, affirm this multiple times at https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que... , most recently 17 days ago.


The guidelines are here. It's unreasonable to expect people to search for them in a comment. https://news.ycombinator.com/newsguidelines.html


Guidelines are not strict rules, as Captain Barbossa pointed out, but let's assume they are.

Since moderators have promoted this practice for at least a decade, it's much more likely your interpretation of the guidelines which is wrong.

Any moderately good rules lawyer would point out the guideline is "please use the original title" not "please use only the original title" nor the even stronger "use only the original title", which turns the polite request into a requirement.

Adding the year is not editorializing. Nor is adding "[video]" or "[pdf]", nor even adding the conference name and year as https://news.ycombinator.com/item?id=37216895 did 11 hours ago.

Since there appears to be a conflict between your understanding of the guidelines and that of the moderators, you should take it up with them at hn@ycombinator.com in hopes of resolving the issue.

After all, the guidelines say "If you're worried about abuse, email hn@ycombinator.com and we'll look at the data." ;)


recency trolling; it's trivial to find the date of publication, but some people seem to think they can get karma points by their selfless public service...


No, some people go to Hacker News to read things that are new. But people love reposting so the compromise is to put the date in the title for old posts so we can easily skip over them.


such people must not be easily disappointed


Why do all these serialization frameworks like protobufs, flatbuffers, capnproto, etc. have bespoke schema languages? Why not just use JSON-based schema so that you don't need custom parsers for it? It would definitely make metaprogramming easier as well.


Because JSON-based schemas are excessively verbose making them painful to read and write, to the point where really nobody wants to do so.

Consider:

    struct Person {
      name @0 :Text;
      age @1 :UInt16;
    }
vs.:

    "declarations": [
      {
        "name": "Person",
        "kind": "struct",
        "fields": [
          {
            "name": "name",
            "type": "text",
            "ordinal": 0
          }, {
            "name": "age",
            "type": "uint16",
            "ordinal": 1
          }
        ]
      }
    ]


Also JSON schemas can represent the exact same type in multiple ways, meaning some form of normalization is required prior.


Why not just use JSON for everything? Why isn't /etc/passwd JSON?


Avro uses json for its schema language.


But also has a schema language, and a avro.reflect which can construct schema from java class.

Avro schema in Json is very unreadable in my experience.

Another bad thing is something like ["int", "null"] gets formatted over 2 lines by standard JSON formatters, making it even harder to read.


> But also has a schema language,

Really? I admittedly don't know a whole lot about avro, but everything I've been able to find about avro schemas uses json.

Edit: I found it: https://avro.apache.org/docs/1.11.1/idl-language/


Yeah JSON is usually the way it's written. IIRC the IDL even lacks some features of JSON.

But the fact it exists means at least some people find JSON verbose.


>Cap’n Proto unions are not first class types. Instead, they are fields of structs

In some binary data description format, I didn't include unions because they are just a special case of what I called "conditional sequences". But then I added them, because it is very difficult or impossible to figure out whether a set of conditional sequences corresponds to a union (all must have same size, which is only known at encoding/decoding, and only one must be present), which you need to know when converting into more restrictive formats that only support unions.


Indeed, there's a good reason why sums (unions) and products (records) are the most primitive abstractions in various logics and mathematics.


Is there any serialization format that handles (discriminated) unions well?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: