News

Cap'n Proto 1.0

kentonv on 28 Jul 2023

It’s been a little over ten years since the first release of Cap’n Proto, on April 1, 2013. Today I’m releasing version 1.0 of Cap’n Proto’s C++ reference implementation.

Don’t get too excited! There’s not actually much new. Frankly, I should have declared 1.0 a long time ago – probably around version 0.6 (in 2017) or maybe even 0.5 (in 2014). I didn’t mostly because there were a few advanced features (like three-party handoff, or shared-memory RPC) that I always felt like I wanted to finish before 1.0, but they just kept not reaching the top of my priority list. But the reality is that Cap’n Proto has been relied upon in production for a long time. In fact, you are using Cap’n Proto right now, to view this site, which is served by Cloudflare, which uses Cap’n Proto extensively (and is also my employer, although they used Cap’n Proto before they hired me). Cap’n Proto is used to encode millions (maybe billions) of messages and gigabits (maybe terabits) of data every single second of every day. As for those still-missing features, the real world has seemingly proven that they aren’t actually that important. (I still do want to complete them though.)

Ironically, the thing that finally motivated the 1.0 release is so that we can start working on 2.0. But again here, don’t get too excited! Cap’n Proto 2.0 is not slated to be a revolutionary change. Rather, there are a number of changes we (the Cloudflare Workers team) would like to make to Cap’n Proto’s C++ API, and its companion, the KJ C++ toolkit library. Over the ten years these libraries have been available, I have kept their APIs pretty stable, despite being 0.x versioned. But for 2.0, we want to make some sweeping backwards-incompatible changes, in order to fix some footguns and improve developer experience for those on our team.

Some users probably won’t want to keep up with these changes. Hence, I’m releasing 1.0 now as a sort of “long-term support” release. We’ll backport bugfixes as appropriate to the 1.0 branch for the long term, so that people who aren’t interested in changes can just stick with it.

What’s actually new in 1.0?

Again, not a whole lot has changed since the last version, 0.10. But there are a few things worth mentioning:

A number of optimizations were made to improve performance of Cap’n Proto RPC. These include reducing the amount of memory allocation done by the RPC implementation and KJ I/O framework, adding the ability to elide certain messages from the RPC protocol to reduce traffic, and doing better buffering of small messages that are sent and received together to reduce syscalls. These are incremental improvements.
Breaking change: Previously, servers could opt into allowing RPC cancellation by calling context.allowCancellation() after a call was delivered. In 1.0, opting into cancellation is instead accomplished using an annotation on the schema (the allowCancellation annotation defined in c++.capnp). We made this change after observing that in practice, we almost always wanted to allow cancellation, but we almost always forgot to do so. The schema-level annotation can be set on a whole file at a time, which is easier not to forget. Moreover, the dynamic opt-in required a lot of bookkeeping that had a noticeable performance impact in practice; switching to the annotation provided a performance boost. For users that never used context.allowCancellation() in the first place, there’s no need to change anything when upgrading to 1.0 – cancellation is still disallowed by default. (If you are affected, you will see a compile error. If there’s no compile error, you have nothing to worry about.)
KJ now uses kqueue() to handle asynchronous I/O on systems that have it (MacOS and BSD derivatives). KJ has historically always used epoll on Linux, but until now had used a slower poll()-based approach on other Unix-like platforms.
KJ’s HTTP client and server implementations now support the CONNECT method.
A new class capnp::RevocableServer was introduced to assist in exporting RPC wrappers around objects whose lifetimes are not controlled by the wrapper. Previously, avoiding use-after-free bugs in such scenarios was tricky.
Many, many smaller bug fixes and improvements. See the PR history for details.

What’s planned for 2.0?

The changes we have in mind for version 2.0 of Cap’n Proto’s C++ implementation are mostly NOT related to the protocol itself, but rather to the C++ API and especially to KJ, the C++ toolkit library that comes with Cap’n Proto. These changes are motivated by our experience building a large codebase on top of KJ: namely, the Cloudflare Workers runtime, workerd.

KJ is a C++ toolkit library, arguably comparable to things like Boost, Google’s Abseil, or Facebook’s Folly. I started building KJ at the same time as Cap’n Proto in 2013, at a time when C++11 was very new and most libraries were not really designing around it yet. The intent was never to create a new standard library, but rather to address specific needs I had at the time. But over many years, I ended up building a lot of stuff. By the time I joined Cloudflare and started the Workers Runtime, KJ already featured a powerful async I/O framework, HTTP implementation, TLS bindings, and more.

Of course, KJ has nowhere near as much stuff as Boost or Abseil, and nowhere near as much engineering effort behind it. You might argue, therefore, that it would have been better to choose one of those libraries to build on. However, KJ had a huge advantage: that we own it, and can shape it to fit our specific needs, without having to fight with anyone to get those changes upstreamed.

One example among many: KJ’s HTTP implementation features the ability to “suspend” the state of an HTTP connection, after receiving headers, and transfer it to a different thread or process to be resumed. This is an unusual thing to want, but is something we needed for resource management in the Workers Runtime. Implementing this required some deep surgery in KJ HTTP and definitely adds complexity. If we had been using someone else’s HTTP library, would they have let us upstream such a change?

That said, even though we own KJ, we’ve still tried to avoid making any change that breaks third-party users, and this has held back some changes that would probably benefit Cloudflare Workers. We have therefore decided to “fork” it. Version 2.0 is that fork.

Development of version 2.0 will take place on Cap’n Proto’s new v2 branch. The master branch will become the 1.0 LTS branch, so that existing projects which track master are not disrupted by our changes.

We don’t yet know all the changes we want to make as we’ve only just started thinking seriously about it. But, here’s some ideas we’ve had so far:

We will require a compiler with support for C++20, or maybe even C++23. Cap’n Proto 1.0 only requires C++14.
In particular, we will require a compiler that supports C++20 coroutines, as lots of KJ async code will be refactored to rely on coroutines. This should both make the code clearer and improve performance by reducing memory allocations. However, coroutine support is still spotty – as of this writing, GCC seems to ICE on KJ’s coroutine implementation.
Cap’n Proto’s RPC API, KJ’s HTTP APIs, and others are likely to be revised to make them more coroutine-friendly.
kj::Maybe will become more ergonomic. It will no longer overload nullptr to represent the absence of a value; we will introduce kj::none instead. KJ_IF_MAYBE will no longer produce a pointer, but instead a reference (a trick that becomes possible by utilizing C++17 features).
We will drop support for compiling with exceptions disabled. KJ’s coding style uses exceptions as a form of software fault isolation, or “catchable panics”, such that errors can cause the “current task” to fail out without disrupting other tasks running concurrently. In practice, this ends up affecting every part of how KJ-style code is written. And yet, since the beginning, KJ and Cap’n Proto have been designed to accommodate environments where exceptions are turned off at compile time, using an elaborate system to fall back to callbacks and distinguish between fatal and non-fatal exceptions. In practice, maintaining this ability has been a drag on development – no-exceptions mode is constantly broken and must be tediously fixed before each release. Even when the tests are passing, it’s likely that a lot of KJ’s functionality realistically cannot be used in no-exceptions mode due to bugs and fragility. Today, I would strongly recommend against anyone using this mode except maybe for the most basic use of Cap’n Proto’s serialization layer. Meanwhile, though, I’m honestly not sure if anyone uses this mode at all! In theory I would expect many people do, since many people choose to use C++ with exceptions disabled, but I’ve never actually received a single question or bug report related to it. It seems very likely that this was wasted effort all along. By removing support, we can simplify a lot of stuff and probably do releases more frequently going forward.
Similarly, we’ll drop support for no-RTTI mode and other exotic modes that are a maintenance burden.
We may revise KJ’s approach to reference counting, as the current design has proven to be unintuitive to many users.
We will fix a longstanding design flaw in kj::AsyncOutputStream, where EOF is currently signaled by destroying the stream. Instead, we’ll add an explicit end() method that returns a Promise. Destroying the stream without calling end() will signal an erroneous disconnect. (There are several other aesthetic improvements I’d like to make to the KJ stream APIs as well.)
We may want to redesign several core I/O APIs to be a better fit for Linux’s new-ish io_uring event notification paradigm.
The RPC implementation may switch to allowing cancellation by default. As discussed above, this is opt-in today, but in practice I find it’s almost always desirable, and disallowing it can lead to subtle problems.
And so on.

It’s worth noting that at present, there is no plan to make any backwards-incompatible changes to the serialization format or RPC protocol. The changes being discussed only affect the C++ API. Applications written in other languages are completely unaffected by all this.

It’s likely that a formal 2.0 release will not happen for some time – probably a few years. I want to make sure we get through all the really big breaking changes we want to make, before we inflict update pain on most users. Of course, if you’re willing to accept breakages, you can always track the v2 branch. Cloudflare Workers releases from v2 twice a week, so it should always be in good working order.