Discuss on Groups View on GitHub

Encoding Spec

Organization

64-bit Words

For the purpose of Cap’n Proto, a “word” is defined as 8 bytes, or 64 bits. Since alignment of data is important, all objects (structs, lists, and blobs) are aligned to word boundaries, and sizes are usually expressed in terms of words. (Primitive values are aligned to a multiple of their size within a struct or list.)

Messages

The unit of communication in Cap’n Proto is a “message”. A message is a tree of objects, with the root always being a struct.

Physically, messages may be split into several “segments”, each of which is a flat blob of bytes. Typically, a segment must be loaded into a contiguous block of memory before it can be accessed, so that the relative pointers within the segment can be followed quickly. However, when a message has multiple segments, it does not matter where those segments are located in memory relative to each other; inter-segment pointers are encoded differently, as we’ll see later.

Ideally, every message would have only one segment. However, there are a few reasons why splitting a message into multiple segments may be convenient:

The first word of the first segment of the message is always a pointer pointing to the message’s root struct.

Objects

Each segment in a message contains a series of objects. For the purpose of Cap’n Proto, an “object” is any value which may have a pointer pointing to it. Pointers can only point to the beginning of objects, not into the middle, and no more than one pointer can point at each object. Thus, objects and the pointers connecting them form a tree, not a graph. An object is itself composed of primitive data values and pointers, in a layout that depends on the kind of object.

At the moment, there are three kinds of objects: structs, lists, and far-pointer landing pads. Blobs might also be considered to be a kind of object, but are encoded identically to lists of bytes.

Value Encoding

Primitive Values

The built-in primitive types are encoded as follows:

Primitive types must always be aligned to a multiple of their size. Note that since the size of a Bool is one bit, this means eight Bool values can be encoded in a single byte – this differs from C++, where the bool type takes a whole byte.

Enums

Enums are encoded the same as UInt16.

Object Encoding

Blobs

The built-in blob types are encoded as follows:

Structs

A struct value is encoded as a pointer to its content. The content is split into two sections: data and pointers, with the pointer section appearing immediately after the data section. This split allows structs to be traversed (e.g., copied) without knowing their type.

A struct pointer looks like this:

lsb                      struct pointer                       msb
+-+-----------------------------+---------------+---------------+
|A|             B               |       C       |       D       |
+-+-----------------------------+---------------+---------------+

A (2 bits) = 0, to indicate that this is a struct pointer.
B (30 bits) = Offset, in words, from the end of the pointer to the
    start of the struct's data section.  Signed.
C (16 bits) = Size of the struct's data section, in words.
D (16 bits) = Size of the struct's pointer section, in words.

Fields are positioned within the struct according to an algorithm with the following principles:

Field offsets are computed by the Cap’n Proto compiler. The precise algorithm is too complicated to describe here, but you need not implement it yourself, as the compiler can produce a compiled schema format which includes offset information.

Default Values

A default struct is always all-zeros. To achieve this, fields in the data section are stored xor’d with their defined default values. An all-zero pointer is considered “null”; accessor methods for pointer fields check for null and return a pointer to their default value in this case.

There are several reasons why this is desirable:

Zero-sized structs.

As stated above, a pointer whose bits are all zero is considered a null pointer, not a struct of zero size. To encode a struct of zero size, set A, C, and D to zero, and set B (the offset) to -1.

Historical explanation: A null pointer is intended to be treated as equivalent to the field’s default value. Early on, it was thought that a zero-sized struct was a suitable synonym for null, since interpreting an empty struct as any struct type results in a struct whose fields are all default-valued. So, the pointer encoding was designed such that a zero-sized struct’s pointer would be all-zero, so that it could conveniently be overloaded to mean “null”.

However, it turns out there are two important differences between a zero-sized struct and a null pointer. First, applications often check for null explicitly when implementing optional fields. Second, an empty struct is technically equivalent to the default value for the struct type, whereas a null pointer is equivalent to the default value for the particular field. These are not necessarily the same.

It therefore became necessary to find a different encoding for zero-sized structs. Since the struct has zero size, the pointer’s offset can validly point to any location so long as it is in-bounds. Since an offset of -1 points to the beginning of the pointer itself, it is known to be in-bounds. So, we use an offset of -1 when the struct has zero size.

Lists

A list value is encoded as a pointer to a flat array of values.

lsb                       list pointer                        msb
+-+-----------------------------+--+----------------------------+
|A|             B               |C |             D              |
+-+-----------------------------+--+----------------------------+

A (2 bits) = 1, to indicate that this is a list pointer.
B (30 bits) = Offset, in words, from the end of the pointer to the
    start of the first element of the list.  Signed.
C (3 bits) = Size of each element:
    0 = 0 (e.g. List(Void))
    1 = 1 bit
    2 = 1 byte
    3 = 2 bytes
    4 = 4 bytes
    5 = 8 bytes (non-pointer)
    6 = 8 bytes (pointer)
    7 = composite (see below)
D (29 bits) = Size of the list:
    when C <> 7: Number of elements in the list.
    when C = 7: Number of words in the list, not counting the tag word
    (see below).

The pointed-to values are tightly-packed. In particular, Bools are packed bit-by-bit in little-endian order (the first bit is the least-significant bit of the first byte).

When C = 7, the elements of the list are fixed-width composite values – usually, structs. In this case, the list content is prefixed by a “tag” word that describes each individual element. The tag has the same layout as a struct pointer, except that the pointer offset (B) instead indicates the number of elements in the list. Meanwhile, section (D) of the list pointer – which normally would store this element count – instead stores the total number of words in the list (not counting the tag word). The reason we store a word count in the pointer rather than an element count is to ensure that the extents of the list’s location can always be determined by inspecting the pointer alone, without having to look at the tag; this may allow more-efficient prefetching in some use cases. The reason we don’t store struct lists as a list of pointers is because doing so would take significantly more space (an extra pointer per element) and may be less cache-friendly.

In the future, we could consider implementing matrixes using the “composite” element type, with the elements being fixed-size lists rather than structs. In this case, the tag would look like a list pointer rather than a struct pointer. As of this writing, no such feature has been implemented.

A struct list must always be written using C = 7. However, a list of any element size (except C = 1, i.e. 1-bit) may be decoded as a struct list, with each element being interpreted as being a prefix of the struct data. For instance, a list of 2-byte values (C = 3) can be decoded as a struct list where each struct has 2 bytes in their “data” section (and an empty pointer section). A list of pointer values (C = 6) can be decoded as a struct list where each struct has a pointer section with one pointer (and an empty data section). The purpose of this rule is to make it possible to upgrade a list of primitives to a list of structs, as described under the protocol evolution rules. (We make a special exception that boolean lists cannot be upgraded in this way due to the unreasonable implementation burden.) Note that even though struct lists can be decoded from any element size (except C = 1), it is NOT permitted to encode a struct list using any type other than C = 7 because doing so would interfere with the canonicalization algorithm.

Inter-Segment Pointers

When a pointer needs to point to a different segment, offsets no longer work. We instead encode the pointer as a “far pointer”, which looks like this:

lsb                        far pointer                        msb
+-+-+---------------------------+-------------------------------+
|A|B|            C              |               D               |
+-+-+---------------------------+-------------------------------+

A (2 bits) = 2, to indicate that this is a far pointer.
B (1 bit) = 0 if the landing pad is one word, 1 if it is two words.
    See explanation below.
C (29 bits) = Offset, in words, from the start of the target segment
    to the location of the far-pointer landing-pad within that
    segment.  Unsigned.
D (32 bits) = ID of the target segment.  (Segments are numbered
    sequentially starting from zero.)

If B == 0, then the “landing pad” of a far pointer is normally just another pointer, which in turn points to the actual object.

If B == 1, then the “landing pad” is itself another far pointer that is interpreted differently: This far pointer (which always has B = 0) points to the start of the object’s content, located in some other segment. The landing pad is itself immediately followed by a tag word. The tag word looks exactly like an intra-segment pointer to the target object would look, except that the offset is always zero.

The reason for the convoluted double-far convention is to make it possible to form a new pointer to an object in a segment that is full. If you can’t allocate even one word in the segment where the target resides, then you will need to allocate a landing pad in some other segment, and use this double-far approach. This should be exceedingly rare in practice since pointers are normally set to point to new objects, not existing ones.

Capabilities (Interfaces)

When using Cap’n Proto for RPC, every message has an associated “capability table” which is a flat list of all capabilities present in the message body. The details of what this table contains and where it is stored are the responsibility of the RPC system; in some cases, the table may not even be part of the message content.

A capability pointer, then, simply contains an index into the separate capability table.

lsb                    capability pointer                     msb
+-+-----------------------------+-------------------------------+
|A|              B              |               C               |
+-+-----------------------------+-------------------------------+

A (2 bits) = 3, to indicate that this is an "other" pointer.
B (30 bits) = 0, to indicate that this is a capability pointer.
    (All other values are reserved for future use.)
C (32 bits) = Index of the capability in the message's capability
    table.

In rpc.capnp, the capability table is encoded as a list of CapDescriptors, appearing along-side the message content in the Payload struct. However, some use cases may call for different approaches. A message that is built and consumed within the same process need not encode the capability table at all (it can just keep the table as a separate array). A message that is going to be stored to disk would need to store a table of SturdyRefs instead of CapDescriptors.

Serialization Over a Stream

When transmitting a message, the segments must be framed in some way, i.e. to communicate the number of segments and their sizes before communicating the actual data. The best framing approach may differ depending on the medium – for example, messages read via mmap or shared memory may call for a different approach than messages sent over a socket or a pipe. Cap’n Proto does not attempt to specify a framing format for every situation. However, since byte streams are by far the most common transmission medium, Cap’n Proto does define and implement a recommended framing format for them.

When transmitting over a stream, the following should be sent. All integers are unsigned and little-endian.

Packing

For cases where bandwidth usage matters, Cap’n Proto defines a simple compression scheme called “packing”. This scheme is based on the observation that Cap’n Proto messages contain lots of zero bytes: padding bytes, unset fields, and high-order bytes of small-valued integers.

In packed format, each word of the message is reduced to a tag byte followed by zero to eight content bytes. The bits of the tag byte correspond to the bytes of the unpacked word, with the least-significant bit corresponding to the first byte. Each zero bit indicates that the corresponding byte is zero. The non-zero bytes are packed following the tag.

For example, here is some typical Cap’n Proto data (a struct pointer (offset = 2, data size = 3, pointer count = 2) followed by a text pointer (offset = 6, length = 53)) and its packed form:

unpacked (hex):  08 00 00 00 03 00 02 00   19 00 00 00 aa 01 00 00
packed (hex):  51 08 03 02   31 19 aa 01

In addition to the above, there are two tag values which are treated specially: 0x00 and 0xff.

Examples:

unpacked (hex):  00 (x 32 bytes)
packed (hex):  00 03

unpacked (hex):  8a (x 32 bytes)
packed (hex):  ff 8a (x 8 bytes) 03 8a (x 24 bytes)

Notice that both of the special cases begin by treating the tag as if it weren’t special. This is intentionally designed to make encoding faster: you can compute the tag value and encode the bytes in a single pass through the input word. Only after you’ve finished with that word do you need to check whether the tag ended up being 0x00 or 0xff.

It is possible to write both an encoder and a decoder which only branch at the end of each word, and only to handle the two special tags. It is not necessary to branch on every byte. See the C++ reference implementation for an example.

Packing is normally applied on top of the standard stream framing described in the previous section.

Compression

When Cap’n Proto messages may contain repetitive data (especially, large text blobs), it makes sense to apply a standard compression algorithm in addition to packing. When CPU time is scarce, we recommend LZ4 compression. Otherwise, zlib is slower but will compress more.

Canonicalization

Cap’n Proto messages have a well-defined canonical form. Cap’n Proto encoders are NOT required to output messages in canonical form, and in fact they will almost never do so by default. However, it is possible to write code which canonicalizes a Cap’n Proto message without knowing its schema.

A canonical Cap’n Proto message must adhere to the following rules:

Note that Cap’n Proto 0.5 introduced the rule that struct lists must always be encoded using C = 7 in the list pointer. Prior versions of Cap’n Proto allowed struct lists to be encoded using any element size, so that small structs could be compacted to take less than a word per element, and many encoders in fact implemented this. Unfortunately, this “optimization” made canonicalization impossible without knowing the schema, which is a significant obstacle. Therefore, the rules have been changed in 0.5, but data written by previous versions may not be possible to canonicalize.

Security Considerations

A naive implementation of a Cap’n Proto reader may be vulnerable to attacks based on various kinds of malicious input. Implementations MUST guard against these.

Pointer Validation

Cap’n Proto readers must validate pointers, e.g. to check that the target object is within the bounds of its segment. To avoid an upfront scan of the message (which would defeat Cap’n Proto’s O(1) parsing performance), validation should occur lazily when the getter method for a pointer is called, throwing an exception or returning a default value if the pointer is invalid.

Amplification attack

A message containing cyclic (or even just overlapping) pointers can cause the reader to go into an infinite loop while traversing the content.

To defend against this, as the application traverses the message, each time a pointer is dereferenced, a counter should be incremented by the size of the data to which it points. If this counter goes over some limit, an error should be raised, and/or default values should be returned. We call this limit the “traversal limit” (or, sometimes, the “read limit”).

The C++ implementation currently defaults to a limit of 64MiB, but allows the caller to set a different limit if desired. Another reasonable strategy is to set the limit to some multiple of the original message size; however, most applications should place limits on overall message sizes anyway, so it makes sense to have one check cover both.

List amplification: A list of Void values or zero-size structs can have a very large element count while taking constant space on the wire. If the receiving application expects a list of structs, it will see these zero-sized elements as valid structs set to their default values. If it iterates through the list processing each element, it could spend a large amount of CPU time or other resources despite the message being small. To defend against this, the “traversal limit” should count a list of zero-sized elements as if each element were one word instead. This rule was introduced in the C++ implementation in commit 1048706.

Stack overflow DoS attack

A message with deeply-nested objects can cause a stack overflow in typical code which processes messages recursively.

To defend against this, as the application traverses the message, the pointer depth should be tracked. If it goes over some limit, an error should be raised. The C++ implementation currently defaults to a limit of 64 pointers, but allows the caller to set a different limit.