-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove minprotobuf/protobuf for use of protobuf_serialization. #199
Conversation
var byteAddresses: seq[seq[byte]] = newSeq[seq[byte]](addresses.len) | ||
for a in 0 ..< addresses.len: | ||
byteAddresses[a] = addresses[a].data.buffer | ||
Protobuf.encode(SerializableRequest( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nicer here would be something like (not exact syntax, just for the idea):
Protobuf.encode(SerializableRequest{...})
.reqType(..)
.connectionProperties(proc (x: auto):
x.peerOrAddress(...)
)
protobuf is "naturally" streamable on writing (except the length prefix) - requiring the construction of the entire object up-front, in-memory just to write a message is quite the hurdle..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. So there is a writeField API, which I debated using, except it loses most of the benefits of this library and instead mimics the existing minprotobuf API, which this is explicitly meant to replace.
The main problem with porting the existing type definitions is the way values are serialized. Addresses are serialized as a seq[byte], but in reality, they're an Object(Object(seq[byte)). This can be... solved, by using an inline pragma of sorts. That is, don't construct the outer object, but treat its children as immediate children. That requires some... field number trickery, but is possible.
That said, then we restart the whole discussion about what pragmas we want to include and how to keep this library concise.
I also would like to note, although this uses Protobuf.encode from nim-serialization, the API does fully expose the streams. It would also be possible to work out a cast based solution, where we cast to equivalent objects, yet with their own pragmas, so we can serialize the start of the object, then manually serialize the addresses in the middle, and then finish with the end. That has its own sets of problems and concerns, yet would offer no-copies combined with a minimal amount of field manipulation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, the thing is that writing a few fields of protobuf should be a completely trivial thing to do - the object definitions serve as canonical sources of truth in this case for the field number and type, but beyond that, it's a very high mental tax to pay to bring in hundreds of lines of code just to... write a few bytes to an array - the previous code is a lot more digestible from this point of view.
Instead of being something trivial to audit, analyse and debug, it becomes an black box that does magic that is completely opaque and unrelated to the general notion of "put these key-values in this seq".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the thing here is that the data we're trying to encode does not exist as a single object to begin with - when you're reading, you can't really do much better than putting the data in an intermediary "transport" object because of out-of-order fields and last-field-overwrites etc - but that's not really the case when writing.
that said, you won't do much better with a streaming api: because of the varint length prefix, you must know the length of the serialized data and either you have to make two passes or buffer everything in a seq - since most protobufs are small, it's customary to go with the latter (even though more advanced frameworks will have support for the former as well)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "black box" solution has an advantage though. The auditing can be considered "modular". One auditing team/effort can verify that the ProtoBuf library implements the rules of the format correctly and another team/effort can verify that LibP2P is using ProtoBuf as specified.
With that said, the writeField
set of APIs in nim-serialization exists precisely because constructing a whole object is sometimes undesirable. The P2P protocol layer and Chronicles for example simulate the writing of an entire object by writing its corresponding fields one by one (the fields are accepted as parameters to the RPC procs). It's possible to define some high-level syntax sugar APIs in nim-serialization to make this more accessible to the user code.
Regarding streamed writing, this is handled efficiently in the library due to the support for "delayed writes" in FastStreams. This solves the problem of varint length prefixes without buffering to a sequence or introducing a "measuring pass".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to note why these rigid type definitions exist. They provide a single, canonical, centralized definition of every object, without the use of wrapper functions spread out through the entire codebase. While the library itself may be considered a bit of a black box, as it is an external library, the API it provides is consistent and the type definitions are explicit.
As for the length prefixing, as zah said, delayed writes solve that problem.
For some reason, when I wrote my previous comment about casts, I completely forgot fields can be unordered. If you mark those arrays as don't serialize, create a stream and call newProtobufWriter with it, manually doing the delayed write + prefix, you can use the writeField API to finish it off. Then you don't need any casts. That means we can just use the existing type definitions, yet lose a bit of safety without wrapper functions.
That said, while you can write like that, you can't read like that. It'd require iterating over the entire stream to find the field, or at least parsing it into a buffer table of int, seq[byte]
. As this library doesn't iterate over the entire stream for every field, nor provide a buffer table, you'll still need a wrapper object to read into.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
efficiently, how?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The buffering mechanism in FastStreams divides the stream in "pages". A page is allocated and written to only once, but you can create multiple "fat pointers" to the same page. Each such pointer represents a sub-span of the page.
When you do a delayed write, you"split" the page by creating a new fat pointer pointing to the same buffer, but now it has an adjustable start position that will be determined once the delayed write is completed.
Since the fat pointers are stored in a Deque
, this usually doesn't require any new allocations. In the end, when all delayed writes are completed, the outstanding page spans are flushed to the output device, potentially using some Vectored I/O API. Please note that in this setup, every byte is written to exactly once and you never need to copy any buffer.
The same mechanism can be used for implementing a zero-copy multiplexer. The incoming bytes exist in sharable pages. Something like Mplex only needs to determine what are the boundaries between the sub-streams and it pushes the same buffers with adjusted readable spans to the next layer.
In async pipes and channels, the pages can flow from one stream to another or they can be shared between a reader and a writer that only need to adjust two offsets - writtenTo
and consumedTo
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok.. so.. I don't quite understand the difference between between that and a seq
and a length_counter
in a local variable (minus the paging) though?
ie if the final destination is disk or network network, ultimately, you will need to buffer the whole chunk in some temporary location (memory, fancy swap-backed pages with layers of fat pointers, whatever), then write the length, then write the data to the actual sink (with writev
if you want to be fancy)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There will be difference in the async channel scenario (because you'll then have to copy the sequence to a buffer that will be sent to a different thread).
Another difference will be in the case where your length-prefixed content is smallish. You wouldn't want to send it out to the network immediately, so you'll first write it to the temporary sequence, then to a higher-level buffer in the stream and finally it will go out to the I/O device.
OTOH, larger content written to a temporary sequence will suffer from re-allocations and the buffer copying associated with that.
Maybe it's not clear how writing the length works in the end. In the stream buffers, you've left a "hole" when the delayed write was started. You have specified the maximum size for the length prefix and once you know its final size, you close the hole by writing the bytes in the right place in the stream buffer and then you adjust the page span boundaries in such a way that some bytes remain effectively unused.
The protobuf-serialization library has had PB2 semantics added via status-im/nim-protobuf-serialization#5. This PR has been updated to use the new library. |
Checks failed as I forgot to add protobuf_serialization to the Nimble package. Even once that happens, this will fail until n-p-s 5 is merged. Before I edit the Nimble file, should I add "protobuf_serialization", in preparation for addition to the package list, or the status-im URL since it's not currently in the package list? |
not sure if I understand correctly, you're asking if you can add the protobuf serialization package entry to the nimble file? |
Asking if I should add protobuf-serialization via "protobuf-serialization", which requires a PR to the Nimble package list, or via the URL, which does not. |
I would say we use a link for now, that way we're not blocked by upstream. |
Does so via the GitHub URL as it's not in the Nimble package directory yet. Doesn't work, currently, as nim-protobuf-serialization 0.2.0 had yet to be merged into its master.
Obsoleted by #838 |
Reliant on status-im/nim-protobuf-serialization#2 being merged and added to the Nimble package registry. If the blocking library requires further modifications, to its handling of proto2/proto3/required, this PR will need the appropriate adjustments.