-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: encoding/json: opt-in for true streaming support #33714
Comments
Thanks for filing this issue. There have indeed been previous discussions around the topic, but they've all been in separate places. The most recent discussion I remember is https://go-review.googlesource.com/c/go/+/135595, which included a working implementation for the encoder, and benchmark numbers. Edit: just realised you link it above as well. I assume that this proposal is mainly driven by performance. If that's the case, what's the expected win from such API changes and internal refactors? It's hard to make a decision without experimental numbers. For example, if the wins on the current benchmarks are within a few percent, I'd say it's not worth the extra complexity and complex rewrite. I'd also say that you should look at a recent master, or at least a 1.13 tag, when experimenting with changes. For example, the |
Yes--Improving performance, and reducing cumbersome code which works around the current limitations. As an example, I've written some pretty ugly code using the Tokenizer interface, to read large JSON responses from CouchDB. The above linked code provides a CouchDB analog to the This exact benefit is tricky to measure accurately with a Go benchmark suite. That said, I expect there is room for some easily-measured performance gains. I'll try to put together some benchmarks, and add them to this issue.
Good suggestion, and of course for any serious testing, I will do that. |
Fair enough. With the encoder, if one wants to stream lots of elements, it's been suggested before to do something like:
I understand that this is harder to do with the decoder, as you have to then deal with the tokenizer, like you did. So it seems like your "large JSON" problem is more about decoding than encoding - is that correct? |
I think both problems are worth solving. I don't know which is "bigger". Probably for my own use case, decoding is more painful (if only because working with the Tokenizer is more cumbersome). It seems historically more people have complained about the decoding instance, too. Where the encoding problem becomes cumbersome is when your "write To provide a real-world example (again, from CouchDB), to upload a file attachment, you include the following value in your JSON document*:
In this scenario, the ideal situation would probably be to read the content of the files directly from disk, and stream to the network, rather than buffering internally. The " I hope that all makes sense :) *This isn't the only way to upload attachments--there are methods that don't require bloating your payload with base64, but I think this example still illustrates the point. |
This isn't the most interesting benchmark yet, but it's what I could throw together quickly, based on my previous experimentation: flimzy#1 This rewrites the |
It would be helpful to see: |
I'm not sure how informative this would be, given that the other libraries (that I know of) take a vastly different approach, making any benchmarks against them an apples/oranges comparison. For example, json-iterator exposes special functions for every data type, to avoid reflection. The other leading third-party json libraries use code generation and/or don't support streaming.
This would be informative. A complete back-of-napkin statistics from |
Why is it not worth benchmarking libraries with streaming support that use "vastly different" approaches? Any project concerned with JSON performance would consider them. |
That depends on the project's needs. Obviously some people feel that the different approaches are useful, or the libraries wouldn't be used. I don't think the standard library is likely to adopt the techniques used by those libraries (and for good reason), so I'm not sure what value such benchmarks provide to this discussion. If you're just curious about benchmarks, most of theses projects provide them. See here, for example. |
I have some more benchmarks to share. Still nothing earth-shattering, but building on the previous work I mentioned above with streaming the Before (standard implementation):
After (streaming implementation):
|
And memory demands? Isn't that the principal benefit of streaming en/decode? |
I think such an API would be useful. |
Hi all, we kicked off a discussion for a possible "encoding/json/v2" package that addresses the spirit of this proposal. |
Overview
I have long wanted proper streaming support in the
encoding/json
library. I’ve been doing some homework to understand the current state of things, and I think I’ve come to grips with most of it.A number of previous issues relate to this topic: #7872, #11046, #12001, #14140
In a nutshell: The library implicitly guarantees that marshaling will never write an incomplete JSON object due to an error, and that during unmarshaling, it will never pass an incomplete JSON message to
UnmarshalJSON
, and this seems a reasonable, conservative default, but is not always the desired behavior.Work toward this has been done on a couple of occasions, but abandoned or stalled for various reasons. See https://go-review.googlesource.com/c/go/+/13818/ and https://go-review.googlesource.com/c/go/+/135595
See also my related post on golang-nuts: https://groups.google.com/d/msg/golang-nuts/ABD4fTkP4Nc/bliIAAAeAQAJ
The problem to be solved
Dealing with large JSON structures is inefficient, due to the internal buffering done by
encoding/json
.json.NewEncoder
andjson.NewDecoder
appear to offer streaming benefits, but this is mostly an idiomatic advantage, not a performance one, as internal buffering still takes place.To elaborate:
When encoding, even with
json.Encoder
, the entire object is marshaled into memory, before it is written to theio.Writer
. This proposal allows writing the JSON output immediately, rather than waiting for the entire process to complete successfully first.The same problem occurs in reverse--when reading a large JSON object: you cannot begin processing the result until the entire result is received.
A naïve solution
I believe a simple solution (simple from the perspective of a consumer of the library--the internal changes are not so simple) would be to add two interfaces:
During (un)marshaling, where
encoding/json
looks forjson.Marshaler
andjson.Unmarshaler
respectively, it will now look for (and possibly prefer) the new interfaces instead. Wrapping either the old or new interfaces to work as the other is a trivial matter.With this change, and the requisite internal changes, it would be possible to begin streaming large JSON data to a server immediately, from within a
MarshalJSONStream()
implementation, for instance.The drawback is that it violates the above mentioned promise of complete reads and writes, even with errors.
Making it Opt-in
To accommodate this requirement, I believe it would be possible to expose the streaming functionality only with the
json.Encoder
andjson.Decoder
implementations, and only whenSetDirect*
(name TBD, borrowed from https://go-review.googlesource.com/c/go/+/135595/8/src/encoding/json/stream.go#283) is enabled. So further, the following two functions would be added to the public API:The default behavior, even when a type implements one of the new
Stream*
interfaces, will be to operate on an entire JSON object at once. That is to say, the Encoder will internally bufferMarshalJSONStream
's output, and process any error before continuing, and a decoder will read an entire JSON object into a buffer, then pass it toUnmarshalJSONStream
only if there are no errors.However, when
SetDirect*
is enabled, the library will bypass this internal buffering, allowing for immediate streaming to/from the source/destination.Enabling streaming with the
SetDirect*
toggle could be enough to already experience a benefit for many users, even without the use of the additional interfaces above.Toggling
SetDirect*
on will, of course, enable streaming for all types, not just those which implement the new interface above, so this could be considered a separate part of the proposal. In my opinion, this alone would be worth implementing, even if the new interface types above are done later or never.Internals
CLs 13818 and 135595 can serve as informative for this part of the discussion. I've also done some digging in the
encoding/json
package (as of 1.12) recently, for more current context.A large number of internal changes will be necessary to allow for this. I started playing around with a few internals, and I believe this is doable, but will mean a lot of code churn, so will need to be done carefully, in small steps with good code review.
As an exercise, I have successfully rewritten
indent()
to work with streams, rather than on byte slices, and began doing the same withcompact()
. TheencodeState
type would need to work with a standardio.Writer
rather than specifically abytes.Buffer
. This seems to be a bigger change, but not technically difficult. I know there are other changes needed--I haven't done a complete audit of the code.An open question is how these changes might impact performance. My benchmarks after changing
indent()
showed no change in performance, but it wasn't a particularly rigorous test.With the internals rewritten to support streams, then it's just a matter of doing the internal buffering at the appropriate place, such as at API boundaries (i.e. in
Marshal()
andUnmarshal()
), rather than as a bulit-in fundamental concept. Then, as described above, turning off that buffering when properly configured above.Final comments
To be clear, I am interested in working on this. I’m not just trying to throw out a “nice to have, now would somebody do this for me?” type of proposal. But I want to make sure I fully understand the history and context of this situation before I start too far down this rabbit hole.
I'm curious to hear the opinions of others who have been around longer. Perhaps such a proposal was already discussed (and possibly rejected?) in greater length than I can find in the above linked tickets. If so, please point me to the relevant conversation(s).
I am aware of several third-party libraries that offer some support like this, but most have various drawbacks (relying on code generation, or over-complex APIs). I would love to see this kind of support in the standard library.
If this general direction is approved, I think the first step is to break it into smaller parts that can be accomplished incrementally. I have given this thought, but so as not to jump the gun too much, will withhold my thoughts for a while, to allow proper discussion.
And one last aside: CL 13818 also added support for marshaling channels. That may or may not be a good idea (my personal feeling: probably not), but that can be addressed separately.
The text was updated successfully, but these errors were encountered: