-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large column value not split into chunks #91
Comments
Hello, I browsed the copy to implementation here - https://github.com/brianc/node-pg-copy-streams/blob/master/copy-to.js#L41 I fear that some people using the library currently expect it to output full rows so we would have to think about the backward compatibility if we were to modify this behaviour. I did not check the postgres documentation about this I remember that there was cases when postgres itself was handling full rows. can I ask what your use case is ? |
That is very valid fear. I'm thinking whether the used abstraction is ideal. I was under the impression that the contract for streams are that they deliver a larger whole efficiently in small chunks and avoid holding anything in memory in its entirety. One effectively receives parts of the end result. For contract that emits full rows, would EventEmitter-approach be a more intuitive solution? There one could explicitly state what's arriving:
To stream an image stored in BYTEA column to client through Node.js server.
Inspecting this further, In theory, the pg-copy-streams implementation could strip control data and push payload from those 64kB chunks forward as they arrive. Would you be open to this kind of solution? I could create a PR. |
Hello I was re-reading the protocol spec regarding COPY
and I think that the current implementation was mislead by this sentence about CopyData : "Data that forms part of a COPY data stream. Messages sent from the backend will always correspond to single data rows, but messages sent by frontends might divide the data stream arbitrarily."
Since the goal of this library is to have high throughput and low memory requirements, I could envision a major 3.0 version that would break the backward compatibility and more or less keep the chunk size that postgres outputs. Some people might not be happy but we could document and make a recommendation for a simple streaming csv parser that would aggregate the csv compatible chunks - https://csv.js.org/parse/ might do the trick we need to test + add a warning with an upgrade path in the changelog I am not in favor of outputting 'row' events since we would need to buffer the whole row to pass it in the event which is exactly what we want to avoid. If you want to give it a try and send a PR I will find time to review it. As a reminder we will have to check the following boxes before a 3.0 can be published :
Regarding the storage of images as BYTEA inside postgres, I have been tinkering for some time with ideas around this (small images as a BYTEA field, or big images split over many lines in a table with an offset for re-ordering). Maybe the problem you uncovered can lead to an acceptable solution for a "1 row per image" solution, whatever the image size. But there might still be a problem with > 1GB images because the oversized fields are TOASTED - https://www.postgresql.org/docs/9.5/storage-toast.html . Have you already encountered this issue ? Keep me posted on this |
Ohhh yeah it makes sense to emit fixed sized chunks. Would be subtle backwards compatible break but I think probably overall better for efficiency as @jeromew said. I agree with all his thoughts FWIW! 😄 |
Hello @pspi have you started working on a PR for this ? I could do it but do not want to collide with your work. |
@jeromew Yep, I'm working on it. The radio silence was because I don't have anything to show yet. |
@pspi great ! thanks for your work on this. |
@jeromew You know what. You probably know around this module way better than I do. And this issue looks like a tricky nut to crack. If you're eager, why don't you give it a shot as well. We could join forces and compare notes off-channel. You can find my contact info in my profile. |
ok i'll take a look in parallel. I'll send you an email. |
After reading the code once again, I better understand the problem :
rows are extracted only when a full row has been seen so when there is a huge row (>64kB) the buffer that we prefix to the pg chunk keeps growing until a full row can be seen. This is probably the degredation that you experiment with a huge BYTEA, since the module needs to buffer at least the whole BYTEA before pushing it. |
I temporarily re-open this issue to make sure we have a common understanding of what is happening.
but after reading @brianc do you confirm that even with the line https://github.com/brianc/node-pg-copy-streams/blob/master/copy-to.js#L30 This would mean that even with I suppose that this does not trigger the same memory/GC issue that @pspi was seeing because Maybe we can find a solution for this |
I'll close this as I think it is fixed with 4.0.0 |
When reading a large column value, only one large chunk covering the complete value gets emitted. I was expecting the stream to emit multiple smaller chunks.
Test
Output
For large values (for example BYTEA column that holds a file) emitting a single chunk slows down performance significantly because the emitted Buffer has to be allocated for several megabytes. To give you an idea, running this same test with larger field sizes:
Versions: pg 8.0.2, pg-copy-streams 2.2.2, node v13.8.0, PostgreSQL 12.1 on x86_64-apple-darwin16.7.0
The text was updated successfully, but these errors were encountered: